Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 77]
cs.CV [Total: 215]
cs.AI [Total: 44]
cs.SD [Total: 11]
cs.LG [Total: 164]
cs.MA [Total: 6]
cs.MM [Total: 3]
eess.AS [Total: 4]
eess.IV [Total: 10]

cs.CL

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

Hen-Hsen Huang

Main category: cs.CL

TL;DR: The paper argues that current LLM efficiency methods (MoE, speculative decoding, RAG) only work for hyperscale providers and fail in modest-resource contexts, proposing a new research agenda focused on robust simplicity and overhead-aware efficiency to democratize LLM deployment.

Details

Motivation: Current efficiency methods benefit only Big Tech companies while leaving hospitals, schools, governments, and enterprises without viable options due to excessive overhead, fragility, and carbon waste in modest-resource environments.

Method: Proposes retrofitting pretrained models without retraining, lightweight fine-tuning that preserves alignment, economical reasoning for long chains of thought, dynamic knowledge management without heavy RAG pipelines, and adopting Overhead-Aware Efficiency (OAE) as a standard benchmark.

Result: The paper presents a conceptual framework and research agenda rather than empirical results, arguing for redefining efficiency to include adoption cost, sustainability, and fairness.

Conclusion: By focusing on robust simplicity and overhead-aware efficiency, LLM deployment can be democratized to reduce inequality and carbon waste rather than amplifying existing disparities.

Abstract: Large language models (LLMs) have become indispensable, but the most celebrated efficiency methods – mixture-of-experts (MoE), speculative decoding, and complex retrieval-augmented generation (RAG) – were built for hyperscale providers with vast infrastructure and elite teams. Outside that context, their benefits collapse into overhead, fragility, and wasted carbon. The result is that a handful of Big Tech companies benefit, while thousands of hospitals, schools, governments, and enterprises are left without viable options. We argue that the next frontier is not greater sophistication at scale, but robust simplicity: efficiency that thrives under modest resources and minimal expertise. We propose a new research agenda: retrofitting pretrained models with more efficient architectures without retraining, inventing lightweight fine-tuning that preserves alignment, making reasoning economical despite long chains of thought, enabling dynamic knowledge management without heavy RAG pipelines, and adopting Overhead-Aware Efficiency (OAE) as a standard benchmark. By redefining efficiency to include adoption cost, sustainability, and fairness, we can democratize LLM deployment – ensuring that optimization reduces inequality and carbon waste rather than amplifying them.

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

Tcharlies Schmitz

Main category: cs.CL

TL;DR: HTP is a reversible, deterministic framework for text embeddings that encodes tokens as harmonic trajectories from Unicode, achieving competitive semantic similarity scores without training or vocabularies.

Details

Motivation: To create transparent, efficient text embeddings that don't rely on statistical co-occurrence, optimization, or training data, providing a deterministic alternative to neural embeddings.

Method: Encodes each token analytically as a harmonic trajectory derived from its Unicode integer representation, establishing a bijective mapping between symbols and continuous vector space using phase-coherent projections.

Result: Achieves Spearman correlation of ρ = 0.68 on STS-B benchmark, maintains stable performance across 10 languages with sub-millisecond latency and negligible computational cost.

Conclusion: Meaningful semantic relations can emerge from deterministic geometry, offering a transparent and efficient alternative to data-driven embeddings.

Abstract: This paper introduces the Harmonic Token Projection (HTP), a reversible and deterministic framework for generating text embeddings without training, vocabularies, or stochastic parameters. Unlike neural embeddings that rely on statistical co-occurrence or optimization, HTP encodes each token analytically as a harmonic trajectory derived from its Unicode integer representation, establishing a bijective and interpretable mapping between discrete symbols and continuous vector space. The harmonic formulation provides phase-coherent projections that preserve both structure and reversibility, enabling semantic similarity estimation from purely geometric alignment. Experimental evaluation on the Semantic Textual Similarity Benchmark (STS-B) and its multilingual extension shows that HTP achieves a Spearman correlation of \r{ho} = 0.68 in English, maintaining stable performance across ten languages with negligible computational cost and sub-millisecond latency per sentence pair. This demonstrates that meaningful semantic relations can emerge from deterministic geometry, offering a transparent and efficient alternative to data-driven embeddings. Keywords: Harmonic Token Projection, reversible embedding, deterministic encoding, semantic similarity, multilingual representation.

[3] A centroid based framework for text classification in itsm environments

Hossein Mohanna, Ali Ait-Bachir

Main category: cs.CL

TL;DR: A dual-embedding centroid-based framework for hierarchical text classification in ITSM systems that achieves competitive performance with SVM while providing interpretability and significant speed improvements.

Details

Motivation: Hierarchical text classification is essential in IT Service Management systems for categorizing support tickets into tree-structured taxonomies, requiring methods that balance performance, interpretability, and operational efficiency.

Method: Dual-embedding centroid-based classification framework that maintains separate semantic and lexical centroid representations per category, combining them through reciprocal rank fusion at inference time.

Result: Achieves hierarchical F1 score of 0.731 (vs 0.727 for SVM), 5.9x faster training, up to 152x faster incremental updates, and 8.6-8.8x speedup across batch sizes (100-1000 samples) when excluding embedding computation.

Conclusion: The method is suitable for production ITSM environments that prioritize interpretability through centroid representations and operational efficiency through significant speed improvements.

Abstract: Text classification with hierarchical taxonomies is a fundamental requirement in IT Service Management (ITSM) systems, where support tickets must be categorized into tree-structured taxonomies. We present a dual-embedding centroid-based classification framework that maintains separate semantic and lexical centroid representations per category, combining them through reciprocal rank fusion at inference time. The framework achieves performance competitive with Support Vector Machines (hierarchical F1: 0.731 vs 0.727) while providing interpretability through centroid representations. Evaluated on 8,968 ITSM tickets across 123 categories, this method achieves 5.9 times faster training and up to 152 times faster incremental updates. With 8.6-8.8 times speedup across batch sizes (100-1000 samples) when excluding embedding computation. These results make the method suitable for production ITSM environments prioritizing interpretability and operational efficiency.

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

Yongfu Xue

Main category: cs.CL

TL;DR: PIRA is a training paradigm that improves reward models for LLMs by reformulating question-answer pairs into preference instructions, aggregating rewards from diverse tasks, and stabilizing outputs with dropout averaging.

Details

Motivation: Traditional reward models have low data efficiency due to direct concatenation of questions and responses, and are vulnerable to reward overoptimization.

Method: Three strategies: (1) Reformulate question-answer pairs into preference-based instructions, (2) aggregate rewards from diverse preference tasks, (3) average value-head outputs under varying dropout rates.

Result: Extensive experiments demonstrate the effectiveness of PIRA in improving reward model performance.

Conclusion: PIRA successfully addresses key challenges in reward modeling through its three-component approach, enhancing data efficiency and robustness against overoptimization.

Abstract: Reward models are crucial for aligning Large Language Models (LLMs) with human preferences but face two representative challenges. First, traditional discriminative reward models usually concatenate questions and responses directly as input, resulting in low data efficiency. Second, reward models are vulnerable to reward overoptimization. We propose PIRA, a training paradigm addressing these issues through three strategies: (1) Reformulating question-answer pairs into preference-based instructions for clearer and more explicit task specification, (2) aggregating rewards from diverse preference tasks to reduce bias and improve robustness, and (3) averaging value-head outputs under varying dropout rates to stabilize rewards. Extensive experiments have demonstrated the effectiveness of PIRA.

[5] Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data

Mann Khatri, Mirza Yusuf, Rajiv Ratn Shah, Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: LLMs struggle with legal tasks due to lack of domain-specific training. The paper shows that organizing legal documents by rhetorical roles and explaining legal terminology improves model performance by 1.5-4.36% in F1 score on Indian legal judgment prediction.

Details

Motivation: LLMs lack domain-specific pretraining for legal tasks, and legal documents are long and complex, making it difficult for models to process them efficiently.

Method: Three experimental approaches: (i) reorganizing documents by rhetorical roles, (ii) defining rhetorical roles to familiarize models with legal terminology, (iii) emulating court reasoning steps regarding rhetorical roles. Conducted in zero-shot setting on three Indian legal judgment prediction datasets.

Result: Organizing data or explaining key legal terms significantly boosts model performance with minimum 1.5% and maximum 4.36% improvement in F1 score compared to baseline.

Conclusion: Structuring legal information through rhetorical roles and explaining legal terminology effectively enhances LLM performance on legal tasks without requiring full domain-specific pretraining.

Abstract: Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.

Saad Mankarious, Ayah Zirikly, Daniel Wiechmann, Elma Kerz, Edward Kempa, Yu Qiao

Main category: cs.CL

TL;DR: MindSET is a new benchmark dataset for mental health analysis from Reddit with 13M annotated posts across 7 conditions, featuring rigorous preprocessing and showing 18-point F1 improvement in Autism detection.

Details

Motivation: Existing mental health benchmarks are outdated due to limited data, inadequate cleaning, and inability to handle diverse social media content like multilingual and harmful material.

Method: Curated Reddit data using self-reported diagnoses, applied rigorous preprocessing (language filtering, NSFW removal, deduplication), performed linguistic analysis with LIWC, and conducted binary classification experiments using fine-tuned language models and BoW features.

Result: MindSET contains over 13M annotated posts (more than twice previous benchmarks), models trained on it consistently outperformed previous benchmarks with up to 18-point F1 improvement for Autism detection.

Conclusion: MindSET provides a robust foundation for mental health research using social media data, supporting both early risk detection and deeper analysis of emerging psychological trends.

Abstract: Social media data has become a vital resource for studying mental health, offering real-time insights into thoughts, emotions, and behaviors that traditional methods often miss. Progress in this area has been facilitated by benchmark datasets for mental health analysis; however, most existing benchmarks have become outdated due to limited data availability, inadequate cleaning, and the inherently diverse nature of social media content (e.g., multilingual and harmful material). We present a new benchmark dataset, \textbf{MindSET}, curated from Reddit using self-reported diagnoses to address these limitations. The annotated dataset contains over \textbf{13M} annotated posts across seven mental health conditions, more than twice the size of previous benchmarks. To ensure data quality, we applied rigorous preprocessing steps, including language filtering, and removal of Not Safe for Work (NSFW) and duplicate content. We further performed a linguistic analysis using LIWC to examine psychological term frequencies across the eight groups represented in the dataset. To demonstrate the dataset utility, we conducted binary classification experiments for diagnosis detection using both fine-tuned language models and Bag-of-Words (BoW) features. Models trained on MindSET consistently outperformed those trained on previous benchmarks, achieving up to an \textbf{18-point} improvement in F1 for Autism detection. Overall, MindSET provides a robust foundation for researchers exploring the intersection of social media and mental health, supporting both early risk detection and deeper analysis of emerging psychological trends.

[7] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

Zheng Hui, Xiaokai Wei, Reza Shirkavand, Chen Wang, Weizhi Zhang, Alejandro Peláez, Michelle Gong

Main category: cs.CL

TL;DR: FlexCode introduces a popularity-aware framework that uses separate collaborative filtering and semantic codebooks with adaptive token allocation to address the imbalance between popular and long-tail items in generative recommendation systems.

Details

Motivation: Existing generative recommendation approaches use a single uniform codebook, which overlooks the inherent imbalance between popular items (rich in collaborative signals) and long-tail items (requiring semantic understanding), limiting representational efficiency and generalization.

Method: FlexCode adaptively allocates a fixed token budget between a collaborative filtering codebook and a semantic codebook using a lightweight Mixture of Experts (MoE) that dynamically balances CF-specific precision and semantic generalization, with alignment and smoothness objectives to maintain coherence.

Result: Experiments on public and industrial-scale datasets show that FlexCode consistently outperforms strong baselines, achieving stronger accuracy and tail robustness.

Conclusion: FlexCode provides a new mechanism for token representation in generative recommenders that effectively balances memorization and generalization, offering improved performance for both popular and long-tail items.

Abstract: Generative recommendation has recently emerged as a powerful paradigm that unifies retrieval and generation, representing items as discrete semantic tokens and enabling flexible sequence modeling with autoregressive models. Despite its success, existing approaches rely on a single, uniform codebook to encode all items, overlooking the inherent imbalance between popular items rich in collaborative signals and long-tail items that depend on semantic understanding. We argue that this uniform treatment limits representational efficiency and hinders generalization. To address this, we introduce FlexCode, a popularity-aware framework that adaptively allocates a fixed token budget between a collaborative filtering (CF) codebook and a semantic codebook. A lightweight MoE dynamically balances CF-specific precision and semantic generalization, while an alignment and smoothness objective maintains coherence across the popularity spectrum. We perform experiments on both public and industrial-scale datasets, showing that FlexCode consistently outperform strong baselines. FlexCode provides a new mechanism for token representation in generative recommenders, achieving stronger accuracy and tail robustness, and offering a new perspective on balancing memorization and generalization in token-based recommendation models.

[8] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

Saleh Almohaimeed, May Alsofyani, Saad Almohaimeed, Mansour Al Ghanim, Liqiang Wang

Main category: cs.CL

TL;DR: First Arabic cross-domain, context-dependent text-to-SQL dataset (Ar-SParC) with 10,225 questions, plus a GAT corrector method that improves performance across all experiments.

Details

Motivation: Address the lack of Arabic text-to-SQL datasets and research, as most existing work focuses on English and Chinese languages.

Method: Created Ar-SParC dataset with 3,450 question sequences, tested GPT-3.5-turbo and GPT-4.5-turbo with 10 prompt engineering techniques, and developed GAT corrector approach.

Result: GAT corrector improved performance by average 1.9% EX and 1.9% IX (zero-shot) and 1.72% EX and 0.92% IX (in-context learning) across 40 experiments.

Conclusion: Successfully established first Arabic text-to-SQL dataset and demonstrated effectiveness of GAT corrector for Arabic language processing.

Abstract: In recent years, the task of cross-domain, context-dependent text-to-SQL has received significant attention. Enables users with no prior knowledge of SQL to have a conversation with databases using natural language. However, most of the available datasets and research have been conducted in English, along with some work in Chinese. To this date, no effort has been made to address this task in the Arabic language. In this paper, we introduce Ar-SParC, the first Arabic cross-domain, context-dependent text-to-SQL dataset. The dataset consists of 3,450 sequences of interrelated questions, each sequence containing an average of approximately three questions, which results in a total of 10225 questions along with their corresponding SQL queries. We conducted 40 experiments on the Ar-SParC dataset using two large language models, GPT-3.5-turbo and GPT-4.5-turbo, applying 10 different prompt engineering techniques, including four question representation methods and six in-context learning techniques. Furthermore, we developed a novel approach named GAT corrector, which enhanced the performance across all 40 experiments, yielding an average improvement of 1.9% in execution accuracy (EX) and 1.9% in interaction accuracy (IX) under zero-shot settings, and an average increase of 1.72% EX and 0.92% IX under in-context learning settings. Finally, we conducted an ablation study with two more experiments to explain why the GAT corrector outperformed the previous GAT verifier technique, particularly for the Arabic language.

[9] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

Matthew W. Kenaston, Umair Ayub, Mihir Parmar, Muhammad Umair Anjum, Syed Arsalan Ahmed Naqvi, Priya Kumar, Samarth Rawal, Aadel A. Chaudhuri, Yousef Zakharia, Elizabeth I. Heath, Tanios S. Bekaii-Saab, Cui Tao, Eliezer M. Van Allen, Ben Zhou, YooJung Choi, Chitta Baral, Irbaz Bin Riaz

Main category: cs.CL

TL;DR: LLMs in oncology may reach correct conclusions through faulty reasoning, posing safety risks not captured by accuracy metrics. A hierarchical taxonomy of reasoning errors was developed and validated, showing 23% error rate with confirmation and anchoring biases most common, leading to guideline-discordant recommendations.

Details

Motivation: To address safety concerns in oncology decision support where LLMs may provide correct answers through flawed reasoning, which standard accuracy-based evaluations fail to detect.

Method: Developed a three-tier taxonomy of reasoning errors from GPT-4 chain-of-thought responses on real oncology notes, validated on 822 responses from prostate cancer consult notes across disease stages, simulating extraction, analysis, and recommendation tasks.

Result: Reasoning errors occurred in 23% of interpretations, dominated by confirmation and anchoring biases. These failures led to guideline-discordant and potentially harmful recommendations, especially in advanced disease. Automated evaluators could detect error presence but not reliably classify subtypes.

Conclusion: LLMs can provide fluent but clinically unsafe recommendations when reasoning is flawed. The taxonomy provides a framework for evaluating and improving reasoning fidelity before clinical deployment.

Abstract: Despite high performance on clinical benchmarks, large language models may reach correct conclusions through faulty reasoning, a failure mode with safety implications for oncology decision support that is not captured by accuracy-based evaluation. In this two-cohort retrospective study, we developed a hierarchical taxonomy of reasoning errors from GPT-4 chain-of-thought responses to real oncology notes and tested its clinical relevance. Using breast and pancreatic cancer notes from the CORAL dataset, we annotated 600 reasoning traces to define a three-tier taxonomy mapping computational failures to cognitive bias frameworks. We validated the taxonomy on 822 responses from prostate cancer consult notes spanning localized through metastatic disease, simulating extraction, analysis, and clinical recommendation tasks. Reasoning errors occurred in 23 percent of interpretations and dominated overall errors, with confirmation bias and anchoring bias most common. Reasoning failures were associated with guideline-discordant and potentially harmful recommendations, particularly in advanced disease management. Automated evaluators using state-of-the-art language models detected error presence but could not reliably classify subtypes. These findings show that large language models may provide fluent but clinically unsafe recommendations when reasoning is flawed. The taxonomy provides a generalizable framework for evaluating and improving reasoning fidelity before clinical deployment.

[10] Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Leanne Nortje, Dan Oneata, Gabriel Pirlogeanu, Herman Kamper

Main category: cs.CL

TL;DR: This paper introduces a few-shot learning scheme for visually prompted keyword localisation (VPKL) that automatically mines positive/negative pairs without transcriptions, and extends VPKL to Yoruba as a real low-resource language.

Details

Motivation: To enable VPKL for low-resource languages without transcriptions, addressing limitations of previous English-only work that relied on transcriptions for contrastive loss training.

Method: Proposes a few-shot learning scheme that automatically mines positive and negative pairs for contrastive loss without using transcriptions.

Result: On English, performance drop is small compared to using ground truth pairs. On Yoruba, reasonable scores but bigger performance drop due to less accurate mining.

Conclusion: The automatic mining approach works reasonably well, especially for English, but faces challenges in low-resource languages like Yoruba where mining accuracy decreases.

Abstract: Given an image query, visually prompted keyword localisation (VPKL) aims to find occurrences of the depicted word in a speech collection. This can be useful when transcriptions are not available for a low-resource language (e.g. if it is unwritten). Previous work showed that VPKL can be performed with a visually grounded speech model trained on paired images and unlabelled speech. But all experiments were done on English. Moreover, transcriptions were used to get positive and negative pairs for the contrastive loss. This paper introduces a few-shot learning scheme to mine pairs automatically without transcriptions. On English, this results in only a small drop in performance. We also - for the first time - consider VPKL on a real low-resource language, Yoruba. While scores are reasonable, here we see a bigger drop in performance compared to using ground truth pairs because the mining is less accurate in Yoruba.

[11] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

Bharadwaj Yadavalli

Main category: cs.CL

TL;DR: DTS adaptively selects response templates based on query complexity to reduce token costs without quality loss, achieving 90.5% routing accuracy and 32-34% token reduction across major LLM providers.

Details

Motivation: Uniform prompting across diverse queries causes token inefficiency, which is costly due to output tokens being 4-8x more expensive than input tokens across major providers.

Method: Dynamic Template Selection (DTS) with two routing approaches: simple MLP using pre-computed embeddings and fine-tuned RoBERTa transformer, evaluated on 1,000 MMLU questions.

Result: MLP router achieved 90.5% routing accuracy (slightly better than RoBERTa’s 89.5%) with 125M fewer parameters, and generalized across 3 LLM providers with 32.6-33.9% token reduction.

Conclusion: DTS provides significant cost savings through adaptive template selection while maintaining response quality, with simple MLP routing outperforming more complex approaches and generalizing well across providers.

Abstract: Contemporary large language model deployments typically employ uniform prompting strategies across diverse query types, applying verbose response patterns to both complex analytical tasks and straightforward factual questions. This one-size-fits-all methodology leads to substantial token inefficiency, a concern amplified by the significant cost differential between input and output tokens–the latter commanding 4-8x higher prices across major providers. We present Dynamic Template Selection (DTS), which adaptively matches response templates to query complexity, achieving significant cost reductions without compromising response quality. We compared two routing approaches: a simple MLP that uses pre-computed embeddings and a more complex fine-tuned RoBERTa transformer. Through comprehensive evaluation on 1,000 MMLU questions, we find that the MLP router achieves 90.5% routing accuracy on held-out test data, marginally exceeding RoBERTa’s performance (89.5%) despite utilizing 125M fewer parameters. Notably, our empirical analysis reveals provider-agnostic behavior in template selection–routing decisions generalize effectively across 3 major LLM providers (OpenAI GPT-4, Google Gemini, and Anthropic Claude), as validated through 9,000 production API calls. While routing accuracy remains consistent at 90.5% across providers, observed token reductions vary from 32.6% to 33.9%, reflecting provider-specific generation characteristics. This work contributes several key elements: formal problem formulation with theoretical grounding in machine learning, four algorithms with corresponding complexity analyses, and extensive empirical validation across production systems.

[12] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

Lijun Shang, Yadong Yu, Wenqiang Kang, Jian Zhou, Dongyue Gao, Pan Xiang, Zhe Liu, Mengyan Dai, Zhonglu Guo, Zhimei Sun

Main category: cs.CL

TL;DR: 2D materials have valuable applications in energy storage/conversion, but their information is scattered across research papers, making systematic analysis difficult.

Details

Motivation: To address the challenge of dispersed information about 2D materials' properties and preparation methods across numerous research papers, which hinders systematic analysis and application development.

Method: The paper likely employs data mining, text analysis, or systematic review approaches to extract and organize information about 2D materials from published research literature.

Result: The research probably results in a comprehensive database, classification system, or analytical framework that systematically organizes information about 2D materials’ properties and synthesis methods.

Conclusion: Systematic organization of scattered 2D materials information from research papers enables better understanding, comparison, and application development for energy storage and conversion technologies.

Abstract: Two-dimensional (2D) materials have showed widespread applications in energy storage and conversion owning to their unique physicochemical, and electronic properties. Most of the valuable information for the materials, such as their properties and preparation methods, is included in the published research papers. However, due to the dispersion of synthe

[13] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

Trung Cuong Dang, David Mohaisen

Main category: cs.CL

TL;DR: Proposes multi-prefix memorization framework to better detect verbatim memorization in LLMs by measuring robustness through diverse retrieval paths rather than single-path extraction.

Details

Motivation: Existing memorization definitions have shortcomings in comprehensively capturing memorization, especially in aligned models, creating privacy and copyright risks.

Method: Introduces multi-prefix memorization framework where sequences are considered memorized if external adversarial search can find sufficient distinct prefixes that elicit them, focusing on robustness of memory.

Result: Experiments on open-source and aligned chat models show the multi-prefix definition reliably distinguishes memorized from non-memorized data.

Conclusion: The framework provides a robust and practical tool for auditing data leakage in LLMs by shifting focus to quantifying memory robustness through diverse retrieval paths.

Abstract: Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.

[14] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov

Main category: cs.CL

TL;DR: ToolOrchestra trains small orchestrator models to coordinate multiple tools, achieving higher accuracy at lower cost than previous methods while aligning with user preferences.

Details

Motivation: Large language models struggle with deep complex problems like Humanity's Last Exam, being both conceptually challenging and computationally expensive. There's a need for more efficient and effective approaches to solving difficult agentic tasks.

Method: ToolOrchestra uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards to train small orchestrators that coordinate intelligent tools. Produces an 8B model called Orchestrator.

Result: Orchestrator achieves 37.1% on HLE (outperforming GPT-5’s 35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, it surpasses GPT-5 by wide margins using only ~30% of the cost. Shows best trade-off between performance and cost across multiple metrics.

Conclusion: Composing diverse tools with lightweight orchestration models is more efficient and effective than existing methods, enabling practical and scalable tool-augmented reasoning systems.

Abstract: Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity’s Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.

[15] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

Jiaojiao Han, Wujiang Xu, Mingyu Jin, Mengnan Du

Main category: cs.CL

TL;DR: SAGE is an agent-based framework that improves feature interpretation in sparse autoencoders for LLMs by using an active, iterative process of explanation formulation, testing, and refinement.

Details

Motivation: LLMs' internal mechanisms are opaque, making safe deployment challenging. While sparse autoencoders help decompose representations into interpretable features, explaining these features remains difficult.

Method: SAGE recasts feature interpretation as an active process: systematically formulating multiple explanations per feature, designing targeted experiments to test them, and iteratively refining explanations based on activation feedback.

Result: Experiments on SAE features from diverse language models show SAGE produces explanations with significantly higher generative and predictive accuracy than state-of-the-art baselines.

Conclusion: SAGE provides a more effective framework for interpreting features in sparse autoencoders, advancing the interpretability of LLM representations.

Abstract: Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

[16] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Ye Bhone Lin, Thura Aung, Ye Kyaw Thu, Thazin Myint Oo

Main category: cs.CL

TL;DR: First study on ASR error correction for Burmese using Transformer models with IPA and alignment features, achieving significant WER reduction from 51.56 to 39.82.

Details

Motivation: Address ASR error correction specifically for low-resource Burmese language, which lacks prior research in this area.

Method: Sequence-to-sequence Transformer models with feature integration strategies including IPA and alignment information, evaluated on five ASR backbones.

Result: AEC model reduced average WER from 51.56 to 39.82 (before augmentation) and improved chrF++ scores from 0.5864 to 0.627, showing consistent gains over baseline ASR outputs.

Conclusion: AEC is robust and feature design is crucial for improving ASR outputs in low-resource settings, with IPA and alignment features proving effective.

Abstract: This paper investigates sequence-to-sequence Transformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IPA and alignment information. To our knowledge, this is the first study addressing ASR error correction specifically for Burmese. We evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word- and character-level accuracy over baseline outputs. The proposed AEC model, combining IPA and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.

[17] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

Asad Aali, Muhammad Ahmed Mohsin, Vasiliki Bikia, Arnav Singhvi, Richard Gaus, Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Yifan Mai, Jordan Cahoon, Michael Pfeffer, Roxana Daneshjou, Sanmi Koyejo, Emily Alsentzer, Percy Liang, Christopher Potts, Nigam H. Shah, Akshay S. Chaudhari

Main category: cs.CL

TL;DR: DSPy+HELM framework improves LM benchmarking by using structured prompting methods to estimate performance ceilings, revealing that traditional HELM underestimates LM capabilities and misrepresents rankings.

Details

Motivation: Existing benchmarking frameworks like HELM use fixed prompts that fail to generalize across LMs, leading to unrepresentative performance estimates and potentially underestimating LM capabilities.

Method: Created reproducible DSPy+HELM framework with four structured prompting methods that elicit reasoning, evaluated across four frontier LMs and seven benchmarks against HELM baselines.

Result: Without structured prompting: HELM underestimates LM performance by 4% average, increases performance variance, misrepresents leaderboard rankings (flipped on 3/7 benchmarks), and reasoning reduces LM sensitivity to prompt design.

Conclusion: Scalable performance ceiling estimation through structured prompting enables more accurate and decision-useful LM benchmarking, with open-sourced integration and optimization pipeline.

Abstract: As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we estimate each LM’s ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (+2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing reasoning (chain-of-thought) reduces LM sensitivity to prompt design (smaller Δ across prompts). To our knowledge, this is the first large-scale benchmarking study to empirically characterize LM behavior across benchmarks and prompting methods, showing that scalable performance ceiling estimation enables more decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).

[18] Length-MAX Tokenizer for Language Models

Dong Dong, Weijie Su

Main category: cs.CL

TL;DR: Length-MAX tokenizer reduces tokens per character by 13-18% vs BPE, speeding up training and inference while improving downstream task performance.

Details

Motivation: Current tokenizers like BPE don't optimize for token length, leading to inefficient text representation during training and inference.

Method: Casts length-weighted objective maximization as graph partitioning problem with greedy approximation algorithm to obtain vocabulary.

Result: 14-18% fewer tokens than BPE, 18.5% fewer training steps, 13.7% lower inference latency, 16% throughput gain, improved LAMBADA and HellaSwag performance.

Conclusion: Optimizing for average token length rather than frequency alone enables more efficient language modeling without sacrificing downstream performance.

Abstract: We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14–18% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5%, 17.2%, and 18.5% fewer steps, respectively, to reach a fixed validation loss, and 13.7%, 12.7%, and 13.7% lower inference latency, together with a 16% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7% and enhancing HellaSwag accuracy by 4.3%. Moreover, the Length-MAX tokenizer achieves 99.62% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing – and often improving – downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18% at inference.

[19] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng

Main category: cs.CL

TL;DR: Evo-Memory is a benchmark and framework for evaluating self-evolving memory in LLM agents, addressing the gap in dynamic memory evolution across continuous task streams.

Details

Motivation: Current LLM memory evaluations focus on static conversational settings, overlooking the need for dynamic memory accumulation and reuse in real-world environments like interactive assistants where LLMs fail to learn from accumulated interactions.

Method: Evo-Memory structures datasets into sequential task streams, implements over 10 memory modules, provides ExpRAG baseline for experience retrieval, and proposes ReMem pipeline integrating reasoning, actions, and memory updates.

Result: The framework evaluates memory modules across 10 diverse multi-turn goal-oriented and single-turn reasoning datasets, enabling assessment of memory evolution capabilities.

Conclusion: Evo-Memory bridges the gap in evaluating self-evolving memory for LLM agents, providing tools for continuous improvement through experience reuse and memory refinement in dynamic environments.

Abstract: Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

[20] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

Ali Jahan, Masood Ghayoomi, Annette Hautli-Janisz

Main category: cs.CL

TL;DR: Cross-lingual argument mining approach for low-resource languages using English-Persian parallel corpus, comparing zero-shot transfer, LLM augmentation, and cross-lingual training methods.

Details

Motivation: To address data scarcity in argument mining for low-resource languages by leveraging cross-lingual approaches and parallel corpora.

Method: Three training scenarios: (1) zero-shot transfer from English to Persian, (2) English training enhanced with LLM-generated synthetic examples, (3) cross-lingual model combining original English and manually translated Persian data.

Result: Zero-shot: 50.2% F1 (English), 50.7% (Persian); LLM-augmented: 59.2% (English), 69.3% (Persian); Cross-lingual: 74.8% F1 (Persian only).

Conclusion: Lightweight cross-lingual approach outperforms resource-intensive LLM augmentation, providing practical solution for argument mining in low-resource languages.

Abstract: Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus \citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2% on the English test set and 50.7% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2% on English and 69.3% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.

[21] LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: LightMem is an efficient memory system for LLMs that organizes memory into three stages inspired by human memory, achieving significant improvements in QA accuracy while dramatically reducing computational overhead.

Details

Motivation: Existing memory systems for LLMs introduce substantial time and computational overhead, limiting their practical deployment in dynamic environments.

Method: Three-stage memory system: sensory memory for filtering and topic grouping, short-term memory for consolidation and summarization, and long-term memory with offline sleep-time updates.

Result: On LongMemEval and LoCoMo benchmarks, LightMem improved QA accuracy by up to 7.7%/29.3%, reduced token usage by up to 38x/20.9x, and API calls by up to 30x/55.5x.

Conclusion: LightMem effectively balances performance and efficiency in memory systems for LLMs, enabling practical deployment with minimal online computational costs.

Abstract: Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. On LongMemEval and LoCoMo, using GPT and Qwen backbones, LightMem consistently surpasses strong baselines, improving QA accuracy by up to 7.7% / 29.3%, reducing total token usage by up to 38x / 20.9x and API calls by up to 30x / 55.5x, while purely online test-time costs are even lower, achieving up to 106x / 117x token reduction and 159x / 310x fewer API calls. The code is available at https://github.com/zjunlp/LightMem.

[22] Emergence and Localisation of Semantic Role Circuits in LLMs

Nura Aljaafari, Danilo S. Carvalho, André Freitas

Main category: cs.CL

TL;DR: LLMs form compact, causally isolated circuits for semantic roles through gradual refinement, with partial transfer across model scales and architectures.

Details

Motivation: To understand how large language models internally ground abstract semantic structure despite displaying semantic competence.

Method: Integrated approach using role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study semantic role implementation in LLMs.

Result: Found highly concentrated circuits (89-94% attribution within 28 nodes), gradual structural refinement (no phase transitions), moderate cross-scale conservation (24-59% component overlap), and high spectral similarity.

Conclusion: LLMs develop compact, causally isolated mechanisms for abstract semantic structure that exhibit partial transfer across different scales and architectures.

Abstract: Despite displaying semantic competence, large language models’ internal mechanisms that ground abstract semantic structure remain insufficiently characterised. We propose a method integrating role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study how LLMs implement semantic roles. Our analysis uncovers: (i) highly concentrated circuits (89-94% attribution within 28 nodes); (ii) gradual structural refinement rather than phase transitions, with larger models sometimes bypassing localised circuits; and (iii) moderate cross-scale conservation (24-59% component overlap) alongside high spectral similarity. These findings suggest that LLMs form compact, causally isolated mechanisms for abstract semantic structure, and these mechanisms exhibit partial transfer across scales and architectures.

[23] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

Reham Omar, Abdelghny Orogat, Ibrahim Abdelaziz, Omij Mangukiya, Panos Kalnis, Essam Mansour

Main category: cs.CL

TL;DR: Chatty-KG is a modular multi-agent system for conversational QA over knowledge graphs that combines RAG-style retrieval with structured SPARQL query execution, outperforming state-of-the-art baselines in both single-turn and multi-turn settings.

Details

Motivation: To address limitations of existing KGQA systems that struggle with multi-turn context, coreference resolution, and high latency, while LLMs lack direct access to private/dynamic KGs and RAG systems serialize graph structure.

Method: Uses task-specialized LLM agents collaborating for contextual interpretation, dialogue tracking, entity/relation linking, and query planning to translate natural questions into executable SPARQL queries.

Result: Significantly outperforms state-of-the-art baselines on large diverse KGs, achieving higher F1 and P@1 scores in both single-turn and multi-turn settings with low latency.

Conclusion: Chatty-KG unifies conversational flexibility with structured KG grounding, offering scalable and extensible approach for reliable multi-turn KGQA without requiring fine-tuning or pre-processing.

Abstract: Conversational Question Answering over Knowledge Graphs (KGs) combines the factual grounding of KG-based QA with the interactive nature of dialogue systems. KGs are widely used in enterprise and domain applications to provide structured, evolving, and reliable knowledge. Large language models (LLMs) enable natural and context-aware conversations, but lack direct access to private and dynamic KGs. Retrieval-augmented generation (RAG) systems can retrieve graph content but often serialize structure, struggle with multi-turn context, and require heavy indexing. Traditional KGQA systems preserve structure but typically support only single-turn QA, incur high latency, and struggle with coreference and context tracking. To address these limitations, we propose Chatty-KG, a modular multi-agent system for conversational QA over KGs. Chatty-KG combines RAG-style retrieval with structured execution by generating SPARQL queries through task-specialized LLM agents. These agents collaborate for contextual interpretation, dialogue tracking, entity and relation linking, and efficient query planning, enabling accurate and low-latency translation of natural questions into executable queries. Experiments on large and diverse KGs show that Chatty-KG significantly outperforms state-of-the-art baselines in both single-turn and multi-turn settings, achieving higher F1 and P@1 scores. Its modular design preserves dialogue coherence and supports evolving KGs without fine-tuning or pre-processing. Evaluations with commercial (e.g., GPT-4o, Gemini-2.0) and open-weight (e.g., Phi-4, Gemma 3) LLMs confirm broad compatibility and stable performance. Overall, Chatty-KG unifies conversational flexibility with structured KG grounding, offering a scalable and extensible approach for reliable multi-turn KGQA.

[24] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

Ioana Buhnila, Aman Sinha, Mathieu Constant

Main category: cs.CL

TL;DR: LLMs perform best on definition-type queries but struggle with exemplification and other answer types, with performance varying based on concept frequency (head vs tail knowledge).

Details

Motivation: To investigate why LLMs excel at definition-type answers but perform poorly on other answer types like examples and paraphrases, and to understand how pre-training data frequency affects their performance.

Method: Used TrackList analysis pipeline and RefoMed-EN dataset (6170 annotated medical terms) to evaluate LLMs on different answer types, assessing performance using syntactic/semantic similarity metrics, statistical correlations, and embeddings.

Result: LLMs showed highest performance on definition-type questions and lowest on exemplification. For definitions, models paraphrase more on popular/frequent knowledge and less on technical/tail knowledge, especially in expert texts.

Conclusion: LLMs have significant performance gaps across different answer types, with concept frequency in pre-training data strongly influencing their ability to provide diverse linguistic responses beyond definitions.

Abstract: Large Language Models (LLMs) have proven efficient in giving definition-type answers to user input queries. While for humans giving various types of answers, such as examples and paraphrases, is an easy task, LLMs struggle to provide correct answers for other than definition-type queries. In this study, we evaluated this drop in performance using TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also introduce RefoMed-EN, an English dataset consisting of 6170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We studied whether the high frequency of a concept (head) or low frequency (tail) impacts the language model’s performance. We evaluated the quality of the LLM’s output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM’s task performance for definition type questions is the highest, while for the exemplification type it is the lowest. Additionally, we showed that for definition-type questions, large language models are prone to paraphrase more on popular and frequent knowledge and less on tail and technical knowledge, especially in the expert texts.

[25] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

Anantha Padmanaban Krishna Kumar

Main category: cs.CL

TL;DR: ICL cannot override pre-trained label semantics; it primarily refines existing semantic directions rather than remapping label meanings.

Details

Motivation: To determine whether in-context learning can override pre-trained label semantics or merely refines existing semantic backbones.

Method: Treat LLMs as prompt-induced classifiers and contrast behavior under natural demonstrations (correct labels) vs inverted demonstrations (flipped label meanings). Decompose ICL behavior into three alignment metrics and introduce semantic override rate.

Result: Models cannot learn coherent anti-semantic classifiers with inverted demonstrations. Semantic override rates remain exactly zero in few-shot settings. ICL improves accuracy while maintaining strong prior alignment with natural demonstrations.

Conclusion: ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, suggesting fundamental limits of few-shot prompting for overriding label semantics.

Abstract: Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1–12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1–12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl.

[26] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

Michael Iskandardinata, William Christian, Derwin Suhartono

Main category: cs.CL

TL;DR: The paper introduces a retrieval-aware approach to improve sarcasm detection in LLMs by incorporating contextual information through web-based retrieval and self-knowledge awareness strategies.

Details

Motivation: Sarcasm detection remains challenging due to linguistic diversity, cultural variations, and unreliable detection of words requiring extra grounding, even with advanced PLMs and LLMs.

Method: Built on Pragmatic Metacognitive Prompting (PMP), the approach adds non-parametric knowledge via web retrieval and elicits the model’s internal knowledge through self-knowledge awareness strategies.

Result: Non-parametric retrieval improved macro-F1 by 9.87% on Twitter Indonesia Sarcastic, while self-knowledge retrieval improved by 3.29% on SemEval and 4.08% on MUStARD compared to original PMP.

Conclusion: Context is crucial for enhancing LLM performance in sarcasm detection, especially for culturally specific slang and unknown terms. Future work will optimize retrieval quality and relevance.

Abstract: Detecting sarcasm remains a challenging task in the areas of Natural Language Processing (NLP) despite recent advances in neural network approaches. Currently, Pre-trained Language Models (PLMs) and Large Language Models (LLMs) are the preferred approach for sarcasm detection. However, the complexity of sarcastic text, combined with linguistic diversity and cultural variation across communities, has made the task more difficult even for PLMs and LLMs. Beyond that, those models also exhibit unreliable detection of words or tokens that require extra grounding for analysis. Building on a state-of-the-art prompting method in LLMs for sarcasm detection called Pragmatic Metacognitive Prompting (PMP), we introduce a retrieval-aware approach that incorporates retrieved contextual information for each target text. Our pipeline explores two complementary ways to provide context: adding non-parametric knowledge using web-based retrieval when the model lacks necessary background, and eliciting the model’s own internal knowledge for a self-knowledge awareness strategy. We evaluated our approach with three datasets, such as Twitter Indonesia Sarcastic, SemEval-2018 Task 3, and MUStARD. Non-parametric retrieval resulted in a significant 9.87% macro-F1 improvement on Twitter Indonesia Sarcastic compared to the original PMP method. Self-knowledge retrieval improves macro-F1 by 3.29% on Semeval and by 4.08% on MUStARD. These findings highlight the importance of context in enhancing LLMs performance in sarcasm detection task, particularly the involvement of culturally specific slang, references, or unknown terms to the LLMs. Future work will focus on optimizing the retrieval of relevant contextual information and examining how retrieval quality affects performance. The experiment code is available at: https://github.com/wllchrst/sarcasm-detection_pmp_knowledge-base.

[27] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

Thura Aung, Eaint Kay Khaing Kyaw, Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi

Main category: cs.CL

TL;DR: KANs (Kolmogorov-Arnold Networks) outperform or match MLPs as classification heads for low-resource Burmese language tasks across various embeddings, offering better expressiveness and efficiency.

Details

Motivation: Traditional MLPs used as classification heads in low-resource languages have fixed non-linearity that limits expressiveness and increases computational costs, motivating exploration of more efficient alternatives.

Method: Evaluated three KAN variants (FourierKAN, EfficientKAN, FasterKAN) as classification heads across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT) for Burmese classification tasks.

Result: KAN-based heads were competitive with or superior to MLPs: EfficientKAN with fastText achieved highest F1-score (0.928), FasterKAN offered best speed-accuracy trade-off, and EfficientKAN matched/slightly outperformed MLPs with mBERT (0.917 F1).

Conclusion: KANs serve as expressive, efficient alternatives to MLPs for low-resource language classification, demonstrating competitive or superior performance across various embedding types.

Abstract: In low-resource languages like Burmese, classification tasks often fine-tune only the final classification layer, keeping pre-trained encoder weights frozen. While Multi-Layer Perceptrons (MLPs) are commonly used, their fixed non-linearity can limit expressiveness and increase computational cost. This work explores Kolmogorov-Arnold Networks (KANs) as alternative classification heads, evaluating Fourier-based FourierKAN, Spline-based EfficientKAN, and Grid-based FasterKAN-across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT). Experimental results show that KAN-based heads are competitive with or superior to MLPs. EfficientKAN with fastText achieved the highest F1-score (0.928), while FasterKAN offered the best trade-off between speed and accuracy. On transformer embeddings, EfficientKAN matched or slightly outperformed MLPs with mBERT (0.917 F1). These findings highlight KANs as expressive, efficient alternatives to MLPs for low-resource language classification.

[28] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

Bryan E. Tuck, Rakesh M. Verma

Main category: cs.CL

TL;DR: Cross-architecture evaluation reveals that architectural differences (2.0-2.2x performance gap) matter more than parameter scaling for character-level constraint satisfaction in word puzzles, with systematic failures on orthographically atypical words.

Details

Motivation: To systematically evaluate how different LLM architectures handle hard orthographic constraints during controlled text generation, as current evaluations are limited.

Method: Evaluated 28 configurations across three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction, using human difficulty ratings from 10,000 solvers.

Result: Architectural differences produced 2.0-2.2x performance gaps (F1=0.761 vs. 0.343), larger than parameter scaling gains. Models showed systematic failures on common words with unusual orthography (86-95% human success vs 89-96% model miss rate).

Conclusion: Constraint satisfaction requires specialized architectural features or training objectives beyond standard scaling, as models over-rely on distributional plausibility and penalize orthographically atypical but valid patterns.

Abstract: Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography (“data”, “poop”, “loll”: 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.

[29] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

Manish Jain, Satheesh Kumar Ponnambalam, Salman Faroz, Chandrakanth Lns, Vinay Sharma

Main category: cs.CL

TL;DR: MortgageLLM is a domain-specific LLM for mortgage finance that uses a dual-track specialization framework to create two expert models - one for conversational Q&A and another for structured tasks - avoiding performance trade-offs of single multi-task models.

Details

Motivation: LLMs lack domain-specific knowledge for specialized sectors like mortgage finance, and single multi-task models suffer from performance trade-offs where optimizing for structured tasks degrades conversational fidelity.

Method: Dual-track specialization from LLaMA-3.1-8B base model, creating two specialists: conversational Q&A model and structured task model for classification/summarization. Uses instruction residual technique to restore instruction-following capabilities and intelligent task routing via few-shot classification.

Result: Significantly outperforms base LLaMA-3.1-8B-Instruct: LLM-as-a-Judge scores of 4.58 (summarization), 4.09 (Q&A), 2.6 (classification) vs 3.99, 4.0, 1.2 baseline. BERTScore improvements: 0.77 (summarization), 0.68 (Q&A), 0.75 (classification) vs 0.74, 0.58, 0.73 baseline.

Conclusion: The dual-expert approach effectively addresses domain specialization challenges in mortgage finance, demonstrating superior performance across conversational and structured tasks while avoiding the performance trade-offs of single multi-task models.

Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across general domains, yet their application to specialized sectors such as mortgage finance requires domain-specific knowledge augmentation while preserving instruction-following fidelity. We present MortgageLLM, a novel domain-specific large language model that addresses this dual challenge. It is developed using a dual-track specialization framework from a single base model (LLaMA-3.1-8B). We opted for this dual-expert approach as a single multi-task model suffers from performance trade-offs, where optimizing for structured tasks (via SFT) degrades conversational fidelity (via DPO). Our dual-track method solves this by creating two specialists, allowing each to be optimally trained for its distinct capability. Our approach applies the instruction residual technique to restore instruction-following capabilities post-domain adaptation without supervised fine-tuning. We contribute: (1) application of this residual technique to the highly specialized mortgage finance domain; (2) a dual-expert architecture combining a conversational Q&A model and a structured task model for classification and summarization; and (3) an intelligent task routing mechanism using few-shot classification performed by one of the expert models itself. We validate our approach on domain-specific benchmarks, where our final model (MLM v2) significantly outperforms the base LLaMA-3.1-8B-Instruct, achieving an LLM-as-a-Judge summarization score of 4.58 (vs. 3.99), a Q&A score of 4.09 (vs. 4.0), and a classification score of 2.6 (vs. 1.2). On semantic similarity, our model achieved a BERTScore of 0.77 for summarization (vs. 0.74), 0.68 for Q&A (vs. 0.58), and 0.75 for classification (vs. 0.73), substantially outperforming baseline approaches.

[30] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang

Main category: cs.CL

TL;DR: SGASA framework enhances reasoning model safety by internalizing generated guidelines to defend against adversarial jailbreak prompts while maintaining response quality for benign requests.

Details

Motivation: Address the critical challenge of adversarial jailbreak prompts that evade safety mechanisms and generate harmful content, requiring adaptive safety alignment.

Method: Two-stage framework: Data Pre-synthesis generates safety guidelines and augmented prompts; Alignment Fine-tuning uses SFT and DPO to embed guidelines into models.

Result: Extensive experiments show SGASA significantly improves model safety across multiple datasets, validating adaptive and scalable effectiveness.

Conclusion: SGASA provides an effective framework for autonomous safety reinforcement against adversarial inputs while minimizing unnecessary refusals of benign requests.

Abstract: Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models’ ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.

[31] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

Steven Wang, Kyle Hunt, Shaojie Tang, Kenneth Joseph

Main category: cs.CL

TL;DR: Fine-tuning LLMs on small human survey data improves response heterogeneity and alignment with human behavior, but still fails to reproduce original study’s regression coefficients, making LLMs unsuitable for replacing human participants in inferential analyses.

Details

Motivation: To investigate whether fine-tuning LLMs on small human survey data can address limitations like limited diversity, systematic misalignment for minority subgroups, and belief-action discrepancies that prevent LLMs from serving as substitutes for human participants.

Method: Used a behavioral experiment on information disclosure to compare human and LLM-generated responses across multiple dimensions including distributional divergence, subgroup alignment, belief-action coherence, and regression coefficient recovery. Fine-tuned LLMs on small human samples from pilot studies.

Result: Fine-tuning on small human samples substantially improved heterogeneity, alignment, and belief-action coherence compared to base models. However, even the best fine-tuned models failed to reproduce the original study’s regression coefficients.

Conclusion: While fine-tuning improves some aspects of LLM-generated responses, LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses due to inability to reproduce regression coefficients.

Abstract: There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.

[32] Developing an Open Conversational Speech Corpus for the Isan Language

Adisai Na-Thalang, Chanakan Wittayasakpan, Kritsadha Phatcharoen, Supakit Buakaw

Main category: cs.CL

TL;DR: Development of the first open conversational speech dataset for Isan language, featuring natural speech with authentic linguistic phenomena and addressing challenges of non-standardized orthography.

Details

Motivation: To create inclusive AI development by supporting underrepresented languages, particularly Isan as the most widely spoken regional dialect in Thailand, and to capture authentic conversational speech rather than scripted content.

Method: Established practical transcription protocols that balance representational accuracy with computational processing requirements, addressing the challenge of non-standardized Isan orthography and variable writing practices.

Result: Created the first open conversational speech dataset for Isan language containing natural speech with colloquials, spontaneous prosody, disfluencies, and code-switching with central Thai.

Conclusion: The dataset contributes to inclusive AI development, supports research on underrepresented languages, and provides a basis for addressing linguistic and technical challenges in modeling conversational speech.

Abstract: This paper introduces the development of the first open conversational speech dataset for the Isan language, the most widely spoken regional dialect in Thailand. Unlike existing speech corpora that are primarily based on read or scripted speech, this dataset consists of natural speech, thereby capturing authentic linguistic phenomena such as colloquials, spontaneous prosody, disfluencies, and frequent code-switching with central Thai. A key challenge in building this resource lies in the lack of a standardized orthography for Isan. Current writing practices vary considerably, due to the different lexical tones between Thai and Isan. This variability complicates the design of transcription guidelines and poses questions regarding consistency, usability, and linguistic authenticity. To address these issues, we establish practical transcription protocols that balance the need for representational accuracy with the requirements of computational processing. By releasing this dataset as an open resource, we aim to contribute to inclusive AI development, support research on underrepresented languages, and provide a basis for addressing the linguistic and technical challenges inherent in modeling conversational speech.

[33] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

Robert Belanec, Branislav Pecher, Ivan Srba, Maria Bielikova

Main category: cs.CL

TL;DR: PEFT-Bench is a unified benchmark for evaluating parameter-efficient fine-tuning methods on large language models, addressing current limitations in evaluation scope and reproducibility.

Details

Motivation: Current PEFT method evaluations are limited in scope (models and datasets) and difficult to reproduce, despite the growing importance of parameter-efficient fine-tuning for reducing computational costs of LLMs.

Method: Developed PEFT-Bench as an end-to-end benchmark, evaluated across 27 NLP datasets and 6 PEFT methods, and introduced PEFT Soft Score Penalties (PSCP) metric that considers trainable parameters, inference speed, and training memory usage.

Result: Created a comprehensive evaluation framework that systematically assesses PEFT methods across multiple dimensions including computational efficiency and performance.

Conclusion: PEFT-Bench provides a standardized, reproducible benchmark for comparing PEFT methods, helping researchers and practitioners make informed decisions about parameter-efficient fine-tuning approaches.

Abstract: Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-efficient fine-tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the increased development in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 6 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Score Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.

[34] Emergent Lexical Semantics in Neural Language Models: Testing Martin’s Law on LLM-Generated Text

Kai Kugler

Main category: cs.CL

TL;DR: Martin’s Law (word frequency-polysemy relationship) emerges non-monotonically in LLMs during training, peaking around checkpoint 104 then degrading, with smaller models experiencing semantic collapse while larger ones degrade gracefully.

Details

Motivation: To systematically investigate how Martin's Law - the empirical relationship between word frequency and polysemy - emerges in neural language models during training, and evaluate linguistic regularities in LLM-generated text.

Method: Used DBSCAN clustering of contextualized embeddings to operationalize word senses, analyzing four Pythia models (70M-1B parameters) across 30 training checkpoints.

Result: Non-monotonic developmental trajectory: Martin’s Law emerges around checkpoint 100, peaks at checkpoint 104 (r > 0.6), then degrades by checkpoint 105. Smaller models (70M, 160M) show catastrophic semantic collapse, while larger models (410M, 1B) degrade gracefully. Frequency-specificity trade-off remains stable (r ≈ -0.3).

Conclusion: Compliance with linguistic regularities in LLMs is not monotonically increasing with training but follows a balanced trajectory with an optimal semantic window. This establishes a novel methodology for evaluating emergent linguistic structure.

Abstract: We present the first systematic investigation of Martin’s Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin’s Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.

[35] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

Joshua Fonseca Rivera

Main category: cs.CL

TL;DR: Fine-tuning enables reliable detection of injected activation patterns in language models, transforming near-zero accuracy to 85% with 0% false positives, demonstrating that introspective behavior can be directly trained rather than waiting for emergence.

Details

Motivation: To investigate whether introspective awareness in language models can be directly trained rather than emerging spontaneously, addressing Lindsey's open question about eliminating cross-model differences through training.

Method: Fine-tuning a 7B parameter model on transient single-token injections of activation patterns, enabling detection of fleeting “thoughts” injected at single token positions.

Result: Model transformed from 0.4% accuracy and 6.7% false positive rate to 85% accuracy on held-out concepts with 0% false positives, satisfying three of Lindsey’s criteria: accuracy, grounding, and internality.

Conclusion: At least one component of introspective behavior can be directly induced through training, offering a pathway to built-in AI transparency and addressing Lindsey’s question about training for introspection.

Abstract: Lindsey (2025) investigates introspective awareness in language models through four experiments, finding that models can sometimes detect and identify injected activation patterns – but unreliably (~20% success in the best model). We focus on the first of these experiments – self-report of injected “thoughts” – and ask whether this capability can be directly trained rather than waiting for emergence. Through fine-tuning on transient single-token injections, we transform a 7B parameter model from near-complete failure (0.4% accuracy, 6.7% false positive rate) to reliable detection (85% accuracy on held-out concepts at α=40, 0% false positives). Our model detects fleeting “thoughts” injected at a single token position, retains that information, and reports the semantic content across subsequent generation steps. On this task, our trained model satisfies three of Lindsey’s criteria: accuracy (correct identification), grounding (0/60 false positives), and internality (detection precedes verbalization). Generalization to unseen concept vectors (7.5pp gap) demonstrates the model learns a transferable skill rather than memorizing specific vectors, though this does not establish metacognitive representation in Lindsey’s sense. These results address an open question raised by Lindsey: whether “training for introspection would help eliminate cross-model differences.” We show that at least one component of introspective behavior can be directly induced, offering a pathway to built-in AI transparency.

[36] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Antonín Jarolím, Martin Fajčík, Lucia Makaiová

Main category: cs.CL

TL;DR: This paper creates a new dataset for fine-grained evidence extraction in Czech and Slovak claims from online news comments, and evaluates LLMs’ performance on this task.

Details

Motivation: Misinformation spreads in online news comments, requiring methods to detect incorrect information by identifying supporting or refuting evidence from documents.

Method: Created a new dataset with two-way annotated fine-grained evidence by paid annotators, then evaluated various LLMs (llama3.1:8b, gpt-oss-120b, qwen3:14b, deepseek-r1:32b, gpt-oss:20b) on this dataset.

Result: LLMs often fail to copy evidence verbatim from source text, leading to invalid outputs. llama3.1:8b achieves high proportion of correct outputs despite small size, while gpt-oss-120b underperforms despite more parameters. qwen3:14b, deepseek-r1:32b, and gpt-oss:20b show effective balance between size and alignment.

Conclusion: Model size doesn’t always correlate with better performance in fine-grained evidence extraction; smaller models can outperform larger ones, and certain mid-sized models achieve optimal balance between size and alignment with human annotations.

Abstract: Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task – fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.

[37] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

Zhifeng Hao, Qibin Song, Ruichu Cai, Boyan Xu

Main category: cs.CL

TL;DR: DSR-SQL is a dual-state reasoning framework for Text-to-SQL that improves LLM performance on complex enterprise databases through adaptive context refinement and progressive SQL generation with self-correction.

Details

Motivation: Existing Chain-of-Thought approaches struggle with complex enterprise databases due to limited context capacity, unreliable schema linking, and weak grounding in database semantics.

Method: Models Text-to-SQL as interaction between adaptive context state (refining large schemas and selecting relevant structures) and progressive generation state (feedback-guided SQL synthesis with self-correction).

Result: Achieves 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set without post-training or in-context examples.

Conclusion: DSR-SQL provides an effective framework for handling complex Text-to-SQL tasks on enterprise databases through dual-state reasoning, achieving competitive performance.

Abstract: Recent divide-and-conquer reasoning approaches, particularly those based on Chain-of-Thought (CoT), have substantially improved the Text-to-SQL capabilities of Large Language Models (LLMs). However, when applied to complex enterprise databases, such methods struggle to maintain coherent reasoning due to limited context capacity, unreliable schema linking, and weak grounding in database semantics. To overcome these issues, we introduce DSR-SQL, a \textbf{D}ual-\textbf{S}tate \textbf{R}easoning framework that models Text-to-SQL as an interaction between an adaptive context state and a progressive generation state. The first constructs a compact, semantically faithful environment by refining large schemas and selecting relevant structures, while the second formalizes SQL synthesis as feedback-guided state transitions, enabling the model to self-correct and align with user intent. Without any post-training or in-context examples, DSR-SQL achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set. Our implementation will be open-sourced at: https://github.com/DMIRLAB-Group/DSR-SQL.

[38] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

Kaifeng Hong, Yinglong Zhang, Xiaoying Hong, Xuewen Xia, Xing Xu

Main category: cs.CL

TL;DR: Odin is a novel architecture that integrates graph structure into Transformers at specific layers through oriented dual-modules, avoiding over-smoothing and hop-dependent diffusion while achieving state-of-the-art performance on text-rich graphs.

Details

Motivation: Existing approaches for text-attributed graphs either rely on GNNs (limited by over-smoothing and hop-dependent diffusion) or Transformers (that overlook graph topology), creating a need for better structure-text integration.

Method: Odin injects graph structure into Transformers at selected depths using an oriented dual-module mechanism, integrating multi-hop structures at specific layers aligned with semantic hierarchy. Light Odin is a lightweight variant for efficiency.

Result: Odin achieves state-of-the-art accuracy on multiple text-rich graph benchmarks, while Light Odin delivers competitive performance with significantly reduced computational cost.

Conclusion: Odin and Light Odin form a unified, hop-free framework for principled structure-text integration with proven expressive power that strictly contains both pure Transformers and GNNs.

Abstract: Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs–limited by over-smoothing and hop-dependent diffusion–or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism.Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model’s semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin’s expressive power strictly contains that of both pure Transformers and GNNs.To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.

[39] A Systematic Study of Model Merging Techniques in Large Language Models

Oğuz Kağan Hitit, Leander Girrbach, Zeynep Akata

Main category: cs.CL

TL;DR: Model merging doesn’t work well for LLMs - only simple Task Arithmetic reliably improves performance, while other methods cause significant drops.

Details

Motivation: To determine if model merging advantages from smaller models generalize to LLMs, and to systematically evaluate merging methods on modern LLMs.

Method: Large-scale evaluation of 6 merging methods across 4 LLMs, 12 fine-tuned checkpoints per model, and 16 benchmarks, measuring performance gains over base models and best checkpoints.

Result: Only Task Arithmetic reliably yields performance gains; other interference-aware and subspace methods typically cause significant performance drops.

Conclusion: Current merging techniques don’t transfer well to LLMs, motivating the need for LLM-specific merging algorithms and merging-aware fine-tuning methods.

Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.

[40] Hierarchical Ranking Neural Network for Long Document Readability Assessment

Yurui Zheng, Yijun Chen, Shaohong Zhang

Main category: cs.CL

TL;DR: A bidirectional readability assessment model that captures contextual information to identify semantic-rich regions and uses sentence-level predictions to assist document-level readability assessment, with a pairwise sorting algorithm to model ordinal relationships between readability levels.

Details

Motivation: Most deep learning approaches for readability assessment fail to consider text length or the ordinal relationship between readability labels, limiting their effectiveness.

Method: Proposes a bidirectional mechanism that identifies semantic-rich regions for sentence-level readability prediction, then aggregates these to assist document-level assessment. Introduces pairwise sorting algorithm using label subtraction to model ordinal relationships.

Result: Experimental results on Chinese and English datasets show the model achieves competitive performance and outperforms other baseline models.

Conclusion: The proposed bidirectional readability assessment mechanism with pairwise sorting effectively addresses text length and ordinal relationship issues, demonstrating superior performance across languages.

Abstract: Readability assessment aims to evaluate the reading difficulty of a text. In recent years, while deep learning technology has been gradually applied to readability assessment, most approaches fail to consider either the length of the text or the ordinal relationship of readability labels. This paper proposes a bidirectional readability assessment mechanism that captures contextual information to identify regions with rich semantic information in the text, thereby predicting the readability level of individual sentences. These sentence-level labels are then used to assist in predicting the overall readability level of the document. Additionally, a pairwise sorting algorithm is introduced to model the ordinal relationship between readability levels through label subtraction. Experimental results on Chinese and English datasets demonstrate that the proposed model achieves competitive performance and outperforms other baseline models.

[41] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

Lina Conti, Dennis Fucci, Marco Gaido, Matteo Negri, Guillaume Wisniewski, Luisa Bentivogli

Main category: cs.CL

TL;DR: Speech translation models use complex mechanisms for gender assignment, learning broader masculine patterns from training data while being able to override language model biases using acoustic information, particularly through first-person pronouns that access gender cues distributed across the frequency spectrum.

Details

Motivation: Speech conveys speaker gender through acoustic cues like pitch, creating modality-specific bias concerns in speech translation. When translating from languages with notional gender to languages with grammatical gender, vocal characteristics may influence gender assignment, risking misgendering through masculine defaults or vocal-based assumptions.

Method: Investigated gender assignment mechanisms in speech translation models across three language pairs (en-es/fr/it), examining training data patterns, internal language model biases, and acoustic information interaction using contrastive feature attribution on spectrograms.

Result: Models don’t simply replicate term-specific gender associations but learn broader patterns of masculine prevalence. While internal language models show strong masculine bias, models can override these preferences using acoustic input. Higher gender accuracy models use first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across frequency spectrum rather than concentrated in pitch.

Conclusion: Speech translation models employ sophisticated mechanisms for gender assignment that combine training data patterns, language model biases, and acoustic information, with successful models using first-person pronouns as a key mechanism to access distributed gender cues in the speech signal.

Abstract: Unlike text, speech conveys information about the speaker, such as gender, through acoustic cues like pitch. This gives rise to modality-specific bias concerns. For example, in speech translation (ST), when translating from languages with notional gender, such as English, into languages where gender-ambiguous terms referring to the speaker are assigned grammatical gender, the speaker’s vocal characteristics may play a role in gender assignment. This risks misgendering speakers, whether through masculine defaults or vocal-based assumptions. Yet, how ST models make these decisions remains poorly understood. We investigate the mechanisms ST models use to assign gender to speaker-referring terms across three language pairs (en-es/fr/it), examining how training data patterns, internal language model (ILM) biases, and acoustic information interact. We find that models do not simply replicate term-specific gender associations from training data, but learn broader patterns of masculine prevalence. While the ILM exhibits strong masculine bias, models can override these preferences based on acoustic input. Using contrastive feature attribution on spectrograms, we reveal that the model with higher gender accuracy relies on a previously unknown mechanism: using first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across the frequency spectrum rather than concentrated in pitch.

[42] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

Husne Ara Rubaiyeat, Hasan Mahmud, Md Kamrul Hasan

Main category: cs.CL

TL;DR: Created IsharaKhobor dataset for Bangla Sign Language Translation to address low-resource constraints, with two subsets for vocabulary restriction and canonicalization.

Details

Motivation: BdSLT has been severely constrained due to low resources; standard sentence-level dataset creation is crucial for developing AI assistive tools for deaf and hard-of-hearing Bangla speakers.

Method: Developed IsharaKhobor dataset with two subsets (IsharaKhobor_small and IsharaKhobor_canonical_small) through vocabulary restriction and canonicalization; benchmarked using landmark-based raw and RQE embeddings.

Result: Dataset publicly available on Kaggle; ablation studies on vocabulary restriction and canonicalization resulted in improved subsets for research.

Conclusion: IsharaKhobor enables BdSLT research by providing foundational datasets and benchmarks, addressing critical resource gaps for assistive technology development.

Abstract: Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creation for BdSLT is of immense importance for developing AI based assistive tools for deaf and hard of hearing people of Bangla speaking community. In this paper, we present a dataset, IsharaKhobor , and two subset of it for enabling research. We also present the challenges towards developing the dataset and present some way forward by benchmarking with landmark based raw and RQE embedding. We do some ablation on vocabulary restriction and canonicalization of the same within the dataset, which resulted in two more datasets, IsharaKhobor_small and IsharaKhobor_canonical_small. The dataset is publicly available at: www.kaggle.com/datasets/hasanssl/isharakhobor [1].

[43] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

Minjoon Choi

Main category: cs.CL

TL;DR: RoParQ benchmark evaluates LLM consistency across paraphrased questions, revealing reliance on surface patterns. XParaCon metric measures robustness, and paraphrase-aware SFT improves consistency, making smaller models perform like larger ones.

Details

Motivation: LLMs show inconsistent behavior with paraphrased questions, indicating they rely on surface-level patterns rather than true semantic understanding.

Method: Created RoParQ benchmark from standard datasets using proprietary models to generate paraphrases, developed XParaCon metric to measure robustness, and implemented reasoning-based paraphrase-aware SFT for model alignment.

Result: Targeted alignment significantly enhanced robustness, with fine-tuned lightweight models achieving consistency levels comparable to much larger pre-trained models.

Conclusion: The approach effectively mitigates superficial memorization and fosters more robust, reliable LLMs through semantic invariance alignment.

Abstract: Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions, suggesting a reliance on surface-level patterns rather than true semantic understanding. To address this limitation, we introduce RoParQ, a benchmark specifically constructed to evaluate cross-paraphrase consistency in closed-book multiple-choice QA. This benchmark is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. We further propose XParaCon, a novel evaluation metric that quantifies a model’s robustness by measuring the standard deviation of accuracies across question variants. Additionally, we implement a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy designed to align models toward semantic invariance. Our experiments demonstrate that this targeted alignment significantly enhances robustness. Notably, fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models. These results highlight the efficacy of our approach in mitigating superficial memorization and fostering more robust, reliable LLMs.

[44] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

Yixiu Zhao, Xiaozhi Wang, Zijun Yao, Lei Hou, Juanzi Li

Main category: cs.CL

TL;DR: A lightweight method for identifying skill-specific neurons in LLMs by correlating activations with external metrics, revealing interpretable behaviors and shortcuts in complex tasks.

Details

Motivation: Large language models have impressive capabilities but their internal mechanisms are poorly understood, creating a need for interpretability methods that can isolate neurons encoding specific skills.

Method: Extends prior work on skill neurons by correlating neuron activations with auxiliary metrics (external labels and model confidence) in complex multi-skill scenarios, without requiring manual token aggregation.

Result: Successfully identified neurons driving known skills and revealed previously unknown shortcuts in arithmetic reasoning on BigBench, validated across open-ended text generation and natural language inference tasks.

Conclusion: The proposed method provides a simple, broadly applicable approach for uncovering interpretable, task-specific neuron behaviors in LLMs, advancing model transparency.

Abstract: Large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, yet their internal mechanisms remain largely opaque. In this paper, we introduce a simple, lightweight, and broadly applicable method with a focus on isolating neurons that encode specific skills. Building upon prior work that identified “skill neurons” via soft prompt training on classification tasks, our approach extends the analysis to complex scenarios involving multiple skills. We correlate neuron activations with auxiliary metrics – such as external labels and the model’s own confidence score – thereby uncovering interpretable and task-specific behaviors without the need for manual token aggregation. We empirically validate our method on tasks spanning open-ended text generation and natural language inference, demonstrating its ability to detect neurons that not only drive known skills but also reveal previously unidentified shortcuts in arithmetic reasoning on BigBench.

[45] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Dongyang Fan, Diba Hashemi, Sai Praneeth Karimireddy, Martin Jaggi

Main category: cs.CL

TL;DR: This paper explores using various metadata types beyond URLs to accelerate LLM pretraining, finding that fine-grained quality indicators and metadata appending can improve training efficiency through quality-aware latent structures.

Details

Motivation: Prior work only used URLs as metadata for LLM pretraining acceleration, leaving open whether other metadata types could provide greater benefits for training efficiency.

Method: Investigated multiple metadata types, introduced metadata appending as auxiliary tasks, used learnable meta-tokens with masked loss, and analyzed latent representations through probing.

Result: Found that fine-grained quality indicators and metadata appending can accelerate pretraining, with effective metadata encoding information at finer granularity.

Conclusion: Provides practical guidelines for integrating metadata to improve both efficiency and effectiveness of LLM pretraining through quality-aware latent structures.

Abstract: Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.

[46] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

Anna Marklová, Ondřej Vinš, Martina Vokáčová, Jiří Milička

Main category: cs.CL

TL;DR: Czech native speakers cannot reliably distinguish AI-generated from human-written poetry, showing AI can convincingly produce poetry even in morphologically complex, low-resource languages like Czech.

Details

Motivation: To examine perception of AI- and human-written Czech poetry, testing if native speakers can identify authorship and how they aesthetically judge it, given most AI poetry studies focus on English.

Method: Conducted experiments where Czech native speakers guessed authorship of poems and provided aesthetic evaluations, using logistic regression to analyze factors affecting recognition accuracy.

Result: Participants performed at chance level (45.8% correct) identifying authorship, AI poems were rated equally or more favorably than human ones, but belief about authorship strongly biased aesthetic evaluations.

Conclusion: AI can convincingly produce Czech poetry, readers’ beliefs about authorship and aesthetic evaluation are interconnected, and familiarity with poetry had no effect on recognition accuracy.

Abstract: Large language models are increasingly capable of producing creative texts, yet most studies on AI-generated poetry focus on English – a language that dominates training data. In this paper, we examine the perception of AI- and human-written Czech poetry. We ask if Czech native speakers are able to identify it and how they aesthetically judge it. Participants performed at chance level when guessing authorship (45.8% correct on average), indicating that Czech AI-generated poems were largely indistinguishable from human-written ones. Aesthetic evaluations revealed a strong authorship bias: when participants believed a poem was AI-generated, they rated it as less favorably, even though AI poems were in fact rated equally or more favorably than human ones on average. The logistic regression model uncovered that the more the people liked a poem, the less probable was that they accurately assign the authorship. Familiarity with poetry or literary background had no effect on recognition accuracy. Our findings show that AI can convincingly produce poetry even in a morphologically complex, low-resource (with respect of the training data of AI models) Slavic language such as Czech. The results suggest that readers’ beliefs about authorship and the aesthetic evaluation of the poem are interconnected.

[47] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad, Xinjie Lei, Liam Robbins, Karthik Padthe, Hu Xu, Xian Li, Asli Celikyilmaz, Ramya Raghavendra, Lifei Huang, Carole-Jean Wu, Shang-Wen Li

Main category: cs.CL

TL;DR: Matrix is a decentralized framework for multi-agent synthetic data generation that eliminates central orchestrators, using distributed queues for control and data flow to achieve 2-15x higher throughput than existing approaches.

Details

Motivation: Existing multi-agent synthesis frameworks suffer from scalability bottlenecks due to centralized orchestrators and are often hardcoded for specific domains, limiting flexibility for diverse data generation tasks.

Method: Matrix uses a decentralized peer-to-peer design where control and data flow are represented as serialized messages passed through distributed queues. It separates lightweight agents from compute-intensive operations handled by distributed services, built on Ray for scalability.

Result: Matrix scales to tens of thousands of concurrent agentic workflows and achieves 2-15x higher data generation throughput under identical hardware resources across diverse scenarios including multi-agent dialogue, web reasoning, and tool-use trajectory generation.

Conclusion: Matrix provides a scalable, modular framework for multi-agent synthetic data generation that significantly outperforms existing approaches in throughput while maintaining output quality.

Abstract: Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$–$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.

[48] Revisiting Generalization Across Difficulty Levels: It’s Not So Easy

Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach

Main category: cs.CL

TL;DR: LLMs show limited generalization across task difficulties - training on easy or hard data alone doesn’t consistently improve performance across all difficulty levels, highlighting the need for diverse difficulty ranges in training and evaluation data.

Details

Motivation: To understand how LLMs generalize across different task difficulties, addressing mixed findings in existing research about whether training on easier or harder data leads to better performance and where those gains occur.

Method: Systematic evaluation using six datasets, ranking examples by difficulty using outputs from thousands of different LLMs and Item Response Theory (IRT), creating difficulty ratings based solely on LLM abilities without human input.

Result: Cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties.

Conclusion: Having a range of difficulties in both training and evaluation data is crucial for LLMs, and taking shortcuts with respect to difficulty is risky.

Abstract: We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs’ generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.

[49] Evaluating Large Language Models for Radiology Natural Language Processing

Zhengliang Liu, Tianyang Zhong, Yiwei Li, Yutong Zhang, Yi Pan, Zihao Zhao, Peixin Dong, Chao Cao, Yuxiao Liu, Peng Shu, Yaonai Wei, Zihao Wu, Chong Ma, Jiaqi Wang, Sheng Wang, Mengyue Zhou, Zuowei Jiang, Chunlin Li, Jason Holmes, Shaochen Xu, Lu Zhang, Haixing Dai, Kai Zhang, Lin Zhao, Yuanhao Chen, Xu Liu, Peilong Wang, Junhao Chen, Pingkun Yan, Jun Liu, Bao Ge, Lichao Sun, Dajiang Zhu, Xiang Li, Wei Liu, Xiaoyan Cai, Xintao Hu, Xi Jiang, Shu Zhang, Xin Zhang, Tuo Zhang, Shijie Zhao, Quanzheng Li, Hongtu Zhu, Dinggang Shen, Tianming Liu

Main category: cs.CL

TL;DR: Evaluation of 32 bilingual LLMs for interpreting radiology reports, specifically assessing their ability to derive clinical impressions from radiologic findings.

Details

Motivation: Despite the abundance of bilingual LLMs and their significant impact in medical fields, there's a lack of comprehensive evaluation specifically in radiology NLP, particularly for deriving impressions from radiology reports.

Method: Critical evaluation of thirty-two large language models by assessing their performance in interpreting radiology reports and deriving clinical impressions from radiologic findings.

Result: The evaluation provides insights into the performance, strengths, and weaknesses of these LLMs in radiology report interpretation.

Conclusion: The study bridges the evaluation gap for LLMs in radiology NLP and informs their practical applications in the medical domain.

Abstract: The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain.

[50] Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization

Yue Zhang, Jingxuan Zuo, Liqiang Jing

Main category: cs.CL

TL;DR: Proposes FALLACIOUS, two fine-grained evaluation frameworks for assessing factuality in multimodal summarization: reference-based and reference-free approaches.

Details

Motivation: Existing multimodal summarization methods potentially suffer from unfactual output, requiring better evaluation frameworks to assess factuality.

Method: Two evaluation frameworks: reference-based factuality evaluation (uses ground truth) and reference-free factuality evaluation (doesn’t need ground truth) for different application scenarios.

Result: Experimental results show effectiveness through correlation analysis with other metrics; code and dataset will be released.

Conclusion: The proposed FALLACIOUS frameworks provide fine-grained and explainable evaluation of factuality in multimodal summarization, with the reference-free approach having wider applicability.

Abstract: Multimodal summarization aims to generate a concise summary based on the input text and image. However, the existing methods potentially suffer from unfactual output. To evaluate the factuality of multimodal summarization models, we propose two fine-grained and explainable evaluation frameworks (FALLACIOUS) for different application scenarios, i.e. reference-based factuality evaluation framework and reference-free factuality evaluation framework. Notably, the reference-free factuality evaluation framework doesn’t need ground truth and hence it has a wider application scenario. To evaluate the effectiveness of the proposed frameworks, we compute the correlation between our frameworks and the other metrics. The experimental results show the effectiveness of our proposed method. We will release our code and dataset via github.

[51] Scaling Efficient LLMs

B. N. Kausik

Main category: cs.CL

TL;DR: The paper proposes recurrent transformers as an efficient LLM architecture that scales parameters as D^γ (γ∈[0.44,0.72]) instead of linearly with data size, achieving linear time complexity and memory efficiency while maintaining performance.

Details

Motivation: Current LLMs with hundreds of billions of parameters consume vast resources, and traditional transformer scaling requires parameters to grow linearly with data size, making them inefficient.

Method: Recurrent transformers combine transformers with recurrent networks by progressively applying a single transformer layer to a fixed-width sliding window across input sequences.

Result: Recurrent transformers perform favorably on benchmark tests, running in linear time, being memory-efficient, learning to forget/accumulate history as needed, and amenable to curriculum training.

Conclusion: Recurrent transformers offer a more efficient architecture that scales better with data size while maintaining performance, suggesting potential for more resource-efficient LLMs.

Abstract: Recent LLMs have hundreds of billions of parameters consuming vast resources. Furthermore, the so called “AI scaling law” for transformers suggests that the number of parameters must scale linearly with the size of the data. In response, we inquire into efficient LLMs, i.e. those with the fewest parameters that achieve the desired accuracy on a training corpus. Specifically, by comparing theoretical and empirical estimates of the Kullback-Leibler divergence, we derive a natural AI scaling law that the number of parameters in an efficient LLM scales as $D^γ$ where $D$ is the size of the training data and $ γ\in [0.44, 0.72]$, suggesting the existence of more efficient architectures. Against this backdrop, we propose recurrent transformers, combining the efficacy of transformers with the efficiency of recurrent networks, progressively applying a single transformer layer to a fixed-width sliding window across the input sequence. Recurrent transformers (a) run in linear time in the sequence length, (b) are memory-efficient and amenable to parallel processing in large batches, (c) learn to forget history for language tasks, or accumulate history for long range tasks like copy and selective copy, and (d) are amenable to curriculum training to overcome vanishing gradients. In our experiments, we find that recurrent transformers perform favorably on benchmark tests.

[52] Gram2Vec: An Interpretable Document Vectorizer

Peter Zeng, Hannah Stortz, Eric Sclafani, Alina Shabaeva, Maria Elizabeth Garza, Daniel Greeson, Owen Rambow

Main category: cs.CL

TL;DR: Gram2Vec is a grammatical style embedding system that uses normalized relative frequencies of grammatical features for document embedding, offering interpretability and applications in authorship verification and AI detection.

Details

Motivation: To create an interpretable grammatical style embedding system that can explain stylistic differences in documents, addressing the black-box nature of neural approaches.

Method: Extracts normalized relative frequencies of grammatical features from text to generate feature vectors, then uses these for authorship verification explanations and AI detection classification.

Result: Outperforms machine learning models trained on comparable Biber features for AI detection, and provides interpretable explanations for authorship verification decisions.

Conclusion: Gram2Vec provides an interpretable alternative to neural embedding methods for grammatical style analysis, with practical applications in authorship verification and AI detection tasks.

Abstract: We present Gram2Vec, a grammatical style embedding system that embeds documents into a higher dimensional space by extracting the normalized relative frequencies of grammatical features present in the text. Compared to neural approaches, Gram2Vec offers inherent interpretability based on how the feature vectors are generated. In this paper, we use authorship verification and AI detection as two applications to show how Gram2Vec can be used. For authorship verification, we use the features from Gram2Vec to explain why a pair of documents is by the same or by different authors. We also demonstrate how Gram2Vec features can be used to train a classifier for AI detection, outperforming machine learning models trained on a comparable set of Biber features.

[53] A Psychology-based Unified Dynamic Framework for Curriculum Learning

Guangyu Meng, Qingkai Zeng, John P. Lalor, Hong Yu

Main category: cs.CL

TL;DR: PUDF is a psychology-based curriculum learning framework that uses Item Response Theory and Artificial Crowds to dynamically select training data from easy to hard, improving LLM fine-tuning efficiency and accuracy.

Details

Motivation: Traditional curriculum learning faces challenges in defining data difficulty and determining appropriate data amounts for each training step. Drawing from psychometrics can provide more systematic solutions.

Method: Uses Item Response Theory applied to Artificial Crowds responses to quantify global, interpretable difficulty values. Proposes DDS-MAE strategy to dynamically schedule data amounts based on model ability estimation.

Result: Fine-tuning pre-trained LLMs with PUDF achieves higher accuracy and faster convergence on benchmark datasets compared to standard fine-tuning and state-of-the-art CL methods.

Conclusion: PUDF’s unified theory-based approach for difficulty labeling and model ability estimation enables aligned training data selection, leading to improved performance and efficiency in curriculum learning.

Abstract: Directly learning from examples of varying difficulty levels is often challenging for both humans and machine learning models. A more effective strategy involves exposing learners to examples in a progressive order from easy to difficult. Curriculum Learning (CL) has been proposed to implement this strategy in machine learning model training. However, two key challenges persist in CL framework design: defining the difficulty of training data and determining the appropriate amount of data to input at each training step. Drawing inspiration from psychometrics, this paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF). We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC). This theory-driven IRT-AC approach leads to global (i.e., model-independent) and interpretable difficulty values. Leveraging IRT, we propose a training strategy, Dynamic Data Selection via Model Ability Estimation (DDS-MAE), to schedule the appropriate amount of data during model training. Since our difficulty labeling and model ability estimation are based on a consistent theory, namely IRT, their values are comparable within the same scope, potentially leading to aligned training data selection and faster convergence compared to the other CL methods. Experimental results demonstrate that fine-tuning pre-trained large language models with PUDF leads to higher accuracy and faster convergence on a suite of benchmark datasets compared to standard fine-tuning and state-of-the-art CL methods. Ablation studies and downstream analyses further validate the impact of PUDF for CL.

[54] Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, Aleksandra Faust

Main category: cs.CL

TL;DR: Proposes inference-aware fine-tuning to optimize LLM performance for Best-of-N inference strategy, achieving significant improvements in mathematical reasoning and code generation benchmarks.

Details

Motivation: Effectively utilizing inference-time compute is crucial for better LLM performance, and current fine-tuning methods don't directly optimize for inference strategies like Best-of-N.

Method: Develops imitation learning and reinforcement learning methods for BoN-aware fine-tuning that overcome the non-differentiable argmax operator in Best-of-N selection.

Result: Improved Gemma 2B performance: Hendrycks MATH Bo32 from 26.8% to 30.8%, pass@32 from 60.0% to 67.0%; HumanEval pass@16 from 61.6% to 67.1%. Models learn meta-strategies balancing best responses with diverse exploration.

Conclusion: BoN-aware fine-tuning effectively optimizes inference-time performance, enabling models to learn exploration-exploitation trade-offs and significantly improve performance with better compute utilization.

Abstract: Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large language models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning~(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input – a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.

[55] BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations

Simone Giovannini, Fabio Coppini, Andrea Gemelli, Simone Marinai

Main category: cs.CL

TL;DR: A unified document QA dataset combining multiple public datasets, reformulating Document AI tasks as QA format with OCR and bounding box annotations for LLM training and evaluation.

Details

Motivation: To create a comprehensive resource for document understanding by unifying existing datasets and reformulating tasks into QA format suitable for Large Language Models.

Method: Combined multiple public Document AI datasets, reformulated Information Extraction tasks as Question-Answering, provided OCR text and bounding box annotations for answer localization.

Result: Created a unified dataset enabling exploration of prompting techniques (including bounding box information) for document comprehension using open-weight models.

Conclusion: The dataset facilitates training and evaluation of LLMs on document QA tasks, with identified effective prompting approaches for improved document understanding.

Abstract: We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.

[56] Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

Siqi Fan, Xuezhi Fang, Xingrun Xing, Peng Han, Shuo Shang, Yequan Wang

Main category: cs.CL

TL;DR: D³ is a training-free layer skipping method that uses position-aware depth decay to reduce LLM inference computation by 1.5x while maintaining performance.

Details

Motivation: LLM inference is resource-intensive, and not all components are needed for every token. Later tokens have lower perplexity and require less computation.

Method: Position-Aware Depth Decay Decoding (D³) uses a power-law decay function to determine layer retention per token position without retraining.

Result: Achieves 1.5x speedup on Llama models (7B-70B) with <1% performance drop on GSM8K and BBH benchmarks.

Conclusion: D³ enables efficient LLM inference through dynamic depth computation without retraining, maintaining performance while significantly reducing operations.

Abstract: Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding ($D^3$), which leverages a power-law decay function, $\left\lfloor L \times (α^i) \right\rfloor$, to determine the number of layers to retain when generating token $T_i$. Remarkably, without any retraining, the $D^3$ achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with $7 \sim 70$ billion parameters show that $D^3$ can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop ($<1%$) on the GSM8K and BBH benchmarks.

[57] Reasoning Transfer for an Extremely Low-Resource and Endangered Language: Bridging Languages Through Sample-Efficient Language Understanding

Khanh-Tung Tran, Barry O’Sullivan, Hoang D. Nguyen

Main category: cs.CL

TL;DR: English-Pivoted CoT Training improves reasoning in low-resource languages by using English for chain-of-thought while keeping final output in target language, achieving up to 28.33% improvement on math reasoning benchmarks.

Details

Motivation: Chain-of-thought reasoning gains have largely benefited high-resource languages, leaving low-resource languages behind. The authors aim to bridge this gap by leveraging LLMs' internal English-aligned latent space.

Method: English-Pivoted CoT Training: supervised fine-tuning to generate CoT in English but final response in target language. Also tested Mixed-Language CoT and Two-Stage Training. Released LC2024 benchmark for Irish language.

Result: Outperformed other baselines with up to 28.33% improvement in low-resource scenarios. Analysis shows explicitly separating language understanding from reasoning enhances cross-lingual reasoning abilities.

Conclusion: Provides practical pathway for multilingual reasoning without extensive retraining in every low-resource language, despite data scarcity. The approach leverages LLMs’ English-aligned internal representations effectively.

Abstract: Recent advances have enabled Large Language Models (LLMs) to tackle reasoning tasks by generating chain-of-thought (CoT) rationales, yet these gains have largely applied to high-resource languages, leaving low-resource languages behind. In this work, we first investigate CoT techniques in extremely low-resource scenarios through previous prompting, model-editing, and fine-tuning approaches. We introduce English-Pivoted CoT Training, leveraging the insight that LLMs internally operate in a latent space aligned toward the dominant language. Given input in a low-resource language, we perform supervised fine-tuning to generate CoT in English and output the final response in the target language. Across mathematical reasoning benchmarks, our approach outperforms other baselines with up to 28.33% improvement in low-resource scenarios. Our analysis and additional experiments, including Mixed-Language CoT and Two-Stage Training, show that explicitly separating language understanding from reasoning enhances cross-lingual reasoning abilities. To facilitate future work, we also release \emph{LC2024}, the first benchmark for mathematical tasks in Irish, an extremely low-resource and endangered language. Our results and resources highlight a practical pathway to multilingual reasoning without extensive retraining in every extremely low-resource language, despite data scarcity.

[58] The Distribution of Dependency Distance and Hierarchical Distance in Contemporary Written Japanese and Its Influencing Factors

Linxuan Wang, Shuiyuan Yu

Main category: cs.CL

TL;DR: This study examines how dependency distance and hierarchical distance relate in Japanese, finding that predicate valency drives the trade-off between linear and hierarchical complexity.

Details

Motivation: To understand the relationship between dependency distance and hierarchical distance in Japanese syntax and identify the underlying factors that influence their trade-off.

Method: Analyzed probability distributions of DD and HD using the Balanced Corpus of Contemporary Written Japanese, comparing distributions with and without fixed sentence length, and examining changes in MDD and MHD as sentence length increases.

Result: Predicate valency is the key factor behind the trade-off between MDD and MHD. Native speakers regulate linear and hierarchical complexity through predicate valency, and the relative sizes of MDD and MHD depend on whether valency thresholds are reached.

Conclusion: Predicate valency significantly affects both DD and HD distributions, with greater impact on HD than DD, resulting in lower mean MDD compared to MHD and explaining the observed probability distribution differences.

Abstract: To explore the relationship between dependency distance (DD) and hierarchical distance (HD) in Japanese, we compared the probability distributions of DD and HD with and without sentence length fixed, and analyzed the changes in mean dependency distance (MDD) and mean hierarchical distance (MHD) as sentence length increases, along with their correlation coefficient based on the Balanced Corpus of Contemporary Written Japanese. It was found that the valency of the predicates is the underlying factor behind the trade-off relation between MDD and MHD in Japanese. Native speakers of Japanese regulate the linear complexity and hierarchical complexity through the valency of the predicates, and the relative sizes of MDD and MHD depend on whether the threshold of valency has been reached. Apart from the cognitive load, the valency of the predicates also affects the probability distributions of DD and HD. The effect of the valency of the predicates on the distribution of HD is greater than on that of DD, which leads to differences in their probability distributions and causes the mean of MDD to be lower than that of MHD.

[59] A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, Jemin Lee

Main category: cs.CL

TL;DR: This paper provides a comprehensive evaluation of 25 open-source and commercial LLM inference engines, examining their ease-of-use, deployment, scalability, and performance characteristics while exploring their design goals and optimization techniques.

Details

Motivation: LLM inference costs are rising due to complex workloads like chain-of-thought and agent services that require repeated model invocations. While optimization methods exist, the diversity of service requirements makes method selection challenging, and there's a lack of systematic study on specialized inference engines.

Method: The authors conducted a comprehensive evaluation of 25 open-source and commercial inference engines, examining them across multiple dimensions including ease-of-use, deployment, general-purpose support, scalability, and suitability for throughput/latency-aware computation. They also analyzed optimization techniques and ecosystem maturity.

Result: The study provides systematic insights into inference engine capabilities, design goals, and optimization support. It assesses ecosystem maturity for open-source solutions and performance/cost policies for commercial offerings.

Conclusion: The paper outlines future research directions including support for complex LLM-based services, hardware diversity, and enhanced security. It offers practical guidance for selecting and designing optimized inference engines and provides a public repository to track developments in this fast-evolving field.

Abstract: Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workload such as chain-of-throught, complex reasoning, agent services significantly increase the inference cost by invoke the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking.This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions.We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: \href{https://github.com/sihyeong/Awesome-LLM-Inference-Engine}{https://github.com/sihyeong/Awesome-LLM-Inference-Engine}.

[60] Enhancing Large Language Models for Detecting Mental Manipulation via Annotation-Free Data Augmentation and Anti-Curriculum Distillation

Yuansheng Gao, Han Bao, Tong Zhang, Bin Li, Jixiang Luo, Ronghao Chen, Zonghui Wang, Wenzhi Chen

Main category: cs.CL

TL;DR: MentalMAC is a framework that enhances LLMs’ ability to detect mental manipulation in dialogues using data augmentation, multi-task supervision, and progressive distillation, achieving significant performance improvements over baselines.

Details

Motivation: Mental manipulation is a serious psychological threat but detection faces challenges including insufficient training data, covert nature of manipulation, and lack of real-world datasets.

Method: Three key components: EvoSA (annotation-free data augmentation using evolutionary operations and speech act theory), teacher-model-generated multi-task supervision, and progressive task-level anti-curriculum distillation.

Result: Achieves up to 25.9% improvement in F1mac and 8.1% in accuracy over best-performing baselines, outperforming commercial LLMs like GPT-4 and Claude-3.5-Sonnet. Created ReaMent dataset with 5,000 real-world dialogue samples.

Conclusion: MentalMAC effectively addresses the challenges in mental manipulation detection and significantly enhances detection capabilities, with the framework also aiding in creating valuable real-world datasets.

Abstract: Mental manipulation is a subtle yet pervasive form of psychological abuse that poses serious threats to mental health. Nevertheless, detecting mental manipulation remains a largely underexplored research problem. The field faces three major challenges: (i) insufficient and hard-to-obtain training data; (ii) the covert nature of mental manipulation, which hinders detection; and (iii) the lack of real-world datasets. To address these challenges, we propose MentalMAC, a novel framework that enhances large language models’ ability to detect elements of mental manipulation in multi-turn dialogue. Our approach consists of three key components: EvoSA, an annotation-free data augmentation method based on evolutionary operations and speech act theory; teacher-model-generated multi-task supervision; and progressive task-level anti-curriculum distillation. We then constructed the ReaMent dataset, comprising 5,000 real-world dialogue samples, utilizing MentalMAC-distilled models to aid in human annotation. Vast experiments show that MentalMAC achieves up to 25.9% improvement in F1mac and 8.1% in accuracy over the best-performing baseline, outperforming commercial LLMs such as GPT-4 and Claude-3.5-Sonnet. Warning: This paper contains content that may be offensive to the reader.

[61] Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo

Main category: cs.CL

TL;DR: Web-Shepherd is the first process reward model for web navigation that evaluates trajectories step-by-step, achieving 30% better accuracy than GPT-4o and improving performance by 10.9 points while reducing costs by 10x.

Details

Motivation: Web navigation requires long-horizon sequential decision making beyond typical MLLM capabilities, but specialized reward models for training and testing have been lacking. Current approaches using MLLMs as reward models are too slow and expensive for real-world deployment.

Method: Created WebPRM Collection (40K step-level preference pairs with annotated checklists) and WebRewardBench benchmark. Developed Web-Shepherd PRM that assesses web navigation trajectories at step-level granularity.

Result: Web-Shepherd achieves ~30 points better accuracy than GPT-4o on WebRewardBench. When used as verifier with GPT-4o-mini policy on WebArena-lite, achieves 10.9 points better performance at 10x lower cost compared to using GPT-4o-mini as verifier.

Conclusion: Web-Shepherd provides an effective, cost-efficient solution for web navigation reward modeling, enabling better performance at significantly reduced costs compared to using general-purpose MLLMs.

Abstract: Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.

[62] UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Longrong Yang, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma

Main category: cs.CL

TL;DR: UITron-Speech is the first end-to-end GUI agent that processes speech instructions and screenshots to predict user actions, addressing text-based limitations in hands-free scenarios through synthesized speech datasets and mixed-modality training.

Details

Motivation: Text-based GUI agents limit accessibility and convenience in hands-free scenarios. Speech input offers a more natural and accessible alternative for human-computer interaction.

Method: Uses synthesized speech datasets from TTS models, mixed-modality training to address modality imbalance, and a two-step grounding refinement method to correct localization errors.

Result: Achieves robust performance and superior adaptability across multiple benchmarks, demonstrating the feasibility of speech-driven GUI agents.

Conclusion: Speech-driven GUI agents show great potential for more accessible and intelligent human-computer interaction, with UITron-Speech providing a viable solution.

Abstract: Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.

[63] The Structure-Content Trade-off in Knowledge Graph Retrieval

Valentin Six, Evan Dufraisse, Gaël de Chalendar

Main category: cs.CL

TL;DR: Subquestion-based retrieval improves content precision but yields disjoint subgraphs, while question-based retrieval maintains structure at cost of relevance. Optimal performance occurs between these extremes.

Details

Motivation: To understand how retrieval design shapes LLM performance when using knowledge graphs for factual reasoning, specifically examining how question decomposition affects retrieved subgraph content and structure.

Method: Used a hybrid retrieval function that controls importance of initial question and subquestions to examine how question decomposition changes retrieved subgraph content and structure.

Result: Subquestion-based retrieval improves content precision but yields disjoint subgraphs, while question-based retrieval maintains structure at the cost of relevance.

Conclusion: Balancing retrieval content and structure is key to effective LLM reasoning over structured knowledge, with optimal performance arising between the extremes of question-based and subquestion-based retrieval.

Abstract: Large Language Models (LLMs) increasingly rely on knowledge graphs for factual reasoning, yet how retrieval design shapes their performance remains unclear. We examine how question decomposition changes the retrieved subgraph’s content and structure. Using a hybrid retrieval function that controls the importance of initial question and subquestions, we show that subquestion-based retrieval improves content precision, but yields disjoint subgraphs, while question-based retrieval maintains structure at the cost of relevance. Optimal performance arises between these extremes, revealing that balancing retrieval content and structure is key to effective LLM reasoning over structured knowledge.

[64] Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation

Minh Hoang Nguyen, Thuat Thien Nguyen, Minh Nhat Ta, Tung Le, Huy Tien Nguyen

Main category: cs.CL

TL;DR: Co-NAML-LSTUR is a hybrid news recommendation framework that combines multi-view news encoding with hierarchical user modeling for improved performance on limited data resources.

Details

Motivation: To address the challenge of jointly modeling multi-view news representations and capturing dynamic user interests (both short- and long-term) in news recommendation systems, especially when training data is limited.

Method: Integrates NAML for attentive multi-view news encoding and LSTUR for hierarchical user modeling, using BERT-based embeddings to enhance semantic representation.

Result: Significantly outperforms strong baselines on MIND-small and MIND-large benchmarks, achieving improvements over NRMS by 1.55% in AUC and 1.15% in MRR, and over NAML by 2.45% in AUC and 1.71% in MRR.

Conclusion: The hybrid model effectively combines multi-view news modeling with dual-scale user representations for practical, resource-limited scenarios, demonstrating efficiency-focused design rather than claiming absolute state-of-the-art.

Abstract: News recommendation systems play a critical role in alleviating information overload by delivering personalized content. A key challenge lies in jointly modeling multi-view representations of news articles and capturing the dynamic, dual-scale nature of user interests-encompassing both short- and long-term preferences. Prior methods often rely on single-view features or insufficiently model user behavior across time. In this work, we introduce Co-NAML-LSTUR, a hybrid news recommendation framework that integrates NAML for attentive multi-view news encoding and LSTUR for hierarchical user modeling, designed for training on limited data resources. Our approach leverages BERT-based embeddings to enhance semantic representation. We evaluate Co-NAML-LSTUR on two widely used benchmarks, MIND-small and MIND-large. Results show that our model significantly outperforms strong baselines, achieving improvements over NRMS by 1.55% in AUC and 1.15% in MRR, and over NAML by 2.45% in AUC and 1.71% in MRR. These findings highlight the effectiveness of our efficiency-focused hybrid model, which combines multi-view news modeling with dual-scale user representations for practical, resource-limited resources rather than a claim to absolute state-of-the-art (SOTA). The implementation of our model is publicly available at https://github.com/MinhNguyenDS/Co-NAML-LSTUR

[65] On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: This survey provides a comprehensive overview of general-purpose text embeddings (GPTE) in the era of pretrained language models (PLMs), examining PLMs’ fundamental and advanced roles in GPTE development and highlighting future research directions.

Details

Motivation: Text embeddings have become increasingly important for various NLP tasks, and with the emergence of PLMs, there's growing interest in understanding how PLMs drive the development of general-purpose text embeddings and their future potential.

Method: The survey examines GPTE architecture and PLMs’ roles in embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction, then explores advanced roles including multilingual support, multimodal integration, code understanding, and scenario-specific adaptation.

Result: The survey provides a comprehensive framework for understanding PLMs’ contributions to GPTE development, covering both fundamental architectural elements and advanced capabilities enabled by modern language models.

Conclusion: GPTE represents a significant advancement in text representation, with PLMs playing crucial roles in its development, and future research should focus on areas like ranking integration, safety considerations, bias mitigation, structural information incorporation, and cognitive extensions of embeddings.

Abstract: Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, including retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. We then describe advanced roles enabled by PLMs, including multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.

[66] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

Main category: cs.CL

TL;DR: CAMERA introduces micro-expert as a finer-grained compression unit for Mixture-of-Experts LLMs, enabling efficient pruning and quantization while maintaining performance.

Details

Motivation: MoE models suffer from substantial computational and storage overheads without proportional performance gains as expert parameters increase, and existing compression methods face challenges in both performance and efficiency.

Method: Views MoE layers as mixtures of micro-experts, proposes CAMERA framework for identifying redundancy, CAMERA-P for structured micro-expert pruning, and CAMERA-Q for mixed-precision quantization of micro-experts.

Result: CAMERA-P outperforms baselines under 20-60% pruning ratios across 9 tasks, CAMERA-Q achieves superior results under aggressive 2-bit quantization, and enables complete micro-expert analysis of Qwen2-57B-A14B in <5 minutes on a single A100 GPU.

Conclusion: Micro-expert level compression provides an effective approach for reducing MoE model overheads while maintaining strong performance, offering significant computational efficiency improvements.

Abstract: Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

[67] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

Leroy Z. Wang

Main category: cs.CL

TL;DR: In-context concept learning reveals upward monotonicity bias in LLMs that direct prompting doesn’t detect.

Details

Motivation: To uncover implicit biases in large language models that may not be apparent through standard testing methods.

Method: Used in-context concept learning experiments with a specialized dataset of concept learning tasks.

Result: Language models show bias toward upward monotonicity in quantifiers during concept learning, but this bias is less apparent in direct prompting without concept learning components.

Conclusion: In-context concept learning is an effective method for discovering hidden biases in language models that standard testing approaches may miss.

Abstract: We introduce a dataset of concept learning tasks that helps uncover implicit biases in large language models. Using in-context concept learning experiments, we found that language models may have a bias toward upward monotonicity in quantifiers; such bias is less apparent when the model is tested by direct prompting without concept learning components. This demonstrates that in-context concept learning can be an effective way to discover hidden biases in language models.

[68] Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

Adity Khisa, Nusrat Jahan Lia, Tasnim Mahfuz Nafis, Zarif Masud, Tanzir Pial, Shebuti Rayana, Ahmedul Kabir

Main category: cs.CL

TL;DR: Fine-tuning multilingual transformer models on Bangla-transliterated Chakma corpus improves performance for this low-resource Indo-Aryan language, achieving up to 73.54% token accuracy.

Details

Motivation: Chakma is an underrepresented Indo-Aryan language with limited available data, making it challenging for language models to handle effectively.

Method: Created a novel corpus of Bangla-transliterated Chakma from literature, validated by native speakers, and fine-tuned six encoder-based transformer models (multilingual, regional, and English variants) on masked language modeling tasks.

Result: Fine-tuned multilingual models outperformed pre-trained counterparts, achieving up to 73.54% token accuracy and perplexity as low as 2.90. Analysis showed data quality impacts performance and OCR pipelines have limitations for Indic scripts.

Conclusion: Bangla-transliterated Chakma is effective for transfer learning, and the released dataset encourages further research on multilingual modeling for low-resource languages.

Abstract: As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based transformer models, including multilingual (mBERT, XLM-RoBERTa, DistilBERT), regional (BanglaBERT, IndicBERT), and monolingual English (DeBERTaV3) variants on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our dataset to encourage further research on multilingual language modeling for low-resource languages.

[69] Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

Wenjin Liu, Haoran Luo, Xueyuan Lin, Haoming Liu, Tiesunlong Shen, Jiapu Wang, Rui Mao, Erik Cambria

Main category: cs.CL

TL;DR: Prompt-R1 is a reinforcement learning framework that uses a small LLM to generate prompts for a large LLM, improving problem-solving without user input.

Details

Motivation: Users often struggle to provide effective prompts for complex problems, limiting LLM performance. This framework automates prompt generation to enhance LLM capabilities.

Method: End-to-end RL framework with small LLM generating prompts for large LLM in multi-turn interactions. Uses dual-constrained reward for correctness, quality, and reasoning accuracy.

Result: Significantly outperforms baseline models across multiple public datasets and tasks.

Conclusion: Prompt-R1 provides an effective plug-and-play solution to enhance LLM performance through automated prompt generation, making complex problem-solving more accessible.

Abstract: Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.

[70] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, Shengjie Bi, Shishir G. Patil, Qi Qi, Shengyu Feng, Julian Katz-Samuels, Richard Yuanzhe Pang, Sujan Gonugondla, Hunter Lang, Yue Yu, Yundi Qian, Maryam Fazel-Zarandi, Licheng Yu, Amine Benhalloum, Hany Awadalla, Manaal Faruqui

Main category: cs.CL

TL;DR: RIFL is a novel training pipeline that uses rubric generation and verification to improve LLMs’ ability to follow complex, multi-turn instructions, achieving 6.7% improvement on the new AdvancedIF benchmark.

Details

Motivation: Advanced instruction following for complex, multi-turn, and system-prompted instructions remains challenging for LLMs, with limitations in evaluation benchmarks and reliable training signals.

Method: Proposed RIFL pipeline with rubric generation, finetuned rubric verifier, and reward shaping for reinforcement learning on instruction following.

Result: RIFL achieves 6.7% absolute gain on AdvancedIF benchmark and strong results on public benchmarks, with ablation studies confirming component effectiveness.

Conclusion: Rubrics are established as powerful tools for both training and evaluating advanced instruction following in LLMs, enabling more capable AI systems.

Abstract: Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.

[71] Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

Zhaopei Huang, Qifeng Dai, Guozheng Wu, Xiaopeng Wu, Kehan Chen, Chuan Yu, Xubin Li, Tiezheng Ge, Wenxuan Wang, Qin Jin

Main category: cs.CL

TL;DR: PAL-Bench is a new benchmark for evaluating personalization capabilities in service-oriented dialogue assistants, featuring PAL-Set (first Chinese multi-session dataset) and H²Memory framework for improved personalized interactions.

Details

Motivation: Existing approaches overlook long-term interaction complexities and fail to capture users' subjective characteristics in service-oriented human-agent interactions, creating a need for better personalization evaluation.

Method: Developed a multi-step LLM-based synthesis pipeline to create PAL-Set dataset, and proposed H²Memory - a hierarchical and heterogeneous memory framework with retrieval-augmented generation for personalized response generation.

Result: Comprehensive experiments on PAL-Bench and external datasets demonstrate the effectiveness of the proposed memory framework in improving personalized service-oriented interactions.

Conclusion: The work addresses gaps in long-term personalization evaluation and provides both a benchmark (PAL-Bench) and a framework (H²Memory) to enhance personalized dialogue assistants in service-oriented contexts.

Abstract: With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users’ subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.

[72] AICC: Parse HTML Finer, Make Models Better – A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu, Pengli Ren, Yuan Qu, Jiahui Peng, Linfeng Hou, Mengjie Liu, Lindong Lu, Wenchang Ning, Jia Yu, Rui Min, Jin Shi, Haojiong Chen, Peng Zhang, Wenjian Zhang, Qian Jiang, Zengjie Hu, Guoqiang Yang, Zhenxiang Li, Fukai Shang, Runyuan Ma, Chenlin Su, Zhongying Tu, Wentao Zhang, Dahua Lin, Conghui He

Main category: cs.CL

TL;DR: MinerU-HTML is a novel HTML-to-text extraction pipeline that uses a 0.6B-parameter language model for sequence labeling, significantly outperforming heuristic methods like Trafilatura and improving downstream model performance by preserving structured elements better.

Details

Motivation: Current web data curation focuses on filtering and deduplication while treating HTML extraction as fixed pre-processing. Heuristic extractors like Trafilatura struggle to preserve document structure and corrupt elements like formulas, codes, and tables.

Method: Reformulates content extraction as sequence labeling using a 0.6B-parameter language model. Uses two-stage formatting pipeline that categorizes semantic elements before converting to Markdown. Constructed AICC corpus from Common Crawl using MinerU-HTML.

Result: Achieved 81.8% ROUGE-N F1 vs Trafilatura’s 63.6% on MainWebBench (7,887 pages). Exceptional structured element preservation: 90.9% for code blocks, 94.0% for formulas. Models trained on AICC (62B tokens) achieved 50.8% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp.

Conclusion: HTML extraction quality significantly impacts model capabilities and is a critical, often underestimated component of web corpus construction. MinerU-HTML’s model-based approach is inherently scalable compared to heuristic methods.

Abstract: While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8% ROUGE-N F1 compared to Trafilatura’s 63.6%, with exceptional structured element preservation (90.9% for code blocks, 94.0% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.

[73] CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang

Main category: cs.CL

TL;DR: CLaRa is a unified framework that performs embedding-based compression and joint optimization in continuous space to address issues in retrieval-augmented generation, achieving state-of-the-art performance on QA benchmarks.

Details

Motivation: Retrieval-augmented generation (RAG) enhances LLMs with external knowledge but suffers from long contexts and disjoint retrieval-generation optimization.

Method: Proposes CLaRa framework with SCP data synthesis for semantically rich compressed vectors, and trains reranker and generator end-to-end via single language modeling loss using differentiable top-k estimator.

Result: Achieves state-of-the-art compression and reranking performance across multiple QA benchmarks, often surpassing text-based fine-tuned baselines.

Conclusion: The unified optimization in CLaRa aligns retrieval relevance with answer quality, providing an effective solution for retrieval-augmented generation challenges.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.

[74] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh

Main category: cs.CL

TL;DR: RLER enables training deep research models for long-form tasks by evolving rubrics during training, resulting in DR Tulu-8B that matches proprietary systems while being smaller and cheaper.

Details

Motivation: Existing open deep research models are trained on short-form QA tasks with verifiable rewards, which doesn't extend to realistic long-form research tasks.

Method: Reinforcement Learning with Evolving Rubrics (RLER) - constructing and maintaining rubrics that co-evolve with the policy model during training to provide discriminative, on-policy feedback.

Result: DR Tulu-8B substantially outperforms existing open deep research models and matches/exceeds proprietary systems across four long-form benchmarks in science, healthcare and general domains.

Conclusion: RLER enables effective training of open deep research models for long-form tasks, with DR Tulu-8B demonstrating state-of-the-art performance while being more efficient than proprietary alternatives.

Abstract: Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.

[75] A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

Farzad Ahmed, Joniel Augustine Jerome, Meliha Yetisgen, Özlem Uzuner

Main category: cs.CL

TL;DR: RDP outperforms zero-shot and static prompting for medical error detection and correction, reducing false positives by 15% and improving recall by 5-10%.

Details

Motivation: Clinical documentation errors compromise patient safety, and LLMs may help detect/correct them, but their behavior under different prompting strategies is unclear.

Method: Evaluated 9 instruction-tuned LLMs using MEDEC dataset with zero-shot, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for error flag detection, error sentence detection, and error correction.

Result: RDP reduced FPR by ~15%, improved recall by 5-10% in error sentence detection, and generated more contextually accurate corrections compared to other methods.

Conclusion: Retrieval-augmented dynamic prompting improves detection accuracy, reduces false positives, and enhances reliability of medical error correction across diverse LLMs.

Abstract: Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.

[76] MTA: A Merge-then-Adapt Framework for Personalized Large Language Model

Xiaopeng Li, Yuanjin Zheng, Wanyu Wang, wenlin zhang, Pengyue Jia, Yiqi Wang, Maolin Wang, Xuetao Wei, Xiangyu Zhao

Main category: cs.CL

TL;DR: MTA is a Merge-then-Adapt framework for Personalized LLMs that addresses scalability and sparse data issues through meta-LoRA bank construction, adaptive LoRA fusion, and LoRA stacking for few-shot personalization.

Details

Motivation: Current PLLM approaches face scalability issues with linear storage costs per user and suboptimal performance for users with sparse data.

Method: Three-stage framework: (1) construct shared Meta-LoRA Bank with anchor users, (2) adaptive LoRA fusion to dynamically merge relevant meta-LoRAs, (3) LoRA stacking with ultra-low-rank LoRA for few-shot personalization.

Result: Outperforms existing SOTA methods on LaMP benchmark across multiple tasks.

Conclusion: MTA provides scalable and effective personalization for LLMs while addressing storage and sparse data challenges.

Abstract: Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.

[77] BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali

Abdullah Al Sefat

Main category: cs.CL

TL;DR: BengaliFig is a Bengali challenge dataset with 435 culturally rich riddles that reveals LLM weaknesses in figurative and cultural reasoning for low-resource languages.

Details

Motivation: To address the gap in evaluating LLMs on figurative and culturally grounded reasoning in low-resource contexts, specifically for Bengali language.

Method: Created a dataset of 435 Bengali riddles from oral/literary traditions, annotated across 5 dimensions, and converted to multiple-choice format using AI-assisted pipeline. Evaluated 8 frontier LLMs with zero-shot and few-shot chain-of-thought prompting.

Result: LLMs showed consistent weaknesses in metaphorical and culturally specific reasoning, highlighting limitations in low-resource cultural contexts.

Conclusion: BengaliFig provides a diagnostic tool for evaluating LLM robustness in heritage-aware NLP and promotes more inclusive evaluation practices.

Abstract: Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.

cs.CV

David Amebley, Sayanton Dibbo

Main category: cs.CV

TL;DR: This paper introduces a neuroscience-inspired topological regularization framework to make multi-modal vision-language models more resilient against membership inference attacks while maintaining model utility.

Details

Motivation: Multi-modal models are vulnerable to privacy attacks like membership inference, but current research focuses mainly on unimodal systems. It's unknown whether neuro-inspired multi-modal models are resilient against such privacy attacks.

Method: Proposed a topological regularization framework (tau) applied to three VLMs (BLIP, PaliGemma 2, ViT-GPT2) across three datasets (COCO, CC3M, NoCaps). The tau > 0 configuration defines NEURO variants of VLMs.

Result: NEURO VLMs showed 24% mean ROC-AUC drop in MIA attack success on BLIP with COCO dataset, while maintaining similar model utility (MPNet and ROUGE-2 metrics). Results were consistent across other models and datasets.

Conclusion: Neuro-inspired VLMs are more resilient against privacy attacks without significantly compromising model utility, contributing to understanding privacy risks in multi-modal models.

Abstract: In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.

[79] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Inferix Team, Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, Zhiyuan Wang, Jichao Wu, Mingyang Yang, Yinghao Yu, Zeyu Zhang, Bohan Zhuang

Main category: cs.CV

TL;DR: Inferix is a next-generation inference engine designed for world simulation through optimized semi-autoregressive decoding, enabling immersive video generation with interactive streaming and benchmarking capabilities.

Details

Motivation: To advance world models as core simulators for agentic AI, embodied AI, and gaming by enabling long, physically realistic, and interactive high-quality video generation, moving beyond current LLM-centric vision foundation models.

Method: Uses semi-autoregressive (block-diffusion) decoding paradigm that combines diffusion and autoregressive methods, generating video tokens in blocks with diffusion within each block while conditioning on previous ones. Features LLM-style KV Cache management for efficient variable-length generation, interactive video streaming, profiling, and LV-Bench integration.

Result: Enables coherent and stable video sequences, overcoming limitations of standard video diffusion models. Supports efficient, variable-length, high-quality video generation with real-time interaction capabilities.

Conclusion: Inferix represents a dedicated world simulation system distinct from high-concurrency engines and classic video diffusion models, with the goal of advancing world model exploration through community collaboration.

Abstract: World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.

[80] Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?

Kun Guo, Yun Shen, Xijun Wang, Chaoqun You, Yun Rui, Tony Q. S. Quek

Main category: cs.CV

TL;DR: LTED-Ada is a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection for video object recognition, with federated learning enhancement for multi-device scenarios.

Details

Motivation: Resource-constrained devices like traffic cameras struggle with fast and accurate video object recognition. While edge computing offloads computation to servers, the challenge is deciding when to use edge detection vs local tracking.

Method: Formulated long-term optimization problems for single/multi-device scenarios, proposed LTED-Ada using deep reinforcement learning for adaptive selection, and enhanced it with federated learning for multi-device collaboration.

Result: Extensive hardware-in-the-loop experiments using Raspberry Pi 4B devices and PC edge server demonstrated LTED-Ada’s superiority in performance.

Conclusion: LTED-Ada effectively addresses the adaptive selection challenge between edge detection and local tracking, with federated learning improving generalization across devices and requirements.

Abstract: Fast and accurate video object recognition, which relies on frame-by-frame video analytics, remains a challenge for resource-constrained devices such as traffic cameras. Recent advances in mobile edge computing have made it possible to offload computation-intensive object detection to edge servers equipped with high-accuracy neural networks, while lightweight and fast object tracking algorithms run locally on devices. This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking. To address this, we formulate two long-term optimization problems for both single-device and multi-device scenarios, taking into account the temporal correlation of consecutive frames and the dynamic conditions of mobile edge networks. Based on the formulation, we propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection, according to the frame rate as well as recognition accuracy and delay requirement. In multi-device setting, we further enhance LTED-Ada using federated learning to enable collaborative policy training across devices, thereby improving its generalization to unseen frame rates and performance requirements. Finally, we conduct extensive hardware-in-the-loop experiments using multiple Raspberry Pi 4B devices and a personal computer as the edge server, demonstrating the superiority of LTED-Ada.

[81] Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Jie Li, Hongyi Cai, Mingkang Dong, Muxin Pu, Shan You, Fei Wang, Tao Huang

Main category: cs.CV

TL;DR: Pistachio is a new Video Anomaly Detection/Understanding benchmark created using a generation-based pipeline that provides controlled scenes, anomaly types, and temporal narratives to overcome limitations of existing datasets.

Details

Motivation: Existing VAD benchmarks lack scene diversity, balanced anomaly coverage, and temporal complexity needed for real-world assessment, while VAU requires deeper semantic reasoning but is difficult to benchmark due to heavy annotation requirements.

Method: A controlled generation-based pipeline using video generation models with scene-conditioned anomaly assignment, multi-step storyline generation, and temporally consistent long-form synthesis to produce coherent 41-second videos with minimal human intervention.

Result: Pistachio demonstrates scale, diversity, and complexity, revealing new challenges for existing methods and enabling more reliable assessment of real-world performance.

Conclusion: The benchmark motivates future research on dynamic and multi-event anomaly understanding by providing a controlled, bias-free dataset that addresses limitations of Internet-collected datasets.

Abstract: Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.

[82] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

Haibo HU, Lianming Huang, Nan Guan, Chun Jason Xue

Main category: cs.CV

TL;DR: DeeAD is a training-free early-exit framework that accelerates Vision-Language Action models for autonomous driving by terminating inference when trajectories meet planning priors, achieving 28% sparsity and 29% latency reduction.

Details

Motivation: VLA models suffer from high inference latency due to deep transformer stacks, which limits their practical deployment in real-time autonomous driving applications.

Method: Uses action-guided early-exit with physical feasibility checks, multi-hop controller for layer skipping, and integrates with existing VLA models without retraining by evaluating trajectory alignment with planning priors.

Result: Achieves up to 28% transformer-layer sparsity and 29% latency reduction on Bench2Drive benchmark while maintaining planning quality and safety.

Conclusion: DeeAD provides an effective training-free solution for accelerating VLA planning models through early-exit mechanisms based on trajectory feasibility, enabling practical real-time deployment.

Abstract: Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.

[83] Foundry: Distilling 3D Foundation Models for the Edge

Guillaume Letellier, Siddharth Srivastava, Frédéric Jurie, Gaurav Sharma

Main category: cs.CV

TL;DR: FMD compresses large SSL foundation models into compact proxies that retain general-purpose representational power, with Foundry as the first implementation for 3D point clouds using SuperTokens to reconstruct teacher representations.

Details

Motivation: Large foundation models are too computationally expensive for edge devices, and existing compression methods sacrifice the downstream-agnostic generality that makes foundation models valuable.

Method: Foundry trains a student model to learn compressed SuperTokens that reconstruct the teacher’s token-level representations, capturing a compact basis of the latent space.

Result: A single distilled model maintains strong transferability across diverse downstream tasks (classification, part segmentation, few-shot scenarios), approaching full foundation-model performance with significantly fewer tokens and FLOPs.

Conclusion: FMD enables practical deployment of foundation models on resource-constrained hardware while preserving their general-purpose representational capabilities.

Abstract: Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient ‘specialist’ models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.

[84] DinoLizer: Learning from the Best for Generative Inpainting Localization

Minh Thong Doi, Jan Butora, Vincent Itier, Jérémie Boulanger, Patrick Bas

Main category: cs.CV

TL;DR: DinoLizer is a DINOv2-based model for detecting manipulated regions in generative inpainting, using patch embeddings and linear classification to achieve state-of-the-art localization performance.

Details

Motivation: To develop an effective method for localizing manipulated regions in generative inpainting that can distinguish semantic alterations from non-semantic edits and remain robust to common post-processing operations.

Method: Builds on DINOv2 pretrained on B-Free dataset, adds linear classification head on patch embeddings to predict manipulations at 14×14 resolution, uses sliding-window strategy for larger images, and post-processes heatmaps to refine binary masks.

Result: Surpasses state-of-the-art local manipulation detectors, achieves 12% higher IoU than next best model, remains robust to resizing, noise addition, and JPEG compression, with even greater gains after post-processing.

Conclusion: Demonstrates strong representational power of Vision Transformers for manipulation localization, with DinoLizer showing superiority over other methods and confirming effectiveness through ablation studies comparing DINOv2 and DINOv3.

Abstract: We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer’s patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer’s superiority. The code will be publicly available upon acceptance of the paper.

[85] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

Daeheon Jeong, Seoyeon Byun, Kihoon Son, Dae Hyun Kim, Juho Kim

Main category: cs.CV

TL;DR: CANVAS is a new benchmark for evaluating vision language models’ ability to perform tool-based UI design tasks through design software like Figma or Sketch, covering 598 tasks across 30 functional categories.

Details

Motivation: Current VLMs show potential to operate design software and collaborate with designers, but there's no existing benchmark to evaluate their tool-based design capabilities, leaving this capacity unknown.

Method: Created CANVAS benchmark with 598 tool-based design tasks sampled from 3.3K mobile UI designs across 30 categories, featuring two task types: design replication (reproducing whole UI screens) and design modification (modifying specific parts of existing screens).

Result: Leading models demonstrate more strategic tool invocations that improve design quality, and common error patterns were identified to guide future improvements.

Conclusion: The benchmark successfully evaluates VLMs’ tool-based UI design capabilities, revealing strategic tool usage patterns and identifying areas for future enhancement in design collaboration tools.

Abstract: User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs’ potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.

[86] MODEST: Multi-Optics Depth-of-Field Stereo Dataset

Nisarg K. Trivedi, Vinayak A. Belludi, Li-Yun Wang, Pardis Taghavi, Dante Lok

Main category: cs.CV

TL;DR: A high-resolution stereo DSLR dataset with systematic optical variations to bridge the realism gap between synthetic training data and real camera optics.

Details

Motivation: Address the lack of large-scale, high-fidelity real stereo DSLR datasets that limit real-world generalization and evaluation of depth estimation models trained on synthetic data.

Method: Created a dataset with 18,000 high-resolution (5472×3648px) stereo images captured with two identical DSLR cameras across 9 scenes, systematically varying 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), totaling 50 optical configurations per scene.

Result: The dataset enables controlled analysis of geometric and optical effects for various computer vision tasks including monocular/stereo depth estimation, depth-of-field rendering, deblurring, 3D reconstruction, and novel view synthesis.

Conclusion: The work bridges the realism gap between synthetic and real camera optics, reveals challenges with current state-of-the-art methods, and provides resources to support reproducible research on real-world optical generalization.

Abstract: Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.

[87] Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru, Xiaochuang Han, Bhuwan Dhingra, Emily Dinan, Maha Elbayad

Main category: cs.CV

TL;DR: TIE is a text-guided semantic image encoder that generates image representations conditioned on text queries, improving VLM performance and efficiency.

Details

Motivation: Standard image encoders in VLMs process images agnostically without considering downstream tasks or text queries, limiting their effectiveness.

Method: Propose Text-Guided Semantic Image Encoder (TIE) that generates image representations conditioned on input text queries through text-conditioned training.

Result: TIE-based VLMs outperform conventional counterparts by +1.5 and +1.3 points on average across nine benchmarks, with up to 6-point gains on DocVQA and InfoVQA, while using only half the image tokens.

Conclusion: TIE effectively optimizes encoders to capture key visual features, improves interpretability and query-specific grounding, and enhances both performance and inference efficiency.

Abstract: Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.

[88] One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

Sindhuja Penchala, Gavin Money, Gabriel Marques, Samuel Wood, Jessica Kirschman, Travis Atkison, Shahram Rahimi, Noorbakhsh Amiri Golilarz

Main category: cs.CV

TL;DR: SMARC is a unified model that reconstructs full RGB surfaces and classifies material categories from just a single 10% contiguous image patch, achieving state-of-the-art performance in both tasks.

Details

Motivation: Existing methods require dense or full-scene observations, limiting their effectiveness in constrained environments. There's a need for models that can understand material surfaces from minimal visual input.

Method: Combines Partial Convolutional U-Net with a classification head, enabling spatial inpainting and semantic understanding under extreme observation sparsity.

Result: Achieves PSNR of 17.55 dB for surface reconstruction and 85.10% accuracy for material classification on Touch and Go dataset, outperforming five baseline models including ViT, MAE, and DETR.

Conclusion: Partial convolution provides advantages in spatial reasoning under missing data, establishing a strong foundation for minimal-vision surface understanding.

Abstract: Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation, and material perception. However, most existing methods rely on dense or full-scene observations, limiting their effectiveness in constrained or partial view environment. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders [17], Vision Transformer (ViT) [13], Masked Autoencoder (MAE) [5], Swin Transformer [9], and DETR [2] using Touch and Go dataset [16] of real-world surface textures. SMARC achieves state-of-the-art results with a PSNR of 17.55 dB and a material classification accuracy of 85.10%. Our findings highlight the advantages of partial convolution in spatial reasoning under missing data and establish a strong foundation for minimal-vision surface understanding.

[89] LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing

Main category: cs.CV

TL;DR: LongVT is an agentic framework that enables multimodal chain-of-tool-thought reasoning for long videos, using LMMs’ temporal grounding to crop and examine relevant clips in a global-to-local loop.

Details

Motivation: Existing LMMs are vulnerable to hallucinations when processing long-form videos where evidence is sparse and temporally dispersed, unlike human comprehension that skims globally then examines details.

Method: Uses LMMs’ temporal grounding as a native video cropping tool to zoom in on specific clips and resample finer-grained frames in a global-to-local reasoning loop. Includes three-stage training with supervised fine-tuning and agentic reinforcement learning.

Result: Outperforms existing baselines across four challenging long-video understanding benchmarks. Releases VideoSIAH dataset with 247.9K training samples and 1,280 QA evaluation pairs.

Conclusion: LongVT provides an effective end-to-end framework for long video reasoning that mimics human comprehension patterns, addressing hallucination issues through evidence-grounded multimodal reasoning.

Abstract: Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables “Thinking with Long Videos” via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs’ inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

[90] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models

Souradeep Dutta, Keshav Bulia, Neena S Nair

Main category: cs.CV

TL;DR: Lightweight reproduction of KRISP model with significantly fewer parameters, achieving 75% of original performance while uncovering design flaws and enabling edge device deployment.

Details

Motivation: To create a more efficient and accessible version of KRISP that addresses computational demands and enables deployment on resource-constrained devices like smartphones and AR-VR systems.

Method: Systematic replication with reduced parameters, constrained external knowledge graph domain, and evaluation through ablation studies on synthetic VQA data and DAQUAR dataset.

Result: Achieved 75% of original KRISP performance while preventing AI hallucinations by constraining outputs to the knowledge graph domain, enabling operation on edge devices.

Conclusion: Lightweight knowledge-enhanced VQA models can maintain reasonable performance while being deployable on resource-constrained devices, though they reveal previously undocumented design flaws in the original architecture.

Abstract: Facebook AI Research introduced KRISP [4], which integrates structured external knowledge into pipelines for vision-language reasoning. Despite its effectiveness, the original model has been developed for industrial-scale training, is computationally demanding, and is tightly connected to a large backbone. In this work, we reexamine KRISP from a different angle and offer a lightweight reproduction with significantly fewer parameters. Even though our replicated model performs about 75 % of the original, the replication process uncovers a number of design flaws, real-world pitfalls, and implicit problems that were not fully covered in the original paper. We offer insights into the scalability and efficacy of knowledge-enhanced VQA architectures under resource constraints through systematic ablation studies, which include a proof-of-concept on synthetic VQA data and evaluation on the DAQUAR dataset. Our model, configured with a low parameter setup and constrained by the external Knowledge graph domain, prevents AI hallucinations and generates outputs solely within that domain. Minimal parameters allow us to function on edge devices like smartphones and AR-VR, further improving offline visual reasoning.

[91] Intriguing Properties of Dynamic Sampling Networks

Dario Morle, Reid Zaffino

Main category: cs.CV

TL;DR: The paper introduces a unified theoretical framework called ‘warping’ that generalizes dynamic sampling methods in deep learning, analyzes their statistical properties, and reveals unique training asymmetries between forward and backward passes.

Details

Motivation: To unify the theoretical analysis of various dynamic sampling mechanisms in computer vision models and provide a common framework for understanding architectures like deformable convolutions, active convolutional units, and spatial transformer networks.

Method: Developed a novel ‘warping’ operator that generalizes existing dynamic sampling methods, conducted statistical analysis modeling inputs as IID variables and homogeneous random fields, and introduced a novel loss landscape visualization using gradient update information.

Result: Discovered unique asymmetry between forward and backward passes in training, identified dynamic sampling as an orthogonal class of operators to traditional convolutions, and established conditions for stable training of dynamic sampling networks.

Conclusion: The warping framework successfully unifies dynamic sampling methods, provides theoretical foundations for analysis, and enables better understanding of training dynamics through novel visualization techniques.

Abstract: Dynamic sampling mechanisms in deep learning architectures have demonstrated utility across many computer vision models, though the theoretical analysis of these structures has not yet been unified. In this paper we connect the various dynamic sampling methods by developing and analyzing a novel operator which generalizes existing methods, which we term “warping”. Warping provides a minimal implementation of dynamic sampling which is amenable to analysis, and can be used to reconstruct existing architectures including deformable convolutions, active convolutional units, and spatial transformer networks. Using our formalism, we provide statistical analysis of the operator by modeling the inputs as both IID variables and homogeneous random fields. Extending this analysis, we discover a unique asymmetry between the forward and backward pass of the model training. We demonstrate that these mechanisms represent an entirely different class of orthogonal operators to the traditional translationally invariant operators defined by convolutions. With a combination of theoretical analysis and empirical investigation, we find the conditions necessary to ensure stable training of dynamic sampling networks. In addition, statistical analysis of discretization effects are studied. Finally, we introduce a novel loss landscape visualization which utilizes gradient update information directly, to better understand learning behavior.

[92] One-Step Diffusion-Based Image Compression with Semantic Distillation

Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, Yan Lu

Main category: cs.CV

TL;DR: OneDC is a one-step diffusion-based generative image codec that eliminates iterative sampling latency while achieving state-of-the-art perceptual quality with 39% bitrate reduction and 20x faster decoding.

Details

Motivation: To address the unpleasing latency introduced by iterative sampling in diffusion-based generative image codecs, while maintaining high compression performance.

Method: Integrates latent compression with one-step diffusion generator, uses hyperprior as semantic guidance instead of text prompts, employs semantic distillation from pretrained generative tokenizer, and adopts hybrid pixel- and latent-domain optimization.

Result: Achieves SOTA perceptual quality with one-step generation, offering over 39% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs.

Conclusion: Multi-step sampling is not necessary for generative compression, and one-step diffusion codecs can achieve superior performance with significantly reduced latency.

Abstract: While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec – that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 39% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Project: https://onedc-codec.github.io/

Kriti Ghosh, Devjyoti Chakraborty, Lakshmish Ramaswamy, Suchendra M. Bhandarkar, In Kee Kim, Nancy O’Hare, Deepak Mishra

Main category: cs.CV

TL;DR: Δ-NeRF is a modular residual framework for incremental NeRF refinement that enables efficient updates without catastrophic forgetting, achieving comparable performance to joint training with 30-42% faster training.

Details

Motivation: Most NeRF frameworks require complete retraining for new views, limiting applicability in sequential data scenarios like satellite terrain analysis where regions are repeatedly observed over time.

Method: Uses residual controller for layer corrections, uncertainty-aware gating to prevent overcorrection, view selection to reduce training data by 47%, and knowledge distillation for model compression to 20% of original size.

Result: Achieves performance comparable to joint training while reducing training time by 30-42%, outperforms baselines with up to 43.5% PSNR improvement over naive fine-tuning, and surpasses joint training on some metrics.

Conclusion: Δ-NeRF provides an effective solution for incremental NeRF refinement that addresses catastrophic forgetting and enables efficient updates for sequential data applications like satellite imagery.

Abstract: Neural Radiance Fields (NeRFs) have demonstrated remarkable capabilities in 3D reconstruction and novel view synthesis. However, most existing NeRF frameworks require complete retraining when new views are introduced incrementally, limiting their applicability in domains where data arrives sequentially. This limitation is particularly problematic in satellite-based terrain analysis, where regions are repeatedly observed over time. Incremental refinement of NeRFs remains underexplored, and naive approaches suffer from catastrophic forgetting when past data is unavailable. We propose $Δ$-NeRF, a unique modular residual framework for incremental NeRF refinement. $Δ$-NeRF introduces several novel techniques including: (1) a residual controller that injects per-layer corrections into a frozen base NeRF, enabling refinement without access to past data; (2) an uncertainty-aware gating mechanism that prevents overcorrection by adaptively combining base and refined predictions; and (3) a view selection strategy that reduces training data by up to 47% while maintaining performance. Additionally, we employ knowledge distillation to compress the enhanced model into a compact student network (20% of original size). Experiments on satellite imagery demonstrate that $Δ$-NeRF achieves performance comparable to joint training while reducing training time by 30-42%. $Δ$-NeRF consistently outperforms existing baselines, achieving an improvement of up to 43.5% in PSNR over naive fine-tuning and surpassing joint training on some metrics.

[94] GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction

Eya Cherif, Arthur Ouaknine, Luke A. Brown, Phuong D. Dao, Kyle R. Kovach, Bing Lu, Daniel Mederer, Hannes Feilhauer, Teja Kattenborn, David Rolnick

Main category: cs.CV

TL;DR: GreenHyperSpectra is a pretraining dataset for plant trait prediction using hyperspectral data, addressing label scarcity and domain shifts across sensors and ecosystems through semi- and self-supervised learning methods.

Details

Motivation: Conventional field sampling cannot cover trait variation at ecologically meaningful scales, and machine learning approaches face challenges with label scarcity and domain shifts across different sensors and ecological distributions.

Method: Created GreenHyperSpectra dataset with real-world cross-sensor and cross-ecosystem samples, used pretraining with semi- and self-supervised methods, and evaluated models in both in-distribution and out-of-distribution scenarios.

Result: Pretrained label-efficient multi-output regression models outperformed state-of-the-art supervised baselines, showing substantial improvements in learning spectral representations for trait prediction.

Conclusion: Established a comprehensive methodological framework that advances research at the intersection of representation learning and plant functional traits assessment, with all code and data publicly available.

Abstract: Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.

[95] Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara, Yujia Chen, Ming-Hsuan Yang, James M. Rehg, Wen-Sheng Chu, Du Tran

Main category: cs.CV

TL;DR: Split-then-Merge (StM) is a novel framework that enhances control in generative video composition by splitting unlabeled videos into foreground/background layers and self-composing them to learn compositional dynamics, addressing data scarcity without annotated datasets.

Details

Motivation: To address the data scarcity problem in generative video composition and enhance control over dynamic subject-scene interactions without relying on annotated datasets or handcrafted rules.

Method: Splits unlabeled videos into dynamic foreground and background layers, then self-composes them using a transformation-aware training pipeline with multi-layer fusion, augmentation for affordance-aware composition, and identity-preservation loss for foreground fidelity.

Result: Outperforms state-of-the-art methods in both quantitative benchmarks and human/VLLM-based qualitative evaluations.

Conclusion: StM effectively learns complex compositional dynamics for realistic video generation through self-supervised learning on unlabeled video data, achieving superior performance over existing methods.

Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

[96] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam, Saksham Aggarwal, Justin Yang Chae, Nidhi Rastogi

Main category: cs.CV

TL;DR: Sphinx is a synthetic environment for visual perception and reasoning that procedurally generates puzzles with verifiable solutions, covering 25 cognitive tasks. Current LVLMs perform poorly (51.1% accuracy), but RLVR training significantly improves performance.

Details

Motivation: To create a precise evaluation framework for visual reasoning that targets core cognitive primitives with verifiable ground-truth solutions, addressing limitations in current benchmarks.

Method: Procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives paired with verifiable solutions. Evaluates models on 25 task types and applies reinforcement learning with verifiable rewards (RLVR).

Result: State-of-the-art GPT-5 achieves only 51.1% accuracy, well below human performance. RLVR training substantially improves model accuracy on Sphinx tasks and yields gains on external visual reasoning benchmarks.

Conclusion: Sphinx provides a rigorous benchmark for visual reasoning, revealing significant gaps in current LVLMs. RLVR shows promise for advancing multimodal reasoning capabilities.

Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

[97] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell’Erba, Andrew D. Bagdanov

Main category: cs.CV

TL;DR: OVI replaces expensive diffusion priors with optimization-based visual inversion, achieving comparable performance without training or data.

Details

Motivation: Current diffusion models rely on computationally expensive prior networks that require massive datasets, which is inefficient and resource-intensive.

Method: Uses Optimization-based Visual Inversion (OVI) with random pseudo-tokens initialization and iterative optimization using cosine similarity with text embeddings, plus Mahalanobis and Nearest-Neighbor constraints for regularization.

Result: OVI achieves quantitative scores comparable to state-of-the-art data-efficient priors, with Nearest-Neighbor approach showing particularly good performance and improved visual fidelity over text-embedding baseline.

Conclusion: OVI is a viable alternative to traditional diffusion priors, revealing flaws in current evaluation benchmarks and showing promise for further investigation.

Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

Roman Naeem, David Hagerman, Jennifer Alvén, Fredrik Kahl

Main category: cs.CV

TL;DR: RefTr is a 3D image-to-graph model for vascular tree centerline detection using recurrent refinement of confluent trajectories, achieving high recall with fewer parameters and faster inference than previous methods.

Details

Motivation: Accurate centerline detection with correct tree topology is critical for clinical applications like diagnosis and surgical planning. High recall is essential as missing small branches can lead to fatal mistakes from incomplete assessments.

Method: Uses Producer-Refiner architecture with Transformer decoder. Producer proposes initial confluent trajectories that are recurrently refined by Refiner to form centerline graphs. Includes efficient non-maximum suppression for merging duplicate branches.

Result: Achieves superior recall and comparable precision to previous SOTA across multiple datasets, with 2.4x reduction in decoder parameters and faster inference speed.

Conclusion: RefTr demonstrates potential as a new state-of-the-art framework for vascular tree analysis in 3D medical imaging, offering improved performance with computational efficiency.

Abstract: Tubular trees, such as blood vessels and lung airways, are essential for material transport within the human body. Accurately detecting their centerlines with correct tree topology is critical for clinical tasks such as diagnosis, treatment planning, and surgical navigation. In these applications, maintaining high recall is crucial, as missing small branches can result in fatal mistakes caused by incomplete assessments or undetected abnormalities. We present RefTr, a 3D image-to-graph model for centerline generation of vascular trees via recurrent refinement of confluent trajectories. RefTr uses a Producer-Refiner architecture based on a Transformer decoder, where the Producer proposes a set of initial confluent trajectories that are recurrently refined by the Refiner to produce final trajectories, which forms the centerline graph. The confluent trajectory representation enables refinement of complete trajectories while explicitly enforcing a valid tree topology. The recurrent refinement scheme improves precision and reuses the same Refiner block across multiple steps, yielding a 2.4x reduction in decoder parameters compared to previous SOTA. We also introduce an efficient non-maximum suppression algorithm for spatial tree graphs to merge duplicate branches and boost precision. Across multiple public centerline datasets, RefTr achieves superior recall and comparable precision to previous SOTA, while offering faster inference and substantially fewer parameters, demonstrating its potential as a new state-of-the-art framework for vascular tree analysis in 3D medical imaging.

[99] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

Sree Bhattacharyya, Yaman Kumar Singla, Sudhir Yarram, Somesh Kumar Singh, Harini S, James Z. Wang

Main category: cs.CV

TL;DR: This paper introduces the first large-scale unsupervised dataset for visual memorability using tip-of-the-tongue queries from online platforms, enabling better recall generation and retrieval tasks than state-of-the-art models.

Details

Motivation: Current visual memorability research is limited by expensive human annotations and lack of nuanced memorability signals from natural recall descriptions, restricting dataset diversity and scalability.

Method: Leveraged tip-of-the-tongue retrieval queries from platforms like Reddit to create an unsupervised dataset of 82,000+ videos with descriptive recall data, and used contrastive training for multimodal ToT retrieval.

Result: Fine-tuned vision-language models outperformed GPT-4o in generating open-ended memorability descriptions, and created the first model capable of multimodal ToT retrieval.

Conclusion: The unsupervised dataset and models provide a novel direction for advancing visual content memorability research by capturing rich memorability signals from natural recall data.

Abstract: Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.

[100] Estimating Fog Parameters from a Sequence of Stereo Images

Yining Ding, João F. C. Mota, Andrew M. Wallace, Sen Wang

Main category: cs.CV

TL;DR: Proposes a method for estimating fog parameters from stereo foggy images using simultaneous optimization, handles locally homogeneous fog, and introduces SDIRF dataset with real foggy road scenes.

Details

Motivation: Existing approaches estimate fog parameters sequentially, leading to error propagation. Real-world fog is often globally inhomogeneous, requiring methods that can handle local variations.

Method: Simultaneous estimation of all fog parameters through novel optimization, assuming locally homogeneous fog. Can be integrated as add-on module in SLAM/odometry systems.

Result: Superior performance on both synthetic and real foggy data from SDIRF dataset. Produces most accurate estimates and better adaptation to real fog compared to prior methods.

Conclusion: The proposed method effectively handles real-world fog conditions and advances visual perception in fog. Code and SDIRF dataset are publicly available for community use.

Abstract: We propose a method which, given a sequence of stereo foggy images, estimates the parameters of a fog model and updates them dynamically. In contrast with previous approaches, which estimate the parameters sequentially and thus are prone to error propagation, our algorithm estimates all the parameters simultaneously by solving a novel optimisation problem. By assuming that fog is only locally homogeneous, our method effectively handles real-world fog, which is often globally inhomogeneous. The proposed algorithm can be easily used as an add-on module in existing visual Simultaneous Localisation and Mapping (SLAM) or odometry systems in the presence of fog. In order to assess our method, we also created a new dataset, the Stereo Driving In Real Fog (SDIRF), consisting of high-quality, consecutive stereo frames of real, foggy road scenes under a variety of visibility conditions, totalling over 40 minutes and 34k frames. As a first-of-its-kind, SDIRF contains the camera’s photometric parameters calibrated in a lab environment, which is a prerequisite for correctly applying the atmospheric scattering model to foggy images. The dataset also includes the counterpart clear data of the same routes recorded in overcast weather, which is useful for companion work in image defogging and depth reconstruction. We conducted extensive experiments using both synthetic foggy data and real foggy sequences from SDIRF to demonstrate the superiority of the proposed algorithm over prior methods. Our method not only produces the most accurate estimates on synthetic data, but also adapts better to real fog. We make our code and SDIRF publicly available\footnote{https://github.com/SenseRoboticsLab/estimating-fog-parameters} to the community with the aim of advancing the research on visual perception in fog.

[101] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Jiancheng Pan, Runze Wang, Tianwen Qian, Mohammad Mahdi, Yanwei Fu, Xiangyang Xue, Xiaomeng Huang, Luc Van Gool, Danda Pani Paudel, Yuqian Fu

Main category: cs.CV

TL;DR: V^2-SAM adapts SAM2 for cross-view object correspondence using two complementary prompt generators and a multi-expert selection mechanism, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: Existing segmentation models like SAM2 struggle with cross-view object correspondence due to drastic viewpoint and appearance variations between different perspectives (e.g., ego-centric and exo-centric views).

Method: Proposes V^2-SAM with two prompt generators: V^2-Anchor for geometry-aware correspondences using DINOv3 features, and V^2-Visual for appearance-guided cues via visual prompt matching. Uses multi-expert design with Post-hoc Cyclic Consistency Selector (PCCS) for adaptive expert selection.

Result: Achieves state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence) benchmarks.

Conclusion: V^2-SAM successfully adapts SAM2 for cross-view object correspondence through complementary prompt generation and adaptive expert selection, demonstrating superior performance across multiple challenging scenarios.

Abstract: Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).

[102] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation

Taehoon Kim, Henry Gouk, Timothy Hospedales

Main category: cs.CV

TL;DR: Null-TTA aligns diffusion models by optimizing the unconditional embedding in classifier-free guidance to prevent reward hacking and maintain semantic coherence during test-time adaptation.

Details

Motivation: Existing test-time alignment methods either under-optimize or over-optimize (reward hack) target reward functions, leading to poor performance and exploitation of non-semantic patterns.

Method: Optimize the unconditional embedding in classifier-free guidance rather than manipulating latent or noise variables, leveraging the structured semantic nature of text embedding space.

Result: Achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalization, preventing reward hacking by ensuring alignment occurs on semantically coherent manifolds.

Conclusion: Semantic-space optimization through unconditional embedding manipulation establishes an effective and principled novel paradigm for test-time alignment in diffusion models.

Abstract: Test-time alignment (TTA) aims to adapt models to specific rewards during inference. However, existing methods tend to either under-optimise or over-optimise (reward hack) the target reward function. We propose Null-Text Test-Time Alignment (Null-TTA), which aligns diffusion models by optimising the unconditional embedding in classifier-free guidance, rather than manipulating latent or noise variables. Due to the structured semantic nature of the text embedding space, this ensures alignment occurs on a semantically coherent manifold and prevents reward hacking (exploiting non-semantic noise patterns to improve the reward). Since the unconditional embedding in classifier-free guidance serves as the anchor for the model’s generative distribution, Null-TTA directly steers model’s generative distribution towards the target reward rather than just adjusting the samples, even without updating model parameters. Thanks to these desirable properties, we show that Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalisation. This establishes semantic-space optimisation as an effective and principled novel paradigm for TTA.

[103] GaINeR: Geometry-Aware Implicit Network Representation

Weronika Jakubowska, Mikołaj Zieliński, Rafał Tobiasz, Krzysztof Byrski, Maciej Zięba, Dominik Belter, Przemysław Spurek

Main category: cs.CV

TL;DR: GaINeR is a geometry-aware implicit neural representation for 2D images that combines trainable Gaussian distributions with neural networks to enable continuous representation, interpretable structure, and local editing.

Details

Motivation: Traditional INRs lack explicit geometric structure and have limited local editing capabilities, restricting their use in dynamic or interactive applications.

Method: Combines trainable Gaussian distributions with neural networks - retrieves K nearest Gaussians for each coordinate, aggregates distance-weighted embeddings, and predicts RGB values via neural network.

Result: Enables continuous image representation with interpretable geometric structure and flexible local editing capabilities.

Conclusion: GaINeR provides a foundation for physically aware and interactive image manipulation by integrating geometric structure with implicit neural representations.

Abstract: Implicit Neural Representations (INRs) have become an essential tool for modeling continuous 2D images, enabling high-fidelity reconstruction, super-resolution, and compression. Popular architectures such as SIREN, WIRE, and FINER demonstrate the potential of INR for capturing fine-grained image details. However, traditional INRs often lack explicit geometric structure and have limited capabilities for local editing or integration with physical simulation, restricting their applicability in dynamic or interactive settings. To address these limitations, we propose GaINeR: Geometry-Aware Implicit Network Representation, a novel framework for 2D images that combines trainable Gaussian distributions with a neural network-based INR. For a given image coordinate, the model retrieves the K nearest Gaussians, aggregates distance-weighted embeddings, and predicts the RGB value via a neural network. This design enables continuous image representation, interpretable geometric structure, and flexible local editing, providing a foundation for physically aware and interactive image manipulation. The official implementation of our method is publicly available at https://github.com/WJakubowska/GaINeR.

[104] A deep learning model to reduce agent dose for contrast-enhanced MRI of the cerebellopontine angle cistern

Yunjie Chen, Rianne A. Weber, Olaf M. Neve, Stephan R. Romeijn, Erik F. Hensen, Jelmer M. Wolterink, Qian Tao, Marius Staring, Berit M. Verbist

Main category: cs.CV

TL;DR: Deep learning model successfully restores standard-dose MRI quality from low-dose (10-30%) contrast-enhanced T1-weighted images of cerebellopontine angle cistern, enabling accurate lesion detection and segmentation with significantly reduced contrast agent.

Details

Motivation: To reduce contrast agent dose in MRI scans while maintaining diagnostic image quality, particularly for cerebellopontine angle cistern imaging in vestibular schwannoma patients.

Method: Multi-center retrospective study using T1 and contrast-enhanced T1-weighted MRI to simulate low-dose images. Deep learning models were trained to restore standard-dose quality from low-dose simulations with varying contrast agent reductions.

Result: DL restoration improved image quality metrics significantly - SSIM increased from 0.639 to 0.993 and PSNR from 21.6 dB to 41.4 dB. At 10% input dose, segmentation performance improved (Dice from 0.673 to 0.734). Both 10% and 30% dose restored images showed excellent quality, with 30% being more informative.

Conclusion: The DL model enables lesion detection and diagnostic characterization with only 10-30% of standard contrast agent dose while maintaining excellent image quality for cerebellopontine angle MRI.

Abstract: Objectives: To evaluate a deep learning (DL) model for reducing the agent dose of contrast-enhanced T1-weighted MRI (T1ce) of the cerebellopontine angle (CPA) cistern. Materials and methods: In this multi-center retrospective study, T1 and T1ce of vestibular schwannoma (VS) patients were used to simulate low-dose T1ce with varying reductions of contrast agent dose. DL models were trained to restore standard-dose T1ce from the low-dose simulation. The image quality and segmentation performance of the DL-restored T1ce were evaluated. A head and neck radiologist was asked to rate DL-restored images in multiple aspects, including image quality and diagnostic characterization. Results: 203 MRI studies from 72 VS patients (mean age, 58.51 \pm 14.73, 39 men) were evaluated. As the input dose increased, the structural similarity index measure of the restored T1ce increased from 0.639 \pm 0.113 to 0.993 \pm 0.009, and the peak signal-to-noise ratio increased from 21.6 \pm 3.73 dB to 41.4 \pm 4.84 dB. At 10% input dose, using DL-restored T1ce for segmentation improved the Dice from 0.673 to 0.734, the 95% Hausdorff distance from 2.38 mm to 2.07 mm, and the average surface distance from 1.00 mm to 0.59 mm. Both DL-restored T1ce from 10% and 30% input doses showed excellent images, with the latter being considered more informative. Conclusion: The DL model improved the image quality of low-dose MRI of the CPA cistern, which makes lesion detection and diagnostic characterization possible with 10% - 30% of the standard dose.

[105] Smooth regularization for efficient video recognition

Gil Goldman, Raja Giryes, Mahadev Satyanarayanan

Main category: cs.CV

TL;DR: A smooth regularization technique using Gaussian Random Walk to enforce temporal coherence in video recognition models, improving accuracy of lightweight architectures by 3.8-6.4% on Kinetics-600.

Details

Motivation: To instill strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures by promoting temporal coherence that aligns with natural video properties.

Method: Proposes smooth regularization that encourages smoothness in intermediate-layer embeddings of consecutive frames by modeling changes as Gaussian Random Walk, penalizing abrupt representational shifts.

Result: Lightweight models achieve 3.8-6.4% accuracy improvement on Kinetics-600. MoViNets improve state-of-the-art by 3.8-6.1% within FLOP constraints, while MobileNetV3 and MoViNets-Stream gain 4.9-6.4% over prior models with comparable memory footprints.

Conclusion: The smooth regularization technique effectively improves video recognition performance in lightweight models by enforcing temporal coherence through Gaussian Random Walk modeling.

Abstract: We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/gilgoldm/grw-smoothing.

[106] Open Vocabulary Compositional Explanations for Neuron Alignment

Biagio La Rosa, Leilani H. Gilpin

Main category: cs.CV

TL;DR: A framework for generating open vocabulary compositional explanations of neurons in vision models using semantic segmentation masks instead of human-annotated data.

Details

Motivation: To overcome the limitation of existing compositional explanations that rely on human-annotated datasets, restricting their applicability to specific domains and predefined concepts.

Method: Three-step framework: (1) specify arbitrary concepts, (2) generate semantic segmentation masks using open vocabulary models, (3) derive compositional explanations from these masks.

Result: The framework enables probing neurons for arbitrary concepts and datasets, provides flexible explanations, and shows comparable performance to previous methods while offering greater flexibility.

Conclusion: The proposed framework successfully addresses the limitations of human-annotated data dependency, enabling more flexible and open vocabulary compositional explanations for neuron analysis in vision models.

Abstract: Neurons are the fundamental building blocks of deep neural networks, and their interconnections allow AI to achieve unprecedented results. Motivated by the goal of understanding how neurons encode information, compositional explanations leverage logical relationships between concepts to express the spatial alignment between neuron activations and human knowledge. However, these explanations rely on human-annotated datasets, restricting their applicability to specific domains and predefined concepts. This paper addresses this limitation by introducing a framework for the vision domain that allows users to probe neurons for arbitrary concepts and datasets. Specifically, the framework leverages masks generated by open vocabulary semantic segmentation to compute open vocabulary compositional explanations. The proposed framework consists of three steps: specifying arbitrary concepts, generating semantic segmentation masks using open vocabulary models, and deriving compositional explanations from these masks. The paper compares the proposed framework with previous methods for computing compositional explanations both in terms of quantitative metrics and human interpretability, analyzes the differences in explanations when shifting from human-annotated data to model-annotated data, and showcases the additional capabilities provided by the framework in terms of flexibility of the explanations with respect to the tasks and properties of interest.

[107] UruDendro4: A Benchmark Dataset for Automatic Tree-Ring Detection in Cross-Section Images of Pinus taeda L

Henry Marichal, Joaquin Blanco, Diego Passarella, Gregory Randall

Main category: cs.CV

TL;DR: The paper introduces UruDendro4 dataset for tree-ring analysis, provides baseline performance using DeepCS-TRD method, and shows improved generalization when including this dataset in training.

Details

Motivation: Manual tree-ring measurement is time-consuming and imprecise, and there's scarcity of wood cross-section data for automated analysis. Existing datasets lack samples from multiple stem heights needed for volumetric modeling.

Method: Created UruDendro4 dataset with 102 Pinus taeda L. samples annotated with annual rings from multiple stem heights. Used DeepCS-TRD method for automatic ring detection and conducted ablation experiments to validate parameters.

Result: DeepCS-TRD achieved mean Average Precision of 0.838, mean Average Recall of 0.782, and Adapted Rand Error of 0.084. Training with this dataset improved model generalization for tree-ring detection.

Conclusion: UruDendro4 enables volumetric wood growth modeling and provides a valuable resource for automated tree-ring analysis. The dataset improves model performance and generalization in ring detection tasks.

Abstract: Tree-ring growth represents the annual wood increment for a tree, and quantifying it allows researchers to assess which silvicultural practices are best suited for each species. Manual measurement of this growth is time-consuming and often imprecise, as it is typically performed along 4 to 8 radial directions on a cross-sectional disc. In recent years, automated algorithms and datasets have emerged to enhance accuracy and automate the delineation of annual rings in cross-sectional images. To address the scarcity of wood cross-section data, we introduce the UruDendro4 dataset, a collection of 102 image samples of Pinus taeda L., each manually annotated with annual growth rings. Unlike existing public datasets, UruDendro4 includes samples extracted at multiple heights along the stem, allowing for the volumetric modeling of annual growth using manually delineated rings. This dataset (images and annotations) allows the development of volumetric models for annual wood estimation based on cross-sectional imagery. Additionally, we provide a performance baseline for automatic ring detection on this dataset using state-of-the-art methods. The highest performance was achieved by the DeepCS-TRD method, with a mean Average Precision of 0.838, a mean Average Recall of 0.782, and an Adapted Rand Error score of 0.084. A series of ablation experiments were conducted to empirically validate the final parameter configuration. Furthermore, we empirically demonstrate that training a learning model including this dataset improves the model’s generalization in the tree-ring detection task.

[108] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

Rawa Mohammed, Mina Attin, Bryar Shareef

Main category: cs.CV

TL;DR: BUSTR is a multitask vision-language framework that generates breast ultrasound reports without paired image-report supervision by using structured descriptors and radiomics features.

Details

Motivation: Automated radiology report generation for breast ultrasound is limited by lack of paired image-report datasets and hallucination risks from large language models.

Method: Constructs reports from structured descriptors, learns descriptor-aware visual representations with multi-head Swin encoder using multitask loss, and aligns visual/textual tokens via dual-level objective combining token-level cross-entropy with cosine-similarity alignment loss.

Result: Consistently improves standard natural language generation metrics and clinical efficacy metrics across two BUS datasets (BrEaST and BUS-BRA), particularly for key targets like BI-RADS category and pathology.

Conclusion: Descriptor-aware vision model trained with combined token-level and alignment loss improves both automatic report metrics and clinical efficacy without requiring paired image-report data.

Abstract: Automated radiology report generation (RRG) for breast ultrasound (BUS) is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models. We propose BUSTR, a multitask vision-language framework that generates BUS reports without requiring paired image-report supervision. BUSTR constructs reports from structured descriptors (e.g., BI-RADS, pathology, histology) and radiomics features, learns descriptor-aware visual representations with a multi-head Swin encoder trained using a multitask loss over dataset-specific descriptor sets, and aligns visual and textual tokens via a dual-level objective that combines token-level cross-entropy with a cosine-similarity alignment loss between input and output representations. We evaluate BUSTR on two public BUS datasets, BrEaST and BUS-BRA, which differ in size and available descriptors. Across both datasets, BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics, particularly for key targets such as BI-RADS category and pathology. Our results show that this descriptor-aware vision model, trained with a combined token-level and alignment loss, improves both automatic report metrics and clinical efficacy without requiring paired image-report data. The source code can be found at https://github.com/AAR-UNLV/BUSTR

[109] Beyond Realism: Learning the Art of Expressive Composition with StickerNet

Haoming Lu, David Kocharian, Humphrey Shi

Main category: cs.CV

TL;DR: StickerNet is a two-stage framework for expressive image composition that predicts placement parameters (opacity, mask, location, scale) after determining composition type, trained on real-world editing data from online platforms.

Details

Motivation: Traditional image composition focuses on realism, but modern content creation often aims for artistic, playful, or socially engaging compositions that don't preserve realism, reflecting actual user behavior on creative platforms.

Method: Two-stage framework: 1) determine composition type, 2) predict placement parameters (opacity, mask, location, scale). Dataset built from 1.8M real editing actions from an online visual creation platform.

Result: StickerNet outperforms common baselines and closely matches human placement behavior in user studies and quantitative evaluations, demonstrating effectiveness despite task ambiguity.

Conclusion: This work introduces a new direction in visual understanding emphasizing expressiveness and user intent over realism, showing the value of learning from real-world editing patterns.

Abstract: As a widely used operation in image editing workflows, image composition has traditionally been studied with a focus on achieving visual realism and semantic plausibility. However, in practical editing scenarios of the modern content creation landscape, many compositions are not intended to preserve realism. Instead, users of online platforms motivated by gaining community recognition often aim to create content that is more artistic, playful, or socially engaging. Taking inspiration from this observation, we define the expressive composition task, a new formulation of image composition that embraces stylistic diversity and looser placement logic, reflecting how users edit images on real-world creative platforms. To address this underexplored problem, we present StickerNet, a two-stage framework that first determines the composition type, then predicts placement parameters such as opacity, mask, location, and scale accordingly. Unlike prior work that constructs datasets by simulating object placements on real images, we directly build our dataset from 1.8 million editing actions collected on an anonymous online visual creation and editing platform, each reflecting user-community validated placement decisions. This grounding in authentic editing behavior ensures strong alignment between task definition and training supervision. User studies and quantitative evaluations show that StickerNet outperforms common baselines and closely matches human placement behavior, demonstrating the effectiveness of learning from real-world editing patterns despite the inherent ambiguity of the task. This work introduces a new direction in visual understanding that emphasizes expressiveness and user intent over realism.

[110] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

Md Adnan Arefeen, Biplob Debnath, Srimat Chakradhar

Main category: cs.CV

TL;DR: TrafficLens is a novel algorithm that accelerates multi-camera traffic video analysis by using sequential VLM processing with overlapping camera coverage and object-level similarity detection to reduce redundant computations.

Details

Motivation: Current methods for analyzing multi-camera traffic feeds are inefficient due to time-consuming video-to-text conversion using Vision-Language Models, which delays timely insights and incident investigation.

Method: Uses sequential VLM processing across overlapping camera views, iteratively applying VLMs with varying token limits and using previous outputs as prompts. Includes object-level similarity detector to bypass redundant VLM invocations.

Result: Experimental results show TrafficLens reduces video-to-text conversion time by up to 4x while maintaining information accuracy compared to conventional approaches.

Conclusion: TrafficLens provides an efficient solution for real-time multi-camera traffic analysis by optimizing VLM usage and leveraging camera overlap, enabling faster insights from traffic video data.

Abstract: Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.

[111] Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI

Al Amin, Kamrul Hasan, Liang Hong, Sharif Ullah

Main category: cs.CV

TL;DR: Privacy-preserving federated learning framework combining Vision Transformers with homomorphic encryption for secure histopathology classification, achieving 30x communication reduction while preventing model inversion attacks.

Details

Motivation: Privacy regulations like HIPAA prohibit direct patient data sharing in healthcare, and conventional federated learning remains vulnerable to gradient-based reconstruction attacks that can expose sensitive medical information.

Method: Uses Vision Transformers with CLS token as compact 768D feature representation, encrypts CLS tokens using CKKS homomorphic encryption before server transmission, enabling secure aggregation and encrypted inference.

Result: Achieves 96.12% global accuracy (unencrypted) and 90.02% (encrypted), prevents model inversion attacks (vs vulnerable gradients: PSNR 52.26 dB, SSIM 0.999, NMI 0.741), reduces communication by 30x to 326 KB per round.

Conclusion: The framework provides strong privacy guarantees against reconstruction attacks while maintaining high classification accuracy and significantly reducing communication overhead in multi-institutional healthcare settings.

Abstract: Collaborative machine learning across healthcare institutions promises improved diagnostic accuracy by leveraging diverse datasets, yet privacy regulations such as HIPAA prohibit direct patient data sharing. While federated learning (FL) enables decentralized training without raw data exchange, recent studies show that model gradients in conventional FL remain vulnerable to reconstruction attacks, potentially exposing sensitive medical information. This paper presents a privacy-preserving federated learning framework combining Vision Transformers (ViT) with homomorphic encryption (HE) for secure multi-institutional histopathology classification. The approach leverages the ViT CLS token as a compact 768-dimensional feature representation for secure aggregation, encrypting these tokens using CKKS homomorphic encryption before transmission to the server. We demonstrate that encrypting CLS tokens achieves a 30-fold communication reduction compared to gradient encryption while maintaining strong privacy guarantees. Through evaluation on a three-client federated setup for lung cancer histopathology classification, we show that gradients are highly susceptible to model inversion attacks (PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741), enabling near-perfect image reconstruction. In contrast, the proposed CLS-protected HE approach prevents such attacks while enabling encrypted inference directly on ciphertexts, requiring only 326 KB of encrypted data transmission per aggregation round. The framework achieves 96.12 percent global classification accuracy in the unencrypted domain and 90.02 percent in the encrypted domain.

[112] Inversion-Free Style Transfer with Dual Rectified Flows

Yingying Deng, Xiangyu He, Fan Tang, Weiming Dong, Xucheng Yin

Main category: cs.CV

TL;DR: Proposes an inversion-free style transfer framework using dual rectified flows that avoids computationally expensive inversion processes, enabling style transfer with only forward passes through dynamic trajectory fusion and attention injection.

Details

Motivation: Address limitations of existing diffusion-based style transfer methods that rely on computationally intensive inversion processes, which compromise efficiency and cause visual distortions when inversion is inaccurate.

Method: Uses dual rectified flows to predict content and style trajectories in parallel, then fuses them through dynamic midpoint interpolation that integrates velocities from both paths. Incorporates attention injection to guide style integration and jointly models content, style, and stylized distributions.

Result: Achieves robust fusion without naive overlays, demonstrates generalization across diverse styles and content, provides enhanced visual fidelity and content preservation while being computationally efficient.

Conclusion: The proposed inversion-free framework offers an effective and efficient pipeline for style transfer that avoids the shortcomings of inversion-dependent methods while maintaining high-quality results.

Abstract: Style transfer, a pivotal task in image processing, synthesizes visually compelling images by seamlessly blending realistic content with artistic styles, enabling applications in photo editing and creative design. While mainstream training-free diffusion-based methods have greatly advanced style transfer in recent years, their reliance on computationally inversion processes compromises efficiency and introduces visual distortions when inversion is inaccurate. To address these limitations, we propose a novel \textit{inversion-free} style transfer framework based on dual rectified flows, which tackles the challenge of finding an unknown stylized distribution from two distinct inputs (content and style images), \textit{only with forward pass}. Our approach predicts content and style trajectories in parallel, then fuses them through a dynamic midpoint interpolation that integrates velocities from both paths while adapting to the evolving stylized image. By jointly modeling the content, style, and stylized distributions, our velocity field design achieves robust fusion and avoids the shortcomings of naive overlays. Attention injection further guides style integration, enhancing visual fidelity, content preservation, and computational efficiency. Extensive experiments demonstrate generalization across diverse styles and content, providing an effective and efficient pipeline for style transfer.

[113] RefOnce: Distilling References into a Prototype Memory for Referring Camouflaged Object Detection

Yu-Huan Wu, Zi-Xuan Zhu, Yan Wang, Liangli Zhen, Deng-Ping Fan

Main category: cs.CV

TL;DR: A new Ref-COD framework that distills reference images into class prototypes during training, eliminating the need for test-time references through query-conditioned prototype mixing.

Details

Motivation: Current Ref-COD systems require reference images at test time, which limits deployability, adds latency, and increases data-collection burden.

Method: Maintains EMA-updated class prototypes during training and synthesizes reference vectors at inference via query-conditioned prototype mixture. Uses bidirectional attention alignment to bridge representation gaps between references and camouflaged queries.

Result: Achieves competitive or superior performance on the R2C7K benchmark compared to state-of-the-art methods.

Conclusion: Proposes an efficient Ref-COD approach that eliminates mandatory test-time references while maintaining strong performance.

Abstract: Referring Camouflaged Object Detection (Ref-COD) segments specified camouflaged objects in a scene by leveraging a small set of referring images. Though effective, current systems adopt a dual-branch design that requires reference images at test time, which limits deployability and adds latency and data-collection burden. We introduce a Ref-COD framework that distills references into a class-prototype memory during training and synthesizes a reference vector at inference via a query-conditioned mixture of prototypes. Concretely, we maintain an EMA-updated prototype per category and predict mixture weights from the query to produce a guidance vector without any test-time references. To bridge the representation gap between reference statistics and camouflaged query features, we propose a bidirectional attention alignment module that adapts both the query features and the class representation. Thus, our approach yields a simple, efficient path to Ref-COD without mandatory references. We evaluate the proposed method on the large-scale R2C7K benchmark. Extensive experiments demonstrate competitive or superior performance of the proposed method compared with recent state-of-the-arts. Code is available at https://github.com/yuhuan-wu/RefOnce.

[114] Wavefront-Constrained Passive Obscured Object Detection

Zhiwen Zheng, Yiwei Ouyang, Zhao Huang, Tao Zhang, Xiaoshuai Zhang, Huiyu Zhou, Wenwen Tang, Shaowei Jiang, Jin Liu, Xingru Huang

Main category: cs.CV

TL;DR: WavePCNet: A physics-driven network that uses complex amplitude modeling and frequency-selective pathways to accurately localize and segment obscured objects from faint light patterns beyond the field of view, outperforming existing methods in accuracy and robustness.

Details

Motivation: Existing methods based on real-valued modeling or local convolutional operations are inadequate for capturing coherent light propagation physics and often converge to non-physical solutions under low signal-to-noise conditions, compromising stability and reliability.

Method: Proposes WavePCNet with Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) for precise coherent propagation constraints, momentum memory mechanism for perturbation suppression, and High-frequency Cross-layer Compensation Enhancement for frequency-selective pathways and structural consistency modeling.

Result: Extensive experiments on four physically collected datasets demonstrate that WavePCNet consistently outperforms state-of-the-art methods across both accuracy and robustness metrics.

Conclusion: The proposed physics-driven approach effectively addresses the challenges of localizing obscured objects from faint light patterns by incorporating coherent propagation physics and perturbation suppression mechanisms, achieving superior performance in complex environmental conditions.

Abstract: Accurately localizing and segmenting obscured objects from faint light patterns beyond the field of view is highly challenging due to multiple scattering and medium-induced perturbations. Most existing methods, based on real-valued modeling or local convolutional operations, are inadequate for capturing the underlying physics of coherent light propagation. Moreover, under low signal-to-noise conditions, these methods often converge to non-physical solutions, severely compromising the stability and reliability of the observation. To address these challenges, we propose a novel physics-driven Wavefront Propagating Compensation Network (WavePCNet) to simulate wavefront propagation and enhance the perception of obscured objects. This WavePCNet integrates the Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) to incorporate complex amplitude transfer operators to precisely constrain coherent propagation behavior, along with a momentum memory mechanism to effectively suppress the accumulation of perturbations. Additionally, a High-frequency Cross-layer Compensation Enhancement is introduced to construct frequency-selective pathways with multi-scale receptive fields and dynamically model structural consistency across layers, further boosting the model’s robustness and interpretability under complex environmental conditions. Extensive experiments conducted on four physically collected datasets demonstrate that WavePCNet consistently outperforms state-of-the-art methods across both accuracy and robustness.

[115] GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision

Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan, Qi Chu, Tao Gong, Nenghai Yu

Main category: cs.CV

TL;DR: GuardTrace-VL is a vision-aware safety auditor that monitors multimodal reasoning models to detect unsafe content in intermediate reasoning traces, not just final answers, achieving 93.1% F1 score on unsafe reasoning detection.

Details

Motivation: Existing multimodal safety guards only evaluate input questions and final answers, missing unsafe content that emerges during intermediate reasoning processes, creating deployment risks.

Method: Introduces GuardTrace-VL with joint image-text analysis of the full Question-Thinking-Answer pipeline, using a three-stage progressive training scheme and data refinement via MLRM-human voting pipeline on the GuardTrace dataset.

Result: Achieves 93.1% F1 score on unsafe reasoning detection, representing 13.5% improvement over previous multimodal safety defense methods on both in-domain and out-of-domain test scenarios.

Conclusion: GuardTrace-VL effectively addresses the critical gap in multimodal safety by detecting unsafe content during reasoning stages, significantly improving safety monitoring for multimodal large reasoning models.

Abstract: Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.

[116] From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

Jingxi Chen, Yixiao Zhang, Xiaoye Qian, Zongxia Li, Cornelia Fermuller, Caren Chen, Yiannis Aloimonos

Main category: cs.CV

TL;DR: The paper proposes adapting a diffusion-based inpainting model for image layer decomposition using lightweight finetuning and a multi-modal context fusion module, achieving superior performance in object removal and occlusion recovery.

Details

Motivation: Images can be viewed as layered compositions (foreground over background with occlusions), enabling independent editing for content creation flexibility. However, decomposing single images into layers remains challenging due to limited methods and data.

Method: Adapt a diffusion-based inpainting model for layer decomposition using lightweight finetuning. Introduce a novel multi-modal context fusion module with linear attention complexity to preserve detail in latent space. Train purely on synthetic dataset from open-source assets.

Result: Achieves superior performance in object removal and occlusion recovery. Unlocks new possibilities in downstream editing and creative applications.

Conclusion: The proposed approach successfully bridges layer decomposition with in/outpainting tasks, demonstrating effective adaptation of existing models for complex image decomposition tasks using synthetic training data.

Abstract: Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.

[117] Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

Xiaoxing You, Qiang Huang, Lingyu Li, Chi Zhang, Xiaopeng Liu, Min Zhang, Jun Yu

Main category: cs.CV

TL;DR: MERGE is a multimodal entity-aware retrieval-augmented generation framework that addresses key challenges in news image captioning through enriched knowledge retrieval, improved cross-modal alignment, and enhanced visual-entity grounding.

Details

Motivation: Existing methods struggle with incomplete information coverage, weak cross-modal alignment, and suboptimal visual-entity grounding in news image captioning.

Method: Constructs entity-centric multimodal knowledge base integrating textual, visual, and structured knowledge; uses multistage hypothesis-caption strategy for cross-modal alignment; employs dynamic retrieval guided by image content for visual-entity matching.

Result: Significantly outperforms state-of-the-art baselines on GoodNews and NYTimes800k with CIDEr gains of +6.84 and +1.16, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Generalizes well to unseen Visual News dataset with +20.17 CIDEr and +6.22 F1-score.

Conclusion: MERGE demonstrates strong robustness and domain adaptability, effectively addressing key challenges in news image captioning through its multimodal entity-aware retrieval-augmented approach.

Abstract: News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.

[118] MetaRank: Task-Aware Metric Selection for Model Transferability Estimation

Yuhang Liu, Wenjie Zhao, Yunhui Guo

Main category: cs.CV

TL;DR: MetaRank is a meta-learning framework that automatically selects the most appropriate Model Transferability Estimation (MTE) metric for a given target dataset by learning the relationship between dataset characteristics and metric performance.

Details

Motivation: Current MTE metric selection is ad hoc or based on average historical performance, but no single metric works optimally across all datasets. The effectiveness of MTE metrics is highly task-dependent.

Method: Formulates metric selection as a learning-to-rank problem. Uses pretrained language models to encode textual descriptions of datasets and metrics into a shared semantic space. Trains a meta-predictor offline on diverse meta-tasks with listwise optimization to prioritize correct ranking of top-performing metrics.

Result: Extensive experiments across 11 pretrained models and 11 target datasets demonstrate strong effectiveness of MetaRank in selecting optimal MTE metrics for different datasets.

Conclusion: MetaRank provides an automated, task-aware approach for MTE metric selection that outperforms ad hoc selection methods, enabling practitioners to choose the most appropriate metric for their specific target dataset.

Abstract: Selecting an appropriate pre-trained source model is a critical, yet computationally expensive, task in transfer learning. Model Transferability Estimation (MTE) methods address this by providing efficient proxy metrics to rank models without full fine-tuning. In practice, the choice of which MTE metric to use is often ad hoc or guided simply by a metric’s average historical performance. However, we observe that the effectiveness of MTE metrics is highly task-dependent and no single metric is universally optimal across all target datasets. To address this gap, we introduce MetaRank, a meta-learning framework for automatic, task-aware MTE metric selection. We formulate metric selection as a learning-to-rank problem. Rather than relying on conventional meta-features, MetaRank encodes textual descriptions of both datasets and MTE metrics using a pretrained language model, embedding them into a shared semantic space. A meta-predictor is then trained offline on diverse meta-tasks to learn the intricate relationship between dataset characteristics and metric mechanisms, optimized with a listwise objective that prioritizes correctly ranking the top-performing metrics. During the subsequent online phase, MetaRank efficiently ranks the candidate MTE metrics for a new, unseen target dataset based on its textual description, enabling practitioners to select the most appropriate metric a priori. Extensive experiments across 11 pretrained models and 11 target datasets demonstrate the strong effectiveness of our approach.

[119] Structure-Aware Prototype Guided Trusted Multi-View Classification

Haojian Huang, Jiahao Shi, Zhe Liu, Harold Haodong Chen, Han Fang, Hao Sun, Zhongjiang He

Main category: cs.CV

TL;DR: Proposes a novel trustworthy multi-view classification framework using prototypes to represent neighbor structures, enabling efficient cross-view consensus discovery with competitive performance and robustness.

Details

Motivation: Existing TMVC methods have high computational costs from dense neighbor relationships, cannot ensure inter-view consistency, and lack guarantees for learned multi-view structures being consistent in class space, undermining trustworthiness.

Method: Introduces prototypes to represent neighbor structures of each view, simplifies intra-view neighbor relation learning, and enables dynamic alignment of intra- and inter-view structures for efficient cross-view consensus discovery.

Result: Extensive experiments on multiple public multi-view datasets demonstrate competitive downstream performance and robustness compared to prevalent TMVC methods.

Conclusion: The proposed prototype-based TMVC framework effectively addresses limitations of existing methods by enabling more efficient and consistent discovery of cross-view consensus while maintaining competitive performance.

Abstract: Trustworthy multi-view classification (TMVC) addresses the challenge of achieving reliable decision-making in complex scenarios where multi-source information is heterogeneous, inconsistent, or even conflicting. Existing TMVC approaches predominantly rely on globally dense neighbor relationships to model intra-view dependencies, leading to high computational costs and an inability to directly ensure consistency across inter-view relationships. Furthermore, these methods typically aggregate evidence from different views through manually assigned weights, lacking guarantees that the learned multi-view neighbor structures are consistent within the class space, thus undermining the trustworthiness of classification outcomes. To overcome these limitations, we propose a novel TMVC framework that introduces prototypes to represent the neighbor structures of each view. By simplifying the learning of intra-view neighbor relations and enabling dynamic alignment of intra- and inter-view structure, our approach facilitates more efficient and consistent discovery of cross-view consensus. Extensive experiments on multiple public multi-view datasets demonstrate that our method achieves competitive downstream performance and robustness compared to prevalent TMVC methods.

[120] CameraMaster: Unified Camera Semantic-Parameter Control for Photography Retouching

Qirui Yang, Yang Yang, Ying Zeng, Xiaobin Hu, Bo Li, Huanjing Yue, Jingyu Yang, Peng-Tao Jiang

Main category: cs.CV

TL;DR: CameraMaster is a unified framework for image retouching that explicitly decouples camera directives and parameter embeddings to achieve precise camera control while maintaining semantic consistency.

Details

Motivation: Existing methods for image retouching either rely on ambiguous text prompts that hinder precise camera control, or use separate training heads that compromise scalability and multi-parameter composition. There's a need for a unified approach that can handle precise parameter adjustments while maintaining physical consistency.

Method: CameraMaster explicitly decouples camera directive and parameter information, using camera parameter embeddings to modulate both directive and content semantics. It injects modulated directives via cross-attention and uses directive/camera embeddings as conditioning signals in the denoising process for semantic-parameter alignment.

Result: Extensive experiments show CameraMaster produces monotonic and near-linear responses to parameter variations, supports seamless multi-parameter composition, and significantly outperforms existing methods. The framework was trained on a dataset of 78K image-prompt pairs with camera parameter annotations.

Conclusion: CameraMaster successfully addresses the limitations of existing methods by providing a unified framework that enables precise camera parameter control while maintaining semantic consistency, supporting multi-parameter composition, and being sensitive to subtle variations.

Abstract: Text-guided diffusion models have greatly advanced image editing and generation. However, achieving physically consistent image retouching with precise parameter control (e.g., exposure, white balance, zoom) remains challenging. Existing methods either rely solely on ambiguous and entangled text prompts, which hinders precise camera control, or train separate heads/weights for parameter adjustment, which compromises scalability, multi-parameter composition, and sensitivity to subtle variations. To address these limitations, we propose CameraMaster, a unified camera-aware framework for image retouching. The key idea is to explicitly decouple the camera directive and then coherently integrate two critical information streams: a directive representation that captures the photographer’s intent, and a parameter embedding that encodes precise camera settings. CameraMaster first uses the camera parameter embedding to modulate both the camera directive and the content semantics. The modulated directive is then injected into the content features via cross-attention, yielding a strongly camera-sensitive semantic context. In addition, the directive and camera embeddings are injected as conditioning and gating signals into the time embedding, enabling unified, layer-wise modulation throughout the denoising process and enforcing tight semantic-parameter alignment. To train and evaluate CameraMaster, we construct a large-scale dataset of 78K image-prompt pairs annotated with camera parameters. Extensive experiments show that CameraMaster produces monotonic and near-linear responses to parameter variations, supports seamless multi-parameter composition, and significantly outperforms existing methods.

[121] CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu

Main category: cs.CV

TL;DR: CaptionQA is a utility-based benchmark that evaluates image captions by measuring how well they support downstream tasks across 4 domains, revealing significant gaps between image and caption utility.

Details

Motivation: Current evaluation practices miss whether captions can effectively substitute for images in real downstream tasks like retrieval, recommendation, and multi-step agentic inference pipelines.

Method: Built 33,027 densely annotated multiple-choice questions across 4 domains with fine-grained taxonomies, using an LLM to answer questions using captions alone to directly measure caption utility.

Result: Evaluation reveals substantial gaps between image and caption utility, with models nearly identical on traditional benchmarks differing by up to 32% in caption utility.

Conclusion: CaptionQA provides a comprehensive probe of caption utility and reveals that current captioning models have significant limitations in preserving image-level utility for downstream tasks.

Abstract: Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains–Natural, Document, E-commerce, and Embodied AI–each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.

[122] FlowerDance: MeanFlow for Efficient and Refined 3D Dance Generation

Kaixing Yang, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jun He, Hongyan Liu

Main category: cs.CV

TL;DR: FlowerDance is an efficient music-to-dance generation method that combines MeanFlow with Physical Consistency Constraints for high-quality motion generation with few sampling steps, using a BiMamba-based architecture for non-autoregressive generation with motion editing capabilities.

Details

Motivation: Existing music-to-dance generation methods have limited efficiency, leaving insufficient computational resources for high-fidelity 3D rendering, which constrains the expressiveness of 3D characters in real-world applications.

Method: Combines MeanFlow with Physical Consistency Constraints for high-quality motion generation with few sampling steps. Uses BiMamba-based backbone with Channel-Level Cross-Modal Fusion for efficient non-autoregressive generation. Supports motion editing for interactive refinement.

Result: Achieves state-of-the-art results on AIST++ and FineDance datasets in both motion quality and generation efficiency, with significant improvements in inference speed and memory utilization.

Conclusion: FlowerDance provides an efficient solution for music-to-dance generation that enables high-quality motion with physical plausibility and artistic expressiveness while maintaining computational efficiency for real-world applications.

Abstract: Music-to-dance generation aims to translate auditory signals into expressive human motion, with broad applications in virtual reality, choreography, and digital entertainment. Despite promising progress, the limited generation efficiency of existing methods leaves insufficient computational headroom for high-fidelity 3D rendering, thereby constraining the expressiveness of 3D characters during real-world applications. Thus, we propose FlowerDance, which not only generates refined motion with physical plausibility and artistic expressiveness, but also achieves significant generation efficiency on inference speed and memory utilization . Specifically, FlowerDance combines MeanFlow with Physical Consistency Constraints, which enables high-quality motion generation with only a few sampling steps. Moreover, FlowerDance leverages a simple but efficient model architecture with BiMamba-based backbone and Channel-Level Cross-Modal Fusion, which generates dance with efficient non-autoregressive manner. Meanwhile, FlowerDance supports motion editing, enabling users to interactively refine dance sequences. Extensive experiments on AIST++ and FineDance show that FlowerDance achieves state-of-the-art results in both motion quality and generation efficiency. Code will be released upon acceptance.

[123] LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules

Cheng Yang, Hui Jin, Xinlei Yu, Zhipeng Wang, Yaoqun Liu, Fenglei Fan, Dajiang Lei, Gangyong Jia, Changmiao Wang, Ruiquan Ge

Main category: cs.CV

TL;DR: LungNoduleAgent is a collaborative multi-agent system that improves lung nodule diagnosis in CT scans through sequential modules for nodule detection, comprehensive reporting, and malignancy grading.

Details

Motivation: Current multimodal LLMs struggle with accurate nodule morphology description and medical expertise integration, limiting clinical reliability. Multi-agent systems offer potential for balancing generality and precision in medical applications.

Method: Three sequential modules: Nodule Spotter coordinates detection models to identify nodules; Radiologist uses localized image description for CT reports; Doctor Agent System performs malignancy reasoning using images, reports, pathology knowledge base, and multi-agent framework.

Result: Outperforms mainstream vision-language models, agent systems, and expert models on two private datasets and public LIDC-IDRI dataset, demonstrating superior nodule diagnosis capabilities.

Conclusion: LungNoduleAgent shows the importance of region-level semantic alignment and multi-agent collaboration for lung nodule diagnosis, serving as a promising clinical analysis tool.

Abstract: Diagnosing lung cancer typically involves physicians identifying lung nodules in Computed tomography (CT) scans and generating diagnostic reports based on their morphological features and medical expertise. Although advancements have been made in using multimodal large language models for analyzing lung CT scans, challenges remain in accurately describing nodule morphology and incorporating medical expertise. These limitations affect the reliability and effectiveness of these models in clinical settings. Collaborative multi-agent systems offer a promising strategy for achieving a balance between generality and precision in medical applications, yet their potential in pathology has not been thoroughly explored. To bridge these gaps, we introduce LungNoduleAgent, an innovative collaborative multi-agent system specifically designed for analyzing lung CT scans. LungNoduleAgent streamlines the diagnostic process into sequential components, improving precision in describing nodules and grading malignancy through three primary modules. The first module, the Nodule Spotter, coordinates clinical detection models to accurately identify nodules. The second module, the Radiologist, integrates localized image description techniques to produce comprehensive CT reports. Finally, the Doctor Agent System performs malignancy reasoning by using images and CT reports, supported by a pathology knowledge base and a multi-agent system framework. Extensive testing on two private datasets and the public LIDC-IDRI dataset indicates that LungNoduleAgent surpasses mainstream vision-language models, agent systems, and advanced expert models. These results highlight the importance of region-level semantic alignment and multi-agent collaboration in diagnosing nodules. LungNoduleAgent stands out as a promising foundational tool for supporting clinical analyses of lung nodules.

[124] PG-ControlNet: A Physics-Guided ControlNet for Generative Spatially Varying Image Deblurring

Hakki Motorcu, Mujdat Cetin

Main category: cs.CV

TL;DR: A novel framework that combines generative models with physical constraints for spatially varying image deblurring, using dense kernel fields to guide diffusion sampling.

Details

Motivation: Existing methods either produce over-smoothed results with physical constraints (model-based) or hallucinate details with weak constraints (generative). There's a need to bridge physical accuracy and perceptual realism.

Method: Models degradation as dense continuum of high-dimensional compressed kernels, then uses this descriptor field to condition a ControlNet architecture to guide diffusion sampling process.

Result: Outperforms state-of-the-art model-based and generative methods in challenging blurred scenarios, effectively bridging physical accuracy and perceptual realism.

Conclusion: The proposed framework successfully reconciles physical constraints with generative priors for superior spatially varying image deblurring.

Abstract: Spatially varying image deblurring remains a fundamentally ill-posed problem, especially when degradations arise from complex mixtures of motion and other forms of blur under significant noise. State-of-the-art learning-based approaches generally fall into two paradigms: model-based deep unrolling methods that enforce physical constraints by modeling the degradations, but often produce over-smoothed, artifact-laden textures, and generative models that achieve superior perceptual quality yet hallucinate details due to weak physical constraints. In this paper, we propose a novel framework that uniquely reconciles these paradigms by taming a powerful generative prior with explicit, dense physical constraints. Rather than oversimplifying the degradation field, we model it as a dense continuum of high-dimensional compressed kernels, ensuring that minute variations in motion and other degradation patterns are captured. We leverage this rich descriptor field to condition a ControlNet architecture, strongly guiding the diffusion sampling process. Extensive experiments demonstrate that our method effectively bridges the gap between physical accuracy and perceptual realism, outperforming state-of-the-art model-based methods as well as generative baselines in challenging, severely blurred scenarios.

[125] MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization

Yingjie Xia, Xi Wang, Jinglei Shi, Vicky Kalogeiton, Jian Yang

Main category: cs.CV

TL;DR: MUSE is a unified framework for both emotional image generation and editing that uses test-time scaling with an off-the-shelf emotion classifier, addressing how, when, and which emotions to guide synthesis without requiring specialized datasets or model updates.

Details

Motivation: Current Image Emotional Synthesis approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling.

Method: Adopts Test-Time Scaling strategy with gradient-based optimization of emotional tokens using an off-the-shelf emotion classifier, identifies optimal timing using semantic similarity, and employs multi-emotion loss to reduce interference from inherent and similar emotions.

Result: MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining optimal balance between desired content, text prompt adherence, and realistic emotional expression.

Conclusion: MUSE establishes a new paradigm for emotion synthesis by providing a unified framework that efficiently handles both generation and editing tasks without requiring specialized datasets or model updates.

Abstract: Images evoke emotions that profoundly influence perception, often prioritized over content. Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling. In this work, we introduce MUSE, the first unified framework capable of both emotional generation and editing. By adopting a strategy conceptually aligned with Test-Time Scaling (TTS) that widely used in both LLM and diffusion model communities, it avoids the requirement for additional updating diffusion model and specialized emotional synthesis datasets. More specifically, MUSE addresses three key questions in emotional synthesis: (1) HOW to stably guide synthesis by leveraging an off-the-shelf emotion classifier with gradient-based optimization of emotional tokens; (2) WHEN to introduce emotional guidance by identifying the optimal timing using semantic similarity as a supervisory signal; and (3) WHICH emotion to guide synthesis through a multi-emotion loss that reduces interference from inherent and similar emotions. Experimental results show that MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining an optimal balance between desired content, adherence to text prompts, and realistic emotional expression. It establishes a new paradigm for emotion synthesis.

[126] AnchorOPT: Towards Optimizing Dynamic Anchors for Adaptive Prompt Learning

Zheng Li, Yibing Song, Xin Zhang, Lei Luo, Xiang Li, Jian Yang

Main category: cs.CV

TL;DR: AnchorOPT introduces dynamic anchor-based prompt learning with learnable anchor values and adaptive positional relationships between anchors and soft tokens, achieving competitive performance without additional modules.

Details

Motivation: Existing prompt learning methods use static anchors that lack cross-task and stage-adaptive flexibility, limiting their generalization capabilities.

Method: Two-stage training: first learn anchor tokens from task-specific data, then freeze them and optimize soft tokens and a learnable position matrix that adapts to training stage and task context.

Result: Achieves performance comparable to or exceeding methods with additional learnable modules or regularization techniques, with consistent gains across diverse datasets.

Conclusion: AnchorOPT provides an effective plug-and-play module that enhances CLIP generalization through dynamic anchor optimization.

Abstract: Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., “shape”, “color”), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.

[127] Long-Term Alzheimers Disease Prediction: A Novel Image Generation Method Using Temporal Parameter Estimation with Normal Inverse Gamma Distribution on Uneven Time Series

Xin Hong, Xinze Sun, Yinhao Li, Yen-Wei Chen

Main category: cs.CV

TL;DR: Proposes T-NIG model using Normal Inverse Gamma Distribution with time parameter for long-term Alzheimer’s Disease prediction via image generation, handling irregular time intervals while maintaining disease characteristics.

Details

Motivation: Long-term AD predictions face difficulties maintaining disease-related characteristics when dealing with irregular time intervals in sequential data, as time-related aspects reflect disease changes in unevenly distributed images.

Method: T-NIG model estimates temporal parameter within Normal Inverse Gamma Distribution, uses brain images from two time points to create intermediate/future images, identifies features using coordinate neighborhoods, and incorporates uncertainty estimation to reduce epistemic and aleatoric uncertainties.

Result: T-NIG demonstrates state-of-the-art performance in both short-term and long-term prediction tasks, proficient in forecasting disease progression while maintaining disease-related characteristics despite irregular temporal data distribution.

Conclusion: The proposed T-NIG model effectively handles irregular time intervals in AD prediction through temporal parameter estimation in Normal Inverse Gamma Distribution, achieving superior performance in maintaining disease characteristics during long-term forecasting.

Abstract: Image generation can provide physicians with an imaging diagnosis basis in the prediction of Alzheimer’s Disease (AD). Recent research has shown that long-term AD predictions by image generation often face difficulties maintaining disease-related characteristics when dealing with irregular time intervals in sequential data. Considering that the time-related aspects of the distribution can reflect changes in disease-related characteristics when images are distributed unevenly, this research proposes a model to estimate the temporal parameter within the Normal Inverse Gamma Distribution (T-NIG) to assist in generating images over the long term. The T-NIG model employs brain images from two different time points to create intermediate brain images, forecast future images, and predict the disease. T-NIG is designed by identifying features using coordinate neighborhoods. It incorporates a time parameter into the normal inverse gamma distribution to understand how features change in brain imaging sequences that have varying time intervals. Additionally, T-NIG utilizes uncertainty estimation to reduce both epistemic and aleatoric uncertainties in the model, which arise from insufficient temporal data. In particular, the T-NIG model demonstrates state-of-the-art performance in both short-term and long-term prediction tasks within the dataset. Experimental results indicate that T-NIG is proficient in forecasting disease progression while maintaining disease-related characteristics, even when faced with an irregular temporal data distribution.

[128] MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Ziyun Zeng, Hang Hua, Jiebo Luo

Main category: cs.CV

TL;DR: MIRA is a lightweight multimodal reasoning agent that improves instruction-guided image editing by using iterative perception-reasoning-action loops to handle complex instructions, achieving performance comparable to proprietary systems.

Details

Motivation: Diffusion-based editing models struggle to accurately interpret complex user instructions involving compositional relationships, contextual cues, or referring expressions, leading to semantic drift and failed edits.

Method: MIRA performs editing through iterative perception-reasoning-action loops, predicting atomic edit instructions step by step using visual feedback. It’s trained on a 150K multimodal dataset (MIRA-Editing) with a two-stage SFT + GRPO pipeline.

Result: When paired with open-source image editing models, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems like GPT-Image and Nano-Banana.

Conclusion: MIRA effectively addresses the limitations of current diffusion-based editing models by simulating multi-turn human-model interaction processes, enabling more accurate interpretation and execution of complex editing instructions.

Abstract: Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

[129] CLRecogEye : Curriculum Learning towards exploiting convolution features for Dynamic Iris Recognition

Geetanjali Sharma, Gaurav Jaswal, Aditya Nigam, Raghavendra Ramachandra

Main category: cs.CV

TL;DR: Proposes a novel iris authentication pipeline using 3D-CNN to capture spatio-temporal features, trained with curriculum learning and triplet/ArcFace losses for robustness against rotation, scale, reflections, and blur.

Details

Motivation: Existing iris authentication methods lack robustness to variations (rotation, scale, reflections, blur) and fail to leverage spatio-temporal structure of iris patterns, relying on simple point-to-point comparisons.

Method: Splits iris images into sequences of sub-images, processes with 3D-CNN to capture spatial-temporal features, trained end-to-end with triplet and ArcFace loss in curriculum manner.

Result: Achieves robust and generalizable iris authentication by embedding temporal dependencies directly into feature space, improving discriminability despite challenging variations.

Conclusion: The proposed framework provides a robust solution for real-world iris authentication applications by effectively modeling spatio-temporal iris patterns through curriculum learning and deep metric embeddings.

Abstract: Iris authentication algorithms have achieved impressive recognition performance, making them highly promising for real-world applications such as border control, citizen identification, and both criminal investigations and commercial systems. However, their robustness is still challenged by variations in rotation, scale, specular reflections, and defocus blur. In addition, most existing approaches rely on straightforward point-to-point comparisons, typically using cosine or L2 distance, without effectively leveraging the spatio-spatial-temporal structure of iris patterns. To address these limitations, we propose a novel and generalized matching pipeline that learns rich spatio-spatial-temporal representations of iris features. Our approach first splits each iris image along one dimension, generating a sequence of sub-images that serve as input to a 3D-CNN, enabling the network to capture both spatial and spatio-spatial-temporal cues. To further enhance the modeling of spatio-spatial-temporal feature dynamics, we train the model in curriculum manner. This design allows the network to embed temporal dependencies directly into the feature space, improving discriminability in the deep metric domain. The framework is trained end-to-end with triplet and ArcFace loss in a curriculum manner, enforcing highly discriminative embeddings despite challenges like rotation, scale, reflections, and blur. This design yields a robust and generalizable solution for iris authentication.Github code: https://github.com/GeetanjaliGTZ/CLRecogEye

[130] Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis

Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee

Main category: cs.CV

TL;DR: Visual distractors in vision-language models cause inverse scaling (reduced accuracy) without increasing reasoning length, unlike textual distractors. The study introduces Idis dataset and shows how attribute tracking in reasoning traces reveals distractor effects.

Details

Motivation: To investigate whether visual distractors cause similar inverse scaling effects as observed with textual distractors in language models, and understand how distractors affect reasoning in multimodal settings.

Method: Created Idis dataset with systematic visual distractors across semantic, numerical, and spatial dimensions. Analyzed reasoning traces to track attribute counts and understand distractor effects. Extended analysis to Waterbirds benchmark and proposed prompting strategies.

Result: Visual distractors cause inverse scaling (accuracy decreases) but don’t increase reasoning length like textual distractors. Attribute tracking in reasoning traces reveals how distractors affect model performance. The effects extend to established bias benchmarks.

Conclusion: Visual distractors fundamentally differ from textual ones in their effects on reasoning. Simple prompting strategies can help mitigate bias-driven predictions in reasoning models affected by distractors.

Abstract: How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.

[131] Pygmalion Effect in Vision: Image-to-Clay Translation for Reflective Geometry Reconstruction

Gayoung Lee, Junho Kim, Jin-Hwa Kim, Junmo Kim

Main category: cs.CV

TL;DR: Pygmalion Effect in Vision framework uses image-to-clay translation to suppress reflections while preserving geometry for robust 3D reconstruction of reflective objects.

Details

Motivation: Reflection remains challenging in 3D reconstruction due to entanglement of appearance and geometry under view-dependent reflections.

Method: Dual-branch network with BRDF-based reflective branch and clay-guided branch, trained jointly using synthesized clay-like images as reflection-free supervision.

Result: Substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods on synthetic and real datasets.

Conclusion: Seeing by unshining (translating radiance into neutrality) serves as powerful inductive bias for reflective object geometry learning.

Abstract: Understanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically “sculpts” reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.

[132] Scaling Foundation Models for Radar Scene Understanding

Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia

Main category: cs.CV

TL;DR: RadarFM: A radar foundation model using structured spatial language supervision to learn unified scene-level representations, enabling transfer across tasks through contrastive learning and spatial reasoning.

Details

Motivation: Radar sensors provide reliable perception in adverse conditions, but existing radar approaches are fragmented and task-specific, preventing transfer across tasks. Foundation models have transformed other domains but remain underexplored for radar sensing.

Method: Uses structured caption framework encoding vehicle distributions in radar coordinates, hash-aware contrastive learning for continuous scene similarity quantification, and generates large-scale annotated radar datasets using CARLA simulator.

Result: Proposes localization-aware metrics for spatial accuracy assessment beyond traditional detection measures.

Conclusion: RadarFM provides a unified foundation model approach for radar sensing that enables transfer learning across tasks through structured spatial language supervision and contrastive learning.

Abstract: Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.

[133] EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

Ze Feng, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang

Main category: cs.CV

TL;DR: EM-KD enhances Efficient Multimodal Large Language Models (MLLMs) through knowledge distillation, addressing unbalanced vision tokens between student and teacher models using spatial alignment and two distillation strategies.

Details

Motivation: Existing efficient MLLMs compress vision tokens to reduce resource consumption but lose visual information, degrading comprehension. Prior knowledge distillation methods overlook fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between efficient student and vanilla teacher models.

Method: Proposes EM-KD with: 1) Manhattan distance calculation between teacher and student vision logits, 2) Hungarian matching algorithm for spatial alignment, 3) Vision-Language Affinity Distillation (VLAD) using smooth L1 distance on affinity matrices, and 4) Vision Semantic Distillation (VSD) using reverse KL divergence on aligned vision logits.

Result: Comprehensive evaluation shows EM-KD trained models outperform prior Efficient MLLMs on both accuracy and efficiency by a large margin. Also achieves better performance than previous distillation methods when equipped with the proposed vision token matching strategy.

Conclusion: EM-KD effectively enhances Efficient MLLMs through knowledge distillation with proper vision token alignment, demonstrating superior performance in both accuracy and efficiency compared to existing methods.

Abstract: Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.

[134] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, Jiangmiao Pang

Main category: cs.CV

TL;DR: G²VLM is a geometry-grounded vision-language model that bridges 3D spatial reconstruction and spatial understanding by leveraging learned 3D visual geometry features.

Details

Motivation: Current Vision-Language Models lack robustness in spatial intelligence due to absence of visual geometry learning capable of reconstructing 3D space from 2D images.

Method: Unified design that natively leverages learned 3D visual geometry features to predict 3D attributes and enhance spatial reasoning via in-context learning and interleaved reasoning. Trains on abundant multi-view image and video data while leveraging 3D visual priors.

Result: Achieves comparable results to state-of-the-art feed-forward 3D reconstruction models and better/competitive results across spatial understanding and reasoning tasks.

Conclusion: G²VLM serves as a strong baseline for unifying semantically strong VLMs with low-level 3D vision tasks, potentially unlocking future applications like 3D scene editing.

Abstract: Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.

[135] FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain

YuAn Wang, Xiaofan Li, Chi Huang, Wenhao Zhang, Hao Li, Bosheng Wang, Xun Sun, Jun Wang

Main category: cs.CV

TL;DR: FaithFusion is a 3DGS-diffusion fusion framework that uses Expected Information Gain (EIG) to maintain geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts in driving-scene reconstruction.

Details

Motivation: To address challenges in fusing geometry-based 3DGS and appearance-driven diffusion models, which often lead to over-restoration and geometric drift due to lack of pixel-wise, 3D-consistent editing criteria.

Method: Introduces Expected Information Gain (EIG) as a unified policy for coherent spatio-temporal synthesis, guiding diffusion as a spatial prior to refine high-uncertainty regions and distilling edits back into 3DGS through pixel-level weighting.

Result: Achieves SOTA performance on Waymo dataset across NTA-IoU, NTL-IoU, and FID metrics, maintaining FID of 107.47 even at 6 meters lane shift.

Conclusion: FaithFusion provides an effective plug-and-play framework for controllable driving-scene reconstruction that maintains geometric fidelity while enabling plausible appearance synthesis under large viewpoint changes.

Abstract: In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce \textbf{FaithFusion}, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications.Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift. Our code is available at https://github.com/wangyuanbiubiubiu/FaithFusion.

[136] Deformation-aware Temporal Generation for Early Prediction of Alzheimers Disease

Xin Honga, Jie Lin, Minghui Wang

Main category: cs.CV

TL;DR: Proposes DATGN, a deformation-aware temporal generative network that learns morphological changes in brain MRI sequences to predict Alzheimer’s disease progression and generate future MRI images, improving classification accuracy when used with synthetic data.

Details

Motivation: Early prediction of Alzheimer's disease can slow its progression, and current methods rely on manual feature extraction from brain images showing morphological changes like brain atrophy.

Method: DATGN first interpolates incomplete temporal MRI sequences, then uses a bidirectional temporal deformation-aware module to generate future MRI images that follow disease progression patterns for early AD prediction.

Result: DATGN achieved competitive PSNR and MMSE metrics on ADNI dataset. When synthetic data was used with classification methods, accuracy improved by 6.21-16% for AD vs. NC and 7.34-21.25% for AD vs. MCI vs. NC classification.

Conclusion: DATGN successfully generates MRI images consistent with Alzheimer’s brain atrophy trends, enabling early disease prediction and significantly improving classification performance when integrated with existing methods.

Abstract: Alzheimer’s disease (AD), a degenerative brain condition, can benefit from early prediction to slow its progression. As the disease progresses, patients typically undergo brain atrophy. Current prediction methods for Alzheimers disease largely involve analyzing morphological changes in brain images through manual feature extraction. This paper proposes a novel method, the Deformation-Aware Temporal Generative Network (DATGN), to automate the learning of morphological changes in brain images about disease progression for early prediction. Given the common occurrence of missing data in the temporal sequences of MRI images, DATGN initially interpolates incomplete sequences. Subsequently, a bidirectional temporal deformation-aware module guides the network in generating future MRI images that adhere to the disease’s progression, facilitating early prediction of Alzheimer’s disease. DATGN was tested for the generation of temporal sequences of future MRI images using the ADNI dataset, and the experimental results are competitive in terms of PSNR and MMSE image quality metrics. Furthermore, when DATGN-generated synthetic data was integrated into the SVM vs. CNN vs. 3DCNN-based classification methods, significant improvements were achieved from 6. 21% to 16% in AD vs. NC classification accuracy and from 7. 34% to 21. 25% in AD vs. MCI vs. NC classification accuracy. The qualitative visualization results indicate that DATGN produces MRI images consistent with the brain atrophy trend in Alzheimer’s disease, enabling early disease prediction.

[137] Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models

Changlin Li, Jiawei Zhang, Zeyi Shi, Zongxin Yang, Zhihui Li, Xiaojun Chang

Main category: cs.CV

TL;DR: EntPruner is an entropy-guided automatic pruning framework for diffusion and flow models that reduces parameter redundancy while maintaining generation quality.

Details

Motivation: Large-scale vision generative models have significant parameter redundancy when transferred to downstream tasks, requiring efficient pruning methods that preserve output diversity and condition-fidelity.

Method: Uses entropy-guided pruning with Conditional Entropy Deviation (CED) metric to assess block importance, and a zero-shot adaptive pruning framework that automatically determines when and how much to prune during training.

Result: Achieves up to 2.22× inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets with DiT and SiT models.

Conclusion: EntPruner effectively reduces parameter redundancy in generative models while preserving performance, offering a practical solution for efficient downstream task adaptation.

Abstract: Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive pruning framework for diffusion and flow models. First, we introduce entropy-guided pruning, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require preserving the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes pruning of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive pruning framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot pruning, mitigating mode collapse, and preserving model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.

[138] CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Jialun Liu, Hao Pan, Yuchi Huo, Rui Wang, Haibin Huang, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: CtrlVDiff is a unified diffusion model that handles both video understanding and controllable generation by using multiple graphics-based modalities (depth, normals, segmentation, edges, intrinsics) to enable precise edits like relighting and material swaps while maintaining temporal coherence.

Details

Motivation: Geometry-only cues are insufficient for physically meaningful video edits as they under-constrain appearance, materials, and illumination, leading to temporal drift. Additional graphics-based modalities provide complementary constraints for better understanding and precise control.

Method: Proposes CtrlVDiff with Hybrid Modality Control Strategy (HMCS) that routes and fuses features from multiple modalities (depth, normals, segmentation, edges, albedo, roughness, metallic). Uses MMVideo dataset with aligned real-and-synthetic data across modalities and captions.

Result: Superior controllability and fidelity across understanding and generation benchmarks, enabling layer-wise edits (relighting, material adjustment, object insertion) while surpassing state-of-the-art baselines. Remains robust when some modalities are unavailable.

Conclusion: Enriching video models with graphics-based modalities enables both improved understanding and precise controllable generation, overcoming limitations of geometry-only approaches and enabling physically meaningful edits with temporal coherence.

Abstract: We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.

[139] DeepRFTv2: Kernel-level Learning for Image Deblurring

Xintian Mao, Haofei Song, Yin-Nian Liu, Qingli Li, Yan Wang

Main category: cs.CV

TL;DR: Proposes Fourier Kernel Estimator (FKE) that learns blur kernels in Fourier space to enable kernel-level deblurring, achieving state-of-the-art results with physically meaningful kernels.

Details

Motivation: Current deep networks only perform pixel-level restoration and fail to understand the essence of blur, which is fundamentally a convolution process with blur kernels.

Method: Fourier Kernel Estimator that converts convolution to multiplication in Fourier space, uses feature-level convolution instead of image-level, and employs decoupled multi-scale architecture with reversible sub-unets.

Result: Achieves state-of-the-art motion deblurring performance, learns physically meaningful kernels, and shows potential for other kernel-related problems.

Conclusion: Learning blur process at kernel-level through Fourier space transformation enables better deblurring performance and physical understanding of blur.

Abstract: It is well-known that if a network aims to learn how to deblur, it should understand the blur process. Blurring is naturally caused by the convolution of the sharp image with the blur kernel. Thus, allowing the network to learn the blur process in the kernel-level can significantly improve the image deblurring performance. But, current deep networks are still at the pixel-level learning stage, either performing end-to-end pixel-level restoration or stage-wise pseudo kernel-level restoration, failing to enable the deblur model to understand the essence of the blur. To this end, we propose Fourier Kernel Estimator (FKE), which considers the activation operation in Fourier space and converts the convolution problem in the spatial domain to a multiplication problem in Fourier space. Our FKE, jointly optimized with the deblur model, enables the network to learn the kernel-level blur process with low complexity and without any additional supervision. Furthermore, we change the convolution object of the kernel from image" to network extracted feature", whose rich semantic and structural information is more suitable to blur process learning. With the convolution of the feature and the estimated kernel, our model can learn the essence of blur in kernel-level. To further improve the efficiency of feature extraction, we design a decoupled multi-scale architecture with multiple hierarchical sub-unets with a reversible strategy, which allows better multi-scale encoding and decoding in low training memory. Extensive experiments indicate that our method achieves state-of-the-art motion deblurring results and show potential for handling other kernel-related problems. Analysis also shows our kernel estimator is able to learn physically meaningful kernels. The code will be available at https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur.

[140] Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning

Changlin Li, Jiawei Zhang, Shuhao Liu, Sihao Lin, Zeyi Shi, Zhihui Li, Xiaojun Chang

Main category: cs.CV

TL;DR: Ent-Prog is an efficient training framework for diffusion models in human video generation that reduces training time and GPU memory usage while maintaining performance through prioritized component training and adaptive progressive scheduling.

Details

Motivation: High computational cost and memory consumption in training diffusion models for high-resolution human video generation pose significant challenges that need to be addressed.

Method: Uses Conditional Entropy Inflation (CEI) to prioritize training of critical model components and an adaptive progressive schedule that increases computational complexity based on convergence efficiency.

Result: Achieves up to 2.2× training speedup and 2.4× GPU memory reduction across three datasets without compromising generative performance.

Conclusion: Ent-Prog provides an effective solution for efficient training of diffusion models in human video generation while maintaining model quality.

Abstract: Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.

[141] Referring Video Object Segmentation with Cross-Modality Proxy Queries

Baoli Sun, Xinzhu Ma, Ning Wang, Zhihui Wang, Zhiyong Wang

Main category: cs.CV

TL;DR: ProxyFormer introduces proxy queries to integrate visual and text semantics for referring video object segmentation, addressing limitations in cross-modality alignment and inter-frame dependency modeling.

Details

Motivation: Existing RVOS methods have two key limitations: conditional queries lack inter-frame dependency modeling for accurate tracking amid frame variations, and textual constraints are integrated too late, potentially causing focus on non-referred objects.

Method: ProxyFormer uses proxy queries to integrate visual and text semantics, progressively updating them across video feature encoder stages. It decouples cross-modality interactions into temporal and spatial dimensions for efficiency, and employs Joint Semantic Consistency training for semantic alignment.

Result: Comprehensive experiments on four RVOS benchmarks demonstrate ProxyFormer’s superiority over state-of-the-art methods.

Conclusion: ProxyFormer effectively addresses cross-modality alignment challenges in RVOS through proxy queries that establish inter-frame dependencies and ensure focus on referred objects, achieving improved accuracy and coherence in object tracking.

Abstract: Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.

[142] TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

Jiaming He, Guanyu Hou, Hongwei Li, Zhicong Huang, Kangjie Chen, Yi Yu, Wenbo Jiang, Guowen Xu, Tianwei Zhang

Main category: cs.CV

TL;DR: TEAR is an automated red-teaming framework that uncovers safety risks in Text-to-Video models by exploiting temporal dynamics through temporal-aware prompt generation and preference learning.

Details

Motivation: Existing safety evaluation methods for static images and text are insufficient for capturing complex temporal dynamics in video generation, creating critical safety challenges in T2V models.

Method: Uses temporal-aware test generator with two-stage optimization (initial training + temporal-aware online preference learning) and cyclic refinement to create stealthy prompts that exploit temporal sequencing to elicit policy-violating videos.

Result: Achieves over 80% attack success rate across open-source and commercial T2V systems, significantly improving from prior best result of 57%.

Conclusion: TEAR effectively identifies temporal-related safety vulnerabilities in T2V models, demonstrating the need for specialized safety evaluation frameworks for dynamic video generation.

Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80% attack success rate, a significant boost from prior best result of 57%.

[143] LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

Shichu Sun, Yichen Zhang, Haolin Song, Zonghao Guo, Chi Chen, Yidan Zhang, Yuan Yao, Zhiyuan Liu, Maosong Sun

Main category: cs.CV

TL;DR: LLaVA-UHD v3 introduces Progressive Visual Compression (PVC) to enable efficient native-resolution visual encoding in MLLMs, reducing computational overhead while maintaining competitive performance.

Details

Motivation: Global native-resolution visual encoding in MLLMs enhances capability but incurs significant computational overhead. The paper aims to address this efficiency issue while preserving the benefits of high-resolution encoding.

Method: Proposed Progressive Visual Compression (PVC) with two modules: refined patch embedding for flexible patch-size scaling, and windowed token compression hierarchically deployed across ViT layers to progressively aggregate local token representations.

Result: ViT-UHD (transformed ViT with PVC) achieves competitive performance with MoonViT while reducing TTFT by 2.4x. LLaVA-UHD v3 achieves competitive performance to Qwen2-VL while further reducing TTFT by 1.9x.

Conclusion: PVC enables efficient native-resolution encoding in MLLMs, significantly reducing computational overhead while maintaining competitive performance, supporting future research on efficient multi-modal models.

Abstract: Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.

[144] Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation

Joonhyung Park, Hyeongwon Jang, Joowon Kim, Eunho Yang

Main category: cs.CV

TL;DR: GridAR is a test-time scaling framework for visual autoregressive models that improves image generation quality through grid-partitioned progressive generation and layout-specified prompt reformulation, achieving better results with lower computational cost.

Details

Motivation: Test-time computation scaling has been successful in natural language tasks but remains unexplored for visual AR models. Naive approaches like Best-of-N are suboptimal due to full-length computation on erroneous trajectories and lack of canvas blueprint in raster-scan decoding.

Method: GridAR uses grid-partitioned progressive generation where multiple partial candidates are generated, pruned early if infeasible, and viable ones become anchors for subsequent decoding. It also employs layout-specified prompt reformulation that inspects partial views to infer feasible layouts.

Result: With N=4, GridAR outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also shows comparable edit quality and 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.

Conclusion: GridAR effectively addresses the challenges of test-time scaling for visual AR models, achieving higher quality results with reduced computational cost and generalizing well to image editing tasks.

Abstract: Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.

[145] Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding

Yutao Tang, Cheng Zhao, Gaurav Mittal, Rohith Kukkala, Rama Chellappa, Cheng Peng, Mei Chen

Main category: cs.CV

TL;DR: NDTokenizer3D is a generalist 3D vision-language model that uses a novel three-stage scene tokenization pipeline based on Multi-Scale Normal Distributions Transform to bridge language reasoning with 3D spatial understanding across various tasks.

Details

Motivation: Effectively tokenizing 3D scenes into holistic scene tokens and leveraging them across diverse 3D understanding tasks remains challenging, despite recent advances in 3D vision-language models.

Method: A three-stage scene tokenization pipeline using Multi-Scale Normal Distributions Transform (NDT) representation with a Multi-Scale NDT Decoder (MSDec) that constructs multi-scale NDT from point clouds and fuses cross-scale features to produce scene tokens.

Result: Achieves remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning, offering a fine-grained, general-purpose 3D VLM.

Conclusion: NDTokenizer3D provides a compact and unified architecture that effectively bridges language-level reasoning with 3D spatial understanding, supporting diverse 3D scene understanding tasks and human interactions.

Abstract: Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.

[146] When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models

Hui Lu, Yi Yu, Yiming Yang, Chenyu Yi, Qixin Zhang, Bingquan Shen, Alex C. Kot, Xudong Jiang

Main category: cs.CV

TL;DR: UPA-RFAS is a universal adversarial patch attack framework that transfers across different VLA models, tasks, and viewpoints by combining feature-space objectives, robustness augmentation, and VLA-specific attention/semantic attacks.

Details

Motivation: Vision-Language-Action models are vulnerable to adversarial attacks, but existing patches overfit to single models and fail in black-box settings. There's a need for universal, transferable attacks that work across unknown architectures and sim-to-real shifts.

Method: UPA-RFAS combines: (1) feature-space objective with ℓ₁ deviation prior and repulsive InfoNCE loss for transferable representation shifts, (2) two-phase min-max procedure with inner loop for invisible perturbations and outer loop for universal patch optimization, (3) VLA-specific Patch Attention Dominance and Patch Semantic Misalignment losses.

Result: Experiments show UPA-RFAS consistently transfers across diverse VLA models, manipulation suites, and physical executions, working across models, tasks, and viewpoints.

Conclusion: UPA-RFAS exposes a practical patch-based attack surface for VLA-driven robots and establishes a strong baseline for future defense research.

Abstract: Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.

[147] You Can Trust Your Clustering Model: A Parameter-free Self-Boosting Plug-in for Deep Clustering

Hanyang Li, Yuheng Jia, Hui Liu, Junhui Hou

Main category: cs.CV

TL;DR: DCBoost is a parameter-free plug-in that enhances global feature structures in deep clustering models by leveraging reliable local structural cues to improve clustering performance.

Details

Motivation: Existing deep clustering methods suffer from disparity between global and local feature structures - local features show strong consistency while global features have intertwined boundaries and poor cluster separation.

Method: Uses adaptive k-nearest neighbors-based consistency filtering to identify high-confidence samples as trustworthy anchors, then computes a discriminative loss that promotes intra-class compactness and inter-class separability to guide network optimization.

Result: Significantly improves clustering performance across various benchmark datasets, boosting state-of-the-art baselines by over 3% and amplifying silhouette coefficient by over 7x.

Conclusion: DCBoost effectively enhances global feature structures in deep clustering models through reliable local structural cues, demonstrating substantial performance improvements without requiring additional parameters.

Abstract: Recent deep clustering models have produced impressive clustering performance. However, a common issue with existing methods is the disparity between global and local feature structures. While local structures typically show strong consistency and compactness within class samples, global features often present intertwined boundaries and poorly separated clusters. Motivated by this observation, we propose DCBoost, a parameter-free plug-in designed to enhance the global feature structures of current deep clustering models. By harnessing reliable local structural cues, our method aims to elevate clustering performance effectively. Specifically, we first identify high-confidence samples through adaptive $k$-nearest neighbors-based consistency filtering, aiming to select a sufficient number of samples with high label reliability to serve as trustworthy anchors for self-supervision. Subsequently, these samples are utilized to compute a discriminative loss, which promotes both intra-class compactness and inter-class separability, to guide network optimization. Extensive experiments across various benchmark datasets showcase that our DCBoost significantly improves the clustering performance of diverse existing deep clustering models. Notably, our method improves the performance of current state-of-the-art baselines (e.g., ProPos) by more than 3% and amplifies the silhouette coefficient by over $7\times$. Code is available at https://github.com/l-h-y168/DCBoost.

[148] BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data

Selene Cerna, Sara Si-Moussi, Wilfried Thuiller, Hadrien Hendrikx, Vincent Miele

Main category: cs.CV

TL;DR: BotaCLIP is a lightweight multimodal framework that adapts pre-trained Earth Observation foundation models to inject botanical domain knowledge through contrastive learning with aerial imagery and ecological data, improving downstream ecological predictions.

Details

Motivation: To adapt pre-trained foundation models for domain-specific ecological applications without expensive retraining, enabling expert knowledge injection into data-scarce biodiversity modeling settings.

Method: BotaCLIP uses multimodal contrastive learning to align high-resolution aerial imagery with botanical relevés, incorporating regularization to prevent catastrophic forgetting of original model capabilities.

Result: BotaCLIP embeddings consistently outperformed original DOFA embeddings and supervised baselines across three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group estimation.

Conclusion: Domain-aware adaptation of foundation models can effectively inject expert knowledge into data-scarce scenarios, enabling efficient and frugal representation learning for ecological applications.

Abstract: Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.

[149] Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

Baoli Sun, Yihan Wang, Xinzhu Ma, Zhihui Wang, Kun Lu, Zhiyong Wang

Main category: cs.CV

TL;DR: The paper proposes Action-Region Tracking (ART) framework for fine-grained action recognition, using query-response mechanism to track distinctive local details across video frames.

Details

Motivation: Current recognition methods capture coarse-grained motion patterns but struggle with subtle local details that evolve over time, which are crucial for distinguishing similar fine-grained actions.

Method: Uses region-specific semantic activation module with text-constrained queries from VLMs to capture action-related regions, organizes responses into action tracklets, and applies multi-level tracklet contrastive constraints with task-specific fine-tuning.

Result: Comprehensive experiments on action recognition benchmarks demonstrate superiority over previous state-of-the-art baselines.

Conclusion: The ART framework effectively captures and tracks subtle local details in fine-grained actions, enabling better distinction between similar action categories through region-based dynamics tracking.

Abstract: Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.

[150] From Diffusion to One-Step Generation: A Comparative Study of Flow-Based Models with Application to Image Inpainting

Umang Agarwal, Rudraksh Sangore, Sumit Laddha

Main category: cs.CV

TL;DR: Comparative study of DDPM, CFM, and MeanFlow showing CFM’s superior FID (24.15 vs 402.98 for DDPM) and MeanFlow’s 50X faster single-step generation (FID 29.15). Extended CFM to inpainting with significant quality improvements.

Details

Motivation: To comprehensively compare three generative modeling paradigms and demonstrate the advantages of CFM over DDPM, while showing MeanFlow's efficiency for single-step generation. Also aims to extend CFM to image inpainting tasks.

Method: Implemented DDPM, CFM, and MeanFlow using unified TinyUNet architecture (<1.5M params) on CIFAR-10. Extended CFM to inpainting with mask-guided sampling using four mask types (center, random bbox, irregular, half) and fine-tuned for inpainting-aware training.

Result: CFM achieved FID 24.15 with 50 steps vs DDPM’s 402.98. MeanFlow achieved FID 29.15 with single-step sampling (50X faster inference). Inpainting: PSNR improved from 4.95 to 8.57 dB (+73%) and SSIM from 0.289 to 0.418 (+45%) on center masks.

Conclusion: CFM significantly outperforms DDPM in image quality, MeanFlow enables efficient single-step generation with minimal quality loss, and fine-tuned inpainting models demonstrate substantial improvements in image restoration quality.

Abstract: We present a comprehensive comparative study of three generative modeling paradigms: Denoising Diffusion Probabilistic Models (DDPM), Conditional Flow Matching (CFM), and MeanFlow. While DDPM and CFM require iterative sampling, MeanFlow enables direct one-step generation by modeling the average velocity over time intervals. We implement all three methods using a unified TinyUNet architecture (<1.5M parameters) on CIFAR-10, demonstrating that CFM achieves an FID of 24.15 with 50 steps, significantly outperforming DDPM (FID 402.98). MeanFlow achieves FID 29.15 with single-step sampling – a 50X reduction in inference time. We further extend CFM to image inpainting, implementing mask-guided sampling with four mask types (center, random bbox, irregular, half). Our fine-tuned inpainting model achieves substantial improvements: PSNR increases from 4.95 to 8.57 dB on center masks (+73%), and SSIM improves from 0.289 to 0.418 (+45%), demonstrating the effectiveness of inpainting-aware training.

[151] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides, Grigorios Tsagatakis

Main category: cs.CV

TL;DR: ConFu is a multimodal framework that jointly embeds individual modalities and their fused combinations in a unified space, using contrastive learning to capture both pairwise and higher-order dependencies.

Details

Motivation: Existing multimodal methods focus on pairwise alignment but fail to capture higher-order interactions between multiple modalities while preserving pairwise relationships, limiting their effectiveness on single-modality tasks.

Method: Extends pairwise contrastive objectives with an additional fused-modality contrastive term that aligns modality pairs with a third modality, enabling capture of higher-order dependencies like XOR relationships.

Result: Competitive performance on retrieval and classification tasks across synthetic and real-world benchmarks, supporting unified one-to-one and two-to-one retrieval within a single framework.

Conclusion: ConFu effectively captures higher-order multimodal dependencies while maintaining strong pairwise correspondence, providing a unified approach for multimodal representation learning.

Abstract: Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

[152] 3-Tracer: A Tri-level Temporal-Aware Framework for Audio Forgery Detection and Localization

Shuhan Xia, Xuannan Liu, Xing Cui, Peipei Li

Main category: cs.CV

TL;DR: T3-Tracer is a novel framework for detecting partial audio forgeries by jointly analyzing audio at frame, segment, and audio levels using two complementary modules: FA-FAM for frame authenticity and SMDAM for segment-level boundary detection.

Details

Motivation: Partial audio forgeries selectively modify critical frames while maintaining overall authenticity, making them difficult to detect. Existing methods only detect single frames independently and lack hierarchical analysis across temporal levels.

Method: T3-Tracer uses two core modules: FA-FAM combines frame-level and audio-level features to detect intra-frame forgery cues and global inconsistencies, while SMDAM uses a dual-branch architecture to model frame features and inter-frame differences across multi-scale windows for boundary detection.

Result: Extensive experiments on three challenging datasets demonstrate state-of-the-art performance in detecting partial audio forgeries.

Conclusion: The hierarchical multi-level approach effectively captures both transient and sustained anomalies, providing comprehensive detection of partial audio forgery traces across different temporal scales.

Abstract: Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3-Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3-Tracer consists of two complementary core modules: the Frame-Audio Feature Aggregation Module (FA-FAM) and the Segment-level Multi-Scale Discrepancy-Aware Module (SMDAM). FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state-of-the-art performance.

[153] Hybrid SIFT-SNN for Efficient Anomaly Detection of Traffic Flow-Control Infrastructure

Munish Rathee, Boris Bačić, Maryam Doborjeh

Main category: cs.CV

TL;DR: SIFT-SNN framework combines SIFT feature encoding with SNN classification for real-time structural anomaly detection in transport infrastructure, achieving 92.3% accuracy with 9.5ms latency.

Details

Motivation: Need for low-latency, low-power real-time monitoring of structural safety in transport infrastructure that can operate efficiently on embedded hardware while maintaining interpretability.

Method: Hybrid pipeline integrating SIFT for spatial feature encoding, latency-driven spike conversion layer, and LIF Spiking Neural Network for classification using Auckland Harbour Bridge dataset with real and synthetic unsafe cases.

Result: Achieved 92.3% classification accuracy with 9.5ms per-frame inference time and 8.1% sparse spike activity, enabling real-time edge deployment with transparent decision-making.

Conclusion: SIFT-SNN provides efficient, interpretable structural monitoring solution with validated prototype, though generalization to unseen conditions requires further validation.

Abstract: This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport infrastructure. The proposed approach integrates Scale-Invariant Feature Transform (SIFT) for spatial feature encoding with a latency-driven spike conversion layer and a Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN) for classification. The Auckland Harbour Bridge dataset is recorded under various weather and lighting conditions, comprising 6,000 labelled frames that include both real and synthetically augmented unsafe cases. The presented system achieves a classification accuracy of 92.3% (+- 0.8%) with a per-frame inference time of 9.5 ms. Achieved sub-10 millisecond latency, combined with sparse spike activity (8.1%), enables real-time, low-power edge deployment. Unlike conventional CNN-based approaches, the hybrid SIFT-SNN pipeline explicitly preserves spatial feature grounding, enhances interpretability, supports transparent decision-making, and operates efficiently on embedded hardware. Although synthetic augmentation improved robustness, generalisation to unseen field conditions remains to be validated. The SIFT-SNN framework is validated through a working prototype deployed on a consumer-grade system and framed as a generalisable case study in structural safety monitoring for movable concrete barriers, which, as a traffic flow-control infrastructure, is deployed in over 20 cities worldwide.

[154] FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision

Chen Ling, Henglin Shi, Hedvig Kjellström

Main category: cs.CV

TL;DR: FIELDS is a 3D face reconstruction method that captures subtle emotional expressions by combining 2D image consistency with direct 3D expression supervision and emotion recognition, using authentic 4D facial scan data to bridge the 2D/3D domain gap.

Details

Motivation: Existing 3D face reconstruction methods miss subtle emotional details due to reliance on 2D supervision and lack of 3D ground truth, failing to capture the full range of human emotional expression.

Method: Extends self-supervised 2D image consistency with direct 3D expression parameter supervision from spontaneous 4D facial scans, plus an auxiliary emotion recognition branch with intensity-aware emotion loss to prevent exaggeration.

Result: Produces high-fidelity 3D reconstructions that preserve subtle emotional cues from single images, yielding emotion-rich face models with realistic expressions and significantly improved facial expression recognition performance.

Conclusion: The dual-supervision strategy successfully bridges the 2D/3D domain gap and mitigates expression-intensity bias, enabling accurate capture of genuine emotional content in 3D face reconstruction.

Abstract: Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.

[155] SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Tae-Min Choi, Tae Kyeong Jeong, Garam Kim, Jaemin Lee, Yeongyoon Koh, In Cheul Choi, Jae-Ho Chung, Jong Woong Park, Juyoun Park

Main category: cs.CV

TL;DR: SurgMLLMBench is a unified multimodal benchmark for surgical scene understanding that integrates pixel-level segmentation and structured VQA annotations across multiple surgical domains under a unified taxonomy.

Details

Motivation: Existing surgical datasets use heterogeneous taxonomies and lack pixel-level segmentation support, limiting consistent evaluation and applicability of multimodal LLMs in surgical applications.

Method: Created SurgMLLMBench with the MAVIS dataset, integrating pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy.

Result: A single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets, enabling comprehensive evaluation beyond traditional VQA tasks.

Conclusion: SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

Abstract: Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

[156] Shift-Equivariant Complex-Valued Convolutional Neural Networks

Quentin Gabot, Teck-Yian Lim, Jérémy Fix, Joana Frontera-Pons, Chengfang Ren, Jean-Philippe Ovarlez

Main category: cs.CV

TL;DR: Extends Learnable Polyphase Sampling (LPS) to complex-valued neural networks with a novel projection layer from complex to real space before Gumbel Softmax, achieving shift equivariance and invariance in computer vision tasks.

Details

Motivation: Traditional CNNs lack shift equivariance and invariance due to downsampling/upsampling operations. While data augmentation helps empirically, a systematic theoretical approach is needed to guarantee these properties.

Method: Extends LPS to complex-valued networks with a projection layer from C to R before Gumbel Softmax, theoretically ensuring shift equivariance and invariance in downsampling/upsampling operations.

Result: Evaluated on computer vision problems using polarimetric SAR images, achieving shift invariance in classification tasks and shift equivariance in reconstruction and semantic segmentation.

Conclusion: Successfully extends LPS framework to complex-valued neural networks, providing theoretical guarantees for shift equivariance and invariance while maintaining performance on real-world computer vision tasks.

Abstract: Convolutional neural networks have shown remarkable performance in recent years on various computer vision problems. However, the traditional convolutional neural network architecture lacks a critical property: shift equivariance and invariance, broken by downsampling and upsampling operations. Although data augmentation techniques can help the model learn the latter property empirically, a consistent and systematic way to achieve this goal is by designing downsampling and upsampling layers that theoretically guarantee these properties by construction. Adaptive Polyphase Sampling (APS) introduced the cornerstone for shift invariance, later extended to shift equivariance with Learnable Polyphase up/downsampling (LPS) applied to real-valued neural networks. In this paper, we extend the work on LPS to complex-valued neural networks both from a theoretical perspective and with a novel building block of a projection layer from $\mathbb{C}$ to $\mathbb{R}$ before the Gumbel Softmax. We finally evaluate this extension on several computer vision problems, specifically for either the invariance property in classification tasks or the equivariance property in both reconstruction and semantic segmentation problems, using polarimetric Synthetic Aperture Radar images.

[157] AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

Shuhan Xia, Peipei Li, Xuannan Liu, Dongsen Zhang, Xinyu Guo, Zekun Li

Main category: cs.CV

TL;DR: AVFakeBench is a comprehensive audio-video forgery detection benchmark covering diverse forgery types beyond human-centric deepfakes, with multi-granularity annotations and multi-task evaluation framework.

Details

Motivation: Existing benchmarks are limited to DeepFake-based forgeries and single-granularity annotations, failing to capture the diversity and complexity of real-world AV forgery scenarios.

Method: Proposed a multi-stage hybrid forgery framework integrating proprietary models for task planning with expert generative models for precise manipulation. Created 12K audio-video questions covering 7 forgery types and 4 annotation levels.

Result: Evaluated 11 AV-LMMs and 2 detection methods, showing AV-LMMs’ potential as emerging forgery detectors but revealing weaknesses in fine-grained perception and reasoning.

Conclusion: AVFakeBench addresses limitations of existing benchmarks and demonstrates that while AV-LMMs show promise for forgery detection, they need improvement in fine-grained analysis capabilities.

Abstract: The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.

[158] LaGen: Towards Autoregressive LiDAR Scene Generation

Sizhuo Zhou, Xiaosong Jia, Fanrui Zhang, Junjie Li, Juyong Zhang, Yukang Feng, Jianwen Sun, Songbur Wong, Junqi You, Junchi Yan

Main category: cs.CV

TL;DR: LaGen is the first framework for autoregressive long-horizon LiDAR scene generation, enabling frame-by-frame generation from single-frame input with bounding box conditions, outperforming existing methods.

Details

Motivation: Existing LiDAR generation methods only support single frame generation, while prediction approaches require multiple historical frames and lack interactivity, failing to support long-horizon interactive generation.

Method: LaGen uses a single-frame LiDAR input with bounding box conditions, incorporates scene decoupling estimation for object-level interaction, and noise modulation to reduce error accumulation in long-horizon generation.

Result: LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on later frames, as demonstrated through comprehensive experiments on nuScenes-based evaluation protocol.

Conclusion: LaGen successfully enables long-horizon interactive generation of LiDAR scenes, addressing limitations of existing methods and showing superior performance in generating high-fidelity 4D scene point clouds.

Abstract: Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long-horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame-by-frame autoregressive generation of long-horizon LiDAR scenes. LaGen is able to take a single-frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high-fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model’s interactive generation capability for object-level content, as well as a noise modulation module to mitigate error accumulation during long-horizon generation. We construct a protocol based on nuScenes for evaluating long-horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on the later frames.

[159] Monet: Reasoning in Latent Visual Space Beyond Images and Language

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang

Main category: cs.CV

TL;DR: Monet is a training framework that enables MLLMs to reason directly in latent visual space using continuous embeddings as visual thoughts, addressing computational cost and supervision challenges through a three-stage SFT pipeline and VLPO reinforcement learning.

Details

Motivation: Existing methods for visual reasoning lack human-like abstract visual thinking due to limitations of external tools, and current approaches have high computational costs and insufficient supervision for latent visual reasoning.

Method: Three-stage distillation-based supervised fine-tuning pipeline with VLPO (Visual-latent Policy Optimization) reinforcement learning, using a 125K text-image interleaved CoT dataset for training.

Result: Monet-7B shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on abstract visual reasoning tasks.

Conclusion: The framework successfully enables latent visual reasoning in MLLMs, with each training component playing a crucial role, providing insights for future developments in visual latent reasoning.

Abstract: “Thinking with images” has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.

[160] Unlocking Zero-shot Potential of Semi-dense Image Matching via Gaussian Splatting

Juncheng Chen, Chao Xu, Yanjun Cao

Main category: cs.CV

TL;DR: MatchGS corrects 3DGS geometric inaccuracies to generate precise correspondence labels and aligns 2D-3D representations, enabling robust zero-shot image matching with significant performance improvements.

Details

Motivation: Learning-based image matching requires large, diverse, and geometrically accurate training data, but 3D Gaussian Splatting (3DGS) has geometric inaccuracies that prevent robust correspondence labeling.

Method: Two-fold approach: (1) geometrically-faithful data generation pipeline that refines 3DGS geometry for precise correspondence labels, and (2) 2D-3D representation alignment strategy that infuses 3DGS’ 3D knowledge into 2D matchers.

Result: Generated ground-truth correspondences reduce epipolar error by up to 40x, enable supervision under extreme viewpoint changes, and matchers trained on this data achieve up to 17.7% zero-shot performance gains on public benchmarks.

Conclusion: With proper geometric refinement, 3DGS can serve as a scalable, high-fidelity data source for robust zero-shot image matchers.

Abstract: Learning-based image matching critically depends on large-scale, diverse, and geometrically accurate training data. 3D Gaussian Splatting (3DGS) enables photorealistic novel-view synthesis and thus is attractive for data generation. However, its geometric inaccuracies and biased depth rendering currently prevent robust correspondence labeling. To address this, we introduce MatchGS, the first framework designed to systematically correct and leverage 3DGS for robust, zero-shot image matching. Our approach is twofold: (1) a geometrically-faithful data generation pipeline that refines 3DGS geometry to produce highly precise correspondence labels, enabling the synthesis of a vast and diverse range of viewpoints without compromising rendering fidelity; and (2) a 2D-3D representation alignment strategy that infuses 3DGS’ explicit 3D knowledge into the 2D matcher, guiding 2D semi-dense matchers to learn viewpoint-invariant 3D representations. Our generated ground-truth correspondences reduce the epipolar error by up to 40 times compared to existing datasets, enable supervision under extreme viewpoint changes, and provide self-supervisory signals through Gaussian attributes. Consequently, state-of-the-art matchers trained solely on our data achieve significant zero-shot performance gains on public benchmarks, with improvements of up to 17.7%. Our work demonstrates that with proper geometric refinement, 3DGS can serve as a scalable, high-fidelity, and structurally-rich data source, paving the way for a new generation of robust zero-shot image matchers.

[161] Co-Training Vision Language Models for Remote Sensing Multi-task Learning

Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, Xue Yang, Junchi Yan

Main category: cs.CV

TL;DR: RSCoVLM is a vision-language model baseline for remote sensing multi-task learning that addresses diverse image scales, computational efficiency, and object detection capabilities through unified strategies and data curation.

Details

Motivation: To create a unified model for multiple remote sensing tasks through multi-task learning, leveraging vision-language models' potential for improved generalization, scalability, and practical applicability compared to single-task approaches.

Method: Developed a data curation engine for vision-language conversations, proposed unified dynamic-resolution strategy for diverse image scales, introduced Zoom-in Chain mechanism for ultra-high-resolution images, and enhanced object detection capability with novel evaluation protocols.

Result: RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and rivaling specialized expert models.

Conclusion: The baseline promotes progress toward general-purpose RS models and all tools, models, and datasets are open-sourced for reproducibility.

Abstract: With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model’s object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.

[162] SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning

Futian Wang, Mengqi Wang, Xiao Wang, Haowen Wang, Jin Tang

Main category: cs.CV

TL;DR: This paper proposes a novel remote sensing change captioning method that leverages the SAM foundation model for region-level representation extraction and integrates knowledge graphs to enhance change description accuracy.

Details

Motivation: Existing methods have weak region awareness and limited temporal alignment in remote sensing change captioning. The authors aim to address these limitations by incorporating region-level representations and object-of-interest knowledge.

Method: The method uses CNN/Transformer for global features, SAM foundation model to identify semantic- and motion-level change regions, and a knowledge graph for object information. These heterogeneous sources are fused via cross-attention and processed by a Transformer decoder to generate natural language descriptions.

Result: Extensive experiments show the method achieves state-of-the-art performance across multiple benchmark datasets.

Conclusion: The proposed approach effectively addresses region awareness and temporal alignment issues in remote sensing change captioning by leveraging SAM foundation model and knowledge graphs, demonstrating superior performance over existing methods.

Abstract: Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning

[163] PathMamba: A Hybrid Mamba-Transformer for Topologically Coherent Road Segmentation in Satellite Imagery

Jules Decaestecker, Nicolas Vigne

Main category: cs.CV

TL;DR: PathMamba is a hybrid architecture combining Mamba’s linear-time efficiency for modeling continuous road structures with Transformer’s global reasoning, achieving state-of-the-art road segmentation with superior topological continuity.

Details

Motivation: Existing Vision Transformers have quadratic complexity that limits deployment on resource-constrained platforms, while road networks require modeling of long, continuous structures where Mamba's linear-time efficiency could be beneficial.

Method: Hybrid architecture integrating Mamba blocks to trace continuous road networks and preserve topological structure, combined with Transformer blocks to refine features with global context.

Result: Sets new state-of-the-art on DeepGlobe Road Extraction and Massachusetts Roads datasets, significantly improves topological continuity (APLS metric) while remaining computationally competitive.

Conclusion: PathMamba demonstrates that combining Mamba’s sequential modeling with Transformer’s global reasoning yields topologically superior road segmentation without prohibitive scaling costs of pure attention-based models.

Abstract: Achieving both high accuracy and topological continuity in road segmentation from satellite imagery is a critical goal for applications ranging from urban planning to disaster response. State-of-the-art methods often rely on Vision Transformers, which excel at capturing global context, yet their quadratic complexity is a significant barrier to efficient deployment, particularly for on-board processing in resource-constrained platforms. In contrast, emerging State Space Models like Mamba offer linear-time efficiency and are inherently suited to modeling long, continuous structures. We posit that these architectures have complementary strengths. To this end, we introduce PathMamba, a novel hybrid architecture that integrates Mamba’s sequential modeling with the Transformer’s global reasoning. Our design strategically uses Mamba blocks to trace the continuous nature of road networks, preserving topological structure, while integrating Transformer blocks to refine features with global context. This approach yields topologically superior segmentation maps without the prohibitive scaling costs of pure attention-based models. Our experiments on the DeepGlobe Road Extraction and Massachusetts Roads datasets demonstrate that PathMamba sets a new state-of-the-art. Notably, it significantly improves topological continuity, as measured by the APLS metric, setting a new benchmark while remaining computationally competitive.

[164] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Jiajie Zhang, Sören Schwertfeger, Alexander Kleiner

Main category: cs.CV

TL;DR: Unsupervised framework for extracting structured VLA pre-training data from industrial videos using motion tokenization and action segmentation.

Details

Motivation: To unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action model pre-training.

Method: Trains lightweight motion tokenizer to encode motion dynamics, then uses unsupervised action segmenter with novel “Latent Action Energy” metric to discover semantically coherent action primitives.

Result: Effective segmentation of key tasks on public benchmarks and proprietary electric motor assembly dataset, with confirmed semantic coherence through clustering and VLM assessment.

Conclusion: First fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering scalable solution for embodied AI in manufacturing.

Abstract: We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel “Latent Action Energy” metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.

[165] CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation

Chenyu Liu, Hongze Chen, Jingzhi Bao, Lingting Zhu, Runze Zhang, Weikai Chen, Zeyu Hu, Yingda Yin, Keyang Luo, Xin Wang

Main category: cs.CV

TL;DR: CaliTex introduces geometry-calibrated attention to solve cross-view inconsistency in 3D texture generation by aligning attention with 3D structure through part-aligned and condition-routed attention modules.

Details

Motivation: Current 3D texture generation systems suffer from cross-view inconsistency where textures appear convincing from one viewpoint but fail to align across others, caused by attention ambiguity in unstructured full attention.

Method: Introduces CaliTex with two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention that routes appearance information through geometry-conditioned pathways. Uses a two-stage diffusion transformer.

Result: CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.

Conclusion: CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization, effectively solving cross-view inconsistency in 3D texture generation.

Abstract: Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency – textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling. To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure. It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity. Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization. Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.

[166] EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream Aggregation

Futian Wang, Fan Zhang, Xiao Wang, Mengqi Wang, Dexing Huang, Jin Tang

Main category: cs.CV

TL;DR: Proposes a hypergraph-guided spatio-temporal event stream completion method to address spatial sparsity in event cameras by connecting event tokens across time and space using hypergraphs and contextual message passing.

Details

Motivation: Event cameras produce spatially sparse but temporally dense event streams, and existing representation learning methods struggle with undersampling caused by spatial sparsity.

Method: Uses hypergraphs to connect event tokens across different times and spatial locations, leverages contextual information message passing to complete sparse events, and can incorporate RGB tokens for multi-modal completion. Aggregates hypergraph node information through self-attention.

Result: Extensive experiments on single- and multi-label event classification tasks fully validated the effectiveness of the proposed framework.

Conclusion: The hypergraph-based completion mechanism successfully addresses spatial sparsity in event streams and enables effective multi-modal feature learning and fusion.

Abstract: Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.

[167] HTTM: Head-wise Temporal Token Merging for Faster VGGT

Weitian Wang, Lukas Meiner, Rai Shubham, Cecilia De La Parra, Akash Kumar

Main category: cs.CV

TL;DR: HTTM is a training-free 3D token merging method that accelerates VGGT by merging tokens in multi-head granularity, achieving up to 7x speedup with minimal performance loss.

Details

Motivation: VGGT's joint inference of 3D attributes requires global attention layers with all-to-all computation, creating latency bottlenecks for large scenes with long-sequence inputs.

Method: Head-wise temporal merging (HTTM) merges tokens at multi-head granularity instead of uniformly across attention heads, preserving feature uniqueness and leveraging spatial locality and temporal correspondence.

Result: HTTM achieves up to 7x acceleration with negligible performance drops in GPU-based inference compared to existing merging methods.

Conclusion: HTTM effectively addresses the computational bottleneck in VGGT while maintaining model performance through head-wise token merging.

Abstract: The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers’ output, which hinders the model’s representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.

[168] PFF-Net: Patch Feature Fitting for Point Cloud Normal Estimation

Qing Li, Huifang Feng, Kanle Shi, Yue Gao, Yi Fang, Yu-Shen Liu, Zhizhong Han

Main category: cs.CV

TL;DR: A novel multi-scale feature fusion approach for robust normal estimation in point clouds that adapts to varying local geometries without requiring manual patch size selection.

Details

Motivation: Existing methods struggle with selecting appropriate neighborhood sizes for different point cloud data and geometries, leading to inaccurate normal estimation. They also use parameter-heavy strategies that are inefficient.

Method: Proposes Patch Feature Fitting (PFF) using multi-scale feature aggregation and cross-scale feature compensation. Aggregates features from different neighborhood sizes progressively to the center, while compensation ensures reusability of large-scale features.

Result: Achieves state-of-the-art performance on both synthetic and real-world datasets with fewer network parameters and reduced running time compared to existing methods.

Conclusion: The multi-scale feature fusion approach effectively addresses the patch size selection problem and provides optimal feature descriptions for robust normal estimation across various point cloud geometries.

Abstract: Estimating the normal of a point requires constructing a local patch to provide center-surrounding context, but determining the appropriate neighborhood size is difficult when dealing with different data or geometries. Existing methods commonly employ various parameter-heavy strategies to extract a full feature description from the input patch. However, they still have difficulties in accurately and efficiently predicting normals for various point clouds. In this work, we present a new idea of feature extraction for robust normal estimation of point clouds. We use the fusion of multi-scale features from different neighborhood sizes to address the issue of selecting reasonable patch sizes for various data or geometries. We seek to model a patch feature fitting (PFF) based on multi-scale features to approximate the optimal geometric description for normal estimation and implement the approximation process via multi-scale feature aggregation and cross-scale feature compensation. The feature aggregation module progressively aggregates the patch features of different scales to the center of the patch and shrinks the patch size by removing points far from the center. It not only enables the network to precisely capture the structure characteristic in a wide range, but also describes highly detailed geometries. The feature compensation module ensures the reusability of features from earlier layers of large scales and reveals associated information in different patch sizes. Our approximation strategy based on aggregating the features of multiple scales enables the model to achieve scale adaptation of varying local patches and deliver the optimal feature description. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets with fewer network parameters and running time.

[169] Endo-G$^{2}$T: Geometry-Guided & Temporally Aware Time-Embedded 4DGS For Endoscopic Scenes

Yangle Liu, Fengze Li, Kan Liu, Jieming Ma

Main category: cs.CV

TL;DR: Endo-G²T is a geometry-guided training scheme for 4D Gaussian splatting in endoscopic videos, addressing view-dependent effects and geometric drift through monocular depth supervision, temporal consistency, and streaming optimization.

Details

Motivation: Endoscopic videos suffer from strong view-dependent effects (specularities, reflections, occlusions) that cause geometric drift in photometric supervision, leading to erroneous shapes that become hard to correct during densification.

Method: Three key components: 1) Geo-guided prior distillation using confidence-gated monocular depth with scale-invariant losses and warm-up schedule; 2) Time-embedded Gaussian field with rotor-like rotation for temporal coherence; 3) Keyframe-constrained streaming with max-points budget for efficiency.

Result: Achieves state-of-the-art results on EndoNeRF and StereoMIS-P1 datasets among monocular reconstruction baselines, demonstrating improved geometric accuracy and temporal consistency.

Conclusion: Endo-G²T successfully anchors geometry early in 4DGS training while maintaining temporal consistency and efficiency, overcoming challenges of view-dependent effects in dynamic endoscopic scenes.

Abstract: Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns with geometry and triggers early geometric drift, where erroneous shapes are reinforced during densification and become hard to correct. We ask how to anchor geometry early for 4D Gaussian splatting (4DGS) while maintaining temporal consistency and efficiency in dynamic endoscopic scenes. Thus, we present Endo-G$^{2}$T, a geometry-guided and temporally aware training scheme for time-embedded 4DGS. First, geo-guided prior distillation converts confidence-gated monocular depth into supervision with scale-invariant depth and depth-gradient losses, using a warm-up-to-cap schedule to inject priors softly and avoid early overfitting. Second, a time-embedded Gaussian field represents dynamics in XYZT with a rotor-like rotation parameterization, yielding temporally coherent geometry with lightweight regularization that favors smooth motion and crisp opacity boundaries. Third, keyframe-constrained streaming improves efficiency and long-horizon stability through keyframe-focused optimization under a max-points budget, while non-keyframes advance with lightweight updates. Across EndoNeRF and StereoMIS-P1 datasets, Endo-G$^{2}$T achieves state-of-the-art results among monocular reconstruction baselines.

[170] Frequency-Aware Token Reduction for Efficient Vision Transformer

Dong-Jae Lee, Jiwan Hur, Jaehyun Choi, Jaemyung Yu, Junmo Kim

Main category: cs.CV

TL;DR: Proposes a frequency-aware token reduction strategy for Vision Transformers that partitions tokens into high-frequency and low-frequency components, preserving high-frequency tokens while aggregating low-frequency ones to improve computational efficiency and mitigate rank collapsing.

Details

Motivation: Vision Transformers face quadratic computational complexity challenges, and existing token reduction methods overlook frequency characteristics like rank collapsing and over-smoothing in self-attention mechanisms.

Method: Partitions tokens into high-frequency and low-frequency categories, selectively preserves high-frequency tokens, and aggregates low-frequency tokens into a compact direct current token to retain essential low-frequency components.

Result: Significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over-smoothing, as demonstrated through extensive experiments and analysis.

Conclusion: The frequency-aware token reduction strategy effectively addresses computational efficiency issues in Vision Transformers while preserving performance by considering frequency characteristics that previous methods overlooked.

Abstract: Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations.

[171] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Xin Gu, Haoji Zhang, Qihang Fan, Jingxuan Niu, Zhipeng Zhang, Libo Zhang, Guang Chen, Fan Chen, Longyin Wen, Sijie Zhu

Main category: cs.CV

TL;DR: STVG-o1 enables off-the-shelf multimodal large language models (MLLMs) to achieve state-of-the-art spatio-temporal video grounding performance without architectural changes, using bounding-box chain-of-thought reasoning and multi-dimensional reinforcement rewards.

Details

Motivation: Multimodal large language models (MLLMs) underperform on spatio-temporal video grounding (STVG) due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders.

Method: Proposes STVG-o1 framework with bounding-box chain-of-thought mechanism for explicit spatio-temporal reasoning and multi-dimensional reinforcement reward function (format, consistency, temporal, spatial, think rewards) for geometry-aware supervision during reinforcement fine-tuning.

Result: Sets new state-of-the-art results on HCSTVG, outperforming best task-specific method by 7.3% m_tIoU on HCSTVG-v1, matches specialized models on VidSTG, and surpasses all existing MLLM-based approaches by large margins. Demonstrates strong open-vocabulary generalization across datasets.

Conclusion: Establishes MLLMs as viable and powerful backbones for precise spatio-temporal video grounding, enabling off-the-shelf models to achieve state-of-the-art performance without architectural modifications.

Abstract: Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3% m_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.

[172] Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning

Taehoon Kim, Donghwan Jang, Bohyung Han

Main category: cs.CV

TL;DR: Merge-and-Bound (M&B) is a novel Class Incremental Learning approach that manipulates model weights through inter-task and intra-task merging with bounded updates to prevent catastrophic forgetting.

Details

Motivation: To address catastrophic forgetting in Class Incremental Learning by directly optimizing model weights in parameter space while preserving knowledge from previous tasks.

Method: Uses two types of weight merging: inter-task (averaging weights from previous stages) and intra-task (combining parameters within current stage), plus bounded update technique to minimize cumulative updates and preserve old knowledge.

Result: Demonstrates superior performance compared to state-of-the-art methods on standard CIL benchmarks.

Conclusion: M&B effectively reduces catastrophic forgetting and can be seamlessly integrated into existing CIL methods without architectural changes or revised learning objectives.

Abstract: We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the parameter space for optimization. Our algorithm involves two types of weight merging: inter-task weight merging and intra-task weight merging. Inter-task weight merging unifies previous models by averaging the weights of models from all previous stages. On the other hand, intra-task weight merging facilitates the learning of current task by combining the model parameters within current stage. For reliable weight merging, we also propose a bounded update technique that aims to optimize the target model with minimal cumulative updates and preserve knowledge from previous tasks; this strategy reveals that it is possible to effectively obtain new models near old ones, reducing catastrophic forgetting. M&B is seamlessly integrated into existing CIL methods without modifying architecture components or revising learning objectives. We extensively evaluate our algorithm on standard CIL benchmarks and demonstrate superior performance compared to state-of-the-art methods.

[173] DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models

Mingue Park, Prin Phunyaphibarn, Phillip Y. Lee, Minhyuk Sung

Main category: cs.CV

TL;DR: DiverseVAR enhances diversity in text-conditioned visual autoregressive models without retraining by combining text-embedding noise injection with scale-travel latent refinement.

Details

Motivation: VAR models suffer from limited diversity, producing nearly identical images for simple prompts, which has been overlooked amid focus on image quality.

Method: Two-stage approach: 1) Inject noise into text embeddings to increase diversity, 2) Use scale-travel latent refinement to preserve image quality by resuming generation at intermediate stages using multi-scale autoencoder.

Result: Combining text-embedding noise injection with scale-travel refinement significantly enhances diversity while minimizing quality degradation, achieving new Pareto frontier in diversity-quality trade-off.

Conclusion: DiverseVAR provides an effective test-time solution to improve VAR model diversity without computational overhead or retraining requirements.

Abstract: We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.

[174] E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework

Adeela Islam, Stefano Fiorini, Manuel Lecha, Theodore Tsesmelis, Stuart James, Pietro Morerio, Alessio Del Bue

Main category: cs.CV

TL;DR: E-M3RF is an equivariant multimodal 3D reassembly framework that combines geometric and color features using SE(3) flow matching to address limitations of geometry-only methods, achieving significant error reductions on cultural heritage datasets.

Details

Motivation: Traditional learning-based 3D reassembly methods rely primarily on geometric features, which struggle with ambiguous cases like small, eroded, or symmetric fragments where geometry alone is insufficient. Current solutions also lack physical constraints to prevent overlapping assemblies.

Method: Uses multimodal representation combining: 1) rotation-equivariant encoder for 3D point positions, 2) transformer for color features at each point. Predicts transformations using SE(3) flow matching on point clouds with both positions and colors.

Result: On RePAIR dataset: 23.1% reduction in rotation error, 13.2% reduction in translation error, 18.4% decrease in Chamfer Distance compared to competing methods. Validated on 4 datasets including synthetic (Breaking Bad, Fantastic Breaks) and real-world cultural heritage (RePAIR, Presious).

Conclusion: E-M3RF effectively addresses limitations of geometry-only reassembly by incorporating multimodal features and physical constraints, demonstrating superior performance particularly on challenging cultural heritage reconstruction tasks.

Abstract: 3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.

[175] Multimodal Robust Prompt Distillation for 3D Point Cloud Models

Xiang Gu, Liming Lu, Xu Zheng, Anan Du, Yongbin Zhou, Shuchao Pang

Main category: cs.CV

TL;DR: MRPD is an efficient teacher-student framework that uses multimodal knowledge distillation with confidence-gated mechanism to build robust 3D point cloud models against adversarial attacks, with no inference overhead.

Details

Motivation: Existing defense methods for 3D point cloud models suffer from high computational overhead and poor generalization across different attack types, limiting their practical deployment in security-sensitive applications.

Method: Proposes Multimodal Robust Prompt Distillation (MRPD) - a teacher-student framework that learns lightweight prompts by aligning student model features with robust embeddings from three teachers: vision model (depth projections), high-performance 3D model, and text encoder, guided by confidence-gated mechanism.

Result: MRPD substantially outperforms state-of-the-art defense methods against wide range of white-box and black-box attacks, while achieving better performance on clean data. No additional computational cost at inference since distillation occurs only during training.

Conclusion: Presents a practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge through confidence-gated distillation.

Abstract: Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model’s features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.

[176] MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices

Shuai Zhang, Bao Tang, Siyuan Yu, Yueting Zhu, Jingfeng Yao, Ya Zou, Shanglin Yuan, Li Yu, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: MobileI2V is a 270M lightweight diffusion model that enables real-time 720p image-to-video generation on mobile devices through linear hybrid architecture, time-step distillation, and mobile-specific optimizations.

Details

Motivation: Address the computational complexity and slow generation speed challenges of diffusion models for real-time, high-resolution video generation on resource-constrained mobile devices.

Method: Proposed linear hybrid architecture denoiser balancing efficiency and quality, time-step distillation compressing sampling from 20+ to 2 steps, and mobile-specific attention optimizations for 2x speed-up.

Result: Achieves fast 720p image-to-video generation with quality comparable to existing models, with each frame generation under 100ms under one-step conditions and 10x speed increase overall.

Conclusion: MobileI2V enables real-time high-resolution video generation on mobile devices for the first time, demonstrating the feasibility of deploying advanced diffusion models on resource-constrained platforms.

Abstract: Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the substantial computational complexity and slow generation speed of diffusion models pose significant challenges for real-time, high-resolution video generation on resource-constrained mobile devices. In this work, we propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices. The core lies in: (1) We analyzed the performance of linear attention modules and softmax attention modules on mobile devices, and proposed a linear hybrid architecture denoiser that balances generation efficiency and quality. (2) We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss, resulting in a 10-fold increase in generation speed. (3) We apply mobile-specific attention optimizations that yield a 2-fold speed-up for attention operations during on-device inference. MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models. Under one-step conditions, the generation speed of each frame of 720p video is less than 100 ms. Our code is available at: https://github.com/hustvl/MobileI2V.

[177] CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation

Shizhe Sun, Wataru Ohyama

Main category: cs.CV

TL;DR: CanKD is a novel feature-based knowledge distillation framework using cross-attention to enable non-local knowledge transfer between teacher and student models, outperforming state-of-the-art methods in object detection and segmentation tasks.

Details

Motivation: Traditional self-attention-based distillation methods align teacher and student feature maps independently, which limits the thorough capture of pixel-wise relationships. CanKD aims to enable more comprehensive knowledge transfer by allowing dynamic consideration of all pixels between teacher and student feature maps.

Method: Proposes Cross-Attention-based Non-local Knowledge Distillation (CanKD) that uses cross-attention mechanisms where each pixel in the student feature map dynamically considers all pixels in the teacher feature map. The method introduces only an additional loss function without complex architectural changes.

Result: Extensive experiments on object detection and image segmentation tasks show that CanKD outperforms state-of-the-art feature and hybrid distillation methods, demonstrating superior performance compared to existing attention-guided distillation approaches.

Conclusion: CanKD represents a new paradigm for attention-guided distillation in computer vision tasks, effectively capturing pixel-wise relationships through non-local knowledge transfer and achieving better performance with minimal additional complexity.

Abstract: We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD’s potential as a new paradigm for attention-guided distillation in computer vision tasks. Code is available at https://github.com/tori-hotaru/CanKD

[178] Generalized Design Choices for Deepfake Detectors

Lorenzo Pellegrini, Serafino Pandolfini, Davide Maltoni, Matteo Ferrara, Marco Prati, Marco Ramilli

Main category: cs.CV

TL;DR: Systematic investigation of design choices in deepfake detection reveals that implementation details like data preprocessing and augmentation significantly impact performance more than core architecture, leading to architecture-agnostic best practices.

Details

Motivation: Deepfake detection effectiveness depends heavily on implementation details rather than core design, making fair comparisons difficult and obscuring true performance factors.

Method: Systematically investigate how different design choices influence accuracy and generalization, focusing on training, inference, and incremental updates by isolating individual factor impacts.

Result: Identified a set of design choices that consistently improve deepfake detection and enable state-of-the-art performance on AI-GenBench benchmark.

Conclusion: Established robust, architecture-agnostic best practices for future deepfake detection system design and development.

Abstract: The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentation strategies, and optimization techniques. These factors make it difficult to fairly compare detectors and to understand which factors truly contribute to their performance. To address this, we systematically investigate how different design choices influence the accuracy and generalization capabilities of deepfake detection models, focusing on aspects related to training, inference, and incremental updates. By isolating the impact of individual factors, we aim to establish robust, architecture-agnostic best practices for the design and development of future deepfake detection systems. Our experiments identify a set of design choices that consistently improve deepfake detection and enable state-of-the-art performance on the AI-GenBench benchmark.

[179] Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu

Main category: cs.CV

TL;DR: Qwen3-VL is the most advanced vision-language model in the Qwen series, featuring native support for 256K token interleaved contexts (text, images, video), available in dense (2B-32B) and MoE (30B/235B) variants with superior multimodal reasoning capabilities.

Details

Motivation: To create a more capable vision-language model that addresses the need for robust long-context comprehension, advanced multimodal reasoning, and stronger pure-text understanding while accommodating diverse latency-quality trade-offs for real-world applications.

Method: Three key architectural upgrades: enhanced interleaved-MRoPE for spatial-temporal modeling, DeepStack integration for multi-level ViT features to improve vision-language alignment, and text-based time alignment for video (evolving from T-RoPE to explicit textual timestamp alignment).

Result: Achieves superior performance across multimodal benchmarks including MMMU, MathVista, and MathVision. Demonstrates leading performance in single-image, multi-image, and video tasks with faithful retention, retrieval, and cross-referencing across long documents and videos.

Conclusion: Qwen3-VL serves as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows, delivering state-of-the-art performance under comparable token budgets and latency constraints.

Abstract: We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

[180] Self-Paced Learning for Images of Antinuclear Antibodies

Yiyang Jiang, Guangwu Qian, Jiaxin Wu, Qi Huang, Qing Li, Yongkang Wu, Xiao-Yong Wei

Main category: cs.CV

TL;DR: A novel framework for automated ANA detection using multi-instance multi-label learning with task-specific components that mimic human labeling logic, achieving state-of-the-art performance on medical datasets.

Details

Motivation: Manual ANA testing for autoimmune disorders is slow, labor-intensive, and requires extensive training. Existing automation methods struggle with the multi-instance, multi-label nature of real-world clinical ANA detection involving over 100 antibody types and complex fluorescent patterns.

Method: Proposed framework uses three components: instance sampler to suppress low-confidence instances, probabilistic pseudo-label dispatcher for adaptive label assignment, and self-paced weight learning rate coefficients. Inspired by human labeling logic, it identifies consistent ANA sub-regions and assigns aggregated labels without manual preprocessing.

Result: Achieved up to +7.0% F1-Macro and +12.6% mAP gains on ANA dataset over prior methods. Ranked top-2 across all key metrics on public medical MIML benchmarks, reducing Hamming loss by up to 18.2% and one-error by 26.9%.

Conclusion: The framework effectively handles MIML complexities in ANA detection, supports end-to-end optimization, and sets new state-of-the-art results while overcoming limitations of traditional MIML methods.

Abstract: Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren’s syndrome, and scleroderma. Despite its importance, manual ANA detection is slow, labor-intensive, and demands years of training. ANA detection is complicated by over 100 coexisting antibody types, resulting in vast fluorescent pattern combinations. Although machine learning and deep learning have enabled automation, ANA detection in real-world clinical settings presents unique challenges as it involves multi-instance, multi-label (MIML) learning. In this paper, a novel framework for ANA detection is proposed that handles the complexities of MIML tasks using unaltered microscope images without manual preprocessing. Inspired by human labeling logic, it identifies consistent ANA sub-regions and assigns aggregated labels accordingly. These steps are implemented using three task-specific components: an instance sampler, a probabilistic pseudo-label dispatcher, and self-paced weight learning rate coefficients. The instance sampler suppresses low-confidence instances by modeling pattern confidence, while the dispatcher adaptively assigns labels based on instance distinguishability. Self-paced learning adjusts training according to empirical label observations. Our framework overcomes limitations of traditional MIML methods and supports end-to-end optimization. Extensive experiments on one ANA dataset and three public medical MIML benchmarks demonstrate the superiority of our framework. On the ANA dataset, our model achieves up to +7.0% F1-Macro and +12.6% mAP gains over the best prior method, setting new state-of-the-art results. It also ranks top-2 across all key metrics on public datasets, reducing Hamming loss and one-error by up to 18.2% and 26.9%, respectively. The source code can be accessed at https://github.com/fletcherjiang/ANA-SelfPacedLearning.

[181] Continual Error Correction on Low-Resource Devices

Kirill Paramonov, Mete Ozay, Aristeidis Mystakidis, Nikolaos Tsalikidis, Dimitrios Sotos, Anastasios Drosou, Dimitrios Tzovaras, Hyunjun Kim, Kiseok Chang, Sangdok Mo, Namwoong Kim, Woojong Yoo, Jijoong Moon, Umberto Michieli

Main category: cs.CV

TL;DR: A system for efficient AI error correction on resource-constrained devices using few-shot learning with prototype updates instead of model retraining.

Details

Motivation: Address AI prediction errors on everyday devices where existing solutions lack efficient correction mechanisms, especially for resource-constrained environments.

Method: Combines server-side foundation model training with knowledge distillation and device-side prototype-based classification that enables error correction through prototype updates rather than full model retraining.

Result: Achieved over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets with minimal forgetting (<0.02%) and negligible computational overhead.

Conclusion: The system proves practical for real-world deployment, enabling efficient AI error correction on resource-constrained devices through the proposed prototype-based approach.

Abstract: The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system’s effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system’s practicality in real-world scenarios.

[182] EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

Pierre Adorni, Minh-Tan Pham, Stéphane May, Sébastien Lefèvre

Main category: cs.CV

TL;DR: An Ensemble-of-Specialists framework for Remote Sensing Foundation Models that decomposes training into lightweight task-specific specialists, offering efficiency, interpretability and extensibility advantages over large monolithic models.

Details

Motivation: Current foundation model approaches require prohibitive computational resources and contradict sustainable AI principles. There's a need for more accessible and environmentally responsible alternatives in Earth Observation.

Method: Decomposes training into lightweight ConvNeXtV2 specialists that are task-specific, can be frozen and reused, and supports federated training, pruning, and continuous integration.

Result: The framework provides strong advantages in efficiency, interpretability, and extensibility while being well-suited for collaborative and resource-constrained settings.

Conclusion: This ensemble approach sets a new direction for building scalable and efficient Remote Sensing Foundation Models that are more sustainable and accessible than current large-scale approaches.

Abstract: Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs.

[183] Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Naifu Zhang, Wei Tao, Xi Xiao, Qianpu Sun, Yuxin Zheng, Wentao Mo, Peiqiang Wang, Nan Zhang

Main category: cs.CV

TL;DR: ADVLA is an efficient adversarial attack framework for Vision-Language-Action models that applies perturbations directly in feature space, achieving high attack success with minimal patch modifications and low computational cost.

Details

Motivation: Existing adversarial attack methods for VLA models require costly end-to-end training and generate noticeable perturbation patches, limiting their practicality.

Method: ADVLA applies adversarial perturbations on features projected from visual encoder to textual feature space, using attention guidance and three strategies to enhance sensitivity, enforce sparsity, and concentrate perturbations.

Result: Under L∞=4/255 constraint, ADVLA with Top-K masking modifies <10% of patches while achieving ~100% attack success rate, with perturbations concentrated on critical regions and taking only 0.06 seconds per iteration.

Conclusion: ADVLA effectively weakens VLA model action predictions under low-amplitude sparse conditions, avoiding high training costs and conspicuous perturbations of traditional attacks, demonstrating practical value for VLA feature space attacks.

Abstract: In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.

[184] The Age-specific Alzheimer ’s Disease Prediction with Characteristic Constraints in Nonuniform Time Span

Xin Hong, Kaifeng Huang

Main category: cs.CV

TL;DR: A novel method for generating sequential MRI images with quantitative metrics and age-scaling to predict Alzheimer’s disease progression, achieving high structural similarity in synthesized images.

Details

Motivation: Timely identification of Alzheimer's disease is crucial for personalized treatment, but current image generation methods struggle with irregular time intervals and accurate disease characteristic representation.

Method: Sequential image generation guided by quantitative metrics with integrated age-scaling factor to produce age-specific MRI images for disease prediction.

Result: Quantitative metrics significantly improved MRI image synthesis accuracy, age-scaled pixel loss enhanced iterative generation, and achieved Structural Similarity Index of 0.882 for long-term prognosis.

Conclusion: The proposed methodology effectively generates age-specific MRI images that maintain disease progression characteristics, enabling improved prediction of advanced Alzheimer’s disease stages.

Abstract: Alzheimer’s disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development of personalized treatment strategies that aim to mitigate its progression. The application of generated images for the prediction of Alzheimer’s disease poses challenges, particularly in accurately representing the disease’s characteristics when input sequences are captured at irregular time intervals. This study presents an innovative methodology for sequential image generation, guided by quantitative metrics, to maintain the essential features indicative of disease progression. Furthermore, an age-scaling factor is integrated into the process to produce age-specific MRI images, facilitating the prediction of advanced stages of the disease. The results obtained from the ablation study suggest that the inclusion of quantitative metrics significantly improves the accuracy of MRI image synthesis. Furthermore, the application of age-scaled pixel loss contributed to the enhanced iterative generation of MRI images. In terms of long-term disease prognosis, the Structural Similarity Index reached a peak value of 0.882, indicating a substantial degree of similarity in the synthesized images.

[185] Video Generation Models Are Good Latent Reward Models

Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang

Main category: cs.CV

TL;DR: PRFL is a latent-space reward feedback learning framework for video generation that avoids VAE decoding, enabling efficient optimization throughout the entire denoising process while improving human preference alignment.

Details

Motivation: Existing video reward models require pixel-space inputs, leading to high memory overhead, slow training, and late-stage optimization that only refines visual quality rather than fundamental motion dynamics and structural coherence.

Method: Leverages pre-trained video generation models as reward models in noisy latent space, conducting preference optimization entirely in latent space without VAE decoding, enabling full gradient backpropagation through the denoising chain.

Result: PRFL significantly improves alignment with human preferences while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

Conclusion: Pre-trained video generation models are naturally suited for latent-space reward modeling, enabling efficient and effective preference optimization that improves both visual quality and motion dynamics.

Abstract: Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

[186] CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Pandeng Li, Boqiang Zhang, Nianzu Yang, Yinglu Li, Zuan Gao, Yun Zheng, Hongtao Xie

Main category: cs.CV

TL;DR: CAPability is a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views, addressing limitations of outdated benchmarks with modern MLLMs.

Details

Motivation: Current visual captioning benchmarks are outdated for modern multimodal large language models (MLLMs) as brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. Recent benchmarks remain limited to vague-view or object-view analyses with incomplete visual element coverage.

Method: Introduced CAPability benchmark with nearly 11K human-annotated images and videos with visual element annotations. Uses precision and hit metrics to assess correctness and thoroughness of captions. Introduces heuristic metric ‘know but cannot tell’ (K¬T) by converting annotations to QA pairs to measure performance gap between QA and caption capabilities.

Result: The benchmark provides stable assessment of caption correctness and thoroughness across 12 dimensions. Identifies significant performance gap between QA and caption capabilities in MLLMs through the K¬T metric.

Conclusion: CAPability provides holistic analysis of MLLMs’ captioning abilities, identifying their strengths and weaknesses across various dimensions to guide future research in enhancing specific aspects of their capabilities.

Abstract: Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. In this paper, we introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions with \textit{precision} and \textit{hit} metrics. By converting annotations to QA pairs, we further introduce a heuristic metric, \textit{know but cannot tell} ($K\bar{T}$), indicating a significant performance gap between QA and caption capabilities. Our work provides a holistic analysis of MLLMs’ captioning abilities, as we identify their strengths and weaknesses across various dimensions, guiding future research to enhance specific aspects of their capabilities.

[187] UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes

Kang Du, Xue Liao, Junpeng Xia, Chaozheng Guo, Yi Gu, Yirui Guan, Duotun Wang, ShengHuang, Zeyu Wang

Main category: cs.CV

TL;DR: UAVLight is a new benchmark for illumination-robust 3D reconstruction that captures scenes at multiple fixed times of day along repeatable flight paths to provide natural lighting variation while maintaining consistent geometry, calibration, and viewpoints.

Details

Motivation: Illumination inconsistency from sunlight direction, cloud cover, and shadows breaks constant-lighting assumptions in multi-view 3D reconstruction, causing geometry drift, color inconsistency, and shadow imprinting. Existing datasets either lack meaningful illumination diversity or span too long periods where geometric changes confound lighting studies.

Method: The benchmark captures scenes along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation while maintaining consistent geometry, calibration, and viewpoints.

Result: UAVLight provides a controlled-yet-real benchmark with standardized evaluation protocols across different lighting conditions.

Conclusion: This benchmark provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.

Abstract: Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable. However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness. We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.

[188] LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

Maoyuan Ye, Haibin He, Qihuang Zhong, Jing Zhang, Juhua Liu, Bo Du

Main category: cs.CV

TL;DR: LogicOCR is a new benchmark for evaluating complex logical reasoning of Large Multimodal Models on text-rich images, featuring 2780 questions across generated and real-world images. The study reveals LMMs’ limitations in multimodal reasoning and proposes TextCue, a training-free method that improves performance by enhancing text cue perception.

Details

Motivation: To address the underexplored area of complex logical reasoning performance of Large Multimodal Models on text-rich images, as current LMMs have advanced in reasoning and OCR capabilities but their performance on multimodal logical reasoning remains unclear.

Method: Created LogicOCR benchmark with 2780 questions in two subsets: LogicOCR-Gen (1100 multi-choice questions on generated images using GPT-Image-1) and LogicOCR-Real (1680 free-form questions on real-world images). Evaluated LMMs under Chain-of-Thought and direct-answer settings. Proposed TextCue method that uses attention maps and text segmentation to identify and enlarge important text regions.

Result: LMMs significantly lag in multimodal reasoning compared to text-only inputs, showing they haven’t fully integrated visual reading with reasoning. TextCue method achieved 1.8% accuracy gain over LLaVA-OV-1.5-8B under CoT setting. Key insights include impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation.

Conclusion: Large Multimodal Models still struggle with bridging visual reading and complex logical reasoning. The proposed TextCue method effectively enhances text cue perception without training, demonstrating potential for improving multimodal reasoning capabilities. The LogicOCR benchmark provides a valuable tool for future research in this area.

Abstract: Recent advances in Large Multimodal Models (LMMs) have revolutionized their reasoning and Optical Character Recognition (OCR) capabilities. However, their complex logical reasoning performance on text-rich images remains underexplored. To bridge this gap, we introduce LogicOCR, a benchmark comprising 2780 questions with two subsets, i.e., LogicOCR-Gen with 1100 multi-choice questions on generated images, and LogicOCR-Real with 1680 meticulously designed free-form questions on real-world images. For constructing LogicOCR-Gen, we first curate a text corpus from the Chinese National Civil Servant Examination, and customize an automatic pipeline to steer GPT-Image-1 to generate images with varied layouts and fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified. We evaluate a range of representative LMMs under Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. Moreover, we propose TextCue, a training-free method that enhances LMMs’ perception of image regions containing important text cues for solving questions. We leverage LMMs’ attention maps and an off-the-shelf text segmentation specialist to determine the region, which is then cropped and enlarged to augment the original image. Experiments show its effectiveness, e.g., a 1.8% accuracy gain over LLaVA-OV-1.5-8B under the CoT setting. Our benchmark is available at https://github.com/MiliLab/LogicOCR.

[189] Enhanced Landmark Detection Model in Pelvic Fluoroscopy using 2D/3D Registration Loss

Chou Mo, Yehyun Suh, J. Ryan Martin, Daniel Moyer

Main category: cs.CV

TL;DR: A framework combining 2D/3D landmark registration with U-Net training improves pelvic landmark detection accuracy under variable patient poses in intra-operative fluoroscopy.

Details

Motivation: Current pelvic landmark detection methods assume fixed Antero-Posterior views, but real intra-operative imaging often has variable orientations due to patient repositioning or imaging unit movement.

Method: Proposed framework incorporates 2D/3D landmark registration into U-Net training, comparing baseline U-Net, U-Net with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss.

Result: The framework addresses limitations of fixed-view assumptions and improves detection accuracy under realistic variable pose conditions.

Conclusion: Integrating pose estimation into landmark detection training enhances performance for variable patient orientations in intra-operative pelvic fluoroscopy.

Abstract: Automated landmark detection offers an efficient approach for medical professionals to understand patient anatomic structure and positioning using intra-operative imaging. While current detection methods for pelvic fluoroscopy demonstrate promising accuracy, most assume a fixed Antero-Posterior view of the pelvis. However, orientation often deviates from this standard view, either due to repositioning of the imaging unit or of the target structure itself. To address this limitation, we propose a novel framework that incorporates 2D/3D landmark registration into the training of a U-Net landmark prediction model. We analyze the performance difference by comparing landmark detection accuracy between the baseline U-Net, U-Net trained with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss under realistic intra-operative conditions where patient pose is variable.

[190] Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi

Main category: cs.CV

TL;DR: Harmony is a novel framework that addresses audio-visual synchronization challenges in joint diffusion models through cross-task synergy training, global-local decoupled interaction, and synchronization-enhanced CFG.

Details

Motivation: Existing open-source models struggle with robust audio-video alignment due to three fundamental challenges: correspondence drift in joint diffusion, inefficient global attention mechanisms, and intra-modal bias in CFG that doesn't enhance cross-modal synchronization.

Method: Proposes three key components: 1) Cross-Task Synergy training paradigm using audio-driven video and video-driven audio generation tasks, 2) Global-Local Decoupled Interaction Module for efficient temporal-style alignment, 3) Synchronization-Enhanced CFG (SyncCFG) that explicitly amplifies alignment signals during inference.

Result: Extensive experiments show Harmony establishes new state-of-the-art performance, significantly outperforming existing methods in both generation fidelity and fine-grained audio-visual synchronization.

Conclusion: Harmony successfully overcomes fundamental challenges in joint audio-visual diffusion and achieves superior synchronization through its mechanistic approach to alignment enforcement.

Abstract: The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

[191] Deep Learning-Based Multiclass Classification of Oral Lesions with Stratified Augmentation

Joy Naoum, Revana Salama, Ali Hamdi

Main category: cs.CV

TL;DR: Deep learning model for multiclass classification of 16 oral lesions using data augmentation and oversampling to address imbalanced datasets, achieving 83.33% accuracy and showing promise for early oral cancer detection.

Details

Motivation: Oral cancer is often diagnosed late due to visual similarity between benign, precancerous, and malignant lesions. Early implementation of computer-aided diagnosis systems can improve clinical outcomes.

Method: Combines stratified data splitting with advanced data augmentation and oversampling techniques to handle limited and imbalanced datasets for multiclass classification of 16 oral lesions.

Result: Achieved 83.33% accuracy, 89.12% precision, and 77.31% recall, demonstrating superiority over state-of-the-art methods with notable minority class classification performance.

Conclusion: The framework shows promise as a first step toward trustworthy computer-aided diagnostic systems for early detection of oral cancer in clinical settings, effectively demonstrating the value of oversampling and augmentation strategies.

Abstract: Oral cancer is highly common across the globe and is mostly diagnosed during the later stages due to the close visual similarity to benign, precancerous, and malignant lesions in the oral cavity. Implementing computer aided diagnosis systems early on has the potential to greatly improve clinical outcomes. This research intends to use deep learning to build a multiclass classifier for sixteen different oral lesions. To overcome the challenges of limited and imbalanced datasets, the proposed technique combines stratified data splitting and advanced data augmentation and oversampling to perform the classification. The experimental results, which achieved 83.33 percent accuracy, 89.12 percent precision, and 77.31 percent recall, demonstrate the superiority of the suggested model over state of the art methods now in use. The suggested model effectively conveys the effectiveness of oversampling and augmentation strategies in situations where the minority class classification performance is noteworthy. As a first step toward trustworthy computer aided diagnostic systems for the early detection of oral cancer in clinical settings, the suggested framework shows promise.

[192] MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

Haotian Xue, Qi Chen, Zhonghao Wang, Xun Huang, Eli Shechtman, Jinrong Xie, Yongxin Chen

Main category: cs.CV

TL;DR: MoGAN is a motion-centric post-training framework that improves motion realism in video diffusion models by training an optical-flow discriminator and using distribution-matching regularization, achieving significant motion quality improvements without sacrificing visual fidelity.

Details

Motivation: Video diffusion models achieve good frame-level fidelity but struggle with motion coherence, dynamics and realism, producing jitter, ghosting, or implausible dynamics. The standard denoising MSE objective provides no direct supervision on temporal consistency.

Method: Built atop a 3-step distilled video diffusion model, MoGAN trains a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity.

Result: On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, it improves motion score by +7.4% over teacher and +8.8% over DMD. Human study shows preference for MoGAN’s motion quality (52% vs 38% for teacher; 56% vs 29% for DMD).

Conclusion: MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation.

Abstract: Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.

[193] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

M. Naseer Subhani

Main category: cs.CV

TL;DR: A self-prompting framework adapts SAM to remote sensing imagery using only point annotations, achieving better performance than pretrained SAM and other point-supervised methods.

Details

Motivation: SAM performs poorly on remote sensing imagery due to domain shift and lack of dense annotations, requiring adaptation with minimal supervision.

Method: Uses a Refine-Requery-Reinforce loop with coarse pseudo-masks from points, self-constructed box prompts, and embedding alignment to reduce confirmation bias.

Result: Outperforms pretrained SAM and recent point-supervised methods on three RSI benchmarks (WHU, HRSID, NWPU VHR-10).

Conclusion: Self-prompting and semantic alignment enable efficient point-level adaptation of foundation models for remote sensing.

Abstract: Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM’s segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.

[194] Active Learning for GCN-based Action Recognition

Hichem Sahbi

Main category: cs.CV

TL;DR: Proposes a label-efficient GCN model for skeleton-based action recognition with two main contributions: an adversarial acquisition function for selecting informative exemplars and bidirectional/stable GCN architectures for better data mapping.

Details

Motivation: GCNs for skeleton-based action recognition often require large labeled datasets, which are scarce in practical settings, creating a need for label-efficient approaches.

Method: 1) Novel adversarial acquisition function to select compact set of informative exemplars balancing representativeness, diversity, and uncertainty; 2) Bidirectional and stable GCN architectures for better mapping between ambient and latent data spaces.

Result: Extensive evaluations on two challenging skeleton-based action recognition benchmarks show significant improvements over prior work.

Conclusion: The proposed label-efficient GCN model effectively addresses the data scarcity problem in skeleton-based action recognition through improved exemplar selection and network architecture.

Abstract: Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of labeled data, which are frequently scarce in practical settings. To address this limitation, we propose a novel label-efficient GCN model. Our work makes two primary contributions. First, we develop a novel acquisition function that employs an adversarial strategy to identify a compact set of informative exemplars for labeling. This selection process balances representativeness, diversity, and uncertainty. Second, we introduce bidirectional and stable GCN architectures. These enhanced networks facilitate a more effective mapping between the ambient and latent data spaces, enabling a better understanding of the learned exemplar distribution. Extensive evaluations on two challenging skeleton-based action recognition benchmarks reveal significant improvements achieved by our label-efficient GCNs compared to prior work.

[195] UniChange: Unifying Change Detection with Multimodal Large Language Model

Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, Xiang Li

Main category: cs.CV

TL;DR: UniChange is the first MLLM-based unified change detection model that integrates both binary change detection and semantic change detection tasks using language priors and special tokens, achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: Current change detection models are limited to single-type annotated data and cannot leverage diverse datasets, leading to poor generalization and limited versatility. The authors aim to create a unified framework that can handle both binary and semantic change detection tasks.

Method: Leverages Multimodal Large Language Models (MLLMs) with language priors and introduces three special tokens [T1], [T2], and [CHANGE] to unify BCD and SCD tasks. Uses text prompts to guide change category identification instead of predefined classification heads.

Result: Achieves state-of-the-art performance on four benchmarks: WHU-CD (90.41 IoU), S2Looking (53.04 IoU), LEVIR-CD+ (78.87 IoU), and SECOND (57.62 IoU), surpassing all previous methods.

Conclusion: UniChange successfully demonstrates that MLLMs can effectively unify change detection tasks, enabling knowledge acquisition from multi-source datasets even with conflicting class definitions, and provides a versatile framework for land cover monitoring.

Abstract: Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.

[196] CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow

Ruisheng Han, Kanglei Zhou, Shuang Chen, Amir Atapour-Abarghouei, Hubert P. H. Shum

Main category: cs.CV

TL;DR: CaFlow is a unified framework for long-term Action Quality Assessment that combines counterfactual de-confounding with bidirectional time-conditioned flow to address challenges in modeling extended temporal dynamics while being robust to contextual confounders.

Details

Motivation: Long-term AQA in activities like figure skating or rhythmic gymnastics requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches are vulnerable to spurious correlations and unstable long-term representations due to unidirectional temporal modeling and dependency on costly annotations.

Method: CaFlow integrates counterfactual de-confounding with bidirectional time-conditioned flow. It uses Causal Counterfactual Regularization (CCR) to disentangle causal and confounding features in a self-supervised manner, and BiT-Flow module to model forward and backward dynamics with cycle-consistency constraint for smoother representations.

Result: Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance.

Conclusion: CaFlow provides an effective solution for long-term AQA by addressing both causal robustness and temporal coherence through its unified counterfactual and bidirectional flow framework.

Abstract: Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. The Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions, while the BiT-Flow module models forward and backward dynamics with a cycle-consistency constraint to produce smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance. Code is available at https://github.com/Harrison21/CaFlow

[197] Think Visually, Reason Textually: Vision-Language Synergy in ARC

Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang

Main category: cs.CV

TL;DR: Vision-Language Synergy Reasoning (VLSR) and Modality-Switch Self-Correction (MSSC) improve abstract reasoning in ARC-AGI tasks by combining visual pattern abstraction with linguistic rule formulation, achieving 4.33% improvement over text-only baselines.

Details

Motivation: Current foundation models fail at inferring structured transformation rules from minimal examples, a key human intelligence capability. ARC-AGI provides a testbed for this, but existing methods overlook visual abstraction which humans use heavily.

Method: Two synergistic strategies: 1) VLSR decomposes ARC-AGI into modality-aligned subtasks, 2) MSSC uses vision to verify text-based reasoning for error correction. Combines visual pattern abstraction with linguistic rule formulation.

Result: Achieves up to 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks.

Conclusion: Unifying visual abstraction with linguistic reasoning is crucial for achieving generalizable, human-like intelligence in future foundation models.

Abstract: Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code is released at https://github.com/InternLM/ARC-VL.

[198] Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, Heng Huang

Main category: cs.CV

TL;DR: Multi-Crit is a benchmark for evaluating multimodal models’ ability to follow diverse, fine-grained evaluation criteria and produce reliable criterion-level judgments.

Details

Motivation: Large multimodal models are increasingly used as judges in evaluation systems, but their capacity to follow pluralistic criteria remains underexplored.

Method: Developed Multi-Crit benchmark through rigorous data curation with challenging response pairs and multi-criterion human annotations, introducing three novel metrics for systematic assessment.

Result: Analysis of 25 LMMs shows proprietary models struggle with pluralistic criteria adherence in open-ended evaluation, open-source models lag further behind, and critic fine-tuning fails to generalize to pluralistic judgment.

Conclusion: Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation systems.

Abstract: Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria–especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.

[199] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Qin Jin

Main category: cs.CV

TL;DR: TimeViper is a hybrid vision-language model using Mamba-Transformer backbone for long video understanding, featuring TransV module to compress vision tokens and handle 10,000+ frames.

Details

Motivation: To address challenges in long video understanding requiring efficient architecture and effective temporal context handling, while overcoming vision token redundancy in multimodal models.

Method: Hybrid Mamba-Transformer backbone combining state-space model efficiency with attention expressivity, plus TransV module for vision token transfer and compression into instruction tokens.

Result: TimeViper processes hour-long videos with 10,000+ frames and competes with state-of-the-art models across multiple benchmarks while extending frame capacity.

Conclusion: This work advances hybrid Mamba-Transformer architectures for long video understanding, providing insights into model interpretability and compression techniques.

Abstract: We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.

[200] Revolutionizing Glioma Segmentation & Grading Using 3D MRI - Guided Hybrid Deep Learning Models

Pandiyaraju V, Sreya Mynampati, Abishek Karthik, Poovarasan L, D. Saraswathi

Main category: cs.CV

TL;DR: A hybrid deep learning model combining U-Net segmentation with DenseNet-VGG classification using multihead attention achieves 98% Dice coefficient for tumor segmentation and 99% accuracy for glioma classification in MRI data.

Details

Motivation: Early and accurate diagnosis of gliomas is crucial due to their high mortality rate, requiring advanced methods for precise tumor detection and classification in medical imaging.

Method: Hybrid framework with U-Net segmentation for precise tumor demarcation in 3D MRI, combined with DenseNet-VGG classification network enhanced with multihead attention and spatial-channel attention mechanisms. Preprocessing includes normalization, resampling, and data augmentation.

Result: Achieved 98% Dice coefficient for tumor segmentation and 99% classification accuracy, outperforming traditional CNN models and attention-free methods. Enhanced interpretability through attention mechanisms focusing on clinically relevant features.

Conclusion: The proposed framework shows great potential for timely and reliable glioma diagnosis and grading, enabling better treatment planning through improved segmentation and classification performance with enhanced clinical interpretability.

Abstract: Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors. To address this difficulty, the proposed research will develop a hybrid deep learning model which integrates U-Net based segmentation and a hybrid DenseNet-VGG classification network with multihead attention and spatial-channel attention capabilities. The segmentation model will precisely demarcate the tumors in a 3D volume of MRI data guided by spatial and contextual information. The classification network which combines a branch of both DenseNet and VGG, will incorporate the demarcated tumor on which features with attention mechanisms would be focused on clinically relevant features. High-dimensional 3D MRI data could successfully be utilized in the model through preprocessing steps which are normalization, resampling, and data augmentation. Through a variety of measures the framework is evaluated: measures of performance in segmentation are Dice coefficient and Mean Intersection over Union (IoU) and measures of performance in classification are accuracy precision, recall, and F1-score. The hybrid framework that has been proposed has demonstrated through physical testing that it has the capability of obtaining a Dice coefficient of 98% in tumor segmentation, and 99% on classification accuracy, outperforming traditional CNN models and attention-free methods. Utilizing multi-head attention mechanisms enhances notions of priority in aspects of the tumor that are clinically significant, and enhances interpretability and accuracy. The results suggest a great potential of the framework in facilitating the timely and reliable diagnosis and grading of glioma by clinicians is promising, allowing for better planning of patient treatment.

[201] Seeing without Pixels: Perception from Camera Trajectories

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han

Main category: cs.CV

TL;DR: Camera trajectories alone can reveal video content through contrastive learning with language embeddings, enabling various downstream tasks without pixel data.

Details

Motivation: To investigate whether camera movement patterns alone can reveal video content without accessing pixel information, challenging the conventional reliance on visual data.

Method: Proposed CamFormer - a contrastive learning framework that projects camera pose trajectories into joint embedding space aligned with natural language descriptions.

Result: Camera trajectories are surprisingly informative for uncovering video content, enabling cross-modal alignment, classification, and temporal analysis tasks across different pose estimation methods.

Conclusion: Camera trajectory serves as a lightweight, robust, and versatile modality for video content perception that works with both high-fidelity and RGB-only pose estimators.

Abstract: Can one perceive a video’s content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, “how you move” can indeed reveal “what you are doing” (egocentric) or “observing” (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

[202] Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: Canvas-to-Image is a unified framework that converts diverse control signals (text, subject references, spatial arrangements, pose constraints, layout annotations) into a single composite canvas image, enabling diffusion models to generate high-fidelity images with integrated visual-spatial reasoning.

Details

Motivation: Modern diffusion models struggle with high-fidelity compositional and multimodal control when users specify multiple constraints simultaneously, such as text prompts, subject references, spatial arrangements, pose constraints, and layout annotations.

Method: Encode diverse control signals into a single composite canvas image and use Multi-Task Canvas Training strategy to optimize diffusion models for joint understanding and integration of heterogeneous controls within a unified learning paradigm.

Result: Significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

Conclusion: Canvas-to-Image enables diffusion models to reason across multiple control modalities rather than relying on task-specific heuristics, and generalizes well to multi-control scenarios during inference.

Abstract: While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

[203] ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

Mengqi Xue, Qihan Huang, Haofei Zhang, Jingwen Hu, Jie Song, Mingli Song, Canghong Jin

Main category: cs.CV

TL;DR: ProtoPFormer addresses the “distraction” problem in transformer-based prototype networks by introducing global and local prototypes that mutually guide each other to focus on foreground objects and improve interpretability.

Details

Motivation: When applying prototype networks (ProtoPNet) to vision transformers (ViTs), learned prototypes tend to be activated by background rather than foreground due to ViT's long-range dependency modeling, severely impairing interpretability.

Method: Proposes ProtoPFormer with global and local prototypes: global prototypes provide object-level guidance to help local prototypes focus on foreground, while local prototypes are explicitly supervised to concentrate on specific visual parts.

Result: Extensive experiments show global and local prototypes mutually correct each other, jointly making decisions from whole and local perspectives, achieving superior performance and visualization over state-of-the-art prototype-based methods.

Conclusion: ProtoPFormer effectively addresses the distraction problem in transformer-based prototype networks, enabling faithful and transparent reasoning from both global and local perspectives while maintaining high performance.

Abstract: Prototypical part network (ProtoPNet) has drawn wide attention and boosted many follow-up studies due to its self-explanatory property for explainable artificial intelligence (XAI). However, when directly applying ProtoPNet on vision transformer (ViT) backbones, learned prototypes have a “distraction” problem: they have a relatively high probability of being activated by the background and pay less attention to the foreground. The powerful capability of modeling long-term dependency makes the transformer-based ProtoPNet hard to focus on prototypical parts, thus severely impairing its inherent interpretability. This paper proposes prototypical part transformer (ProtoPFormer) for appropriately and effectively applying the prototype-based method with ViTs for interpretable image recognition. The proposed method introduces global and local prototypes for capturing and highlighting the representative holistic and partial features of targets according to the architectural characteristics of ViTs. The global prototypes are adopted to provide the global view of objects to guide local prototypes to concentrate on the foreground while eliminating the influence of the background. Afterwards, local prototypes are explicitly supervised to concentrate on their respective prototypical visual parts, increasing the overall interpretability. Extensive experiments demonstrate that our proposed global and local prototypes can mutually correct each other and jointly make final decisions, which faithfully and transparently reason the decision-making processes associatively from the whole and local perspectives, respectively. Moreover, ProtoPFormer consistently achieves superior performance and visualization results over the state-of-the-art (SOTA) prototype-based baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.

[204] LTD: Low Temperature Distillation for Gradient Masking-free Adversarial Training

Erh-Chung Chen, Che-Rung Lee

Main category: cs.CV

TL;DR: The paper proposes Low-Temperature Distillation (LTD) to improve adversarial robustness by refining one-hot label representations, addressing data ambiguity in real-world datasets.

Details

Motivation: Traditional one-hot label representations in image classification are imprecise for real-world datasets with data ambiguity, where samples can exhibit characteristics of multiple classes, making models vulnerable to adversarial attacks.

Method: Introduces Low-Temperature Distillation (LTD) that uses a low temperature in the teacher model while keeping student model temperature fixed during training and inference, refining label representations without gradient masking issues.

Result: Achieves robust accuracy of 58.19% on CIFAR-10, 31.13% on CIFAR-100, and 42.08% on ImageNet without additional data when combined with existing frameworks.

Conclusion: LTD effectively improves model robustness against adversarial attacks by addressing fundamental issues with one-hot label representations and data ambiguity, while avoiding gradient masking problems.

Abstract: Adversarial training is a widely adopted strategy to bolster the robustness of neural network models against adversarial attacks. This paper revisits the fundamental assumptions underlying image classification and suggests that representing data as one-hot labels is a key factor that leads to vulnerabilities. However, in real-world datasets, data ambiguity often arises, with samples exhibiting characteristics of multiple classes, rendering one-hot label representations imprecise. To address this, we introduce a novel approach, Low-Temperature Distillation (LTD), designed to refine label representations. Unlike previous approaches, LTD incorporates a relatively low temperature in the teacher model, while maintaining a fixed temperature for the student model during both training and inference. This strategy not only refines assumptions about data distribution but also strengthens model robustness and avoids the gradient masking problem commonly encountered in defensive distillation. Experimental results demonstrate the efficacy of the proposed method when combined with existing frameworks, achieving robust accuracy rates of 58.19%, 31.13%, and 42.08% on the CIFAR-10, CIFAR-100, and ImageNet datasets, respectively, without the need for additional data.

[205] AMLP: Adjustable Masking Lesion Patches for Self-Supervised Medical Image Segmentation

Xiangtao Wang, Ruizhi Wang, Thomas Lukasiewicz, Zhenghua Xu

Main category: cs.CV

TL;DR: AMLP is a self-supervised medical image segmentation framework that addresses challenges in applying masked image modeling to medical images through adjustable masking strategies and specialized loss functions.

Details

Motivation: Direct application of self-supervised masked image modeling to medical image segmentation yields unsatisfactory results due to medical images' complexity, distinct contour features, and limitations of conventional fixed masking ratios that may mask important background information.

Method: Proposes AMLP framework with Masked Patch Selection to identify lesion-containing patches, Relative Reconstruction Loss for learning hard-to-reconstruct lesions, Category Consistency Loss to refine patch categorization, and Adjustable Masking Ratio that gradually increases masking during training.

Result: Extensive experiments on two medical segmentation datasets demonstrate superior performance compared to state-of-the-art self-supervised methods, proving AMLP effectively addresses masked modeling challenges for medical images.

Conclusion: AMLP successfully captures accurate lesion details crucial for segmentation tasks by adapting masked image modeling specifically for medical image characteristics through adjustable masking strategies and specialized loss functions.

Abstract: Self-supervised masked image modeling (MIM) methods have shown promising performances on analyzing natural images. However, directly applying such methods to medical image segmentation tasks still cannot achieve satisfactory results. The challenges arise from the facts that (i) medical images are inherently more complex compared to natural images, and the subjects in medical images often exhibit more distinct contour features; (ii) moreover, the conventional high and fixed masking ratio in MIM is likely to mask the background, limiting the scope of learnable information. To address these problems, we propose a new self-supervised medical image segmentation framework, called Adjustable Masking Lesion Patches (AMLP), which employs Masked Patch Selection~(MPS) strategy to identify patches with high probabilities of containing lesions to help model achieve precise lesion reconstruction. To improve the categorization of patches in MPS, we further introduce Relative Reconstruction Loss (RRL) to better learn hard-to-reconstruct lesion patches. Then, Category Consistency Loss (CCL) is proposed to refine patch categorization based on reconstruction difficulty, enhancing difference between lesions and backgrounds. Moreover, an Adjustable Masking Ratio (AMR) strategy is proposed to gradually increase the masking ratio over training to expand~~the scope of learnable mutual information. Extensive~~experiments on two medical segmentation datasets demonstrate the superior performances of the proposed AMLP w.r.t. the SOTA self-supervised methods; the results prove that AMLP effectively addresses the challenges of applying masked modeling to medical images and capturing accurate lesion details that are crucial for segmentation tasks.

[206] SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

Wenbo Huang, Jinghui Zhang, Xuwei Qian, Zhen Wu, Meng Wang, Lei Zhang

Main category: cs.CV

TL;DR: SOAP-Net is a plug-and-play architecture for few-shot action recognition that enhances spatio-temporal relations and motion information capture using frame tuples with multiple frames, achieving state-of-the-art performance on major benchmarks.

Details

Motivation: Traditional high frame-rate video action recognition requires large datasets, but real-world scenarios often lack sufficient samples. Existing few-shot methods separate spatial and temporal features and capture motion information narrowly between adjacent frames, leading to insufficient motion representation.

Method: Proposes SOAP architecture that considers temporal connections between feature channels and spatio-temporal relations. Uses frame tuples with multiple frames to capture comprehensive motion information, combining tuples of diverse frame counts for broader perspective.

Result: SOAP-Net achieves state-of-the-art performance on SthSthV2, Kinetics, UCF101, and HMDB51 benchmarks. Extensive evaluations demonstrate competitiveness, pluggability, generalization, and robustness.

Conclusion: The SOAP architecture effectively addresses limitations in existing few-shot action recognition methods by better integrating spatio-temporal features and capturing comprehensive motion information through frame tuples, providing a versatile solution for real-world scenarios with limited data.

Abstract: High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.

[207] Restoration-Oriented Video Frame Interpolation with Region-Distinguishable Priors from SAM

Yan Han, Xiaogang Xu, Yingqi Lin, Jiafei Wu, Zhe Liu, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: The paper proposes using open-world segmentation models (SAM2) to create Region-Distinguishable Priors (RDPs) for Video Frame Interpolation, improving motion estimation accuracy by distinguishing regions before interpolation.

Details

Motivation: Existing VFI methods struggle with motion estimation accuracy due to ambiguity in identifying corresponding areas between frames. Enhancing accuracy by distinguishing regions before motion estimation is crucial.

Method: Uses SAM2 segmentation to create RDPs as spatial-varying Gaussian mixtures, integrated via Hierarchical Region-aware Feature Fusion Module (HRFFM) with RDP-guided Feature Normalization in a residual learning manner.

Result: Extensive experiments show HRFFM consistently enhances VFI performance across various scenes by improving feature representations for matched regions.

Conclusion: The proposed RDPs and HRFFM effectively improve motion estimation in VFI by distinguishing regions, leading to better intermediate frame synthesis.

Abstract: In existing restoration-oriented Video Frame Interpolation (VFI) approaches, the motion estimation between neighboring frames plays a crucial role. However, the estimation accuracy in existing methods remains a challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. Therefore, enhancing accuracy by distinguishing different regions before motion estimation is of utmost importance. In this paper, we introduce a novel solution involving the utilization of open-world segmentation models, e.g., SAM2 (Segment Anything Model2) for frames, to derive Region-Distinguishable Priors (RDPs) in different frames. These RDPs are represented as spatial-varying Gaussian mixtures, distinguishing an arbitrary number of areas with a unified modality. RDPs can be integrated into existing motion-based VFI methods to enhance features for motion estimation, facilitated by our designed play-and-plug Hierarchical Region-aware Feature Fusion Module (HRFFM). HRFFM incorporates RDP into various hierarchical stages of VFI’s encoder, using RDP-guided Feature Normalization (RDPFN) in a residual learning manner. With HRFFM and RDP, the features within VFI’s encoder exhibit similar representations for matched regions in neighboring frames, thus improving the synthesis of intermediate frames. Extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes.

[208] A Gray-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse

Zhongliang Guo, Chun Tong Lei, Lei Fang, Shuai Zhao, Yifei Qian, Jingyu Lin, Zeyu Wang, Cunjian Chen, Ognjen Arandjelović, Chun Pong Lau

Main category: cs.CV

TL;DR: PCA is a novel framework that protects images from unauthorized manipulation in LDMs by exploiting VAE posterior collapse phenomena, requiring only the VAE encoder (4% of LDM) and achieving prompt-invariant protection without model-specific knowledge.

Details

Motivation: To address concerns about data misappropriation and IP infringement in LDMs, while overcoming limitations of existing adversarial attacks that rely heavily on model-specific knowledge and have high computational costs.

Method: Identifies diffusion collapse and concentration collapse phenomena during VAE inference, designs unified loss function to achieve both collapse types through parameter adjustment, operates on VAE encoder before text conditioning for prompt-invariant protection.

Result: PCA outperforms existing techniques in protection effectiveness, computational efficiency (runtime and VRAM), and generalization across VAE-based LDM variants, while requiring only 4% of LDM parameters.

Conclusion: PCA provides an efficient and effective solution for image protection in LDMs with minimal model knowledge requirements and strong transferability across different architectures.

Abstract: Recent advancements in Latent Diffusion Models (LDMs) have revolutionized image synthesis and manipulation, raising significant concerns about data misappropriation and intellectual property infringement. While adversarial attacks have been extensively explored as a protective measure against such misuse of generative AI, current approaches are severely limited by their heavy reliance on model-specific knowledge and substantial computational costs. Drawing inspiration from the posterior collapse phenomenon observed in VAE training, we propose the Posterior Collapse Attack (PCA), a novel framework for protecting images from unauthorized manipulation. Through comprehensive theoretical analysis and empirical validation, we identify two distinct collapse phenomena during VAE inference: diffusion collapse and concentration collapse. Based on this discovery, we design a unified loss function that can flexibly achieve both types of collapse through parameter adjustment, each corresponding to different protection objectives in preventing image manipulation. Our method significantly reduces dependence on model-specific knowledge by requiring access to only the VAE encoder, which constitutes less than 4% of LDM parameters. Notably, PCA achieves prompt-invariant protection by operating on the VAE encoder before text conditioning occurs, eliminating the need for empty prompt optimization required by existing methods. This minimal requirement enables PCA to maintain adequate transferability across various VAE-based LDM architectures while effectively preventing unauthorized image editing. Extensive experiments show PCA outperforms existing techniques in protection effectiveness, computational efficiency (runtime and VRAM), and generalization across VAE-based LDM variants. Our code is available at https://github.com/ZhongliangGuo/PosteriorCollapseAttack.

[209] A Simple Framework Towards Vision-based Traffic Signal Control with Microscopic Simulation

Pan He, Quanyi Li, Xiaoyong Yuan, Bolei Zhou

Main category: cs.CV

TL;DR: Vision-based traffic signal control using computer vision for end-to-end learning, with a new simulation framework TrafficDojo integrating SUMO and MetaDrive for comprehensive evaluation.

Details

Motivation: Traditional traffic signal control relies on heuristics and predefined features, while vision-based methods offer less dependency on these and enable end-to-end learning for better traffic flow optimization.

Method: Developed TrafficDojo framework integrating SUMO’s microscopic traffic flow with MetaDrive 3D driving simulator, establishing baseline algorithms including traditional and Reinforcement Learning approaches.

Result: Created a versatile traffic environment for comprehensive evaluation of traffic signal controllers across diverse conditions and scenarios.

Conclusion: This work enables vision-based traffic signal control development and opens new research opportunities in the field.

Abstract: Traffic signal control (TSC) is crucial for reducing traffic congestion leading to smoother traffic flow, reduced idle time, and mitigated CO2 emissions. In this paper, we explore the computer vision approach for TSC that modulates on-road traffic flows through visual observation. Unlike traditional feature-based approaches, vision-based methods depend much less on heuristics and predefined features, bringing promising potentials for end-to-end learning and optimization of traffic signals. Thus, we introduce a simple traffic simulation framework called TrafficDojo towards vision-based TSC and its benchmark by integrating the microscopic traffic flow provided in SUMO into the 3D driving simulator MetaDrive. This proposed framework offers a versatile traffic environment for in-depth analysis and comprehensive evaluation of traffic signal controllers across diverse traffic conditions and scenarios. We establish and compare baseline algorithms including both traditional and Reinforcement Learning (RL) approaches. This work sheds light on the design and development of vision-based TSC approaches and opens up new research opportunities

[210] Activator: GLU Activation Function as the Core Component of a Vision Transformer

Abdullah Nazhat Abdullah, Tarkan Aydin

Main category: cs.CV

TL;DR: This paper proposes replacing the MLP and attention mechanisms in transformers with a gated linear unit (GLU) activation structure to reduce computational costs while maintaining competitive performance.

Details

Motivation: Transformer architectures, while successful in NLP and CV tasks, suffer from high computational costs due to the scaled dot product attention mechanism with softmax activation, requiring large compute capabilities for training and inference.

Method: Substitute the traditional MLP and attention mechanism in transformer architecture with an architecture based on incorporating a gated linear unit (GLU) activation function structure.

Result: Experimental assessments show the proposed modification offers competitive performance compared to baseline architectures while achieving targeted reductions in computational complexity.

Conclusion: GLU-based MLPs provide a more efficient but capable alternative to traditional MLP and attention mechanisms as core components in transformer architecture design.

Abstract: The transformer architecture has driven many successes in a variety of tasks within the field of deep learning, in particular the recent advances in natural language processing (NLP) culminating with large language models (LLM). Adding to that success, transformer architecture has found widespread interest from computer vision (CV) researchers and practitioners, allowing for many advancements in vision-related tasks and opening the door for multitask and multi-modal deep learning architectures that share the same principle of operation. One drawback to these architectures is their reliance on the scaled dot product attention mechanism with the softmax activation function, which is computationally expensive and requires large compute capabilities for both training and inference. This paper investigates substituting the MLP and attention mechanism usually adopted for transformer architecture with an architecture based on incorporating a gated linear unit (GLU) activation function structure with the aim of reducing the computational cost. The equalized experimental assessments conducted in this work show that the proposed modification with the targeted reductions in computational complexity offers competitive performance compared to the selected baseline architectures. The results are significantly in support of the aims of this work, in which the focus was to extensively utilize GLU-based MLPs, establishing a more efficient but capable alternative to the traditional MLP and the attention mechanism as the core component in the design of transformer architectures.

[211] Interactive Occlusion Boundary Estimation through Exploitation of Synthetic Data

Lintao Xu, Chaohui Wang

Main category: cs.CV

TL;DR: MS³PE is a multi-scribble-guided deep learning framework for interactive occlusion boundary estimation, featuring intuitive multi-scribble interactions and a 3-encoding-path network with multi-scale strip convolutions. The paper also introduces synthetic data generation via Mesh2OB and two benchmarks: OB-FUTURE (synthetic) and OB-LIGM (real-world).

Details

Motivation: To address the challenge of occlusion boundary estimation in 2D images for scene understanding, and to overcome the scarcity of well-annotated real-world data through synthetic data generation.

Method: Proposed MS³PE framework with multi-scribble interaction mechanism and 3-encoding-path network enhanced with multi-scale strip convolutions. Developed Mesh2OB tool for automated ground-truth OB generation from 3D scenes with self-occlusions handling.

Result: MS³PE surpasses adapted baselines from seven state-of-the-art interactive segmentation methods. Created OB-FUTURE synthetic benchmark and OB-LIGM real-world benchmark with 120 high-resolution annotated images.

Conclusion: The work presents the first systematic study of Interactive Occlusion Boundary Estimation, demonstrating strong performance and providing valuable resources (synthetic data generation tool and benchmarks) to advance OB research.

Abstract: Occlusion boundaries (OBs) geometrically localize occlusion events in 2D images and provide critical cues for scene understanding. In this paper, we present the first systematic study of Interactive Occlusion Boundary Estimation (IOBE), introducing MS\textsuperscript{3}PE, a novel multi-scribble-guided deep-learning framework that advances IOBE through two key innovations: (1) an intuitive multi-scribble interaction mechanism, and (2) a 3-encoding-path network enhanced with multi-scale strip convolutions. Our MS\textsuperscript{3}PE surpasses adapted baselines from seven state-of-the-art interactive segmentation methods, and demonstrates strong potential for OB benchmark construction through our real-user experiment. Besides, to address the scarcity of well-annotated real-world data, we propose using synthetic data for training IOBE models, and developed Mesh2OB, the first automated tool for generating precise ground-truth OBs from 3D scenes with self-occlusions explicitly handled, enabling creation of the OB-FUTURE synthetic benchmark that facilitates generalizable training without domain adaptation. Finally, we introduce OB-LIGM, a high-quality real-world benchmark comprising 120 meticulously annotated high-resolution images advancing evaluation standards in OB research. Source code and resources are available at https://github.com/xul-ops/IOBE.

[212] Open Vocabulary Monocular 3D Object Detection

Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, Zezhou Cheng

Main category: cs.CV

TL;DR: Open-vocabulary monocular 3D detection framework that detects objects of any category from single RGB images using pretrained 2D/3D foundation models and a novel evaluation metric.

Details

Motivation: Existing 3D detectors require expensive sensors like LiDAR or multi-view setups, and are limited to closed vocabularies with restricted categories, limiting their real-world applicability.

Method: Integrates pretrained 2D and 3D vision foundation models to reduce dependence on 3D supervision, and designs a novel evaluation metric to handle missing labels and semantic ambiguities in datasets.

Result: Achieves state-of-the-art results in both zero-shot 3D detection of novel categories and in-domain detection on seen classes.

Conclusion: Provides a strong baseline for open-vocabulary 3D detection and establishes a reliable evaluation protocol for future research in this area.

Abstract: We propose and study open-vocabulary monocular 3D detection, a novel task that aims to detect objects of any categores in metric 3D space from a single RGB image. Existing 3D object detectors either rely on costly sensors such as LiDAR or multi-view setups, or remain confined to closed vocabularies settings with limited categories, restricting their applicability. We identify two key challenges in this new setting. First, the scarcity of 3D bounding box annotations limits the ability to train generalizable models. To reduce dependence on 3D supervision, we propose a framework that effectively integrates pretrained 2D and 3D vision foundation models. Second, missing labels and semantic ambiguities (\eg, table vs. desk) in existing datasets hinder reliable evaluation. To address this, we design a novel metric that captures model performance while mitigating annotation issues. Our approach achieves state-of-the-art results in zero-shot 3D detection of novel categories as well as in-domain detection on seen classes. We hope our method provides a strong baseline and our evaluation protocol establishes a reliable benchmark for future research.

Abhinav Pratap, Sushant Kumar, Suchinton Chakravarty

Main category: cs.CV

TL;DR: Evaluation of four real-time object detection algorithms (YOLO, SSD, Faster R-CNN, Mask R-CNN) for indoor navigation assistance for visually impaired individuals, analyzing accuracy-speed trade-offs.

Details

Motivation: Need for accurate and efficient object detection in assistive technologies for visually impaired individuals to enhance indoor navigation solutions and promote accessibility.

Method: Evaluated four real-time object detection algorithms using the Indoor Objects Detection dataset, analyzing detection accuracy, processing speed, and adaptability to indoor environments.

Result: Findings highlight trade-offs between precision and efficiency, providing insights for selecting optimal algorithms for real-time assistive navigation.

Conclusion: Research advances adaptive machine learning applications for enhancing indoor navigation solutions for the visually impaired.

Abstract: This study addresses the need for accurate and efficient object detection in assistive technologies for visually impaired individuals. We evaluate four real-time object detection algorithms YOLO, SSD, Faster R-CNN, and Mask R-CNN within the context of indoor navigation assistance. Using the Indoor Objects Detection dataset, we analyze detection accuracy, processing speed, and adaptability to indoor environments. Our findings highlight the trade-offs between precision and efficiency, offering insights into selecting optimal algorithms for realtime assistive navigation. This research advances adaptive machine learning applications, enhancing indoor navigation solutions for the visually impaired and promoting accessibility.

[214] Active Negative Loss: A Robust Framework for Learning with Noisy Labels

Xichen Ye, Yifan Wu, Yiqi Wang, Xiaoqiang Li, Weizhong Zhang, Yifan Chen

Main category: cs.CV

TL;DR: The paper introduces Normalized Negative Loss Functions (NNLFs) to replace MAE in the Active Passive Loss framework, creating Active Negative Loss (ANL) for better handling of noisy labels in deep learning.

Details

Motivation: Existing noise-robust loss functions like APL with MAE pay equal attention to clean and noisy samples, slowing convergence and making training difficult in large-scale datasets with noisy labels.

Method: Proposed NNLFs as passive loss functions in APL framework, creating ANL. Also introduced entropy-based regularization for non-symmetric noise scenarios to address label imbalance vulnerability.

Result: Extensive experiments show ANL with NNLFs achieves better or comparable performance to state-of-the-art methods across various label noise types and in image segmentation tasks.

Conclusion: The proposed ANL framework with NNLFs effectively addresses MAE’s limitations by focusing more on memorized clean samples, providing improved robustness to label noise in deep learning.

Abstract: Deep supervised learning has achieved remarkable success across a wide range of tasks, yet it remains susceptible to overfitting when confronted with noisy labels. To address this issue, noise-robust loss functions offer an effective solution for enhancing learning in the presence of label noise. In this work, we systematically investigate the limitation of the recently proposed Active Passive Loss (APL), which employs Mean Absolute Error (MAE) as its passive loss function. Despite the robustness brought by MAE, one of its key drawbacks is that it pays equal attention to clean and noisy samples; this feature slows down convergence and potentially makes training difficult, particularly in large-scale datasets. To overcome these challenges, we introduce a novel loss function class, termed Normalized Negative Loss Functions (NNLFs), which serve as passive loss functions within the APL framework. NNLFs effectively address the limitations of MAE by concentrating more on memorized clean samples. By replacing MAE in APL with our proposed NNLFs, we enhance APL and present a new framework called Active Negative Loss (ANL). Moreover, in non-symmetric noise scenarios, we propose an entropy-based regularization technique to mitigate the vulnerability to the label imbalance. Extensive experiments demonstrate that the new loss functions adopted by our ANL framework can achieve better or comparable performance to state-of-the-art methods across various label noise types and in image segmentation tasks. The source code is available at: https://github.com/Virusdoll/Active-Negative-Loss.

[215] Without Paired Labeled Data: End-to-End Self-Supervised Learning for Drone-view Geo-Localization

Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong, Guoqi Li

Main category: cs.CV

TL;DR: DMNIL is a self-supervised method for drone-view geo-localization that uses dynamic memory and neighborhood learning to eliminate the need for paired drone-satellite images, achieving state-of-the-art performance without supervision.

Details

Motivation: Existing drone-view geo-localization methods require expensive paired drone-satellite images and lack transferability to new regions, limiting practical deployment in open-world scenarios.

Method: Uses clustering for pseudo-labels and dual-path contrastive learning. Includes DHML module for intra-view feature consistency and ICEL module for cross-view semantic correlation, plus pseudo-label enhancement for training stability.

Result: Outperforms existing self-supervised methods and surpasses several state-of-the-art supervised methods on three public benchmark datasets.

Conclusion: DMNIL provides an effective self-supervised solution for drone geo-localization that eliminates dependency on paired data and achieves competitive performance with supervised methods.

Abstract: Drone-view Geo-Localization (DVGL) aims to achieve accurate localization of drones by retrieving the most relevant GPS-tagged satellite images. However, most existing methods heavily rely on strictly pre-paired drone-satellite images for supervised learning. When the target region shifts, new paired samples are typically required to adapt to the distribution changes. The high cost of annotation and the limited transferability of these methods significantly hinder the practical deployment of DVGL in open-world scenarios. To address these limitations, we propose a novel end-to-end self-supervised learning method with a shallow backbone network, called the dynamic memory-driven and neighborhood information learning (DMNIL) method. It employs a clustering algorithm to generate pseudo-labels and adopts a dual-path contrastive learning framework to learn discriminative intra-view representations. Furthermore, DMNIL incorporates two core modules, including the dynamic hierarchical memory learning (DHML) module and the information consistency evolution learning (ICEL) module. The DHML module combines short-term and long-term memory to enhance intra-view feature consistency and discriminability. Meanwhile, the ICEL module utilizes a neighborhood-driven dynamic constraint mechanism to systematically capture implicit cross-view semantic correlations, consequently improving cross-view feature alignment. To further stabilize and strengthen the self-supervised training process, a pseudo-label enhancement strategy is introduced to enhance the quality of pseudo supervision. Extensive experiments on three public benchmark datasets demonstrate that the proposed method consistently outperforms existing self-supervised methods and even surpasses several state-of-the-art supervised methods. Our code is available at https://github.com/ISChenawei/DMNIL.

[216] Unsupervised Segmentation by Diffusing, Walking and Cutting

Daniela Ivanova, Marco Aversa, Paul Henderson, John Williamson

Main category: cs.CV

TL;DR: Unsupervised image segmentation using pre-trained diffusion model features via spectral clustering on self-attention layers, achieving SOTA without training.

Details

Motivation: Leverage semantic relations captured in pre-trained text-to-image diffusion models' self-attention layers for zero-shot unsupervised segmentation.

Method: Construct adjacency matrices from self-attention layers between patches and recursively partition using Normalised Cuts, interpreting attention as transition matrix for random walks.

Result: Surpasses all existing methods for zero-shot unsupervised segmentation, achieving state-of-the-art results on COCO-Stuff-27 and Cityscapes.

Conclusion: Pre-trained diffusion model attention layers provide rich semantic features for effective unsupervised segmentation through spectral clustering approaches.

Abstract: We propose an unsupervised image segmentation method using features from pre-trained text-to-image diffusion models. Inspired by classic spectral clustering approaches, we construct adjacency matrices from self-attention layers between image patches and recursively partition using Normalised Cuts. A key insight is that self-attention probability distributions, which capture semantic relations between patches, can be interpreted as a transition matrix for random walks across the image. We leverage this by first using Random Walk Normalized Cuts directly on these self-attention activations to partition the image, minimizing transition probabilities between clusters while maximizing coherence within clusters. Applied recursively, this yields a hierarchical segmentation that reflects the rich semantics in the pre-trained attention layers, without any additional training. Next, we explore other ways to build the NCuts adjacency matrix from features, and how we can use the random walk interpretation of self-attention to capture long-range relationships. Finally, we propose an approach to automatically determine the NCut cost criterion, avoiding the need to tune this manually. We quantitatively analyse the effect incorporating different features, a constant versus dynamic NCut threshold, and incorporating multi-node paths when constructing the NCuts adjacency matrix. We show that our approach surpasses all existing methods for zero-shot unsupervised segmentation, achieving state-of-the-art results on COCO-Stuff-27 and Cityscapes.

[217] Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

Yuxuan Xue, Xianghui Xie, Riccardo Marin, Gerard Pons-Moll

Main category: cs.CV

TL;DR: Gen-3Diffusion is a method that synergizes 2D and 3D diffusion models to generate realistic 3D objects and avatars from single RGB images while ensuring 3D consistency.

Details

Motivation: Single image 3D generation is challenging due to ill-posed nature and lack of 3D consistency in multi-view images generated by 2D diffusion models alone.

Method: Synchronizes pre-trained 2D and 3D diffusion models during both training and sampling, leveraging 2D for generalization and 3D for multi-view consistency.

Result: Generates realistic 3D objects and avatars with high-fidelity geometry and texture, demonstrating strong generalization to diverse clothing and compositional shapes.

Conclusion: The synergy between 2D and 3D diffusion models effectively addresses both generalization and 3D consistency challenges in image-to-3D generation.

Abstract: Creating realistic 3D objects and clothed avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot guarantee the generated multi-view images are 3D consistent. In this paper, we propose Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy. We leverage a pre-trained 2D diffusion model and a 3D diffusion model via our elegantly designed process that synchronizes two diffusion models at both training and sampling time. The synergy between the 2D and 3D diffusion models brings two major advantages: 1) 2D helps 3D in generalization: the pretrained 2D model has strong generalization ability to unseen images, providing strong shape priors for the 3D diffusion model; 2) 3D helps 2D in multi-view consistency: the 3D diffusion model enhances the 3D consistency of 2D multi-view sampling process, resulting in more accurate multi-view generation. We validate our idea through extensive experiments in image-based objects and clothed avatar generation tasks. Results show that our method generates realistic 3D objects and avatars with high-fidelity geometry and texture. Extensive ablations also validate our design choices and demonstrate the strong generalization ability to diverse clothing and compositional shapes. Our code and pretrained models will be publicly released on https://yuxuan-xue.com/gen-3diffusion.

[218] Systematic Evaluation and Guidelines for Segment Anything Model in Surgical Video Analysis

Cheng Yuan, Jian Jiang, Kunyi Yang, Lv Wu, Rui Wang, Zi Meng, Haonan Ping, Ziyu Xu, Yifan Zhou, Wanli Song, Hesheng Wang, Yueming Jin, Qi Dou, Yutong Ban

Main category: cs.CV

TL;DR: First comprehensive evaluation of SAM2’s zero-shot surgical video segmentation across 9 datasets (17 surgery types), showing notable adaptability in structured scenarios but performance gaps in dynamic conditions.

Details

Motivation: Surgical video segmentation is critical for AI but limited by annotated data. SAM2 offers zero-shot potential but its applicability in complex surgical environments with tissue deformation and instrument variability remains unexplored.

Method: Comprehensive evaluation of SAM2’s zero-shot capability across 9 surgical datasets covering laparoscopic, endoscopic, and robotic procedures, analyzing various prompting strategies (points, boxes, masks) and finetuning approaches.

Result: SAM2 demonstrates notable zero-shot adaptability in structured scenarios (instrument segmentation, multi-organ segmentation, scene segmentation) but performance varies under dynamic surgical conditions, showing gaps in temporal coherence and domain-specific artifacts.

Conclusion: Results highlight future pathways to adaptive data-efficient solutions for surgical data science, addressing limitations in handling surgical dynamics and domain-specific challenges.

Abstract: Surgical video segmentation is critical for AI to interpret spatial-temporal dynamics in surgery, yet model performance is constrained by limited annotated data. The SAM2 model, pretrained on natural videos, offers potential for zero-shot surgical segmentation, but its applicability in complex surgical environments, with challenges like tissue deformation and instrument variability, remains unexplored. We present the first comprehensive evaluation of the zero-shot capability of SAM2 in 9 surgical datasets (17 surgery types), covering laparoscopic, endoscopic, and robotic procedures. We analyze various prompting (points, boxes, mask) and {finetuning (dense, sparse) strategies}, robustness to surgical challenges, and generalization across procedures and anatomies. Key findings reveal that while SAM2 demonstrates notable zero-shot adaptability in structured scenarios (e.g., instrument segmentation, {multi-organ segmentation}, and scene segmentation), its performance varies under dynamic surgical conditions, highlighting gaps in handling temporal coherence and domain-specific artifacts. These results highlight future pathways to adaptive data-efficient solutions for the surgical data science field.

[219] LASER: Lip Landmark Assisted Speaker Detection for Robustness

Le Thien Phuc Nguyen, Zhuoran Yu, Yong Jae Lee

Main category: cs.CV

TL;DR: LASER improves Active Speaker Detection by incorporating lip landmarks during training to enhance attention to speech-relevant regions, achieving state-of-the-art performance and strong robustness to background noise.

Details

Motivation: Existing ASD models often misclassify non-speaking instances when lip movements and audio are unsynchronized, failing to leverage the natural human reliance on lip-audio synchronization.

Method: Extracts visual features and encodes 2D lip landmarks into dense maps, with an auxiliary consistency loss that aligns lip-aware and face-only predictions to handle landmark detector failures.

Result: Outperforms state-of-the-art models across in-domain and out-of-domain benchmarks, with 3.3-4.3 point mAP improvements over LoCoNet and TalkNet on high-noise subsets.

Conclusion: LASER demonstrates strong resilience to real-world acoustic challenges through explicit lip landmark incorporation and consistency-based training, without requiring landmark detectors at test time.

Abstract: Active Speaker Detection (ASD) aims to identify who is speaking in complex visual scenes. While humans naturally rely on lip-audio synchronization, existing ASD models often misclassify non-speaking instances when lip movements and audio are unsynchronized. To address this, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER), which explicitly incorporates lip landmarks during training to guide the model’s attention to speech-relevant regions. Given a face track, LASER extracts visual features and encodes 2D lip landmarks into dense maps. To handle failure cases such as low resolution or occlusion, we introduce an auxiliary consistency loss that aligns lip-aware and face-only predictions, removing the need for landmark detectors at test time. LASER outperforms state-of-the-art models across both in-domain and out-of-domain benchmarks. To further evaluate robustness in realistic conditions, we introduce LASER-bench, a curated dataset of modern video clips with varying levels of background noise. On the high-noise subset, LASER improves mAP by 3.3 and 4.3 points over LoCoNet and TalkNet, respectively, demonstrating strong resilience to real-world acoustic challenges.

[220] OuroMamba: A Data-Free Quantization Framework for Vision Mamba

Akshat Ramachandran, Mingyu Lee, Huan Xu, Souvik Kundu, Tushar Krishna

Main category: cs.CV

TL;DR: OuroMamba is the first data-free post-training quantization method for vision Mamba-based models, addressing challenges in synthetic data generation and dynamic outlier variations through a two-stage framework with mixed-precision quantization.

Details

Motivation: Vision Mamba-based models face two key challenges for data-free quantization: (1) recurrent state transitions limit long-range interaction capture, leading to weak synthetic data, and (2) dynamic outlier variations across time-steps make static PTQ techniques ineffective.

Method: Two-stage framework: (1) OuroMamba-Gen generates semantically rich synthetic data using contrastive learning on patch-level features from neighborhood interactions in latent state space, (2) OuroMamba-Quant employs mixed-precision quantization with lightweight dynamic outlier detection using threshold-based outlier channel selection updated every time-step.

Result: OuroMamba surpasses existing data-driven PTQ techniques across vision and generative tasks, achieving state-of-the-art performance in diverse quantization settings with practical latency speedup of up to 2.36x through efficient GPU kernels.

Conclusion: OuroMamba successfully enables effective data-free quantization for vision Mamba-based models by addressing their unique challenges through innovative synthetic data generation and dynamic quantization techniques, outperforming existing methods while maintaining efficiency.

Abstract: We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM’s recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code and synthetic dataset are available here: https://github.com/georgia-tech-synergy-lab/ICCV-OuroMamba

[221] Bayesian Neural Networks for One-to-Many Mapping in Image Enhancement

Guoxi Huang, Qirui Yang, Ruirui Lin, Zipeng Qi, David Bull, Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: Proposes Bayesian Enhancement Model (BEM) using Bayesian Neural Networks to handle one-to-many mapping in image enhancement tasks, with a BNN-DNN framework for efficient inference.

Details

Motivation: Degraded images can correspond to multiple plausible target images due to dynamic photography conditions, creating a one-to-many mapping problem in enhancement tasks.

Method: Bayesian Enhancement Model (BEM) with BNN-DNN framework: BNN models one-to-many mapping in low-dimensional space, then DNN refines fine-grained image details for fast inference.

Result: Extensive experiments on multiple low-light and underwater image enhancement benchmarks demonstrate effectiveness.

Conclusion: The proposed BEM successfully addresses one-to-many mapping in image enhancement by capturing data uncertainty and producing diverse outputs.

Abstract: In image enhancement tasks, such as low-light and underwater image enhancement, a degraded image can correspond to multiple plausible target images due to dynamic photography conditions. This naturally results in a one-to-many mapping problem. To address this, we propose a Bayesian Enhancement Model (BEM) that incorporates Bayesian Neural Networks (BNNs) to capture data uncertainty and produce diverse outputs. To enable fast inference, we introduce a BNN-DNN framework: a BNN is first employed to model the one-to-many mapping in a low-dimensional space, followed by a Deterministic Neural Network (DNN) that refines fine-grained image details. Extensive experiments on multiple low-light and underwater image enhancement benchmarks demonstrate the effectiveness of our method.

[222] Towards Consistent and Controllable Image Synthesis for Face Editing

Mengting Wei, Tuomas Varanka, Yante Li, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao

Main category: cs.CV

TL;DR: RigFace is a novel face editing method that uses Stable Diffusion and 3D face models to control lighting, facial expression, and head pose while preserving identity characteristics.

Details

Motivation: Traditional GAN-based face editing methods are being replaced by diffusion models, but diffusion models struggle with controlling specific attributes and preserving identity consistency during editing.

Method: RigFace uses: 1) Spatial Attribute Encoder for decoupled background, pose, expression and lighting conditions; 2) FaceFusion method for identity feature transfer to SD denoising UNet; 3) Attribute Rigger to inject conditions into the UNet.

Result: The model achieves comparable or superior performance in both identity preservation and photorealism compared to existing face editing models.

Conclusion: RigFace successfully addresses the challenges of attribute control and identity preservation in diffusion-based face editing through effective disentanglement of control factors.

Abstract: Face editing methods, essential for tasks like virtual avatars, digital human synthesis and identity preservation, have traditionally been built upon GAN-based techniques, while recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in controlling specific attributes and preserving the consistency of other unchanged attributes especially the identity characteristics. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion (SD) models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involves the combinations of target background, identity and face attributes aimed to edit. We strive to sufficiently disentangle the control of these factors to enable consistency of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Attribute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) A high-consistency FaceFusion method that transfers identity features from the Identity Encoder to the denoising UNet of a pre-trained SD model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.

[223] RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

Fanhu Zeng, Haiyang Guo, Fei Zhu, Li Shen, Hao Tang

Main category: cs.CV

TL;DR: RobustMerge is a training-free parameter-efficient merging method that maintains direction robustness by pruning parameters and scaling coefficients, and performs cross-task normalization for better generalization.

Details

Motivation: Parameter-efficient tuning creates many expert models, but existing merging methods designed for full fine-tuning fail under efficient tuning scenarios. There's a need for efficient merging methods that work with parameter-efficient modules.

Method: Analyze low-rank decomposition to reveal direction robustness importance, then propose RobustMerge with: (1) parameter pruning and coefficient scaling from inter-parameter relations to maintain direction stability, (2) cross-task normalization for unseen task generalization.

Result: Established benchmark with diverse multimodal tasks showing outstanding performance and generalizability. Additional studies confirm effectiveness.

Conclusion: RobustMerge successfully addresses efficient merging challenges for parameter-efficient tuned models, maintaining direction robustness and enhancing generalization across tasks.

Abstract: Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relation for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness. Code is available at https://github.com/AuroraZengfh/RobustMerge.

[224] Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, Chen Sun

Main category: cs.CV

TL;DR: The paper proposes force prompts as a control signal for video generation, enabling realistic physical interactions like poking objects or wind effects without 3D assets or physics simulators at inference.

Details

Motivation: Physically meaningful interactions that mimic real-world forces remain understudied in video generation, while navigation has been well-explored. The authors aim to enable realistic physical force interactions in generated videos.

Method: Leverage visual and motion priors from pretrained video generation models, adapting them to follow physical force conditioning from Blender-synthesized videos. Use force prompts for localized point forces and global wind force fields, trained on limited demonstrations with visual diversity and specific text keywords.

Result: The method generates videos that simulate forces across diverse geometries, settings, and materials. It outperforms existing methods on force adherence and physics realism, trained on only ~15k examples for one day on four A100 GPUs.

Conclusion: Video generation models can generalize remarkably well to physical force conditioning from synthetic data, bringing world models closer to real-world physics interactions. Key factors for success are visual diversity and specific text keywords during training.

Abstract: Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.

[225] Class-Independent Increment: An Efficient Approach for Multi-label Class-Incremental Learning

Chenhao Ding, Songlin Dong, Zhengdong Zhou, Jizhou Han, Qiang Wang, Yuhang He, Yihong Gong

Main category: cs.CV

TL;DR: Proposes CLIN approach for multi-label class-incremental learning using class-specific tokens and novel loss functions to address feature confusion and catastrophic forgetting.

Details

Motivation: Real-world applications often involve multi-label scenarios, but current class-incremental learning research mainly focuses on single-label classification. MLCIL faces additional challenges of feature confusion beyond catastrophic forgetting.

Method: Class-independent incremental network (CINet) extracts multiple class-level embeddings using class-specific tokens, with two novel loss functions to optimize token learning and distinguish between new/old classes.

Result: Extensive experiments on MS-COCO and PASCAL VOC datasets show improved recognition performance and reduced forgetting across various MLCIL tasks.

Conclusion: The proposed CLIN method effectively addresses multi-label class-incremental learning challenges by mitigating feature confusion through class-specific embeddings and specialized loss functions.

Abstract: Current research on class-incremental learning primarily focuses on single-label classification tasks. However, real-world applications often involve multi-label scenarios, such as image retrieval and medical imaging. Therefore, this paper focuses on the challenging yet practical multi-label class-incremental learning (MLCIL) problem. In addition to the challenge of catastrophic forgetting, MLCIL encounters issues related to feature confusion, encompassing inter-session and intra-feature confusion. To address these problems, we propose a novel MLCIL approach called class-independent increment (CLIN). Specifically, in contrast to existing methods that extract image-level features, we propose a class-independent incremental network (CINet) to extract multiple class-level embeddings for multi-label samples. It learns and preserves the knowledge of different classes by constructing class-specific tokens. On this basis, we develop two novel loss functions, optimizing the learning of class-specific tokens and class-level embeddings, respectively. These losses aim to distinguish between new and old classes, further alleviating the problem of feature confusion. Extensive experiments on MS-COCO and PASCAL VOC datasets demonstrate the effectiveness of our method for improving recognition performance and mitigating forgetting on various MLCIL tasks.

[226] From Limited Labels to Open Domains:An Efficient Learning Method for Drone-view Geo-Localization

Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong, Jiawei Lang, Guoqi Li

Main category: cs.CV

TL;DR: CDIKTNet is a novel cross-domain invariant knowledge transfer network for drone-view geo-localization that uses limited supervision to overcome feature confusion issues in both supervised and unsupervised methods.

Details

Motivation: Traditional supervised methods require paired training data and struggle with cross-view correlations from unpaired data, while unsupervised methods suffer from feature confusion due to geographical similarity and spatial continuity, leading to unreliable pseudo-labels.

Method: Proposes CDIKTNet with two sub-networks: CDIS learns cross-view structural and spatial invariance from small paired data as prior knowledge, and CDTS uses dual-path contrastive learning to optimize subspaces while maintaining shared feature space consistency.

Result: Extensive experiments show CDIKTNet achieves state-of-the-art performance under full supervision compared to supervised methods, and surpasses existing unsupervised methods in both few-shot and cross-domain initialization scenarios.

Conclusion: CDIKTNet effectively addresses feature confusion in drone-view geo-localization through a closed-loop framework of invariance feature learning and knowledge transfer, requiring only limited supervision while outperforming both supervised and unsupervised approaches.

Abstract: Traditional supervised drone-view geo-localization (DVGL) methods heavily depend on paired training data and encounter difficulties in learning cross-view correlations from unpaired data. Moreover, when deployed in a new domain, these methods require obtaining the new paired data and subsequent retraining for model adaptation, which significantly increases computational overhead. Existing unsupervised methods have enabled to generate pseudo-labels based on cross-view similarity to infer the pairing relationships. However, geographical similarity and spatial continuity often cause visually analogous features at different geographical locations. The feature confusion compromises the reliability of pseudo-label generation, where incorrect pseudo-labels drive negative optimization. Given these challenges inherent in both supervised and unsupervised DVGL methods, we propose a novel cross-domain invariant knowledge transfer network (CDIKTNet) with limited supervision, whose architecture consists of a cross-domain invariance sub-network (CDIS) and a cross-domain transfer sub-network (CDTS). This architecture facilitates a closed-loop framework for invariance feature learning and knowledge transfer. The CDIS is designed to learn cross-view structural and spatial invariance from a small amount of paired data that serves as prior knowledge. It endows the shared feature space of unpaired data with similar implicit cross-view correlations at initialization, which alleviates feature confusion. Based on this, the CDTS employs dual-path contrastive learning to further optimize each subspace while preserving consistency in a shared feature space. Extensive experiments demonstrate that CDIKTNet achieves state-of-the-art performance under full supervision compared with those supervised methods, and further surpasses existing unsupervised methods in both few-shot and cross-domain initialization.

[227] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

Changjiang Jiang, Wenhui Dong, Zhonghao Zhang, Chenyang Si, Fengchang Yu, Wei Peng, Xinbin Yuan, Yifei Bi, Ming Zhao, Zian Zhou, Caifeng Shan

Main category: cs.CV

TL;DR: Ivy-Fake is a large-scale multimodal benchmark for explainable AIGC detection, addressing limitations in current datasets and methods through rich annotations and a reinforcement learning-based detector that achieves state-of-the-art performance.

Details

Motivation: Address two major limitations in AIGC detection: (1) lack of multidimensional explainable datasets with only binary annotations, and (2) insufficient fine-grained interpretability in existing MLLM-based detectors that hinders reliable localization and explanation.

Method: Introduce Ivy-Fake benchmark with 106K+ annotated training samples and 5K verified evaluation examples from diverse sources. Propose Ivy-xDetector using reinforcement learning with Group Relative Policy Optimization (GRPO) to generate explainable reasoning chains.

Result: Extensive experiments show superiority of the dataset and effectiveness of the approach. Method improves performance on GenImage from 86.88% to 96.32%, surpassing prior state-of-the-art methods by a clear margin.

Conclusion: The proposed Ivy-Fake benchmark and Ivy-xDetector successfully address the explainability gap in AIGC detection, providing both comprehensive datasets and effective detection methods with enhanced interpretability and performance.

Abstract: The rapid development of Artificial Intelligence Generated Content (AIGC) techniques has enabled the creation of high-quality synthetic content, but it also raises significant security concerns. Current detection methods face two major limitations: (1) the lack of multidimensional explainable datasets for generated images and videos. Existing open-source datasets (e.g., WildFake, GenVideo) rely on oversimplified binary annotations, which restrict the explainability and trustworthiness of trained detectors. (2) Prior MLLM-based forgery detectors (e.g., FakeVLM) exhibit insufficiently fine-grained interpretability in their step-by-step reasoning, which hinders reliable localization and explanation. To address these challenges, we introduce Ivy-Fake, the first large-scale multimodal benchmark for explainable AIGC detection. It consists of over 106K richly annotated training samples (images and videos) and 5,000 manually verified evaluation examples, sourced from multiple generative models and real world datasets through a carefully designed pipeline to ensure both diversity and quality. Furthermore, we propose Ivy-xDetector, a reinforcement learning model based on Group Relative Policy Optimization (GRPO), capable of producing explainable reasoning chains and achieving robust performance across multiple synthetic content detection benchmarks. Extensive experiments demonstrate the superiority of our dataset and confirm the effectiveness of our approach. Notably, our method improves performance on GenImage from 86.88% to 96.32%, surpassing prior state-of-the-art methods by a clear margin.

[228] PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction

Ziqiao Meng, Qichao Wang, Zhiyang Dou, Zixing Song, Zhipeng Zhou, Irwin King, Peilin Zhao

Main category: cs.CV

TL;DR: PointNSP is a coarse-to-fine autoregressive point cloud generation framework that overcomes limitations of traditional autoregressive models by using multi-scale factorization and next-scale prediction, achieving state-of-the-art quality while being more efficient than diffusion-based approaches.

Details

Motivation: Autoregressive point cloud generation has lagged behind diffusion-based methods due to artificial ordering constraints that undermine global structural properties like symmetry and long-range dependencies.

Method: Proposes PointNSP, a coarse-to-fine framework using level-of-detail principles with next-scale prediction, enabling multi-scale factorization that preserves global shape structure at low resolutions and refines geometry progressively.

Result: Establishes SOTA generation quality on ShapeNet, surpassing diffusion-based baselines in parameter, training, and inference efficiency, with advantages becoming more pronounced at higher point densities (8,192 points).

Conclusion: PointNSP successfully bridges the performance gap between autoregressive and diffusion-based point cloud generation while offering superior efficiency and scalability.

Abstract: Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias emphasizes short-range continuity but undermines the model’s capacity to capture long-range dependencies, hindering its ability to enforce global structural properties such as symmetry, consistent topology, and large-scale geometric regularities. Inspired by the level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization aligns the autoregressive objective with the permutation-invariant nature of point sets, enabling rich intra-scale interactions while avoiding brittle fixed orderings. Experiments on ShapeNet show that PointNSP establishes state-of-the-art (SOTA) generation quality for the first time within the autoregressive paradigm. In addition, it surpasses strong diffusion-based baselines in parameter, training, and inference efficiency. Finally, in dense generation with 8,192 points, PointNSP’s advantages become even more pronounced, underscoring its scalability potential.

[229] Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

Mikey Shechter, Yair Carmon

Main category: cs.CV

TL;DR: FLYT algorithm curates vision-language datasets by learning data point usefulness through gradient signals from downstream tasks, achieving state-of-the-art results on DataComp benchmarks.

Details

Motivation: To improve vision-language pretraining by developing better data curation methods that can identify the most useful training examples for downstream tasks.

Method: FLYT trains a scoring model using gradient signals from downstream tasks, M-FLYT combines multiple scoring methods, and Soft Cap Sampling prevents over-representation through repetition penalty.

Result: Achieved 40.1% ImageNet zero-shot accuracy on DataComp medium scale (2% absolute improvement) and 37.7% average across 38 tasks, outperforming all previous public-resource approaches.

Conclusion: FLYT provides an effective framework for data curation that significantly improves vision-language model performance through learned example scoring and balanced sampling strategies.

Abstract: We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example’s features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2% absolute accuracy increase over all previous results and a 5.5% increase over results that - like us - use only public resources. Our approach also yields 37.7% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4%.

[230] FlowTok: Flowing Seamlessly Across Text and Image Tokens

Ju He, Qihang Yu, Qihao Liu, Liang-Chieh Chen

Main category: cs.CV

TL;DR: FlowTok introduces a simple flow matching framework that directly evolves between text and image modalities using compact 1D token representations, eliminating complex conditioning mechanisms while maintaining competitive performance.

Details

Motivation: To simplify cross-modality generation by avoiding conventional approaches that treat text as conditioning signals guiding denoising processes, and instead explore direct evolution between modalities through flow matching.

Method: Projects both text and images into a shared latent space by encoding images into compact 1D token representations, using flow matching to directly evolve between modalities without complex conditioning or noise scheduling.

Result: Reduces latent space size by 3.3x at 256px resolution, achieves comparable performance to state-of-the-art models with higher memory efficiency, fewer training resources, and faster sampling speeds.

Conclusion: FlowTok demonstrates that streamlined cross-modality generation is possible through compact 1D token representations and flow matching, offering an efficient alternative to conventional approaches.

Abstract: Bridging different modalities lies at the heart of cross-modality generation. While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm-directly evolving between text and image modalities through flow matching. This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. Compared to prior methods, this design reduces the latent space size by 3.3x at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. Moreover, FlowTok naturally extends to image-to-text generation under the same formulation. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds-all while delivering performance comparable to state-of-the-art models. Code is available at https://github.com/TACJu/FlowTok.

[231] Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image

Jerred Chen, Ronald Clark

Main category: cs.CV

TL;DR: A novel framework that uses motion blur as a cue for camera motion estimation, predicting motion flow and depth from single blurred images to recover instantaneous camera velocity.

Details

Motivation: Fast camera motions in robotics and VR/AR cause motion blur that makes existing pose estimation methods fail, so the authors propose using blur as motion information rather than treating it as noise.

Method: Predict dense motion flow field and monocular depth map from single motion-blurred image, then solve linear least squares problem to recover instantaneous camera velocity under small motion assumption.

Result: Achieves state-of-the-art angular and translational velocity estimates on real-world benchmarks, outperforming methods like MASt3R and COLMAP.

Conclusion: Motion blur can be effectively leveraged as a rich cue for robust camera motion estimation, producing IMU-like measurements that capture fast camera movements.

Abstract: In many robotics and VR/AR applications, fast camera motions lead to a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a large-scale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.

[232] Stream and Query-guided Feature Aggregation for Efficient and Effective 3D Occupancy Prediction

Seokha Moon, Janghyun Baek, Giseop Kim, Jinkyu Kim, Sunwook Choi

Main category: cs.CV

TL;DR: DuOcc introduces a dual aggregation strategy for 3D occupancy prediction that maintains dense voxel accuracy while achieving high efficiency through stream-based voxel aggregation and query-guided aggregation.

Details

Motivation: Existing methods face a trade-off between dense voxel representations (accurate but computationally expensive) and sparse representations (efficient but lose spatial detail). DuOcc aims to overcome this trade-off.

Method: Uses two components: (1) Stream-based Voxel Aggregation that recurrently accumulates voxel features over time while suppressing warping distortions, and (2) Query-guided Aggregation that injects instance-level query features into dynamic object regions.

Result: Achieves state-of-the-art performance in real-time settings on Occ3D-nuScenes and SurroundOcc datasets, while reducing memory usage by over 40% compared to prior methods.

Conclusion: DuOcc successfully balances accuracy and efficiency in 3D occupancy prediction through its dual aggregation approach, preserving spatial fidelity while maintaining computational efficiency.

Abstract: 3D occupancy prediction has become a key perception task in autonomous driving, as it enables comprehensive scene understanding. Recent methods enhance this understanding by incorporating spatiotemporal information through multi-frame fusion, but they suffer from a trade-off: dense voxel-based representations provide high accuracy at significant computational cost, whereas sparse representations improve efficiency but lose spatial detail. To mitigate this trade-off, we introduce DuOcc, which employs a dual aggregation strategy that retains dense voxel representations to preserve spatial fidelity while maintaining high efficiency. DuOcc consists of two key components: (i) Stream-based Voxel Aggregation, which recurrently accumulates voxel features over time and refines them to suppress warping-induced distortions, preserving a clear separation between occupied and free space. (ii) Query-guided Aggregation, which complements the limitations of voxel accumulation by selectively injecting instance-level query features into the voxel regions occupied by dynamic objects. Experiments on the widely used Occ3D-nuScenes and SurroundOcc datasets demonstrate that DuOcc achieves state-of-the-art performance in real-time settings, while reducing memory usage by over 40% compared to prior methods.

[233] Contrast-Prior Enhanced Duality for Mask-Free Shadow Removal

Jiyu Wu, Yifan Liu, Jiancheng Huang, Mingfu Yan, Shifeng Chen

Main category: cs.CV

TL;DR: Proposes a mask-free shadow removal method using Adaptive Gated Dual-Branch Attention to filter contrast cues and a diffusion-based Frequency-Contrast Fusion Network for detail restoration.

Details

Motivation: Existing shadow removal methods rely on shadow masks that are hard to acquire in real scenarios. Local contrast cues offer an alternative but suffer from ambiguity in complex scenes where they can't distinguish shadows from low-reflectance objects and complex textures.

Method: Uses Adaptive Gated Dual-Branch Attention (AGBA) to dynamically filter and re-weight contrast prior, and a diffusion-based Frequency-Contrast Fusion Network (FCFN) that leverages high-frequency and contrast cues for generative shadow removal.

Result: Achieves state-of-the-art results among mask-free approaches and maintains competitive performance relative to mask-based methods.

Conclusion: The proposed method effectively addresses shadow removal without requiring shadow masks by intelligently leveraging contrast and frequency cues, demonstrating strong performance in complex scenarios.

Abstract: Existing shadow removal methods often rely on shadow masks, which are challenging to acquire in real-world scenarios. Exploring intrinsic image cues, such as local contrast information, presents a potential alternative for guiding shadow removal in the absence of explicit masks. However, the cue’s inherent ambiguity becomes a critical limitation in complex scenes, where it can fail to distinguish true shadows from low-reflectance objects and intricate background textures. To address this motivation, we propose the Adaptive Gated Dual-Branch Attention (AGBA) mechanism. AGBA dynamically filters and re-weighs the contrast prior to effectively disentangle shadow features from confounding visual elements. Furthermore, to tackle the persistent challenge of restoring soft shadow boundaries and fine-grained details, we introduce a diffusion-based Frequency-Contrast Fusion Network (FCFN) that leverages high-frequency and contrast cues to guide the generative process. Extensive experiments demonstrate that our method achieves state-of-the-art results among mask-free approaches while maintaining competitive performance relative to mask-based methods.

[234] Leveraging Contrast Information for Efficient Document Shadow Removal

Yifan Liu, Jiancheng Huang, Na Liu, Mingfu Yan, Yi Huang, Shifeng Chen

Main category: cs.CV

TL;DR: Proposes an end-to-end document shadow removal method using contrast representation guidance in a coarse-to-fine approach, achieving state-of-the-art performance without needing shadow masks.

Details

Motivation: Existing document shadow removal methods rely on additional information like shadow masks or lack generalization, resulting in incomplete removal or content loss. Document images inherently contain rich information that can be better utilized.

Method: Uses contrast representation to locate shadow shapes and positions without masks, following a coarse-to-fine refinement approach. Integrates contrast information into the refined removal process for better network guidance and feature fusion.

Result: Extensive experiments show the method achieves state-of-the-art performance in both qualitative and quantitative evaluations.

Conclusion: The proposed contrast-guided approach effectively removes document shadows without requiring additional mask information, demonstrating superior performance compared to existing methods.

Abstract: Document shadows are a major obstacle in the digitization process. Due to the dense information in text and patterns covered by shadows, document shadow removal requires specialized methods. Existing document shadow removal methods, although showing some progress, still rely on additional information such as shadow masks or lack generalization and effectiveness across different shadow scenarios. This often results in incomplete shadow removal or loss of original document content and tones. Moreover, these methods tend to underutilize the information present in the original shadowed document image. In this paper, we refocus our approach on the document images themselves, which inherently contain rich information.We propose an end-to-end document shadow removal method guided by contrast representation, following a coarse-to-fine refinement approach. By extracting document contrast information, we can effectively and quickly locate shadow shapes and positions without the need for additional masks. This information is then integrated into the refined shadow removal process, providing better guidance for network-based removal and feature fusion. Extensive qualitative and quantitative experiments show that our method achieves state-of-the-art performance.

[235] Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation

Xiaoxing Hu, Ziyang Gong, Yupei Wang, Yuru Jia, Fei Lin, Dexiang Gao, Ke An, Jianhong Han, Zhuoran Sun, Gen Luo, Gen Luo, Xue Yang

Main category: cs.CV

TL;DR: Earth-Adapter is a novel Parameter-Efficient Fine-Tuning method specifically designed for Remote Sensing that uses Mixture of Frequency Adaptation to overcome artifacts by decomposing features into frequency components and dynamically weighting adapter experts.

Details

Motivation: Existing PEFT methods designed for natural imagery struggle with Remote Sensing scenarios due to their inability to handle artifact influences, which is particularly severe in RS image features.

Method: Introduces Mixture of Frequency Adaptation combining Mixture of Adapter (MoA) with Discrete Fourier Transformation (DFT) to decompose features into frequency components, separate artifacts from original features, and dynamically assign weights to adapter experts across frequency domains.

Result: Significantly outperforms baseline Rein with 9.0% mIoU improvement in Domain Adaptation and 3.1% mIoU improvement in Domain Generalization semantic segmentation benchmarks.

Conclusion: Earth-Adapter effectively overcomes artifact disturbances in Remote Sensing scenarios through frequency-based feature decomposition and dynamic adapter weighting, significantly enhancing Foundation Models’ performance on RS tasks.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows us to adapt powerful Foundation Models (FMs) to diverse downstream tasks while preserving and unleashing their inherent capabilities. However, we have observed that existing PEFT methods, which are often designed with natural imagery in mind, struggle when applied to Remote Sensing (RS) scenarios. This is primarily due to their inability to handle artifact influences, a problem particularly severe in RS image features. To tackle this challenge, we introduce Earth-Adapter, the first PEFT method specifically designed for RS artifacts conquering. Earth-Adapter introduces a novel Mixture of Frequency Adaptation process that combines a Mixture of Adapter (MoA) with Discrete Fourier Transformation (DFT). By utilizing DFT, Earth-Adapter can decompose features into different frequency components, precisely separating artifacts from original features. The MoA then dynamically assigns weights to each adapter expert, allowing for the combination of features across various frequency domains. These simple-yet-effective approaches enable Earth-Adapter to more efficiently overcome the disturbances caused by artifacts than previous PEFT methods, significantly enhancing the FMs’ performance on RS scenarios. Experiments on Domain Adaptation (DA), and Domain Generalization (DG) semantic segmentation benchmarks showcase the Earth-Adapter’s effectiveness. Compared with baseline Rein, Earth-Adapter significantly improves 9.0% mIoU in DA and 3.1% mIoU in DG benchmarks. Our code will be released at https://github.com/VisionXLab/Earth-Adapter.

[236] WeatherDiffusion: Controllable Weather Editing in Intrinsic Space

Yixin Zhu, Zuoliang Zhu, Jian Yang, Miloš Hašan, Jin Xie, Beibei Wang

Main category: cs.CV

TL;DR: WeatherDiffusion is a diffusion-based framework for controllable weather editing using intrinsic maps (material, geometry, lighting) estimated from images, enabling fine-grained weather control through text prompts.

Details

Motivation: Traditional pixel-space weather editing approaches lack controllability and spatial correspondence in large outdoor scenes, limiting their effectiveness for applications like autonomous driving that require robust performance in various weather conditions.

Method: Uses diffusion priors with two components: inverse renderer to estimate intrinsic maps from input images, and forward renderer that combines these maps with weather text prompts. Introduces intrinsic map-aware attention mechanism and CLIP-space interpolation for fine-grained weather control.

Result: Outperforms state-of-the-art pixel-space editing, weather restoration, and rendering-based methods. Demonstrates improved spatial correspondence and decomposition quality in large outdoor scenes.

Conclusion: WeatherDiffusion shows promise for downstream tasks like autonomous driving by enhancing robustness of detection and segmentation in challenging weather scenarios through controllable weather editing in intrinsic space.

Abstract: We present WeatherDiffusion, a diffusion-based framework for controllable weather editing in intrinsic space. Our framework includes two components based on diffusion priors: an inverse renderer that estimates material properties, scene geometry, and lighting as intrinsic maps from an input image, and a forward renderer that utilizes these geometry and material maps along with a text prompt that describes specific weather conditions to generate a final image. The intrinsic maps enhance controllability compared to traditional pixel-space editing approaches.We propose an intrinsic map-aware attention mechanism that improves spatial correspondence and decomposition quality in large outdoor scenes. For forward rendering, we leverage CLIP-space interpolation of weather prompts to achieve fine-grained weather control. We also introduce a synthetic and a real-world dataset, containing 38k and 18k images under various weather conditions, each with intrinsic map annotations. WeatherDiffusion outperforms state-of-the-art pixel-space editing approaches, weather restoration methods, and rendering-based methods, showing promise for downstream tasks such as autonomous driving, enhancing the robustness of detection and segmentation in challenging weather scenarios.

[237] Geometrically Regularized Transfer Learning with On-Manifold and Off-Manifold Perturbation

Hana Satou, Alan Mitkiy, Emma Collins, Finn Kingston

Main category: cs.CV

TL;DR: MAADA is a manifold-aware adversarial data augmentation framework that decomposes perturbations into on-manifold and off-manifold components to improve transfer learning under domain shift.

Details

Motivation: Address the fundamental challenge of domain shift in transfer learning by leveraging manifold geometry to better align source and target data distributions.

Method: Decomposes adversarial perturbations into on-manifold (semantic variation) and off-manifold (model brittleness) components, with theoretical guarantees on hypothesis complexity reduction and decision boundary smoothing. Includes geometry-aware alignment loss to minimize geodesic discrepancy between manifolds.

Result: Outperforms existing adversarial and adaptation methods on DomainNet, VisDA, and Office-Home datasets in both unsupervised and few-shot settings, demonstrating superior structural robustness and cross-domain generalization.

Conclusion: MAADA provides an effective framework for domain adaptation by explicitly modeling manifold geometry, achieving state-of-the-art performance through principled decomposition of adversarial perturbations and geometric alignment.

Abstract: Transfer learning under domain shift remains a fundamental challenge due to the divergence between source and target data manifolds. In this paper, we propose MAADA (Manifold-Aware Adversarial Data Augmentation), a novel framework that decomposes adversarial perturbations into on-manifold and off-manifold components to simultaneously capture semantic variation and model brittleness. We theoretically demonstrate that enforcing on-manifold consistency reduces hypothesis complexity and improves generalization, while off-manifold regularization smooths decision boundaries in low-density regions. Moreover, we introduce a geometry-aware alignment loss that minimizes geodesic discrepancy between source and target manifolds. Experiments on DomainNet, VisDA, and Office-Home show that MAADA consistently outperforms existing adversarial and adaptation methods in both unsupervised and few-shot settings, demonstrating superior structural robustness and cross-domain generalization.

[238] Disentangled Geometric Alignment with Adaptive Contrastive Perturbation for Reliable Domain Transfer

Emma Collins, Myungseo wong, Kim Yun, Finn Kingston, Hana Satou

Main category: cs.CV

TL;DR: GAMA++ improves geometry-aware domain adaptation by introducing latent space disentanglement and adaptive contrastive perturbation to address insufficient disentanglement and rigid perturbation schemes in existing methods.

Details

Motivation: Current geometry-aware domain adaptation methods like GAMA suffer from insufficient disentanglement of task-relevant/irrelevant manifold dimensions and rigid perturbation schemes that ignore per-class alignment asymmetries.

Method: Introduces latent space disentanglement to isolate label-consistent manifold directions, adaptive contrastive perturbation strategy tailored to class-specific manifold curvature and alignment discrepancy, and cross-domain contrastive consistency loss for semantic cluster alignment.

Result: Achieves state-of-the-art results on DomainNet, Office-Home, and VisDA benchmarks under standard and few-shot settings, with improvements in class-level alignment fidelity and boundary robustness.

Conclusion: GAMA++ sets a new standard for semantic geometry alignment in transfer learning by effectively addressing disentanglement and perturbation limitations of previous methods.

Abstract: Despite progress in geometry-aware domain adaptation, current methods such as GAMA still suffer from two unresolved issues: (1) insufficient disentanglement of task-relevant and task-irrelevant manifold dimensions, and (2) rigid perturbation schemes that ignore per-class alignment asymmetries. To address this, we propose GAMA++, a novel framework that introduces (i) latent space disentanglement to isolate label-consistent manifold directions from nuisance factors, and (ii) an adaptive contrastive perturbation strategy that tailors both on- and off-manifold exploration to class-specific manifold curvature and alignment discrepancy. We further propose a cross-domain contrastive consistency loss that encourages local semantic clusters to align while preserving intra-domain diversity. Our method achieves state-of-the-art results on DomainNet, Office-Home, and VisDA benchmarks under both standard and few-shot settings, with notable improvements in class-level alignment fidelity and boundary robustness. GAMA++ sets a new standard for semantic geometry alignment in transfer learning.

[239] ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

Sanghyun Jo, Wooyeol Lee, Ziseok Lee, Kyungsu Kim

Main category: cs.CV

TL;DR: ISAC is a training-free method that improves multi-object generation in diffusion models by using self-attention to establish instance layouts and then binding semantics to these instances, addressing issues like incorrect instance counts and semantic leakage.

Details

Motivation: Text-to-image diffusion models struggle with multi-object scenes, producing incorrect instance counts and semantic leakage across objects due to vague instance boundaries.

Method: ISAC performs hierarchical attention control in two phases: Phase 1 clusters self-attention to establish instance layouts and repel overlaps; Phase 2 injects instance cues into cross-attention to create instance-aware semantic masks and decompose mixing semantics.

Result: ISAC achieves consistent gains on multiple benchmarks, with at least 50% improvement in multi-class accuracy and 7% in multi-instance accuracy on IntraCompBench, without fine-tuning or external models.

Conclusion: Hierarchical decoupling of instance formation and semantic assignment is key for robust multi-object generation, and ISAC also improves layout-to-image controllers by refining coarse box layouts into dense instance masks.

Abstract: Text-to-image diffusion models have recently become highly capable, yet their behavior in multi-object scenes remains unreliable: models often produce an incorrect number of instances and exhibit semantics leaking across objects. We trace these failures to vague instance boundaries; self-attention already reveals instance layouts early in the denoising process, but existing approaches act only on semantic signals. We introduce $\textbf{ISAC}$ ($\textbf{I}$nstance-to-$\textbf{S}$emantic $\textbf{A}$ttention $\textbf{C}$ontrol), a training-free, model-agnostic objective that performs hierarchical attention control by first carving out instance layouts from self-attention and then binding semantics to these instances. In Phase 1, ISAC clusters self-attention into the number of instances and repels overlaps, establishing an instance-level structural hierarchy; in Phase 2, it injects these instance cues into cross-attention to obtain instance-aware semantic masks and decomposes mixing semantics by tying attributes within each instance. ISAC yields consistent gains on T2I-CompBench, HRS-Bench, and IntraCompBench, our new benchmark for intra-class compositions where failures are most frequent, with improvements of at least 50% in multi-class accuracy and 7% in multi-instance accuracy on IntraCompBench, without any fine-tuning or external models. Beyond text-to-image setups, ISAC also strengthens layout-to-image controllers under overlapping boxes by refining coarse box layouts into dense instance masks, indicating that hierarchical decoupling of instance formation and semantic assignment is a key principle for robust, controllable multi-object generation. Code will be released upon publication.

[240] Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories

Liviu Nicolae Fircă, Antonio Bărbălau, Dan Oneata, Elena Burceanu

Main category: cs.CV

TL;DR: This paper evaluates whether models can generalize attribute knowledge across semantically dissimilar categories, showing performance drops as train-test correlation decreases.

Details

Motivation: To test if current models can abstract and apply attributes to conceptually distant categories, beyond narrow taxonomic or visually similar domains.

Method: Introduces train-test split strategies that progressively reduce correlation: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning.

Result: Results show sharp performance drops as training-test correlation decreases, with clustering providing the most effective trade-off between reducing correlations and preserving learnability.

Conclusion: Current models show strong sensitivity to split design, revealing limitations in attribute generalization across dissimilar categories, which informs future benchmark construction.

Abstract: Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute “has four legs” is common to both “dogs” and “chairs”. To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.

[241] Diffusion-Denoised Hyperspectral Gaussian Splatting

Sunil Kumar Narayanan, Lingjun Zhao, Lu Gan, Yongsheng Chen

Main category: cs.CV

TL;DR: DD-HGS enhances 3D Gaussian Splatting with wavelength-aware spherical harmonics, spectral loss, and diffusion denoising for fast, high-quality hyperspectral scene reconstruction.

Details

Motivation: Current NeRF-based hyperspectral imaging methods have slow training and rendering speeds, limiting practical agricultural applications for nutrient composition analysis.

Method: Proposes Diffusion-Denoised Hyperspectral Gaussian Splatting (DD-HGS) with wavelength-aware spherical harmonics, KL divergence spectral loss, and diffusion-based denoiser for explicit 3D reconstruction.

Result: Achieves state-of-the-art performance on Hyper-NeRF dataset with improved training time and rendering speed compared to previous methods.

Conclusion: DD-HGS enables efficient 3D hyperspectral reconstruction for precise spatial-spectral nutrient composition localization in agricultural applications.

Abstract: Hyperspectral imaging (HSI) has been widely used in agricultural applications for non-destructive estimation of plant nutrient composition and precise quantification of sample nutritional elements. Recently, 3D reconstruction methods, such as Neural Radiance Field (NeRF), have been used to create implicit neural representations of HSI scenes. This capability enables the rendering of hyperspectral channel compositions at every spatial location, thereby helping localize the target object’s nutrient composition both spatially and spectrally. However, it faces limitations in training time and rendering speed. In this paper, we propose Diffusion-Denoised Hyperspectral Gaussian Splatting (DD-HGS), which enhances the state-of-the-art 3D Gaussian Splatting (3DGS) method with wavelength-aware spherical harmonics, a Kullback-Leibler divergence-based spectral loss, and a diffusion-based denoiser to enable 3D explicit reconstruction of the hyperspectral scenes for the entire spectral range. We present extensive evaluations on diverse real-world hyperspectral scenes from the Hyper-NeRF dataset to show the effectiveness of our DD-HGS. The results demonstrate that DD-HGS achieves the new state-of-the-art performance compared to all the previously published methods. Project page: https://dragonpg2000.github.io/DDHGS-website/

[242] Modular, On-Site Solutions with Lightweight Anomaly Detection for Sustainable Nutrient Management in Agriculture

Abigail R. Cohen, Yuming Sun, Zhihao Qin, Harsh S. Muriki, Zihao Xiao, Yeonju Lee, Matthew Housley, Andrew F. Sharkey, Rhuanito S. Ferrarezi, Jing Li, Lu Gan, Yongsheng Chen

Main category: cs.CV

TL;DR: A tiered pipeline using multispectral imaging and autoencoders for efficient nutrient management in crops, enabling early anomaly detection and status estimation with energy-efficient approaches.

Details

Motivation: Current nutrient management methods are slow and computationally intensive, preventing real-time optimization needed for sustainable agriculture and resource conservation.

Method: Hierarchical pipeline with autoencoder for early anomaly detection, comparing two status estimation approaches: vegetation index features with Random Forest vs. raw image analysis with Vision Transformer.

Result: High-efficiency anomaly detection (73% net detection of nutrient-deficient samples) with lower energy than wasted nitrogen, and trade-offs between ViT (better accuracy for phosphorus/calcium) and RF (more energy-efficient).

Conclusion: The modular pipeline enables practical edge diagnostics for agricultural sustainability by balancing efficiency and accuracy in nutrient management.

Abstract: Efficient nutrient management is critical for crop growth and sustainable resource consumption (e.g., nitrogen, energy). Current approaches require lengthy analyses, preventing real-time optimization; similarly, imaging facilitates rapid phenotyping but can be computationally intensive, preventing deployment under resource constraints. This study proposes a flexible, tiered pipeline for anomaly detection and status estimation (fresh weight, dry mass, and tissue nutrients), including a comprehensive energy analysis of approaches that span the efficiency-accuracy spectrum. Using a nutrient depletion experiment with three treatments (T1-100%, T2-50%, and T3-25% fertilizer strength) and multispectral imaging (MSI), we developed a hierarchical pipeline using an autoencoder (AE) for early warning. Further, we compared two status estimation modules of different complexity for more detailed analysis: vegetation index (VI) features with machine learning (Random Forest, RF) and raw whole-image deep learning (Vision Transformer, ViT). Results demonstrated high-efficiency anomaly detection (73% net detection of T3 samples 9 days after transplanting) at substantially lower energy than embodied energy in wasted nitrogen. The state estimation modules show trade-offs, with ViT outperforming RF on phosphorus and calcium estimation (R2 0.61 vs. 0.58, 0.48 vs. 0.35) at higher energy cost. With our modular pipeline, this work opens opportunities for edge diagnostics and practical opportunities for agricultural sustainability.

[243] Adapting Segment Anything Model for Power Transmission Corridor Hazard Segmentation

Hang Chen, Maoyuan Ye, Peng Yang, Haibin He, Juhua Liu, Bo Du

Main category: cs.CV

TL;DR: ELE-SAM adapts Segment Anything Model for power transmission corridor hazard segmentation by adding Context-Aware Prompt Adapter and High-Fidelity Mask Decoder, achieving significant performance improvements on the new ELE-40K dataset.

Details

Motivation: SAM struggles with fine-structured objects in complex transmission corridor scenarios, requiring adaptation for power transmission corridor hazard segmentation to maintain electric power transmission safety.

Method: Developed Context-Aware Prompt Adapter for better prompt tokens using global-local features, and High-Fidelity Mask Decoder leveraging multi-granularity mask features at higher resolution.

Result: Outperforms baseline by 16.8% mIoU and 20.6% mBIoU on ELE-40K, and achieves 2.9% mIoU and 3.8% mBIoU improvements over SOTA on HQSeg-44K for generic object segmentation.

Conclusion: ELE-SAM effectively adapts SAM for PTCHS task with superior performance, and the ELE-40K dataset advances the field as the first large-scale real-world benchmark.

Abstract: Power transmission corridor hazard segmentation (PTCHS) aims to separate transmission equipment and surrounding hazards from complex background, conveying great significance to maintaining electric power transmission safety. Recently, the Segment Anything Model (SAM) has emerged as a foundational vision model and pushed the boundaries of segmentation tasks. However, SAM struggles to deal with the target objects in complex transmission corridor scenario, especially those with fine structure. In this paper, we propose ELE-SAM, adapting SAM for the PTCHS task. Technically, we develop a Context-Aware Prompt Adapter to achieve better prompt tokens via incorporating global-local features and focusing more on key regions. Subsequently, to tackle the hazard objects with fine structure in complex background, we design a High-Fidelity Mask Decoder by leveraging multi-granularity mask features and then scaling them to a higher resolution. Moreover, to train ELE-SAM and advance this field, we construct the ELE-40K benchmark, the first large-scale and real-world dataset for PTCHS including 44,094 image-mask pairs. Experimental results for ELE-40K demonstrate the superior performance that ELE-SAM outperforms the baseline model with the average 16.8% mIoU and 20.6% mBIoU performance improvement. Moreover, compared with the state-of-the-art method on HQSeg-44K, the average 2.9% mIoU and 3.8% mBIoU absolute improvements further validate the effectiveness of our method on high-quality generic object segmentation. The source code and dataset are available at https://github.com/Hhaizee/ELE-SAM.

[244] Vision Remember: Recovering Visual Information in Efficient LVLM with Vision Feature Resampling

Ze Feng, Jiang-jiang Liu, Sen Yang, Lingyu Xiao, Zhibin Quan, Zhenhua Feng, Wankou Yang, Jingdong Wang

Main category: cs.CV

TL;DR: Vision Remember improves LVLMs by resampling vision features across decoder layers to recover fine-grained visual information lost in compression, achieving better performance on visual understanding tasks.

Details

Motivation: Existing vision token compression methods lose crucial fine-grained spatial visual information needed for tasks like OCR and Chart&Table Understanding, limiting LVLM performance.

Method: Proposes Vision Remember with two modules: Token-Feature Cross-Attention Layer for local cross-attention and multi-level fusion, and Token Bidirectional Self-Attention Layer for bidirectional interaction between vision tokens and text-guided tokens.

Result: Outperforms TokenPacker by +2.7 and FastV by +5.7 across settings, and surpasses DeepStack by +3.9 and SVA Aggregator by +3.4 on the same baseline, showing strong generalization with various vision projectors and LVLMs.

Conclusion: Vision Remember effectively recovers visual information through cross-layer feature resampling, achieving state-of-the-art performance on visual understanding benchmarks while maintaining efficiency.

Abstract: The computational expense of redundant vision tokens in Large Vision-Language Models (LVLMs) has led many existing methods to compress them via a vision projector. However, this compression may lose visual information that is crucial for tasks relying on fine-grained spatial relationships, such as OCR and Chart&Table Understanding. In this paper, we propose to resample original vision features across the LLM decoder layers to recover visual information and attain efficiency. Following this principle, we introduce Vision Remember, which includes two key modules: (1) Token-Feature Cross-Attention Layer and (2) Token Bidirectional Self-Attention Layer. In the Token bidirectional attention, we employ self-attention mechanism to maintain the bidirectional interaction between vision tokens and the text-guided token. In the Token-Feature interaction attention, we introduce local cross-attention to resample the visual feature and utilize the multi-level fusion to enrich the visual representation. We conduct comprehensive experiments on multiple visual understanding benchmarks and the results with the LLaVA-NeXT baseline show that Vision Remember outperforms TokenPacker by +2.7 and FastV by +5.7 across nearly all the settings. Compared with previous vision feature re-fusion methods, our approach also surpasses DeepStack by +3.9 and SVA Aggregator by +3.4 on the same baseline. The experimental results validate the generalization capability of the proposed method when combined with various efficient vision projectors and LVLMs.

[245] Dynamic Epsilon Scheduling: A Multi-Factor Adaptive Perturbation Budget for Adversarial Training

Alan Mitkiy, James Smith, Myungseo wong, Hana Satou, Hiroshi Tanaka, Emily Johnson

Main category: cs.CV

TL;DR: Dynamic Epsilon Scheduling (DES) adaptively adjusts adversarial perturbation budgets per instance and training iteration using gradient-based boundary distance, prediction confidence, and model uncertainty, improving robustness and accuracy over fixed-budget methods.

Details

Motivation: Existing adversarial training uses fixed perturbation budgets that don't account for instance-specific robustness characteristics, limiting effectiveness.

Method: DES integrates three factors: gradient-based decision boundary distance approximation, softmax entropy prediction confidence, and Monte Carlo dropout model uncertainty to dynamically schedule perturbation budgets.

Result: Experiments on CIFAR-10 and CIFAR-100 show consistent improvements in both adversarial robustness and standard accuracy compared to fixed-epsilon baselines and prior adaptive methods.

Conclusion: DES provides a new approach for instance-aware, data-driven adversarial training with theoretical insights into scheduling stability and convergence.

Abstract: Adversarial training is among the most effective strategies for defending deep neural networks against adversarial examples. A key limitation of existing adversarial training approaches lies in their reliance on a fixed perturbation budget, which fails to account for instance-specific robustness characteristics. While prior works such as IAAT and MMA introduce instance-level adaptations, they often rely on heuristic or static approximations of data robustness. In this paper, we propose Dynamic Epsilon Scheduling (DES), a novel framework that adaptively adjusts the adversarial perturbation budget per instance and per training iteration. DES integrates three key factors: (1) the distance to the decision boundary approximated via gradient-based proxies, (2) prediction confidence derived from softmax entropy, and (3) model uncertainty estimated via Monte Carlo dropout. By combining these cues into a unified scheduling strategy, DES tailors the perturbation budget dynamically to guide more effective adversarial learning. Experimental results on CIFAR-10 and CIFAR-100 show that our method consistently improves both adversarial robustness and standard accuracy compared to fixed-epsilon baselines and prior adaptive methods. Moreover, we provide theoretical insights into the stability and convergence of our scheduling policy. This work opens a new avenue for instance-aware, data-driven adversarial training methods.

[246] PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction

Ziqiao Meng, Qichao Wang, Zhiyang Dou, Zixing Song, Zhipeng Zhou, Irwin King, Peilin Zhao

Main category: cs.CV

Details

Motivation: Autoregressive point cloud generation has lagged behind diffusion-based approaches due to artificial ordering constraints that undermine global structural properties like symmetry and long-range dependencies.

Method: Proposes PointNSP with coarse-to-fine generation using level-of-detail principle, preserving global structure at low resolutions and refining geometry through next-scale prediction paradigm with multi-scale factorization.

Result: Establishes SOTA generation quality on ShapeNet within autoregressive paradigm, surpasses diffusion baselines in parameter/training/inference efficiency, and shows pronounced advantages in dense generation with 8,192 points.

Conclusion: PointNSP successfully bridges the performance gap between autoregressive and diffusion-based point cloud generation while offering superior efficiency and scalability.

[247] MetricHMSR:Metric Human Mesh and Scene Recovery from Monocular Images

Chentao Song, He Zhang, Haolei Yuan, Haozhe Lin, Jianhua Tao, Hongwen Zhang, Tao Yu

Main category: cs.CV

TL;DR: MetricHMSR is a novel method for metric human mesh and scene recovery from monocular images that uses camera rays and Human Mixture-of-Experts to simultaneously estimate human pose and 3D position in a unified framework.

Details

Motivation: Existing approaches struggle with metric human pose and 3D position estimation due to unrealistic camera model assumptions and inherent challenges in metric perception from monocular images.

Method: Incorporates camera rays to encode bounding box and intrinsic parameters, proposes Human Mixture-of-Experts to dynamically route image and ray features to task-specific experts, and refines existing metric depth estimation methods.

Result: Achieves state-of-the-art performance on both human mesh and scene recovery, enabling seamless overlay of humans and scenes in 3D space.

Conclusion: MetricHMSR provides an effective unified framework for simultaneous metric human pose and 3D position estimation from monocular images.

Abstract: We introduce MetricHMSR (Metric Human Mesh and Scene Recovery), a novel approach for metric human mesh and scene recovery from monocular images. Due to unrealistic assumptions in the camera model and inherent challenges in metric perception, existing approaches struggle to achieve human pose and metric 3D position estimation through a unified module. To address this limitation, MetricHMSR incorporates camera rays to comprehensively encode both the bounding box information and the intrinsic parameters of perspective projection. Then we proposed Human Mixture-of-Experts (MoE), the model dynamically routes image features and ray features to task-specific experts for specialized understanding of different data aspects, enabling a unified framework that simultaneously perceives the local pose and the global 3D position. Based on the results above, we further refine the existing monocular metric depth estimation method to achieve more accurate results, ultimately enabling the seamless overlay of humans and scenes in 3D space. Comprehensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on both human mesh and scene recovery.

[248] Automated Neural Architecture Design for Industrial Defect Detection

Yuxi Liu, Yunfeng Ma, Yi Tang, Min Liu, Shuai Jiang, Yaonan Wang

Main category: cs.CV

TL;DR: AutoNAD is an automated neural architecture design framework for surface defect detection that jointly searches over convolutions, transformers, and MLPs to address intraclass difference and interclass similarity challenges.

Details

Motivation: Existing manual methods for surface defect detection require extensive trial and error and struggle with intraclass difference (varied defect shapes/sizes) and interclass similarity (different defects looking similar).

Method: Proposes AutoNAD framework with hybrid architecture search over convolutions, transformers, and MLPs; cross weight sharing for efficient training; searchable multi-level feature aggregation; and latency-aware prior for runtime efficiency.

Result: Validated on three industrial defect datasets and integrated into a defect imaging and detection platform, showing effectiveness in addressing SDD challenges while maintaining runtime efficiency.

Conclusion: AutoNAD successfully automates neural architecture design for surface defect detection, overcoming manual design limitations and providing an efficient solution suitable for industrial deployment.

Abstract: Industrial surface defect detection (SDD) is critical for ensuring product quality and manufacturing reliability. Due to the diverse shapes and sizes of surface defects, SDD faces two main challenges: intraclass difference and interclass similarity. Existing methods primarily utilize manually designed models, which require extensive trial and error and often struggle to address both challenges effectively. To overcome this, we propose AutoNAD, an automated neural architecture design framework for SDD that jointly searches over convolutions, transformers, and multi-layer perceptrons. This hybrid design enables the model to capture both fine-grained local variations and long-range semantic context, addressing the two key challenges while reducing the cost of manual network design. To support efficient training of such a diverse search space, AutoNAD introduces a cross weight sharing strategy, which accelerates supernet convergence and improves subnet performance. Additionally, a searchable multi-level feature aggregation module (MFAM) is integrated to enhance multi-scale feature learning. Beyond detection accuracy, runtime efficiency is essential for industrial deployment. To this end, AutoNAD incorporates a latency-aware prior to guide the selection of efficient architectures. The effectiveness of AutoNAD is validated on three industrial defect datasets and further applied within a defect imaging and detection platform. Code is available at https://github.com/Yuxi104/AutoNAD.

[249] Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding

Youze Wang, Zijun Chen, Ruoyu Chen, Shishen Gu, Wenbo Hu, Jiayang Liu, Yinpeng Dong, Hang Su, Jun Zhu, Meng Wang, Richang Hong

Main category: cs.CV

TL;DR: Trust-videoLLMs is the first comprehensive benchmark evaluating 23 videoLLMs across truthfulness, robustness, safety, fairness, and privacy using 30 tasks with various video types, revealing significant limitations in dynamic scene comprehension and risk mitigation.

Details

Motivation: VideoLLMs face reliability issues including factual inaccuracies, harmful content, biases, hallucinations, and privacy risks, creating a need for standardized trustworthiness assessment beyond just accuracy metrics.

Method: Developed a comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across 5 dimensions using 30 tasks with adapted, synthetic, and annotated videos to assess spatiotemporal risks, temporal consistency and cross-modal impact.

Result: Revealed significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. Open-source models occasionally outperform proprietary ones, but proprietary models generally show superior credibility. Scaling doesn’t consistently improve performance.

Conclusion: There’s a critical need for enhanced training data diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments to bridge the gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.

Abstract: Recent advancements in multimodal large language models for video understanding (videoLLMs) have enhanced their capacity to process complex spatiotemporal data. However, challenges such as factual inaccuracies, harmful content, biases, hallucinations, and privacy risks compromise their reliability. This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across five critical dimensions: truthfulness, robustness, safety, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses spatiotemporal risks, temporal consistency and cross-modal impact. Results reveal significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. While open-source models occasionally outperform, proprietary models generally exhibit superior credibility, though scaling does not consistently improve performance. These findings underscore the need for enhanced training datat diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments, addressing the critical gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.

[250] SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

Zhenjie Mao, Yuhuan Yang, Chaofan Ma, Dongsheng Jiang, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Main category: cs.CV

TL;DR: SaFiRe is a novel framework for Referring Image Segmentation that handles ambiguous expressions through a two-phase cognitive process inspired by human reasoning, using Mamba’s scan-then-update property for efficient multi-cycle refinement.

Details

Motivation: Current RIS methods focus on simple expressions and reduce the task to keyword matching, failing to handle referential ambiguity in complex real-world scenarios like object-distracting and category-implicit expressions.

Method: Proposes SaFiRe framework that mimics human two-phase cognition: global understanding followed by detail-oriented inspection, leveraging Mamba’s scan-then-update property for efficient multi-cycle refinement with linear complexity.

Result: Extensive experiments on standard and proposed aRefCOCO benchmark show SaFiRe’s superiority over state-of-the-art baselines in handling ambiguous referring expressions.

Conclusion: SaFiRe effectively addresses the limitations of current RIS methods by handling complex ambiguous expressions through a biologically-inspired two-phase approach, demonstrating strong performance on challenging real-world scenarios.

Abstract: Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions–short, clear noun phrases like “red car” or “left girl”. This simplification often reduces RIS to a key word/concept matching problem, limiting the model’s ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process–first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba’s scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.

[251] Learning Normals of Noisy Points by Local Gradient-Aware Surface Filtering

Qing Li, Huifang Feng, Xun Gong, Yu-Shen Liu

Main category: cs.CV

TL;DR: A novel method for estimating normals from noisy point clouds using local gradient-aware surface filtering that projects noisy points onto underlying surfaces through implicit functions constrained by local gradients.

Details

Motivation: Existing normal estimation methods work well on clean data but struggle with noisy point clouds, relying on supervised priors and specific neighborhoods without effectively handling noise.

Method: Uses local gradient-aware surface filtering with implicit functions to project noisy points onto surfaces. Includes distance measurement for global surface fitting, implicit field-based filtering with projection constraints, and local gradient consistency constraints to prevent over-smoothing.

Result: Comprehensive experiments show state-of-the-art performance in normal estimation, surface reconstruction, and point cloud denoising tasks.

Conclusion: The proposed LGSF method effectively handles noisy point clouds and achieves superior performance across multiple 3D geometry processing tasks compared to existing approaches.

Abstract: Estimating normals for noisy point clouds is a persistent challenge in 3D geometry processing, particularly for end-to-end oriented normal estimation. Existing methods generally address relatively clean data and rely on supervised priors to fit local surfaces within specific neighborhoods. In this paper, we propose a novel approach for learning normals from noisy point clouds through local gradient-aware surface filtering. Our method projects noisy points onto the underlying surface by utilizing normals and distances derived from an implicit function constrained by local gradients. We start by introducing a distance measurement operator for global surface fitting on noisy data, which integrates projected distances along normals. Following this, we develop an implicit field-based filtering approach for surface point construction, adding projection constraints on these points during filtering. To address issues of over-smoothing and gradient degradation, we further incorporate local gradient consistency constraints, as well as local gradient orientation and aggregation. Comprehensive experiments on normal estimation, surface reconstruction, and point cloud denoising demonstrate the state-of-the-art performance of our method. The source code and trained models are available at https://github.com/LeoQLi/LGSF.

[252] ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models

Jiaxin Liu, Zhaolu Kang

Main category: cs.CV

TL;DR: ReasonAct enhances video reasoning in small models via three-stage training: text reasoning foundation, video fine-tuning, and temporal-aware RL refinement, achieving significant accuracy improvements on video datasets.

Details

Motivation: Small-scale multimodal models struggle with fine-grained temporal reasoning required for video understanding, needing methods to enhance their video reasoning capabilities while maintaining computational efficiency.

Method: Three-stage training: 1) Build foundation with text-only reasoning, 2) Fine-tune on video data, 3) Refine with temporal-aware reinforcement learning using T-GRPO with temporal consistency modeling and biomechanically-motivated sub-action decomposition for graduated rewards.

Result: 3B-parameter model achieves 67.2% on HMDB51, 94.1% on UCF-101, and 78.9% on Kinetics-400, with improvements of 17.9, 15.8, and 12.3 points over baselines respectively.

Conclusion: Progressive training enables smaller models to achieve competitive video reasoning performance while maintaining computational efficiency, as validated by ablation studies.

Abstract: While recent multimodal models have shown progress in vision-language tasks, small-scale variants still struggle with the fine-grained temporal reasoning required for video understanding. We introduce ReasonAct, a method that enhances video reasoning in smaller models through a three-stage training process: first building a foundation with text-only reasoning, then fine-tuning on video, and finally refining with temporal-aware reinforcement learning. We build upon Temporal Group Relative Policy Optimization (T-GRPO) by incorporating temporal consistency modeling into policy optimization. We also propose a biomechanically-motivated sub-action decomposition mechanism that provides graduated rewards for constituent action phases. Through experiments on HMDB51, UCF-101, and Kinetics-400, our 3B-parameter model achieves 67.2%, 94.1%, and 78.9% accuracy respectively, demonstrating improvements of 17.9, 15.8, and 12.3 points over baselines. Ablation studies validate that our progressive training enables smaller models to achieve competitive video reasoning performance while maintaining computational efficiency.

[253] SARVLM: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery

Qiwei Ma, Zhiyu Wang, Wang Liu, Xukun Lu, Bin Deng, Puhong Duan, Xudong Kang, Shutao Li

Main category: cs.CV

TL;DR: SARVLM is the first vision-language foundation model for SAR imagery, trained on a large-scale dataset (SARVLM-1M) with domain transfer strategy to bridge natural and SAR imagery gaps, enabling superior multimodal understanding and zero-shot capabilities.

Details

Motivation: Existing SAR foundation models focus on low-level visual features and lack multimodal alignment and zero-shot target recognition capabilities, limiting their semantic understanding of SAR imagery.

Method: Constructed SARVLM-1M dataset with 1M+ image-text pairs, proposed domain transfer training strategy to mitigate natural-SAR imagery gap, and developed SARVLM model with SARCLIP and SARCap components using vision-language contrastive learning.

Result: SARVLM achieves state-of-the-art performance in image-text retrieval, zero-shot classification, semantic localization, and imagery captioning, demonstrating superior feature extraction and interpretation compared to existing VLMs.

Conclusion: SARVLM advances SAR semantic understanding by effectively bridging SAR imagery with textual descriptions through multimodal alignment, offering improved zero-shot capabilities and comprehensive SAR interpretation.

Abstract: Synthetic Aperture Radar (SAR) is a crucial imaging modality thanks to its all-weather capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these methods largely emphasize low-level visual features and often overlook multimodal alignment and zero-shot target recognition in SAR imagery. To address this, we construct SARVLM-1M, a large-scale vision-language dataset with over one million image-text pairs aggregated from existing datasets. We further propose a domain transfer training strategy to mitigate the large gap between natural and SAR imagery. Building on this, we develop SARVLM, the first vision language foundation model (VLM) tailored to SAR, comprising SARCLIP and SARCap. SARVLM is trained with a vision-language contrastive objective under the proposed domain transfer strategy, bridging SAR imagery and textual descriptions. Extensive experiments on image text retrieval, zero-shot classification, semantic localization, and imagery captioning demonstrate that SARVLM delivers superior feature extraction and interpretation, outperforming state-of-the-art VLMs and advancing SAR semantic understanding. Code and datasets will be released soon.

[254] MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

Thanh-Dat Truong, Christophe Bobda, Nitin Agarwal, Khoa Luu

Main category: cs.CV

TL;DR: MANGO introduces an explicit, interpretable multimodal fusion approach using invertible cross-attention layers in normalizing flows, achieving SoTA performance across multiple tasks.

Details

Motivation: Current multimodal fusion methods use Transformers' attention to implicitly learn correlations, failing to capture essential modality features and understand complex multimodal structures.

Method: Proposes Multimodal Attention-based Normalizing Flow (MANGO) with Invertible Cross-Attention layers and three new cross-attention mechanisms: MMCA, IMCA, and LICA to capture complex multimodal correlations.

Result: Achieved state-of-the-art performance on three multimodal learning tasks: semantic segmentation, image-to-image translation, and movie genre classification.

Conclusion: MANGO provides an explicit, interpretable, and tractable multimodal fusion approach that effectively captures complex correlations in multimodal data.

Abstract: Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.

[255] Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

Hyundo Lee, Suhyung Choi, Inwoo Hwang, Byoung-Tak Zhang

Main category: cs.CV

TL;DR: The paper proposes a method to improve spatial consistency in image generation by co-generating images with their intrinsic scene properties (depth, segmentation maps), using pre-trained estimators and latent diffusion models.

Details

Motivation: Current image generation models often produce spatially inconsistent and distorted images due to limited information about underlying scene structures and spatial layouts.

Method: Extract intrinsic scene properties using pre-trained estimators, aggregate them into a single latent variable via autoencoder, and simultaneously denoise image and intrinsic domains using latent diffusion models with shared mutual information.

Result: The method corrects spatial inconsistencies, produces more natural scene layouts while maintaining image fidelity and textual alignment with base models like Stable Diffusion.

Conclusion: Co-generating images with intrinsic scene properties enables models to implicitly capture underlying scene structure, leading to more spatially consistent and realistic image generation.

Abstract: Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).

[256] DensiCrafter: Physically-Constrained Generation and Fabrication of Self-Supporting Hollow Structures

Shengqi Dang, Fu Chai, Jiaxin Li, Chao Yuan, Wei Ye, Nan Cao

Main category: cs.CV

TL;DR: DensiCrafter generates lightweight, self-supporting 3D hollow structures by optimizing density fields from coarse voxel grids, achieving up to 43% material reduction while maintaining geometric fidelity.

Details

Motivation: Current 3D generative models ignore physical constraints and manufacturability, particularly the need for lightweight and self-supporting structures suitable for 3D printing.

Method: Optimizes continuous density fields from Trellis-generated voxel grids using differentiable, physically constrained loss terms and mass regularization, while preserving outer surfaces.

Result: Achieves up to 43% material mass reduction in text-to-3D tasks, improves stability, maintains high geometric fidelity, and produces reliably fabricable self-supporting structures.

Conclusion: DensiCrafter successfully bridges the gap between 3D generation and physical manufacturability, enabling lightweight, self-supporting hollow structures compatible with existing generative models.

Abstract: The rise of 3D generative models has enabled automatic 3D geometry and texture synthesis from multimodal inputs (e.g., text or images). However, these methods often ignore physical constraints and manufacturability considerations. In this work, we address the challenge of producing 3D designs that are both lightweight and self-supporting. We present DensiCrafter, a framework for generating lightweight, self-supporting 3D hollow structures by optimizing the density field. Starting from coarse voxel grids produced by Trellis, we interpret these as continuous density fields to optimize and introduce three differentiable, physically constrained, and simulation-free loss terms. Additionally, a mass regularization penalizes unnecessary material, while a restricted optimization domain preserves the outer surface. Our method seamlessly integrates with pretrained Trellis-based models (e.g., Trellis, DSO) without any architectural changes. In extensive evaluations, we achieve up to 43% reduction in material mass on the text-to-3D task. Compared to state-of-the-art baselines, our method could improve the stability and maintain high geometric fidelity. Real-world 3D-printing experiments confirm that our hollow designs can be reliably fabricated and could be self-supporting.

[257] Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, Khoa Luu

Main category: cs.CV

TL;DR: A novel learning mechanism for large multimodal models that improves robustness and generalization through shuffling tasks and directed-token approach, achieving state-of-the-art performance.

Details

Motivation: Large multimodal models suffer from limitations in robustness and generalization due to alignment issues between visual and textual features, which affects their reasoning capability and cross-modality understanding.

Method: Introduces two shuffling tasks (reconstructing image order and text order) during pre-training and fine-tuning, plus a directed-token approach to capture visual-textual knowledge and an Image-to-Response Guided loss for better visual understanding.

Result: The proposed approach consistently achieves state-of-the-art performance on academic task-oriented and instruction-following LMM benchmarks.

Conclusion: The simple but efficient learning mechanism with shuffling tasks and directed-token approach effectively improves multimodal alignment, reasoning capability, and visual understanding in large multimodal models.

Abstract: Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM’s pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.

[258] Uncovering and Mitigating Destructive Multi-Embedding Attacks in Deepfake Proactive Forensics

Lixin Jia, Haiyang Sun, Zhiqing Guo, Yunfeng Diao, Dan Ma, Gaobo Yang

Main category: cs.CV

TL;DR: The paper identifies Multi-Embedding Attacks (MEA) as a vulnerability in deepfake proactive forensics and proposes Adversarial Interference Simulation (AIS) to make watermarking methods resilient against multiple embedding rounds.

Details

Motivation: Existing deepfake forensic methods rely on single watermark embedding, which is impractical in real-world scenarios where images may undergo multiple watermarking processes, rendering original forensic watermarks ineffective.

Method: Proposes Adversarial Interference Simulation (AIS) - a training paradigm that simulates MEA scenarios during fine-tuning and uses a resilience-driven loss function to enforce sparse and stable watermark representations without modifying network architecture.

Result: Extensive experiments show that AIS significantly enhances the robustness of various existing methods against Multi-Embedding Attacks, enabling correct extraction of original watermarks even after second embedding.

Conclusion: AIS provides a plug-and-play solution to address the MEA vulnerability in deepfake proactive forensics, making watermark-based source tracking reliable in practical scenarios with multiple embedding rounds.

Abstract: With the rapid evolution of deepfake technologies and the wide dissemination of digital media, personal privacy is facing increasingly serious security threats. Deepfake proactive forensics, which involves embedding imperceptible watermarks to enable reliable source tracking, serves as a crucial defense against these threats. Although existing methods show strong forensic ability, they rely on an idealized assumption of single watermark embedding, which proves impractical in real-world scenarios. In this paper, we formally define and demonstrate the existence of Multi-Embedding Attacks (MEA) for the first time. When a previously protected image undergoes additional rounds of watermark embedding, the original forensic watermark can be destroyed or removed, rendering the entire proactive forensic mechanism ineffective. To address this vulnerability, we propose a general training paradigm named Adversarial Interference Simulation (AIS). Rather than modifying the network architecture, AIS explicitly simulates MEA scenarios during fine-tuning and introduces a resilience-driven loss function to enforce the learning of sparse and stable watermark representations. Our method enables the model to maintain the ability to extract the original watermark correctly even after a second embedding. Extensive experiments demonstrate that our plug-and-play AIS training paradigm significantly enhances the robustness of various existing methods against MEA.

[259] Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video

Filippo Cenacchi, Longbing Cao, Mitchell McEwan, Deborah Richards

Main category: cs.CV

TL;DR: This paper presents a language-free dementia screening method using facial micro-dynamics from short talking head videos, achieving high performance without speech or text analysis.

Details

Motivation: Existing dementia screening methods rely on speech or scripted interviews, limiting scalability and requiring clinical intervention. The authors aim to develop a passive, language-free approach using natural facial behaviors that can work across devices, topics, and cultures.

Method: The method analyzes temporal facial kinematics including blink dynamics, mouth/jaw motions, gaze variability, and subtle head movements. It stabilizes facial signals, converts micro-movements into time series, smooths them, and summarizes into clip-level statistics based on activity mix across motion streams.

Result: On the YT DemTalk dataset (300 clips: 150 dementia, 150 controls), the method achieved AUROC 0.953, Average Precision 0.961, F1-score 0.851, and accuracy 0.857. Gaze lability and mouth/jaw dynamics were identified as the most informative cues.

Conclusion: Facial temporal micro-dynamics are sufficient for effective dementia screening without speech or text, enabling scalable, passive screening in natural settings across diverse populations and devices.

Abstract: We target passive dementia screening from short camera-facing talking head video, developing a facial temporal micro dynamics analysis for language free detection of early neuro cognitive change. This enables unscripted, in the wild video analysis at scale to capture natural facial behaviors, transferrable across devices, topics, and cultures without active intervention by clinicians or researchers during recording. Most existing resources prioritize speech or scripted interviews, limiting use outside clinics and coupling predictions to language and transcription. In contrast, we identify and analyze whether temporal facial kinematics, including blink dynamics, small mouth jaw motions, gaze variability, and subtle head adjustments, are sufficient for dementia screening without speech or text. By stabilizing facial signals, we convert these micro movements into interpretable facial microdynamic time series, smooth them, and summarize short windows into compact clip level statistics for screening. Each window is encoded by its activity mix (the relative share of motion across streams), thus the predictor analyzes the distribution of motion across streams rather than its magnitude, making per channel effects transparent. We also introduce YT DemTalk, a new dataset curated from publicly available, in the wild camera facing videos. It contains 300 clips (150 with self reported dementia, 150 controls) to test our model and offer a first benchmarking of the corpus. On YT DemTalk, ablations identify gaze lability and mouth/jaw dynamics as the most informative cues, and light weighted shallow classifiers could attain a dementia prediction performance of (AUROC) 0.953, 0.961 Average Precision (AP), 0.851 F1-score, and 0.857 accuracy.

[260] FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses

Hao Liang, Zhixuan Ge, Soumendu Majee, Ashish Tiwari, G. M. Dilshan Godaliyadda, Ashok Veeraraghavan, Guha Balakrishnan

Main category: cs.CV

TL;DR: FastAvatar enables fast 3D face reconstruction from single images using 3D Gaussian Splatting, achieving high-quality results in ~3 seconds with a two-stage approach combining direct prediction and optimization.

Details

Motivation: To create fast and robust 3D face reconstruction from single images that preserves identity under extreme poses, overcoming the slow speed limitations of existing per-subject optimization methods.

Method: Two-stage design: feed-forward encoder-decoder predicts coarse geometry from pose-invariant identity embedding, followed by lightweight test-time refinement optimizing appearance parameters for photorealistic rendering.

Result: Achieves state-of-the-art quality (24.01 dB PSNR, 0.91 SSIM) while running 600x faster than existing methods, supporting novel-view synthesis and expression animation.

Conclusion: FastAvatar significantly broadens 3DGS-based facial avatar applicability by offering high fidelity, pose robustness, and rapid reconstruction simultaneously.

Abstract: We present FastAvatar, a fast and robust algorithm for single-image 3D face reconstruction using 3D Gaussian Splatting (3DGS). Given a single input image from an arbitrary pose, FastAvatar recovers a high-quality, full-head 3DGS avatar in approximately 3 seconds on a single NVIDIA A100 GPU. We use a two-stage design: a feed-forward encoder-decoder predicts coarse face geometry by regressing Gaussian structure from a pose-invariant identity embedding, and a lightweight test-time refinement stage then optimizes the appearance parameters for photorealistic rendering. This hybrid strategy combines the speed and stability of direct prediction with the accuracy of optimization, enabling strong identity preservation even under extreme input poses. FastAvatar achieves state-of-the-art reconstruction quality (24.01 dB PSNR, 0.91 SSIM) while running over 600x faster than existing per-subject optimization methods (e.g., FlashAvatar, GaussianAvatars, GASP). Once reconstructed, our avatars support photorealistic novel-view synthesis and FLAME-guided expression animation, enabling controllable reenactment from a single image. By jointly offering high fidelity, robustness to pose, and rapid reconstruction, FastAvatar significantly broadens the applicability of 3DGS-based facial avatars.

[261] Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving

Kangqiao Zhao, Shuo Huai, Xurui Song, Jun Luo

Main category: cs.CV

TL;DR: First texture-enabled physical adversarial attack against stereo matching models using 3D PAEs with global camouflage texture for autonomous driving, achieving visual consistency and attack effectiveness across stereo viewpoints.

Details

Motivation: Existing attacks mostly target monocular perception with 2D patches, leaving stereo-based binocular depth estimation vulnerable and unexplored for physical adversarial examples.

Method: Uses 3D PAEs with global camouflage texture, a 3D stereo matching rendering module to handle camera disparity, and a novel merging attack that blends targets into environment through fine-grained optimization.

Result: PAEs successfully fool stereo models into producing erroneous depth information with enhanced stealth and lethality compared to existing hiding attacks.

Conclusion: The proposed method demonstrates effective physical adversarial attacks against stereo matching models, highlighting vulnerabilities in autonomous driving perception systems.

Abstract: Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.

[262] EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer

Pukun Zhao, Longxiang Wang, Miaowei Wang, Chen Chen, Fanqing Zhou, Haojian Huang

Main category: cs.CV

TL;DR: The paper introduces two dynamic spatial reasoning benchmarks that test models’ abilities in spatial understanding and adaptive planning under partial observability and environmental changes, revealing limitations in current models.

Details

Motivation: Existing spatial reasoning benchmarks fail to capture challenges of long-horizon reasoning and memory utilization in partially observable, dynamically changing environments.

Method: Proposed two dynamic spatial benchmarks (maze navigation and match-2 elimination) with structural changes triggered by actions, and introduced a subjective experience-based memory mechanism for cross-task experience transfer.

Result: Experiments revealed key limitations of mainstream models in dynamic spatial reasoning and long-term memory capabilities.

Conclusion: The benchmarks provide a comprehensive platform for evaluating and advancing methods in dynamic spatial reasoning under partial observability and environmental changes.

Abstract: Most existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture the challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes. We introduce two dynamic spatial benchmarks, locally observable maze navigation and match-2 elimination that systematically evaluate models’ abilities in spatial understanding and adaptive planning when local perception, environment feedback, and global objectives are tightly coupled. Each action triggers structural changes in the environment, requiring continuous update of cognition and strategy. We further propose a subjective experience-based memory mechanism for cross-task experience transfer and validation. Experiments show that our benchmarks reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory, providing a comprehensive platform for future methodological advances. Our code and data are available at https://anonymous.4open.science/r/EvoEmpirBench-143C/.

[263] Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Zhitao Zeng, Guojian Yuan, Junyuan Mao, Yuxuan Wang, Xiaoshuang Jia, Yueming Jin

Main category: cs.CV

TL;DR: The paper introduces Multi-Scale Temporal Prediction (MSTP) task and proposes IG-MC method with incremental generation and multi-agent collaboration for predicting multiple fine-grained scene states across temporal scales.

Details

Motivation: Accurate temporal prediction bridges scene understanding and embodied AI, but current vision-language models struggle with predicting multiple fine-grained states at multiple temporal scales.

Method: Proposes Incremental Generation and Multi-agent Collaboration (IG-MC): 1) Plug-and-play incremental generation module that synthesizes visual previews at expanding temporal scales, 2) Multi-agent collaboration framework with generation, initiation, and assessment agents for dynamic prediction cycles.

Result: Introduced the first MSTP Benchmark with synchronized annotations across multiple state and temporal scales. The method maintains decision-visual synchronization and prevents performance degradation at longer look-ahead intervals.

Conclusion: The MSTP task formalizes multi-scale temporal prediction, and IG-MC method effectively addresses the challenges through incremental generation and multi-agent collaboration, enabling balanced global coherence and local fidelity in predictions.

Abstract: Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in general and surgical scenes by decomposing multi-scale into two orthogonal dimensions: the temporal scale, forecasting states of humans and surgery at varying look-ahead intervals, and the state scale, modeling a hierarchy of states in general and surgical scenes. For example, in general scenes, states of contact relationships are finer-grained than states of spatial relationships. In surgical scenes, medium-level steps are finer-grained than high-level phases yet remain constrained by their encompassing phase. To support this unified task, we introduce the first MSTP Benchmark, featuring synchronized annotations across multiple state scales and temporal scales. We further propose a method, Incremental Generation and Multi-agent Collaboration (IG-MC), which integrates two key innovations. First, we present a plug-and-play incremental generation module that continuously synthesizes up-to-date visual previews at expanding temporal scales to inform multiple decision-making agents, keeping decisions and generated visuals synchronized and preventing performance degradation as look-ahead intervals lengthen. Second, we present a decision-driven multi-agent collaboration framework for multi-state prediction, comprising generation, initiation, and multi-state assessment agents that dynamically trigger and evaluate prediction cycles to balance global coherence and local fidelity.

[264] ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

Ruize Ma, Minghong Cai, Yilei Jiang, Jiaming Han, Yi Feng, Yingshui Tan, Xiaoyong Zhu, Bo Zhang, Bo Zheng, Xiangyu Yue

Main category: cs.CV

TL;DR: ConceptGuard is a unified safeguard framework that proactively detects and mitigates unsafe semantics in multimodal video generation by identifying latent risks in fused image-text inputs and steering the generative process away from unsafe concepts.

Details

Motivation: Existing safety methods for video generation are often text-only, require prior knowledge of risk categories, or operate as post-generation auditors, struggling to proactively mitigate compositional, multimodal risks that emerge from individual modalities or their interactions.

Method: Two-stage approach: 1) Contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; 2) Semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt’s multimodal conditioning.

Result: Comprehensive experiments on ConceptRisk and T2VSafetyBench-TI2V benchmarks show ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation.

Conclusion: ConceptGuard provides an effective unified framework for proactively addressing multimodal safety risks in video generation, demonstrating superior performance over existing methods through rigorous evaluation on novel benchmarks.

Abstract: Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt’s multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation. Our code is available at https://github.com/Ruize-Ma/ConceptGuard.

[265] ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models

Yixuan Hu, Yuxuan Xue, Simon Klenk, Daniel Cremers, Gerard Pons-Moll

Main category: cs.CV

TL;DR: ControlEvents is a diffusion-based generative model that synthesizes high-quality event data using control signals like text labels, 2D skeletons, and 3D body poses, leveraging diffusion priors from foundation models to reduce data labeling costs.

Details

Motivation: Event cameras offer bio-inspired advantages but face challenges in obtaining large-scale labeled ground-truth data, which is costly and difficult to acquire.

Method: Leverages diffusion prior from foundation models like Stable Diffusion to generate event data guided by diverse control signals with minimal fine-tuning and limited labeled data.

Result: Synthesized event data enhances model performance in visual recognition, 2D skeleton estimation, and 3D body pose estimation, and can generate events for unseen text labels.

Conclusion: The approach effectively reduces the cost of producing labeled event datasets and demonstrates powerful text-based generation capabilities inherited from foundation models.

Abstract: In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.

[266] XYZCylinder: Towards Compatible Feed-Forward 3D Gaussian Splatting for Driving Scenes via Unified Cylinder Lifting Method

Haochen Yu, Qiankun Liu, Hongyuan Liu, Jianfei Jiang, Juntao Lyu, Jiansheng Chen, Huimin Ma

Main category: cs.CV

TL;DR: XYZCylinder is a novel 3D reconstruction method that uses unified cylinder lifting to handle varying camera configurations and improve reconstruction accuracy for complex driving scenes.

Details

Motivation: Existing feed-forward 3D reconstruction methods have limitations in complex driving scenes due to fixed view transformations that don't adapt to varying camera configurations and difficulty learning from sparse 360° views with minimal overlap.

Method: Proposes Unified Cylinder Camera Modeling (UCCM) to explicitly model projection parameters for diverse camera setups, and a hybrid representation with Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space.

Result: Achieves state-of-the-art performance under different evaluation settings and demonstrates remarkable zero-shot compatibility in new scenes with different camera settings.

Conclusion: XYZCylinder effectively addresses camera compatibility and reconstruction accuracy challenges in complex driving scenes through unified cylinder lifting and hybrid representation.

Abstract: Feed-forward paradigms for 3D reconstruction have become a focus of recent research, which learn implicit, fixed view transformations to generate a single scene representation. However, their application to complex driving scenes reveals significant limitations. Two core challenges are responsible for this performance gap. First, the reliance on a fixed view transformation hinders compatibility to varying camera configurations. Second, the inherent difficulty of learning complex driving scenes from sparse 360° views with minimal overlap compromises the final reconstruction fidelity. To handle these difficulties, we introduce XYZCylinder, a novel method built upon a unified cylinder lifting method that integrates camera modeling and feature lifting. To tackle the compatibility problem, we design a Unified Cylinder Camera Modeling (UCCM) strategy. This strategy explicitly models projection parameters to unify diverse camera setups, thus bypassing the need for learning viewpoint-dependent correspondences. To improve the reconstruction accuracy, we propose a hybrid representation with several dedicated modules based on newly designed Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space. Extensive evaluations confirm that XYZCylinder not only achieves state-of-the-art performance under different evaluation settings but also demonstrates remarkable compatibility in entirely new scenes with different camera settings in a zero-shot manner. Project page: \href{https://yuyuyu223.github.io/XYZCYlinder-projectpage/}{here}

[267] VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment

Qing Li, Huifang Feng, Xun Gong, Yu-Shen Liu

Main category: cs.CV

TL;DR: This paper proposes VA-GS, a method that enhances 3D Gaussian Splatting for better surface reconstruction by incorporating view alignment techniques including edge-aware rendering, visibility-aware photometric alignment, normal constraints, and deep feature consistency.

Details

Motivation: 3D Gaussian Splatting shows promise for novel view synthesis but struggles with accurate surface reconstruction due to its discrete and unstructured nature, leading to inaccurate geometry and inconsistent multi-view alignment.

Method: The method incorporates edge-aware image cues into rendering loss, introduces visibility-aware photometric alignment to handle occlusions, adds normal-based constraints to refine spatial orientation, and leverages deep image feature embeddings for cross-view consistency.

Result: Extensive experiments on standard benchmarks show state-of-the-art performance in both surface reconstruction and novel view synthesis.

Conclusion: The proposed VA-GS method successfully enhances the geometric representation of 3D Gaussians through comprehensive view alignment techniques, achieving superior surface reconstruction while maintaining high-quality novel view synthesis capabilities.

Abstract: 3D Gaussian Splatting has recently emerged as an efficient solution for high-quality and real-time novel view synthesis. However, its capability for accurate surface reconstruction remains underexplored. Due to the discrete and unstructured nature of Gaussians, supervision based solely on image rendering loss often leads to inaccurate geometry and inconsistent multi-view alignment. In this work, we propose a novel method that enhances the geometric representation of 3D Gaussians through view alignment (VA). Specifically, we incorporate edge-aware image cues into the rendering loss to improve surface boundary delineation. To enforce geometric consistency across views, we introduce a visibility-aware photometric alignment loss that models occlusions and encourages accurate spatial relationships among Gaussians. To further mitigate ambiguities caused by lighting variations, we incorporate normal-based constraints to refine the spatial orientation of Gaussians and improve local surface estimation. Additionally, we leverage deep image feature embeddings to enforce cross-view consistency, enhancing the robustness of the learned geometry under varying viewpoints and illumination. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis. The source code is available at https://github.com/LeoQLi/VA-GS.

[268] Decorrelation Speeds Up Vision Transformers

Kieran Carrigg, Rob van Gastel, Melda Yeghaian, Sander Dalm, Faysal Boughorbel, Marcel van Gerven

Main category: cs.CV

TL;DR: DBP-MAE integrates Decorrelated Backpropagation into MAE pre-training to reduce computational costs and accelerate convergence while maintaining or improving performance in low-data scenarios.

Details

Motivation: Address the substantial computational costs of MAE pre-training for vision transformers, making it impractical in time- and resource-constrained industrial settings.

Method: Integrate Decorrelated Backpropagation (DBP) into MAE pre-training, selectively applying it to the encoder to reduce input correlations at each layer and accelerate convergence.

Result: DBP-MAE reduces wall-clock time by 21.1%, lowers carbon emissions by 21.4%, improves segmentation mIoU by 1.1 points on ImageNet-1K and ADE20K, with similar gains on proprietary industrial data.

Conclusion: DBP can effectively reduce training time and energy use while improving downstream performance for large-scale ViT pre-training in real-world scenarios.

Abstract: Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label data regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by nitegrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. To mimic constrained-data scenarios, we evaluate our approach on ImageNet-1K pre-training and ADE20K fine-tuning using randomly sampled subsets of each dataset. Under this setting, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4%, and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method’s applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training. Keywords: Deep learning, Vision transformers, Efficient AI, Decorrelation

[269] Saliency-R1: Incentivizing Unified Saliency Reasoning Capability in MLLM with Confidence-Guided Reinforcement Learning

Long Li, Shuichen Ji, Ziyang Luo, Zhihui Li, Dingwen Zhang, Junwei Han, Nian Liu

Main category: cs.CV

TL;DR: Saliency-R1 is a unified multimodal large language model framework that jointly handles three saliency tasks (SOD, SIS, CoSOD) using structured tags and confidence-guided policy optimization, achieving state-of-the-art performance.

Details

Motivation: MLLMs lack inherent visual saliency awareness, making it difficult to identify key visual elements in vision-language reasoning tasks.

Method: Proposed Saliency-R1 framework with textual interface using structured tags (, ) for region/instance-level referring, and Confidence-Guided Policy Optimization (CGPO) algorithm for efficient training.

Result: The model exceeds or matches performance of robust open/closed-source MLLMs and specialized state-of-the-art methods across all three saliency tasks.

Conclusion: The framework demonstrates efficacy in saliency reasoning through unified handling of heterogeneous saliency tasks with improved training efficiency.

Abstract: Although multimodal large language models (MLLMs) excel in high-level vision-language reasoning, they lack inherent awareness of visual saliency, making it difficult to identify key visual elements. To bridge this gap, we propose Saliency-R1, the first unified MLLM framework that jointly tackles three representative and heterogeneous saliency tasks: Salient Object Detection (SOD), Salient Instance Segmentation (SIS), and Co-salient Object Detection (CoSOD), enhancing the model’s capacity for saliency reasoning. We introduce a textual interface with structured tags (, ) to encode region- and instance-level referring expressions, enabling a single referring segmenter to produce task-appropriate masks. To train the MLLM efficiently, we propose Confidence-Guided Policy Optimization (CGPO), a novel single-sample reinforcement learning algorithm. CGPO improves on GRPO by replacing group-normalized advantages with a per-sample signal based on reward-confidence discrepancy, thereby reducing computational waste, mitigating signal dilution, and lowering training overhead. Our model exceeds or matches the performance of robust open/closed-source MLLMs and specialized state-of-the-art methods across all three tasks, demonstrating the efficacy of our framework in saliency reasoning.

[270] Probabilistic Robustness for Free? Revisiting Training via a Benchmark

Yi Zhang, Zheng Wang, Zhen Chen, Wenjie Ruan, Qing Guo, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

Main category: cs.CV

TL;DR: PRBench is the first benchmark for evaluating probabilistic robustness training methods, comparing adversarial training and PR-targeted methods across multiple metrics including generalization error and clean accuracy.

Details

Motivation: Probabilistic robustness (PR) is a practical complement to adversarial robustness but lacks dedicated training methods and standardized evaluation protocols, with existing methods having non-comparable evaluations and limited comparisons to strong adversarial training baselines.

Method: Introduced PRBench benchmark that empirically compares adversarial training and PR-targeted training methods using comprehensive metrics including clean accuracy, PR/AR performance, training efficiency, and generalization error, with theoretical analysis on generalization error.

Result: Adversarial training methods are more versatile across diverse hyperparameter settings for improving both AR and PR, while PR-targeted methods consistently yield lower generalization error and higher clean accuracy. A leaderboard with 222 trained models across 7 datasets and 10 architectures is provided.

Conclusion: PRBench enables systematic evaluation of probabilistic robustness training methods, revealing trade-offs between adversarial training and PR-targeted approaches, with adversarial training being more versatile but PR-targeted methods offering better generalization and clean accuracy.

Abstract: Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 222 trained models across 7 datasets and 10 model architectures is publicly available at https://tmpspace.github.io/PRBenchLeaderboard/.

[271] MoEGCL: Mixture of Ego-Graphs Contrastive Representation Learning for Multi-View Clustering

Jian Zhu, Xin Zou, Jun Sun, Cheng Luo, Lei Liu, Lingfang Zeng, Ning Zhang, Bian Wu, Chang Tang, Lirong Dai

Main category: cs.CV

TL;DR: MoEGCL introduces fine-grained ego-graph fusion at sample level using Mixture-of-Experts, outperforming traditional view-level graph fusion methods in multi-view clustering.

Details

Motivation: Existing GNN-based multi-view clustering methods suffer from coarse-grained graph fusion by performing weighted fusion at view level, which is too rough for optimal performance.

Method: Proposes Mixture of Ego-Graphs Fusion (MoEGF) that constructs ego graphs and uses Mixture-of-Experts network for sample-level fusion, plus Ego Graph Contrastive Learning (EGCL) to align fused and view-specific representations.

Result: Extensive experiments show MoEGCL achieves state-of-the-art performance in deep multi-view clustering tasks.

Conclusion: Fine-grained ego-graph fusion at sample level significantly improves multi-view clustering performance compared to traditional view-level fusion approaches.

Abstract: In recent years, the advancement of Graph Neural Networks (GNNs) has significantly propelled progress in Multi-View Clustering (MVC). However, existing methods face the problem of coarse-grained graph fusion. Specifically, current approaches typically generate a separate graph structure for each view and then perform weighted fusion of graph structures at the view level, which is a relatively rough strategy. To address this limitation, we present a novel Mixture of Ego-Graphs Contrastive Representation Learning (MoEGCL). It mainly consists of two modules. In particular, we propose an innovative Mixture of Ego-Graphs Fusion (MoEGF), which constructs ego graphs and utilizes a Mixture-of-Experts network to implement fine-grained fusion of ego graphs at the sample level, rather than the conventional view-level fusion. Additionally, we present the Ego Graph Contrastive Learning (EGCL) module to align the fused representation with the view-specific representation. The EGCL module enhances the representation similarity of samples from the same cluster, not merely from the same sample, further boosting fine-grained graph representation. Extensive experiments demonstrate that MoEGCL achieves state-of-the-art results in deep multi-view clustering tasks. The source code is publicly available at https://github.com/HackerHyper/MoEGCL.

[272] TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks

Xuanle Zhao, Shuxin Zeng, Xinyuan Cai, Xiang Cheng, Duzhen Zhang, Xiuyi Chen, Bo Xu

Main category: cs.CV

TL;DR: TinyChemVL is an efficient 4B-parameter chemical VLM that uses visual token reduction and reaction-level tasks to improve efficiency and reasoning, outperforming larger models while using only 1/16th of visual tokens.

Details

Motivation: Current VLMs for chemical tasks are computationally inefficient due to processing entire chemical images with non-informative backgrounds, and have narrow scope focusing only on molecular-level tasks, limiting chemical reasoning capabilities.

Method: Proposed TinyChemVL with visual token reduction to reduce computational overhead, and introduced reaction-level tasks to enhance reasoning capacity. Also created ChemRxn-V benchmark for vision-based reaction recognition and prediction.

Result: TinyChemVL achieves superior performance on both molecular and reaction tasks with faster inference and training speeds compared to existing models. It outperforms ChemVLM while using only 1/16th of visual tokens.

Conclusion: This work demonstrates that co-designing model architecture and task complexity enables building efficient yet powerful VLMs for chemical domains, advancing chemical reasoning capabilities beyond molecular-level tasks.

Abstract: While Vision Language Models (VLMs) have demonstrated remarkable capabilities in general visual understanding, their application in the chemical domain has been limited, with previous works predominantly focusing on text and thus overlooking critical visual information, such as molecular structures. Current approaches that directly adopt standard VLMs for chemical tasks suffer from two primary issues: (i) computational inefficiency of processing entire chemical images with non-informative backgrounds. (ii) a narrow scope on molecular-level tasks that restricts progress in chemical reasoning. In this work, we propose \textbf{TinyChemVL}, an efficient and powerful chemical VLM that leverages visual token reduction and reaction-level tasks to improve model efficiency and reasoning capacity. Also, we propose \textbf{ChemRxn-V}, a reaction-level benchmark for assessing vision-based reaction recognition and prediction tasks. Directly predicting reaction products from molecular images poses a non-trivial challenge, as it requires models to integrate both recognition and reasoning capacities. Our results demonstrate that with only 4B parameters, TinyChemVL achieves superior performance on both molecular and reaction tasks while demonstrating faster inference and training speeds compared to existing models. Notably, TinyChemVL outperforms ChemVLM while utilizing only 1/16th of the visual tokens. This work builds efficient yet powerful VLMs for chemical domains by co-designing model architecture and task complexity.

[273] Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao

Main category: cs.CV

TL;DR: Agent0-VL is a self-evolving vision-language agent that achieves continual improvement through tool-integrated reasoning, enabling models to self-evaluate and self-repair without human supervision.

Details

Motivation: Overcome limitations of human-annotated supervision in vision-language agents by addressing text-based self-evaluation struggles with complex visual reasoning and evaluation hallucinations.

Method: Incorporates tool usage into reasoning, self-evaluation, and self-repair through a Self-Evolving Reasoning Cycle with two synergistic roles: Solver (multi-turn tool-integrated reasoning) and Verifier (structured feedback and fine-grained self-rewards through tool-grounded critique).

Result: Achieves 12.5% improvement over base model on geometric problem solving and visual scientific analysis through zero-external-reward evolution.

Conclusion: Agent0-VL enables continual self-improvement by aligning reasoning and verification behaviors without human annotation or external reward models, demonstrating effective tool-integrated self-evolution.

Abstract: Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at https://github.com/aiming-lab/Agent0.

[274] DWFF-Net : A Multi-Scale Farmland System Habitat Identification Method with Adaptive Dynamic Weight

Kesong Zheng, Zhi Song, Peizhou Li, Shuyi Yao, Zhenxing Bian

Main category: cs.CV

TL;DR: Proposed DWFF-Net with dynamic-weighted feature fusion for cultivated land habitat segmentation, achieving 69.79% mIoU on a new ultra-high-resolution dataset with 15 habitat categories.

Details

Motivation: Lack of standardized habitat classification system for cultivated land ecosystems, incomplete coverage of habitat types, and existing models' inability to effectively integrate semantic and texture features, leading to insufficient segmentation accuracy and blurred boundaries.

Method: Developed DWFF-Net with frozen-parameter DINOv3 encoder, data-level adaptive dynamic weighting strategy for feature fusion, dynamic weight computation network in decoder, and hybrid loss function. Created comprehensive annotated dataset with 15 cultivated land habitat categories.

Result: Achieved mIoU of 69.79% and F1-score of 80.49%, outperforming baseline by 2.1% and 1.61% respectively. Ablation studies confirmed complementary nature of multi-layer feature fusion, improving IoU for micro-habitats like field ridges.

Conclusion: Established habitat identification framework enabling sub-meter precision habitat mapping at low cost, providing robust technical support for fine-grained habitat monitoring in cultivated landscapes.

Abstract: Addressing the current lack of a standardized habitat classification system for cultivated land ecosystems, incomplete coverage of the habitat types, and the inability of existing models to effectively integrate semantic and texture features-resulting in insufficient segmentation accuracy and blurred boundaries for multi-scale habitats (e.g., large-scale field plots and micro-habitats)-this study developed a comprehensively annotated ultra-high-resolution remote sensing image dataset encompassing 15 categories of cultivated land system habitats. Furthermore, we propose a Dynamic-Weighted Feature Fusion Network (DWFF-Net). The encoder of this model utilizes a frozen-parameter DINOv3 to extract foundational features. By analyzing the relationships between different category images and feature maps, we introduce a data-level adaptive dynamic weighting strategy for feature fusion. The decoder incorporates a dynamic weight computation network to achieve thorough integration of multi-layer features, and a hybrid loss function is adopted to optimize model training. Experimental results on the constructed dataset demonstrate that the proposed model achieves a mean Intersection over Union (mIoU) of 69.79% and an F1-score of 80.49%, outperforming the baseline network by 2.1% and 1.61%, respectively. Ablation studies further confirm the complementary nature of multi-layer feature fusion, which effectively improves the IoU for micro-habitat categories such as field ridges. This study establishes a habitat identification framework for cultivated land systems based on adaptive multi-layer feature fusion, enabling sub-meter precision habitat mapping at a low cost and providing robust technical support for fine-grained habitat monitoring in cultivated landscapes. (The complete code repository can be accessed via GitHub at the following URL: https://github.com/sysau/DWFF-Net)

[275] EmoFeedback$^2$: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

Jingyang Jia, Kai Shu, Gang Yang, Long Xing, Xun Chen, Aiping Liu

Main category: cs.CV

TL;DR: Proposes EmoFeedback², a reinforcement paradigm for continuous emotional image generation that uses fine-tuned LVLM for emotional feedback and prompt refinement to improve emotional continuity and fidelity.

Details

Motivation: Existing approaches lack emotional feedback from generated images and fail to adaptively adjust emotional prompts based on image content, limiting emotional continuity and fidelity.

Method: Uses generation-understanding-feedback reinforcement paradigm with emotion-aware reward feedback and self-promotion textual feedback framework using fine-tuned large vision-language model.

Result: Effectively generates high-quality images with desired emotions, outperforming state-of-the-art methods on custom dataset.

Conclusion: The proposed EmoFeedback² approach successfully addresses emotional continuity and fidelity issues in continuous emotional image generation through LVLM-based feedback mechanisms.

Abstract: Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing approaches lack emotional feedback from generated images, limiting the control of emotional continuity. Additionally, their simple alignment between emotions and naively generated texts fails to adaptively adjust emotional prompts according to image content, leading to insufficient emotional fidelity. To address these concerns, we propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback$^2$) for C-EICG, which exploits the reasoning capability of the fine-tuned large vision-language model (LVLM) to provide reward and textual feedback for generating high-quality images with continuous emotions. Specifically, we introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions, guiding the reinforcement fine-tuning of the generative model and enhancing the emotional continuity of images. Furthermore, we design a self-promotion textual feedback framework, in which the LVLM iteratively analyzes the emotional content of generated images and adaptively produces refinement suggestions for the next-round prompt, improving the emotional fidelity with fine-grained content. Extensive experimental results demonstrate that our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset. The code and dataset will be released soon.

[276] Point-Supervised Facial Expression Spotting with Gaussian-Based Instance-Adaptive Intensity Modeling

Yicheng Deng, Hideaki Hayashi, Hajime Nagahara

Main category: cs.CV

TL;DR: The paper proposes a point-supervised facial expression spotting framework that uses single timestamp annotations per instance, featuring Gaussian-based intensity modeling and a two-branch approach for expression detection and classification.

Details

Motivation: Existing facial expression spotting methods require costly temporal boundary annotations, so the authors aim to develop a more efficient approach using only point supervision (single timestamp per instance).

Method: A two-branch framework with: 1) Gaussian-based instance-adaptive intensity modeling (GIM) for soft pseudo-labeling, 2) Class-agnostic expression intensity branch, 3) Class-aware apex classification branch, and 4) Intensity-aware contrastive loss.

Result: Extensive experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 datasets demonstrate the effectiveness of the proposed framework for facial expression spotting with point supervision.

Conclusion: The proposed point-supervised framework successfully addresses facial expression spotting with minimal annotation requirements, achieving reliable performance through soft pseudo-labeling and dual-branch architecture.

Abstract: Automatic facial expression spotting, which aims to identify facial expression instances in untrimmed videos, is crucial for facial expression analysis. Existing methods primarily focus on fully-supervised learning and rely on costly, time-consuming temporal boundary annotations. In this paper, we investigate point-supervised facial expression spotting (P-FES), where only a single timestamp annotation per instance is required for training. We propose a unique two-branch framework for P-FES. First, to mitigate the limitation of hard pseudo-labeling, which often confuses neutral and expression frames with various intensities, we propose a Gaussian-based instance-adaptive intensity modeling (GIM) module to model instance-level expression intensity distribution for soft pseudo-labeling. By detecting the pseudo-apex frame around each point label, estimating the duration, and constructing an instance-level Gaussian distribution, GIM assigns soft pseudo-labels to expression frames for more reliable intensity supervision. The GIM module is incorporated into our framework to optimize the class-agnostic expression intensity branch. Second, we design a class-aware apex classification branch that distinguishes macro- and micro-expressions solely based on their pseudo-apex frames. During inference, the two branches work independently: the class-agnostic expression intensity branch generates expression proposals, while the class-aware apex-classification branch is responsible for macro- and micro-expression classification. Furthermore, we introduce an intensity-aware contrastive loss to enhance discriminative feature learning and suppress neutral noise by contrasting neutral frames with expression frames with various intensities. Extensive experiments on the SAMM-LV, CAS(ME)$^2$, and CAS(ME)$^3$ datasets demonstrate the effectiveness of our proposed framework.

[277] One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution

Yushun Fang, Yuxiang Chen, Shibo Yin, Qiang Hu, Jiangchao Yao, Ya Zhang, Xiaoyun Zhang, Yanfeng Wang

Main category: cs.CV

TL;DR: ODTSR is a one-step diffusion transformer for real-world image super-resolution that balances fidelity and controllability using a noise-hybrid visual stream design and fidelity-aware adversarial training.

Details

Motivation: Current diffusion-based real-world image super-resolution methods face a trade-off: multi-step methods have generative diversity but low fidelity, while one-step methods lack control flexibility due to fidelity-specific finetuning.

Method: Uses Qwen-Image based transformer with noise-hybrid visual stream (NVS) - one stream receives LQ images with adjustable noise, another with consistent noise. Employs fidelity-aware adversarial training (FAA) for one-step inference.

Result: Achieves state-of-the-art performance on generic Real-ISR and enables prompt controllability on challenging scenarios like Chinese character text image super-resolution without specific dataset training.

Conclusion: ODTSR successfully balances fidelity and controllability in real-world image super-resolution, demonstrating superior performance and flexibility across various applications including text image enhancement.

Abstract: Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets. Codes are available at $\href{https://github.com/RedMediaTech/ODTSR}{\text{this url}}$.

[278] Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Yolo Y. Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu

Main category: cs.CV

TL;DR: Video-R4 introduces visual rumination for text-rich video reasoning, using iterative frame selection, zooming, and re-encoding to achieve state-of-the-art performance on video QA tasks.

Details

Motivation: Current video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence from small, transient textual cues in videos.

Method: A multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via supervised fine-tuning and GRPO-based reinforcement learning, with datasets for practice and RL.

Result: Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and generalizes to multi-page document QA, slides QA, and generic video QA.

Conclusion: Iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning, enabling better understanding of text-rich videos through repeated inspection of critical regions.

Abstract: Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning. Project Page: https://yunlong10.github.io/Video-R4/

[279] Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization

Sina Mokhtarzadeh Azar, Emad Bahrami, Enrico Pallotta, Gianpiero Francesca, Radu Timofte, Juergen Gall

Main category: cs.CV

TL;DR: SAVi-DNO adapts pre-trained diffusion models to continuous video streams by optimizing diffusion noise during inference, improving video prediction without fine-tuning model parameters.

Details

Motivation: To leverage continuously arriving training samples in video streams to improve diffusion-based video prediction models, avoiding expensive model fine-tuning.

Method: Refines diffusion noise during inference while keeping model parameters frozen, allowing adaptive determination of suitable sampling noise for continuous video adaptation.

Result: Shows improved performance on FVD, SSIM, and PSNR metrics across Ego4D, OpenDV-YouTube, UCF-101, and SkyTimelapse datasets in continuous video settings.

Conclusion: SAVi-DNO effectively adapts diffusion models to continuous video streams through noise optimization, demonstrating practical video prediction enhancement without parameter updates.

Abstract: In this work, we investigate diffusion-based video prediction models, which forecast future video frames, for continuous video streams. In this context, the models observe continuously new training samples, and we aim to leverage this to improve their predictions. We thus propose an approach that continuously adapts a pre-trained diffusion model to a video stream. Since fine-tuning the parameters of a large diffusion model is too expensive, we refine the diffusion noise during inference while keeping the model parameters frozen, allowing the model to adaptively determine suitable sampling noise. We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO). To validate our approach, we introduce a new evaluation setting on the Ego4D dataset, focusing on simultaneous adaptation and evaluation on long continuous videos. Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube, as well as videos of UCF-101 and SkyTimelapse, showcasing SAVi-DNO’s effectiveness.

[280] DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

Hai Ci, Ziheng Peng, Pei Yang, Yingxin Xuan, Mike Zheng Shou

Main category: cs.CV

TL;DR: DiffSeg30k is a 30k-image dataset with pixel-level annotations for detecting and localizing diffusion-based image edits, shifting AIGC detection from binary classification to semantic segmentation.

Details

Motivation: Existing AIGC detection benchmarks only classify entire images and overlook localization of diffusion-based edits, which enables realistic modification of local image regions making AI-generated content harder to detect.

Method: Created DiffSeg30k dataset with: 1) In-the-wild images from COCO; 2) Diverse diffusion models (8 SOTA models); 3) Multi-turn editing (up to 3 sequential edits); 4) VLM-based pipeline for automatic region identification and context-aware prompts covering additions, removals, and attribute changes.

Result: Segmentation models trained on DiffSeg30k outperform established forgery classifiers in whole-image classification of diffusion edits and show strong cross-generator generalization, though significant challenges remain in semantic segmentation tasks, especially regarding robustness to image distortions.

Conclusion: DiffSeg30k advances fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods, enabling simultaneous localization of edits and identification of editing models.

Abstract: Diffusion-based editing enables realistic modification of local image regions, making AI-generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion-based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion-edited images with pixel-level annotations, designed to support fine-grained detection. DiffSeg30k features: 1) In-the-wild images–we collect images or image prompts from COCO to reflect real-world content diversity; 2) Diverse diffusion models–local edits using eight SOTA diffusion models; 3) Multi-turn editing–each image undergoes up to three sequential edits to mimic real-world sequential editing; and 4) Realistic editing scenarios–a vision-language model (VLM)-based pipeline automatically identifies meaningful regions and generates context-aware prompts covering additions, removals, and attribute changes. DiffSeg30k shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of the editing models. We benchmark three baseline segmentation approaches, revealing significant challenges in semantic segmentation tasks, particularly concerning robustness to image distortions. Experiments also reveal that segmentation models, despite being trained for pixel-level localization, emerge as highly reliable whole-image classifiers of diffusion edits, outperforming established forgery classifiers while showing great potential in cross-generator generalization. We believe DiffSeg30k will advance research in fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods. DiffSeg30k is released at: https://huggingface.co/datasets/Chaos2629/Diffseg30k

[281] ReMatch: Boosting Representation through Matching for Multimodal Retrieval

Qianying Liu, Xiao Liang, Zhiqiang Zhang, Zhongfei Qing, Fengfan Zhou, Yibo Chen, Xu Tang, Yao Hu, Paul Henderson

Main category: cs.CV

TL;DR: ReMatch is a framework that uses MLLMs for multimodal retrieval by training them end-to-end with a generative matching stage, achieving state-of-the-art results on MMEB with strong zero-shot generalization.

Details

Motivation: Previous approaches underutilized MLLMs' generative nature, compositional reasoning, and world knowledge by treating them as simple encoders rather than leveraging their full capabilities.

Method: Trains MLLMs end-to-end with chat-style generative matching that autoregressively decides relevance from multi-view inputs (raw data and projected embeddings), uses multiple learnable tokens for richer embeddings, and combines instance-wise discrimination with contrastive loss.

Result: Achieves new state-of-the-art on Massive Multimodal Embedding Benchmark (MMEB) with particularly strong zero-shot generalization results on five datasets.

Conclusion: ReMatch demonstrates robust and transferable multimodal retrieval by effectively leveraging MLLMs’ generative capabilities and compositional strengths through end-to-end training with generative matching.

Abstract: We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.

[282] Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

Liqin Luo, Guangyao Chen, Xiawu Zheng, Yongxing Dai, Yixiong Zou, Yonghong Tian

Main category: cs.CV

TL;DR: GroundingAgent is a zero-shot visual grounding framework that uses iterative reasoning with pretrained models to link text queries to image regions without task-specific fine-tuning, achieving 65.1% accuracy on benchmarks.

Details

Motivation: Existing visual grounding methods require extensive task-specific annotations and fine-tuning, limiting generalization to novel scenarios. The authors aim to create a framework that can perform visual grounding without fine-tuning.

Method: Uses iterative reasoning mechanism combining pretrained open-vocabulary object detectors, multimodal LLMs, and LLMs to progressively refine candidate regions through joint semantic and spatial analyses.

Result: Achieves 65.1% zero-shot grounding accuracy on RefCOCO benchmarks without fine-tuning. With MLLM-generated captions replaced by original queries, selection accuracy reaches ~90%, close to supervised performance.

Conclusion: GroundingAgent demonstrates strong zero-shot visual grounding capabilities, highlighting the importance of LLM reasoning. The framework offers interpretability through transparent reasoning steps.

Abstract: Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

[283] Vision-Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

Jiaqi Guo, Mingzhen Li, Hanyu Su, Santiago López, Lexiaozi Fan, Daniel Kim, Aggelos Katsaggelos

Main category: cs.CV

TL;DR: VESSA integrates vision-language models into semi-supervised medical image segmentation, using a two-stage approach with template-guided learning and dynamic interaction with student models to improve accuracy with limited annotations.

Details

Motivation: To reduce reliance on extensive expert annotations in medical image segmentation by leveraging vision-language models' generalization capabilities within semi-supervised learning frameworks.

Method: Two-stage approach: Stage 1 trains VESSA as reference-guided segmentation assistant using template bank; Stage 2 integrates VESSA into SSL framework for dynamic interaction with student model, using refined predictions as prompts to generate higher-quality pseudo-labels.

Result: Extensive experiments show VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions across multiple datasets and domains.

Conclusion: VESSA successfully integrates foundation-level visual-semantic understanding into SSL frameworks, demonstrating strong performance in medical image segmentation with minimal labeled data.

Abstract: Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.

[284] Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models

Qin Ren, Yufei Wang, Lanqing Guo, Wen Zhang, Zhiwen Fan, Chenyu You

Main category: cs.CV

TL;DR: LoTTS introduces localized test-time scaling for diffusion models, adaptively resampling only defective regions while preserving high-quality areas, reducing computation by 2-4x while improving image quality.

Details

Motivation: Existing test-time scaling methods operate at full-image level, wasting computation on satisfactory regions and inadequately correcting localized defects. Image quality is spatially heterogeneous, requiring targeted improvements.

Method: Uses contrast between cross- and self-attention signals under quality-aware prompts to identify defective regions, then refines them into coherent masks. Perturbs only defective regions and denoises them locally to maintain global consistency.

Result: Achieves state-of-the-art performance on SD2.1, SDXL, and FLUX models, consistently improving both local quality and global fidelity while reducing GPU cost by 2-4x compared to Best-of-N sampling.

Conclusion: Localized test-time scaling is a promising new direction for scaling diffusion models at inference time, enabling efficient quality improvements through targeted region resampling.

Abstract: Diffusion models have become the dominant paradigm in text-to-image generation, and test-time scaling (TTS) further improves quality by allocating more computation during inference. However, existing TTS methods operate at the full-image level, overlooking the fact that image quality is often spatially heterogeneous. This leads to unnecessary computation on already satisfactory regions and insufficient correction of localized defects. In this paper, we explore a new direction - Localized TTS - that adaptively resamples defective regions while preserving high-quality regions, thereby substantially reducing the search space. This paradigm poses two central challenges: accurately localizing defects and maintaining global consistency. We propose LoTTS, the first fully training-free framework for localized TTS. For defect localization, LoTTS contrasts cross- and self-attention signals under quality-aware prompts (e.g., high-quality vs. low-quality) to identify defective regions, and then refines them into coherent masks. For consistency, LoTTS perturbs only defective regions and denoises them locally, ensuring that corrections remain confined while the rest of the image remains undisturbed. Extensive experiments on SD2.1, SDXL, and FLUX demonstrate that LoTTS achieves state-of-the-art performance: it consistently improves both local quality and global fidelity, while reducing GPU cost by 2-4x compared to Best-of-N sampling. These findings establish localized TTS as a promising new direction for scaling diffusion models at inference time.

[285] GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion

Hichem Felouat, Hanrui Wang, Isao Echizen

Main category: cs.CV

TL;DR: GFT-GCN is a privacy-preserving 3D face recognition framework that uses spectral graph learning and diffusion-based template protection to achieve secure authentication while maintaining high accuracy.

Details

Motivation: 3D face recognition provides robust biometric authentication resistant to illumination, pose changes, and spoofing attacks, but protecting stored biometric templates is critical for security applications.

Method: Combines Graph Fourier Transform (GFT) and Graph Convolutional Networks (GCN) to extract spectral features from 3D face meshes, then applies spectral diffusion for template protection in a client-server architecture.

Result: Experiments on BU-3DFE and FaceScape datasets show high recognition accuracy and strong resistance to reconstruction attacks.

Conclusion: GFT-GCN effectively balances privacy and performance, offering a practical solution for secure 3D face authentication with irreversible, renewable, and unlinkable templates.

Abstract: 3D face recognition offers a robust biometric solution by capturing facial geometry, providing resilience to variations in illumination, pose changes, and presentation attacks. Its strong spoof resistance makes it suitable for high-security applications, but protecting stored biometric templates remains critical. We present GFT-GCN, a privacy-preserving 3D face recognition framework that combines spectral graph learning with diffusion-based template protection. Our approach integrates the Graph Fourier Transform (GFT) and Graph Convolutional Networks (GCN) to extract compact, discriminative spectral features from 3D face meshes. To secure these features, we introduce a spectral diffusion mechanism that produces irreversible, renewable, and unlinkable templates. A lightweight client-server architecture ensures that raw biometric data never leaves the client device. Experiments on the BU-3DFE and FaceScape datasets demonstrate high recognition accuracy and strong resistance to reconstruction attacks. Results show that GFT-GCN effectively balances privacy and performance, offering a practical solution for secure 3D face authentication.

[286] Restora-Flow: Mask-Guided Image Restoration with Flow Matching

Arnela Hadzic, Franz Thaler, Lea Bogensperger, Simon Johannes Joham, Martin Urschler

Main category: cs.CV

TL;DR: Restora-Flow is a training-free flow matching method for image restoration that uses degradation masks and trajectory correction to achieve fast, high-quality results in tasks like inpainting, super-resolution, and denoising.

Details

Motivation: Flow matching offers faster sampling than diffusion models but current flow-based restoration methods still suffer from long processing times or produce over-smoothed results.

Method: Training-free approach that guides flow matching sampling with degradation masks and incorporates trajectory correction mechanism to enforce consistency with degraded inputs.

Result: Superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods on both natural and medical datasets.

Conclusion: Restora-Flow effectively addresses speed and quality limitations in flow-based image restoration through mask guidance and trajectory correction.

Abstract: Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.

[287] SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

Da Li, Jiping Jin, Xuanlong Yu, Wei Liu, Xiaodong Cun, Kai Chen, Rui Fan, Jiangang Kong, Xi Shen

Main category: cs.CV

TL;DR: SKEL-CF is a coarse-to-fine framework for estimating anatomically accurate SKEL human model parameters, using transformer-based encoder-decoder architecture with explicit camera modeling to overcome perspective ambiguities and limited training data.

Details

Motivation: Existing parametric human models like SMPL have simplified kinematics that limit biomechanical realism, while the anatomically accurate SKEL model faces challenges in parameter estimation due to limited data, perspective ambiguities, and complex human articulation.

Method: Transformer-based encoder-decoder architecture where encoder predicts coarse camera and SKEL parameters, and decoder progressively refines them. Created SKEL-aligned dataset 4DHuman-SKEL from existing SMPL data, and explicitly incorporated camera modeling to address depth/scale ambiguities.

Result: Achieved 85.0 MPJPE / 51.4 PA-MPJPE on MOYO dataset, significantly outperforming previous SKEL-based state-of-the-art HSMR (104.5 / 79.6).

Conclusion: SKEL-CF establishes a scalable and anatomically faithful framework for human motion analysis, bridging computer vision and biomechanics with superior performance over existing methods.

Abstract: Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.

[288] CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng, Xiaoxing Hu, Haoyuan Liang, Guowen Li, Chengwei Qin, Hong Cheng, Xue Yang, Juepeng Zheng, Haohuan Fu

Main category: cs.CV

TL;DR: CrossEarth-Gate is a PEFT method for remote sensing that uses a toolbox of spatial, semantic, and frequency modules with Fisher-guided adaptive selection to handle multifaceted domain gaps in Earth observation tasks.

Details

Motivation: Existing PEFT methods fail in large-scale Earth observation due to inability to handle multifaceted domain gaps (spatial, semantic, frequency shifts) in remote sensing data.

Method: Establishes RS module toolbox with spatial, semantic, frequency modules and uses Fisher-guided adaptive selection to dynamically activate critical modules at appropriate layers based on task-specific gradient flow.

Result: Achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation, demonstrating efficacy and generalizability.

Conclusion: CrossEarth-Gate effectively handles multifaceted domain gaps in remote sensing through adaptive module selection, providing efficient and effective adaptation for Earth observation tasks.

Abstract: In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (\eg, spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module’s importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work will be released.

[289] Thinking in 360°: Humanoid Visual Search in the Wild

Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang, Xiangyu Han, Xinhao Liu, Jing Zhang, Marco Pavone, Chen Feng, Saining Xie, Yiming Li

Main category: cs.CV

TL;DR: Proposes humanoid visual search where agents actively rotate their heads in 360° panoramic environments, introduces H* Bench benchmark with challenging real-world scenes, and shows significant performance improvements through post-training techniques.

Details

Motivation: Prior visual search approaches are limited to static images and neglect physical embodiment and 3D world interaction. The goal is to develop embodied visual search agents as efficient as humans while bypassing real-world hardware constraints.

Method: Humanoid visual search with agents rotating heads in 360° panoramic images, using post-training techniques to enhance open-source Qwen2.5-VL model on the new H* Bench benchmark featuring challenging real-world scenes.

Result: Top-tier proprietary models achieve only ~30% success. Post-training improved Qwen2.5-VL’s success rate by over threefold: object search from 14.83% to 47.38%, path search from 6.44% to 24.94%. Path search remains more challenging due to spatial commonsense demands.

Conclusion: Shows promising path forward but quantifies immense remaining challenge in building MLLM agents for seamless integration into everyday human life, with path search revealing inherent difficulty requiring sophisticated spatial reasoning.

Abstract: Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.

[290] VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild

Xin Ming, Yuxuan Han, Tianyu Huang, Feng Xu

Main category: cs.CV

TL;DR: VGGTFace is an automatic method that uses the VGGT 3D foundation model to reconstruct topologically consistent facial geometry from multi-view images, achieving state-of-the-art results in 10 seconds for 16 views.

Details

Motivation: Existing facial reconstruction methods require manual effort, lack generalization to in-the-wild data, or are limited by 3D Morphable Models' expressiveness.

Method: Leverages VGGT for generalization and expressiveness, augments it with Pixel3DMM to inject topology via UV values, and uses Topology-Aware Bundle Adjustment with Laplacian energy to fuse point clouds.

Result: Achieves high-quality reconstruction in 10 seconds for 16 views on RTX 4090, with state-of-the-art performance on benchmarks and impressive generalization to in-the-wild data.

Conclusion: VGGTFace successfully addresses limitations of existing methods by combining VGGT’s generalization with topology injection, enabling fast, automatic, and high-quality facial reconstruction from everyday multi-view images.

Abstract: Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, i.e. VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at https://github.com/grignarder/vggtface.

[291] BRIC: Bridging Kinematic Plans and Physical Control at Test Time

Dohun Lim, Minji Kim, Jaewoon Lim, Sungchan Kim

Main category: cs.CV

TL;DR: BRIC is a test-time adaptation framework that combines diffusion-based motion planning with RL-based physics controllers to enable long-term, physically plausible human motion generation by resolving execution drift and preserving pre-trained skills.

Details

Motivation: Diffusion models generate diverse motions but often produce physically implausible outputs, leading to execution drift during simulation when combined with physics controllers.

Method: BRIC dynamically adapts physics controllers to noisy motion plans at test time while preserving pre-trained skills, and uses lightweight test-time guidance to steer diffusion models without parameter updates.

Result: BRIC achieves state-of-the-art performance on long-term tasks including motion composition, obstacle avoidance, and human-scene interaction across diverse environments.

Conclusion: The combination of test-time adaptation strategies in BRIC enables consistent and physically plausible long-term human motion execution in an effective and efficient manner.

Abstract: We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.

[292] STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows

Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai

Main category: cs.CV

TL;DR: STARFlow-V is a normalizing flow-based video generator that achieves high-quality autoregressive video generation with end-to-end learning, robust causal prediction, and native likelihood estimation, outperforming diffusion-based models in practical sampling throughput.

Details

Motivation: To address the limitations of diffusion-based models in video generation, particularly high computational costs and error accumulation over time, by revisiting normalizing flows which offer benefits like end-to-end learning and native likelihood estimation.

Method: Uses a spatiotemporal latent space with global-local architecture that restricts causal dependencies to global latent space while preserving local within-frame interactions. Introduces flow-score matching for improved consistency and video-aware Jacobi iteration for efficient sampling.

Result: Achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines, supporting text-to-video, image-to-video, and video-to-video generation tasks.

Conclusion: Demonstrates that normalizing flows are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models.

Abstract: Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.

cs.AI

[293] Minimizing Hyperbolic Embedding Distortion with LLM-Guided Hierarchy Restructuring

Melika Ayoughi, Pascal Mettes, Paul Groth

Main category: cs.AI

TL;DR: LLMs can automatically restructure hierarchies to improve hyperbolic embedding quality by increasing branching factor and enforcing single inheritance, leading to better embeddings across multiple metrics.

Details

Motivation: Hyperbolic embeddings work best with high branching factor and single inheritance hierarchies, but real-world hierarchies often don't meet these criteria. This paper explores using LLMs to automatically restructure existing hierarchies to optimize them for hyperbolic embeddings.

Method: Proposed a prompt-based approach using Large Language Models to transform existing hierarchies, guided by known desiderata for hyperbolic embeddings (high branching factor, single inheritance). Tested on 16 diverse hierarchies.

Result: LLM-restructured hierarchies consistently yielded higher-quality hyperbolic embeddings across several standard embedding quality metrics. The approach also enables explainable reorganizations with justifications.

Conclusion: LLMs can effectively restructure hierarchies to meet hyperbolic embedding desiderata, improving embedding quality while providing explainable reorganizations that assist knowledge engineers.

Abstract: Hyperbolic geometry is an effective geometry for embedding hierarchical data structures. Hyperbolic learning has therefore become increasingly prominent in machine learning applications where data is hierarchically organized or governed by hierarchical semantics, ranging from recommendation systems to computer vision. The quality of hyperbolic embeddings is tightly coupled to the structure of the input hierarchy, which is often derived from knowledge graphs or ontologies. Recent work has uncovered that for an optimal hyperbolic embedding, a high branching factor and single inheritance are key, while embedding algorithms are robust to imbalance and hierarchy size. To assist knowledge engineers in reorganizing hierarchical knowledge, this paper investigates whether Large Language Models (LLMs) have the ability to automatically restructure hierarchies to meet these criteria. We propose a prompt-based approach to transform existing hierarchies using LLMs, guided by known desiderata for hyperbolic embeddings. Experiments on 16 diverse hierarchies show that LLM-restructured hierarchies consistently yield higher-quality hyperbolic embeddings across several standard embedding quality metrics. Moreover, we show how LLM-guided hierarchy restructuring enables explainable reorganizations, providing justifications to knowledge engineers.

[294] AssurAI: Experience with Constructing Korean Socio-cultural Datasets to Discover Potential Risks of Generative AI

Chae-Gyun Lim, Seung-Ho Han, EunYoung Byun, Jeongyun Han, Soohyun Cho, Eojin Joo, Heehyeon Kim, Sieun Kim, Juhoon Lee, Hyunsoo Lee, Dongkun Lee, Jonghwan Hyeon, Yechan Hwang, Young-Jun Lee, Kyeongryul Lee, Minhyeong An, Hyunjun Ahn, Jeongwoo Son, Junho Park, Donggyu Yoon, Taehyung Kim, Jeemin Kim, Dasom Choi, Kwangyoung Lee, Hyunseung Lim, Yeohyun Jung, Jongok Hong, Sooyohn Nam, Joonyoung Park, Sungmin Na, Yubin Choi, Jeanne Choi, Yoojin Hong, Sueun Jang, Youngseok Seo, Somin Park, Seoungung Jo, Wonhye Chae, Yeeun Jo, Eunyoung Kim, Joyce Jiyoung Whang, HwaJung Hong, Joseph Seering, Uichin Lee, Juho Kim, Sunna Choi, Seokyeon Ko, Taeho Kim, Kyunghoon Kim, Myungsik Ha, So Jung Lee, Jemin Hwang, JoonHo Kwak, Ho-Jin Choi

Main category: cs.AI

TL;DR: AssurAI is a quality-controlled Korean multimodal dataset for AI safety evaluation, addressing the gap in non-English, socio-cultural contexts with 11,480 instances across text, image, video, and audio.

Details

Motivation: Current safety datasets are predominantly English-centric and fail to capture specific risks in non-English contexts like Korean, while also being limited to text modality only.

Method: Defined 35 AI risk factors through multidisciplinary expert adaptation, constructed dataset using two-phase approach (expert-led seeding + crowdsourced scaling), with triple independent annotation and iterative expert red-teaming for quality control.

Result: Created AssurAI dataset with 11,480 multimodal instances, validated through pilot study showing effectiveness in assessing safety of recent LLMs.

Conclusion: AssurAI facilitates development of safer generative AI systems for Korean community by providing comprehensive multimodal safety evaluation dataset.

Abstract: The rapid evolution of generative AI necessitates robust safety evaluations. However, current safety datasets are predominantly English-centric, failing to capture specific risks in non-English, socio-cultural contexts such as Korean, and are often limited to the text modality. To address this gap, we introduce AssurAI, a new quality-controlled Korean multimodal dataset for evaluating the safety of generative AI. First, we define a taxonomy of 35 distinct AI risk factors, adapted from established frameworks by a multidisciplinary expert group to cover both universal harms and relevance to the Korean socio-cultural context. Second, leveraging this taxonomy, we construct and release AssurAI, a large-scale Korean multimodal dataset comprising 11,480 instances across text, image, video, and audio. Third, we apply the rigorous quality control process used to ensure data integrity, featuring a two-phase construction (i.e., expert-led seeding and crowdsourced scaling), triple independent annotation, and an iterative expert red-teaming loop. Our pilot study validates AssurAI’s effectiveness in assessing the safety of recent LLMs. We release AssurAI to the public to facilitate the development of safer and more reliable generative AI systems for the Korean community.

[295] $A^2Flow:$ Automating Agentic Workflow Generation via Self-Adaptive Abstraction Operators

Mingming Zhao, Xiaokang Wei, Yuanqi Shao, Kaiwen Zhou, Lin Yang, Siwei Rao, Junhui Zhan, Zhitang Chen

Main category: cs.AI

TL;DR: A²Flow is an automated framework for generating agentic workflows using self-adaptive abstraction operators, eliminating the need for manual operator predefinition and achieving significant performance improvements over existing methods.

Details

Motivation: Existing methods for agentic workflow design heavily rely on manually predefined operators, which limits generalization and scalability in automated workflow generation.

Method: Three-stage operator extraction: 1) Case-based initial operator generation using expert demonstrations and LLM reasoning, 2) Operator clustering and preliminary abstraction across tasks, 3) Deep extraction for abstract execution operators using chain-of-thought prompting and multi-path reasoning. Enhanced with operator memory mechanism for workflow search.

Result: Achieves 2.4% and 19.3% average performance improvement on general and embodied benchmarks, reduces resource usage by 37% compared to state-of-the-art baselines.

Conclusion: A²Flow provides a fully automated approach for agentic workflow generation that outperforms existing methods while being more resource-efficient, demonstrating the effectiveness of self-adaptive abstraction operators.

Abstract: Large language models (LLMs) have shown strong potential in automating the design of agentic workflows. However, existing methods still rely heavily on manually predefined operators, limiting generalization and scalability. To address this issue, we propose $A^2Flow$, a fully automated framework for agentic workflow generation based on self-adaptive abstraction operators. $A^2Flow$ employs a three-stage operator extraction process: 1) Case-based Initial Operator Generation: leveraging expert demonstrations and LLM reasoning to generate case-specific operators; 2) Operator Clustering and Preliminary Abstraction: grouping similar operators across tasks to form preliminary abstractions; and 3) Deep Extraction for Abstract Execution Operators: applying long chain-of-thought prompting and multi-path reasoning to derive compact and generalizable execution operators. These operators serve as reusable building blocks for workflow construction without manual predefinition. Furthermore, we enhance node-level workflow search with an operator memory mechanism, which retains historical outputs to enrich context and improve decision-making. Experiments on general and embodied benchmarks show that $A^2Flow$ achieves a 2.4% and 19.3% average performance improvement and reduces resource usage by 37% over state-of-the-art baselines. Homepage:https://github.com/pandawei-ele/A2FLOW

[296] Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning

Kevin Lee, Russell Spiewak, James Walsh

Main category: cs.AI

TL;DR: The paper introduces Reasoning With a Star, a heliophysics reasoning dataset, and benchmarks multi-agent approaches that outperform direct prompting on deductive reasoning tasks.

Details

Motivation: Heliophysics reasoning requires more than factual recall - it needs physical assumptions, unit consistency, and scientific formatting, which current approaches struggle with.

Method: Created a dataset from NASA/UCAR summer school problems with Q&A structure, then benchmarked single-shot baseline and four multi-agent patterns using programmatic grading with unit-aware tolerance and schema validation.

Result: Multi-agent workflows based on systems engineering principles outperformed direct prompting, especially for deductive reasoning problems rather than pure inductive recall.

Conclusion: Decomposing reasoning workflows through coordinated multi-agent approaches is more effective for complex heliophysics problems requiring deductive reasoning.

Abstract: Scientific reasoning through Large Language Models in heliophysics involves more than just recalling facts: it requires incorporating physical assumptions, maintaining consistent units, and providing clear scientific formats through coordinated approaches. To address these challenges, we present Reasoning With a Star, a newly contributed heliophysics dataset applicable to reasoning; we also provide an initial benchmarking approach. Our data are constructed from National Aeronautics and Space Administration & University Corporation for Atmospheric Research Living With a Star summer school problem sets and compiled into a readily consumable question-and-answer structure with question contexts, reasoning steps, expected answer type, ground-truth targets, format hints, and metadata. A programmatic grader checks the predictions using unit-aware numerical tolerance, symbolic equivalence, and schema validation. We benchmark a single-shot baseline and four multi-agent patterns, finding that decomposing workflows through systems engineering principles outperforms direct prompting on problems requiring deductive reasoning rather than pure inductive recall.

[297] A Brief History of Digital Twin Technology

Yunqi Zhang, Kuangyu Shi, Biao Li

Main category: cs.AI

TL;DR: Digital twin technology, originating from NASA spacecraft simulations, has evolved into healthcare applications that create virtual patient counterparts using real-time data for diagnosis, treatment planning, and drug development, with challenges in interoperability and privacy being addressed through AI and regulatory solutions.

Details

Motivation: To transform healthcare from reactive treatment to predictive, preventive, and personalized medicine by leveraging digital twin technology that creates dynamic virtual counterparts of patients using real-time data streams.

Method: Integration of imaging, biosensors, and computational models to generate patient-specific simulations that support diagnosis, treatment planning, and drug development through applications like cardiac digital twins, oncology digital twins, and pharmacological digital twins.

Result: Successful applications in predicting arrhythmia treatment outcomes, tracking tumor progression, optimizing radiotherapy, and accelerating drug discovery, though limited by interoperability, data privacy, and model fidelity challenges.

Conclusion: Future advances in multi-organ digital twins, genomics integration, and ethical governance are essential to fully realize digital twin’s potential in shifting healthcare toward predictive, preventive, and personalized medicine, with emerging solutions like explainable AI and federated learning offering promising pathways forward.

Abstract: Emerging from NASA’s spacecraft simulations in the 1960s, digital twin technology has advanced through industrial adoption to spark a healthcare transformation. A digital twin is a dynamic, data-driven virtual counterpart of a physical system, continuously updated through real-time data streams and capable of bidirectional interaction. In medicine, digital twin integrates imaging, biosensors, and computational models to generate patient-specific simulations that support diagnosis, treatment planning, and drug development. Representative applications include cardiac digital twin for predicting arrhythmia treatment outcomes, oncology digital twin for tracking tumor progression and optimizing radiotherapy, and pharmacological digital twin for accelerating drug discovery. Despite rapid progress, major challenges, including interoperability, data privacy, and model fidelity, continue to limit widespread clinical integration. Emerging solutions such as explainable AI, federated learning, and harmonized regulatory frameworks offer promising pathways forward. Looking ahead, advances in multi-organ digital twin, genomics integration, and ethical governance will be essential to ensure that digital twin shifts healthcare from reactive treatment to predictive, preventive, and truly personalized medicine.

[298] Paraconsistent-Lib: an intuitive PAL2v algorithm Python Library

Arnaldo de Carvalho Junior, Diego Oliveira da Cruz, Bruno da Silva Alves, Fernando da Silva Paulo Junior, João Inacio da Silva Filho

Main category: cs.AI

TL;DR: Paraconsistent-Lib is an open-source Python library for implementing PAL2v algorithms in reasoning and decision-making systems, providing three result types and enabling various paraconsistent algorithms with reduced complexity.

Details

Motivation: To create an easy-to-use, general-purpose library for PAL2v standard calculations that simplifies implementation of paraconsistent algorithms in reasoning and decision-making systems.

Method: Developed as a Python library offering three types of outputs: paraconsistent analysis in 12 classical lattice PAL2v regions, paraconsistent analysis node outputs, and decision outputs. Supports algorithms like Para-analyzer, ParaExtrCTX, PAL2v Filter, PANnet, and PNN.

Result: Successfully created Paraconsistent-Lib that reduces complexity, code size, and bugs in implementing PAL2v algorithms. The library is stable and actively developed based on user feedback from GitHub.

Conclusion: Paraconsistent-Lib provides a practical tool for building PAL2v-based reasoning systems and continues to evolve through community-driven development.

Abstract: This paper introduces Paraconsistent-Lib, an open-source, easy-to-use Python library for building PAL2v algorithms in reasoning and decision-making systems. Paraconsistent-Lib is designed as a general-purpose library of PAL2v standard calculations, presenting three types of results: paraconsistent analysis in one of the 12 classical lattice PAL2v regions, paraconsistent analysis node (PAN) outputs, and a decision output. With Paraconsistent-Lib, well-known PAL2v algorithms such as Para-analyzer, ParaExtrCTX, PAL2v Filter, paraconsistent analysis network (PANnet), and paraconsistent neural network (PNN) can be written in stand-alone or network form, reducing complexity, code size, and bugs, as two examples presented in this paper. Given its stable state, Paraconsistent-Lib is an active development to respond to user-required features and enhancements received on GitHub.

[299] Prune4Web: DOM Tree Pruning Programming for Web Agent

Jiayuan Zhang, Kaiquan Chen, Zhihao Lu, Enshen Zhou, Qian Yu, Jing Zhang

Main category: cs.AI

TL;DR: Prune4Web introduces programmatic DOM pruning to handle large webpage structures efficiently, shifting from LLM reading to executable scoring scripts for 25-50x element reduction and 88.28% grounding accuracy.

Details

Motivation: Existing web automation struggles with massive DOM structures (10k-100k tokens), relying on crude truncation or inefficient heuristics that lose critical information and fail to balance precision with scalability.

Method: DOM Tree Pruning Programming where LLM generates Python scoring scripts to filter elements based on semantic cues from sub-tasks, with unified training of Planner, Programmatic Filter, and Grounder using two-turn dialogue strategy.

Result: Achieves 25-50x reduction in candidate elements, dramatically improves low-level grounding accuracy from 46.8% to 88.28%, demonstrating state-of-the-art performance in web automation.

Conclusion: Prune4Web effectively addresses DOM scalability challenges through programmatic pruning, enabling precise action localization while mitigating attention dilution in real-world web automation.

Abstract: Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model (LLM)-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation – risking the loss of critical information – or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive LLM reading to efficient programmatic pruning. Central to our approach is DOM Tree Pruning Programming, where an LLM generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for LLMs to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.

[300] Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework

Nitya Tiwari, Parv Maheshwari, Vidisha Agarwal

Main category: cs.AI

TL;DR: Analysis of Multimodal Chain-of-Thought reasoning across diverse domains (A-OKVQA, OKVQA, ChartQA) reveals variable effectiveness, with vision features reducing hallucinations but commonsense reasoning remaining challenging.

Details

Motivation: To evaluate the generalizability of Multimodal-CoT reasoning beyond scientific domains and assess its effectiveness on tasks requiring broad commonsense and world knowledge.

Method: Implemented two-stage framework separating rationale generation from answer inference, using gated fusion mechanism with T5-based language models to integrate vision features, with systematic ablation studies.

Result: Vision integration significantly reduces hallucination in rationale generation, but CoT effectiveness varies substantially across question types, with commonsense reasoning presenting particular challenges.

Conclusion: Provides practical insights for multimodal reasoning systems and identifies key areas for future improvement in cross-domain generalization, particularly for commonsense reasoning.

Abstract: While recent work has extended CoT to multimodal settings, achieving state-of-the-art results on science question answering benchmarks like ScienceQA, the generalizability of these approaches across diverse domains remains underexplored. This work presents a comprehensive analysis of Multimodal Chain-of-Thought (Multimodal-CoT) reasoning, evaluating its effectiveness on the A-OKVQA, OKVQA and ChartQA datasets, which requires broad commonsense and world knowledge beyond scientific reasoning. We implement the two-stage framework proposed by Zhang et al. [3], which separates rationale generation from answer inference and integrates vision features through a gated fusion mechanism with T5-based language models. Through systematic ablation studies, we analyze the contributions of vision features, rationale quality, and architectural choices. Our findings reveal that while vision integration significantly reduces hallucination in rationale generation, the effectiveness of CoT reasoning varies substantially across question types, with commonsense reasoning presenting particular challenges. This work provides practical insights for researchers implementing multimodal reasoning systems and identifies key areas for future improvement in cross-domain generalization.

[301] Conversational no-code and multi-agentic disease module identification and drug repurposing prediction with ChatDRex

Simon Süwer, Kester Bagemihl, Sylvie Baier, Lucia Dicunta, Markus List, Jan Baumbach, Andreas Maier, Fernando M. Delgado-Chaves

Main category: cs.AI

TL;DR: ChatDRex is a conversation-based multi-agent system that enables natural language access to biomedical knowledge graphs for network-based drug repurposing prediction, making complex bioinformatics analyses accessible to non-experts.

Details

Motivation: Drug repurposing is time-efficient and cost-effective, but current in silico prediction methods require specialized expertise and fragmented tools that don't integrate well across workflows.

Method: Multi-agent system built on NeDRex knowledge graph with specialized agents for query routing, data retrieval, network analysis, drug repurposing, functional coherence evaluation, literature mining, and result visualization.

Result: Provides natural language interface for complex bioinformatics analyses, enables hypothesis generation and exploration of drug repurposing opportunities without requiring computer science expertise.

Conclusion: ChatDRex democratizes access to bioinformatics resources for drug repurposing, accelerating discovery of novel therapies and advancing personalized medicine and translational research.

Abstract: Repurposing approved drugs offers a time-efficient and cost-effective alternative to traditional drug development. However, in silico prediction of repurposing candidates is challenging and requires the effective collaboration of specialists in various fields, including pharmacology, medicine, biology, and bioinformatics. Fragmented, specialized algorithms and tools often address only narrow aspects of the overall problem, and heterogeneous, unstructured data landscapes require specialized users to be involved. Hence, these data services do not integrate smoothly across workflows. With ChatDRex, we present a conversation-based, multi-agent system that facilitates the execution of complex bioinformatic analyses aiming for network-based drug repurposing prediction. It builds on the integrated systems medicine knowledge graph NeDRex. ChatDRex provides natural language access to its extensive biomedical KG and integrates bioinformatics agents for network analysis and drug repurposing, complemented by agents for functional coherence evaluation for in silico validation, as well as agents for literature mining and for discussing the obtained results in a scientific context. Its flexible multi-agent design assigns specific tasks to specialized agents, including query routing, data retrieval, algorithm execution, and result visualization. A dedicated reasoning module keeps the user in the loop and allows for hallucination detection. By enabling physicians and researchers without computer science expertise to control complex analyses in natural language, ChatDRex democratizes access to bioinformatics as an important resource for drug repurposing. It enables clinical experts to generate hypotheses and explore drug repurposing opportunities, ultimately accelerating the discovery of novel therapies and advancing personalized medicine and translational research.

[302] Learning Multi-Access Point Coordination in Agentic AI Wi-Fi with Large Language Models

Yifan Fan, Le Liang, Peng Liu, Xiao Li, Ziyang Guo, Qiao Lan, Shi Jin, Wen Tong

Main category: cs.AI

TL;DR: Proposes an Agentic AI Wi-Fi framework using LLM agents at each access point to dynamically coordinate and adapt to network conditions through natural language dialogue and cognitive workflows, outperforming static MAPC protocols.

Details

Motivation: Existing MAPC protocols use static rules that cannot adapt to dynamic network conditions like varying interference and topologies, limiting throughput in dense Wi-Fi environments.

Method: Each access point is modeled as an autonomous LLM agent that collaboratively reasons about network state and negotiates adaptive coordination strategies through natural language dialogue, using integrated memory, reflection, and tool use.

Result: Comprehensive simulations show the framework successfully adapts to diverse dynamic network environments, significantly outperforming state-of-the-art spatial reuse baselines.

Conclusion: The agentic framework validates its potential as a robust and intelligent solution for future wireless networks by enabling real-time adaptive coordination.

Abstract: Multi-access point coordination (MAPC) is a key technology for enhancing throughput in next-generation Wi-Fi within dense overlapping basic service sets. However, existing MAPC protocols rely on static, protocol-defined rules, which limits their ability to adapt to dynamic network conditions such as varying interference levels and topologies. To address this limitation, we propose a novel Agentic AI Wi-Fi framework where each access point, modeled as an autonomous large language model agent, collaboratively reasons about the network state and negotiates adaptive coordination strategies in real time. This dynamic collaboration is achieved through a cognitive workflow that enables the agents to engage in natural language dialogue, leveraging integrated memory, reflection, and tool use to ground their decisions in past experience and environmental feedback. Comprehensive simulation results demonstrate that our agentic framework successfully learns to adapt to diverse and dynamic network environments, significantly outperforming the state-of-the-art spatial reuse baseline and validating its potential as a robust and intelligent solution for future wireless networks.

[303] OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim

Main category: cs.AI

TL;DR: OpenApps is a lightweight ecosystem for evaluating UI-Agents across app variations, revealing that reliability fluctuates significantly (up to 50%) across different app versions.

Details

Motivation: Current evaluations of autonomous UI-Agents use fixed environments, which fail to capture reliability across real-world app variations in design and content.

Method: Developed OpenApps with six configurable apps that can generate thousands of versions, running over 10,000 evaluations across seven multimodal agents.

Result: Task success rates vary drastically across app variations (e.g., Kimi-VL-3B fluctuates from 63% to 4%), and agent behaviors like looping/hallucinating differ by environment.

Conclusion: Measuring reliability across app variations is crucial, as standard evaluations in fixed environments mask significant performance fluctuations in real-world scenarios.

Abstract: Reliability is key to realizing the promise of autonomous UI-Agents, multimodal agents that directly interact with apps in the same manner as humans, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments, often clones of existing apps, which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent’s ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of thousands of versions of each app. Specifically, we run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents. We find that while standard reliability within a fixed app is relatively stable, reliability can vary drastically when measured across app variations. Task success rates for many agents can fluctuate by more than $50%$ across app variations. For example, Kimi-VL-3B’s average success across all tasks fluctuates from $63%$ to just $4%$ across app versions. We also find agent behaviors such as looping or hallucinating actions can differ drastically depending on the environment configuration. These initial findings highlight the importance of measuring reliability along this new dimension of app variations. OpenApps is available at https://facebookresearch.github.io/OpenApps/

[304] Representation Interventions Enable Lifelong Unstructured Knowledge Control

Xuyuan Liu, Zhengzhang Chen, Xinshuai Dong, Yanchi Liu, Xujiang Zhao, Shengyu Chen, Haoyu Wang, Yujun Yan, Haifeng Chen

Main category: cs.AI

TL;DR: RILKE is a scalable method for lifelong knowledge control in LLMs that uses representation-space interventions to efficiently update model knowledge without retraining, maintaining high edit success and general utility.

Details

Motivation: LLMs often produce incorrect or outdated content, and updating their knowledge efficiently without costly retraining is challenging, especially for complex, unstructured knowledge in lifelong settings where many edits must coexist without interference.

Method: RILKE treats knowledge control as interventions in the model’s representation space, learning paraphrase-robust and edit-localized modules that limit updates to low-dimensional subspaces to minimize interference. A query-adaptive router selects appropriate modules during inference.

Result: Evaluation on knowledge editing benchmarks with LLaMA and Qwen models shows RILKE is scalable to large datasets, achieving high edit success, strong paraphrase generalization, and preserved general utility with modest memory overhead.

Conclusion: RILKE is an effective and scalable solution for lifelong knowledge control in LLMs, enabling fine-grained control over complex knowledge while maintaining model utility with frozen base weights.

Abstract: Large language models (LLMs) often produce incorrect or outdated content. Updating their knowledge efficiently and accurately without costly retraining is a major challenge. This problem is especially hard for complex, unstructured knowledge in a lifelong setting, where many edits must coexist without interference. We introduce RILKE (Representation Intervention for Lifelong KnowledgE Control), a robust and scalable method that treats knowledge control as interventions within the model’s representation space. Leveraging representation-space expressiveness, we identify two properties enabling RILKE to deliver fine-grained control over complex, unstructured knowledge while maintaining general utility with frozen base weights. During training, RILKE learns paraphrase-robust and edit-localized modules that limit each update to a low-dimensional subspace to minimize cross-edit interference. In inference, a query-adaptive router selects the appropriate module to guide the model’s generation. In evaluation on knowledge editing benchmarks with LLaMA and Qwen models, RILKE is scalable to large-scale datasets, demonstrating high edit success, strong paraphrase generalization, and preserving general utility with modest memory overhead. These results show RILKE is an effective and scalable solution for lifelong knowledge control in LLMs.

[305] Step-Audio-R1 Technical Report

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

Main category: cs.AI

TL;DR: Step-Audio-R1 is the first successful audio reasoning model that demonstrates reasoning capabilities can be transferred to audio domains through proper grounding in acoustic features, outperforming Gemini 2.5 Pro and matching Gemini 3 Pro performance.

Details

Motivation: Audio language models have consistently performed better with minimal reasoning, raising the fundamental question of whether audio intelligence can truly benefit from deliberate thinking like text and vision models do.

Method: Proposed Modality-Grounded Reasoning Distillation (MGRD) framework that teaches the model to generate audio-relevant reasoning chains genuinely grounded in acoustic features rather than disconnected deliberations.

Result: Step-Audio-R1 exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to state-of-the-art Gemini 3 Pro across comprehensive audio understanding benchmarks spanning speech, environmental sounds, and music.

Conclusion: Reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence, opening pathways toward truly multimodal reasoning systems.

Abstract: Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.

[306] Guaranteed Optimal Compositional Explanations for Neurons

Biagio La Rosa, Leilani H. Gilpin

Main category: cs.AI

TL;DR: This paper introduces the first framework for computing guaranteed optimal compositional explanations in neural networks, addressing limitations of beam search methods that lack theoretical guarantees.

Details

Motivation: Current compositional explanation methods use beam search which cannot provide theoretical guarantees of optimality, making it unclear how close current explanations are to the true optimum.

Method: Proposes: (i) a decomposition identifying factors influencing spatial alignment, (ii) a heuristic to estimate alignment during search, and (iii) the first algorithm for computing optimal compositional explanations efficiently.

Result: Analysis shows 10-40% of explanations from beam search are suboptimal when overlapping concepts are involved. A beam-search variant using the proposed decomposition and heuristic matches or improves runtime while offering greater flexibility.

Conclusion: The framework enables guaranteed optimal compositional explanations, revealing significant suboptimality in current methods and providing more efficient alternatives.

Abstract: While neurons are the basic units of deep neural networks, it is still unclear what they learn and if their knowledge is aligned with that of humans. Compositional explanations aim to answer this question by describing the spatial alignment between neuron activations and concepts through logical rules. These logical descriptions are typically computed via a search over all possible concept combinations. Since computing the spatial alignment over the entire state space is computationally infeasible, the literature commonly adopts beam search to restrict the space. However, beam search cannot provide any theoretical guarantees of optimality, and it remains unclear how close current explanations are to the true optimum. In this theoretical paper, we address this gap by introducing the first framework for computing guaranteed optimal compositional explanations. Specifically, we propose: (i) a decomposition that identifies the factors influencing the spatial alignment, (ii) a heuristic to estimate the alignment at any stage of the search, and (iii) the first algorithm that can compute optimal compositional explanations within a feasible time. Using this framework, we analyze the differences between optimal and non-optimal explanations in the most popular settings for compositional explanations, the computer vision domain and Convolutional Neural Networks. In these settings, we demonstrate that 10-40 percent of explanations obtained with beam search are suboptimal when overlapping concepts are involved. Finally, we evaluate a beam-search variant guided by our proposed decomposition and heuristic, showing that it matches or improves runtime over prior methods while offering greater flexibility in hyperparameters and computational resources.

[307] Learning Individual Behavior in Agent-Based Models with Graph Diffusion Networks

Francesco Cozzi, Marco Pangallo, Alan Perotti, André Panisson, Corrado Monti

Main category: cs.AI

TL;DR: Proposes a differentiable surrogate framework for Agent-Based Models using diffusion models and graph neural networks to enable gradient-based optimization while preserving individual agent behavior.

Details

Motivation: ABMs have non-differentiable rules that limit the use of gradient-based optimization methods and integration with real-world data, creating a need for differentiable surrogates.

Method: Combines diffusion models to capture behavioral stochasticity and graph neural networks to model agent interactions, directly modeling individual agent behavior rather than system-level outputs.

Result: Validated on Schelling’s segregation model and Predator-Prey ecosystem, showing replication of individual-level patterns and accurate forecasting of emergent dynamics beyond training.

Conclusion: Demonstrates the potential of combining diffusion models and graph learning for data-driven ABM simulation, enabling gradient-based optimization while preserving decentralized dynamics.

Abstract: Agent-Based Models (ABMs) are powerful tools for studying emergent properties in complex systems. In ABMs, agent behaviors are governed by local interactions and stochastic rules. However, these rules are, in general, non-differentiable, limiting the use of gradient-based methods for optimization, and thus integration with real-world data. We propose a novel framework to learn a differentiable surrogate of any ABM by observing its generated data. Our method combines diffusion models to capture behavioral stochasticity and graph neural networks to model agent interactions. Distinct from prior surrogate approaches, our method introduces a fundamental shift: rather than approximating system-level outputs, it models individual agent behavior directly, preserving the decentralized, bottom-up dynamics that define ABMs. We validate our approach on two ABMs (Schelling’s segregation model and a Predator-Prey ecosystem) showing that it replicates individual-level patterns and accurately forecasts emergent dynamics beyond training. Our results demonstrate the potential of combining diffusion models and graph learning for data-driven ABM simulation.

[308] ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li

Main category: cs.AI

TL;DR: ENACT is a benchmark that evaluates embodied cognition in vision-language models through world modeling tasks using egocentric interaction in VQA format, revealing performance gaps between models and humans.

Details

Motivation: To investigate whether modern vision-language models, trained in disembodied ways, exhibit signs of embodied cognition by testing their ability to model the world from egocentric interactions.

Method: Created ENACT benchmark with two sequence reordering tasks: forward world modeling (reorder observations given actions) and inverse world modeling (reorder actions given observations), using a POMDP framework with actions as scene graph changes. Uses scalable pipeline from robotics simulation (BEHAVIOR) with 8,972 QA pairs.

Result: Performance gap between frontier VLMs and humans that widens with interaction horizon. Models perform better on inverse task than forward task and show anthropocentric biases (right-handed preference, degradation with non-human camera parameters).

Conclusion: Current VLMs show limitations in embodied cognition capabilities compared to humans, with performance gaps that increase with task complexity and reveal inherent biases from human-centric training data.

Abstract: Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.

[309] Improving Procedural Skill Explanations via Constrained Generation: A Symbolic-LLM Hybrid Architecture

Rahul Dass, Thomas Bowlin, Zebing Li, Xiao Jin, Ashok Goel

Main category: cs.AI

TL;DR: Ivy is an AI coaching system that combines symbolic TMK models with LLMs to generate structured, multi-step explanations for procedural skill learning, improving explanation quality over standard GPT approaches.

Details

Motivation: Current LLMs produce fluent but shallow explanations that miss the causal, goal-directed, and compositional logic needed for effective procedural skill instruction.

Method: Combines symbolic Task-Method-Knowledge (TMK) models with a generative LLM layer, where TMK encodes causal transitions, goal hierarchies, and problem decompositions to constrain the LLM’s explanation generation.

Result: Ivy consistently outperforms GPT and retrieval-augmented GPT baselines in structural quality of explanations for “how” and “why” questions across three inferential dimensions.

Conclusion: Symbolic constraints significantly improve the pedagogical value of AI-generated explanations, demonstrating a scalable approach for intelligent coaching systems in education.

Abstract: In procedural skill learning, instructional explanations must convey not just steps, but the causal, goal-directed, and compositional logic behind them. Large language models (LLMs) often produce fluent yet shallow responses that miss this structure. We present Ivy, an AI coaching system that delivers structured, multi-step explanations by combining symbolic Task-Method-Knowledge (TMK) models with a generative interpretation layer-an LLM that constructs explanations while being constrained by TMK structure. TMK encodes causal transitions, goal hierarchies, and problem decompositions, and guides the LLM within explicit structural bounds. We evaluate Ivy against responses against GPT and retrieval-augmented GPT baselines using expert and independent annotations across three inferential dimensions. Results show that symbolic constraints consistently improve the structural quality of explanations for “how” and “why” questions. This study demonstrates a scalable AI for education approach that strengthens the pedagogical value of AI-generated explanations in intelligent coaching systems.

[310] ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu, Jian Luan

Main category: cs.AI

TL;DR: ICPO method enhances LLM reasoning by using intrinsic confidence and preference modeling to address RLVR limitations like coarse rewards and inefficient exploration.

Details

Motivation: Existing RLVR methods suffer from coarse-grained rewards, reward noise, and inefficient exploration, leading to unstable training and entropy collapse in LLM reasoning.

Method: ICPO calculates preference advantage scores by comparing generation probabilities of multiple responses under the same prompt, integrating these with verifiable rewards to guide exploration.

Result: ICPO alleviates reward issues, curbs overconfident errors, enhances undervalued high-quality responses, and prevents overfitting, enabling more thorough exploration.

Conclusion: Comprehensive experiments across multiple benchmarks show ICPO steadily boosts reasoning performance compared to GRPO.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates significant potential in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing RLVR methods are often constrained by issues such as coarse-grained rewards, reward noise, and inefficient exploration, which lead to unstable training and entropy collapse. To address this challenge, we propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO). The intuition behind it lies in the fact that the probabilities of an LLM generating different responses can inherently and directly reflect its self-assessment of the reasoning process. Inspired by the idea of preference modeling, ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt, and integrates this score with verifiable rewards to guide the exploration process. We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors, enhances the relative superiority of undervalued high-quality responses, and prevents the model from overfitting to specific strategies, thereby facilitating more thorough exploration. Comprehensive experiments across four general-domain benchmarks and three mathematical benchmarks demonstrate that ICPO steadily boosts reasoning compared to GRPO.

[311] Towards Trustworthy Legal AI through LLM Agents and Formal Reasoning

Linze Chen, Yufan Cai, Zhe Hou, Jinsong Dong

Main category: cs.AI

TL;DR: L4M is a framework combining LLM agents with SMT-solver proofs to bridge natural language interpretation with symbolic verification in legal reasoning, outperforming existing LLMs and legal AI systems.

Details

Motivation: Existing LLM-based systems lack the guarantees required for principled jurisprudence, excelling only at surface-level text analysis without formal rationality.

Method: Three-phase pipeline: Statute Formalization (converting legal provisions to logical formulae), Dual Fact and Statute Extraction (prosecutor/defense LLMs extract facts independently), and Solver-Centric Adjudication (autoformalizer compiles arguments into logic constraints with iterative self-critique).

Result: Surpasses advanced LLMs (GPT-4-mini, DeepSeek-V3, Claude 4) and state-of-the-art Legal AI baselines on public benchmarks.

Conclusion: L4M successfully unites interpretive flexibility of natural language with rigor of symbolic verification, providing rigorous and explainable symbolic justifications for legal decisions.

Abstract: The rationality of law manifests in two forms: substantive rationality, which concerns the fairness or moral desirability of outcomes, and formal rationality, which requires legal decisions to follow explicitly stated, general, and logically coherent rules. Existing LLM-based systems excel at surface-level text analysis but lack the guarantees required for principled jurisprudence. We introduce L4M, a novel framework that combines adversarial LLM agents with SMT-solver-backed proofs to unite the interpretive flexibility of natural language with the rigor of symbolic verification. The pipeline consists of three phases: (1) Statute Formalization, where domain-specific prompts convert legal provisions into logical formulae; (2) Dual Fact and Statute Extraction, in which prosecutor- and defense-aligned LLMs independently map case narratives to fact tuples and statutes, ensuring role isolation; and (3) Solver-Centric Adjudication, where an autoformalizer compiles both parties’ arguments into logic constraints, and unsat cores trigger iterative self-critique until a satisfiable formula is achieved, which is then verbalized by a Judge-LLM into a transparent verdict and optimized sentence. Experimental results on public benchmarks show that our system surpasses advanced LLMs including GPT-o4-mini, DeepSeek-V3, and Claude 4 as well as state-of-the-art Legal AI baselines, while providing rigorous and explainable symbolic justifications.

[312] OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection

Chujie Wang, Jianyu Lu, Zhiyuan Luo, Xi Chen, Chu He

Main category: cs.AI

TL;DR: OVOD-Agent transforms passive category matching into proactive visual reasoning and self-evolving detection using a Visual Chain-of-Thought approach modeled as a Weakly Markovian Decision Process.

Details

Motivation: Existing Open-Vocabulary Object Detection methods have a gap between multimodal training and unimodal inference, with textual space being underexplored despite its potential to significantly improve performance.

Method: Proposes OVOD-Agent with Visual-CoT using explicit actions, models visual context transitions as w-MDP over eight state spaces, includes Bandit module for exploration signals, and integrates Markov transition matrices with Bandit trajectories for self-supervised Reward Model optimization.

Result: Experiments on COCO and LVIS show consistent improvements across OVOD backbones, particularly on rare categories.

Conclusion: The proposed framework effectively bridges the gap between multimodal training and unimodal inference in OVOD through proactive visual reasoning and self-evolving detection.

Abstract: Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD’s lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent’s state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.

[313] Causality Without Causal Models

Joseph Y. Halpern, Rafael Pass

Main category: cs.AI

TL;DR: The paper abstracts Halpern and Pearl’s causality definition to work with any model where counterfactuals are defined, enabling broader applications including handling disjunctions, negations, beliefs, and nested counterfactuals.

Details

Motivation: To extend Halpern and Pearl's causality definition beyond causal models, allowing application to any model with counterfactuals and handling more complex logical constructs.

Method: Abstracting the key features of Halpern-Pearl’s causality definition to create a generalized version that can be applied to any model where counterfactuals are defined.

Result: The abstracted definition can be applied to a wider range of models (including those allowing backtracking) and can handle formulas with disjunctions, negations, beliefs, and nested counterfactuals.

Conclusion: Abstracting the causality definition provides multiple benefits: broader applicability, ability to handle complex logical constructs, extension to explanation definitions, and deeper understanding of the original definition’s features.

Abstract: Perhaps the most prominent current definition of (actual) causality is due to Halpern and Pearl. It is defined using causal models (also known as structural equations models). We abstract the definition, extracting its key features, so that it can be applied to any other model where counterfactuals are defined. By abstracting the definition, we gain a number of benefits. Not only can we apply the definition in a wider range of models, including ones that allow, for example, backtracking, but we can apply the definition to determine if A is a cause of B even if A and B are formulas involving disjunctions, negations, beliefs, and nested counterfactuals (none of which can be handled by the Halpern-Pearl definition). Moreover, we can extend the ideas to getting an abstract definition of explanation that can be applied beyond causal models. Finally, we gain a deeper understanding of features of the definition even in causal models.

[314] New Hybrid Heuristics for Pseudo-Boolean Propagation

Mia Müßig, Jan Johannsen

Main category: cs.AI

TL;DR: New heuristics for hybrid unit propagation in pseudo-boolean solving outperform current methods in RoundingSAT.

Details

Motivation: Current hybrid unit propagation strategies combining watched literal scheme with counting method are successful but can be improved.

Method: Introduces new heuristics for making hybrid decisions in unit propagation for pseudo-boolean solving.

Result: The new heuristics drastically outperform the current method in the RoundingSAT solver.

Conclusion: The proposed heuristics significantly improve performance of hybrid unit propagation in pseudo-boolean solving.

Abstract: In pseudo-boolean solving the currently most successful unit propagation strategy is a hybrid mode combining the watched literal scheme with the counting method. This short paper introduces new heuristics for this hybrid decision, which are able to drastically outperform the current method in the RoundingSAT solver.

[315] EWE: An Agentic Framework for Extreme Weather Analysis

Zhe Jiang, Jiong Wang, Xiaoyu Yue, Zijie Guo, Wenlong Zhang, Fenghua Ling, Wanli Ouyang, Lei Bai

Main category: cs.AI

TL;DR: EWE is the first AI framework for automated extreme weather diagnosis, using knowledge-guided planning and meteorological tools to analyze raw data and generate visualizations, with a new benchmark for evaluation.

Details

Motivation: Extreme weather events are increasing but current expert-driven diagnostic methods are labor-intensive and create analytical bottlenecks, while AI has focused on prediction rather than automated diagnostic reasoning.

Method: EWE emulates expert workflows through knowledge-guided planning, closed-loop reasoning, and a domain-tailored meteorological toolkit that autonomously produces and interprets multimodal visualizations from raw meteorological data.

Result: The framework enables comprehensive diagnostic analyses and the authors introduce the first benchmark for this field with 103 high-impact events and a step-wise evaluation metric.

Conclusion: EWE represents progress toward automated scientific discovery and can democratize expertise, especially benefiting developing nations vulnerable to extreme weather.

Abstract: Extreme weather events pose escalating risks to global society, underscoring the urgent need to unravel their underlying physical mechanisms. Yet the prevailing expert-driven, labor-intensive diagnostic paradigm has created a critical analytical bottleneck, stalling scientific progress. While AI for Earth Science has achieved notable advances in prediction, the equally essential challenge of automated diagnostic reasoning remains largely unexplored. We present the Extreme Weather Expert (EWE), the first intelligent agent framework dedicated to this task. EWE emulates expert workflows through knowledge-guided planning, closed-loop reasoning, and a domain-tailored meteorological toolkit. It autonomously produces and interprets multimodal visualizations from raw meteorological data, enabling comprehensive diagnostic analyses. To catalyze progress, we introduce the first benchmark for this emerging field, comprising a curated dataset of 103 high-impact events and a novel step-wise evaluation metric. EWE marks a step toward automated scientific discovery and offers the potential to democratize expertise and intellectual resources, particularly for developing nations vulnerable to extreme weather.

[316] MADRA: Multi-Agent Debate for Risk-Aware Embodied Planning

Junjian Wang, Lidan Zhao, Xi Sheryl Zhang

Main category: cs.AI

TL;DR: MADRA is a training-free multi-agent debate framework for safety assessment in embodied AI, using collective reasoning to reduce false rejections while maintaining high safety sensitivity.

Details

Motivation: Existing methods for embodied AI safety suffer from high computational costs or over-rejection of safe tasks, creating barriers for real-world deployment in household environments.

Method: Uses multiple LLM agents to debate instruction safety with a critical evaluator scoring responses, plus hierarchical cognitive planning with safety, memory, and self-evolution mechanisms.

Result: Achieves over 90% rejection of unsafe tasks with low safe-task rejection, outperforming existing methods in both safety and execution efficiency on AI2-THOR and VirtualHome benchmarks.

Conclusion: Provides a scalable, model-agnostic solution for trustworthy embodied agents through collective reasoning and continuous learning mechanisms.

Abstract: Ensuring the safety of embodied AI agents during task planning is critical for real-world deployment, especially in household environments where dangerous instructions pose significant risks. Existing methods often suffer from either high computational costs due to preference alignment training or over-rejection when using single-agent safety prompts. To address these limitations, we propose MADRA, a training-free Multi-Agent Debate Risk Assessment framework that leverages collective reasoning to enhance safety awareness without sacrificing task performance. MADRA employs multiple LLM-based agents to debate the safety of a given instruction, guided by a critical evaluator that scores responses based on logical soundness, risk identification, evidence quality, and clarity. Through iterative deliberation and consensus voting, MADRA significantly reduces false rejections while maintaining high sensitivity to dangerous tasks. Additionally, we introduce a hierarchical cognitive collaborative planning framework that integrates safety, memory, planning, and self-evolution mechanisms to improve task success rates through continuous learning. We also contribute SafeAware-VH, a benchmark dataset for safety-aware task planning in VirtualHome, containing 800 annotated instructions. Extensive experiments on AI2-THOR and VirtualHome demonstrate that our approach achieves over 90% rejection of unsafe tasks while ensuring that safe-task rejection is low, outperforming existing methods in both safety and execution efficiency. Our work provides a scalable, model-agnostic solution for building trustworthy embodied agents.

[317] SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Yunjian Zhang

Main category: cs.AI

TL;DR: Proposes a hierarchical spatial cognition framework and SpatialBench benchmark to systematically evaluate multimodal large language models’ spatial reasoning abilities across five progressive complexity levels.

Details

Motivation: Existing benchmarks oversimplify spatial cognition as single-dimensional metrics, failing to capture the hierarchical structure and interdependence of spatial abilities in multimodal intelligence.

Method: Developed a hierarchical spatial cognition framework with five progressive levels, constructed SpatialBench benchmark covering 15 tasks aligned with these levels, and introduced a unified capability-oriented metric for evaluation.

Result: Experiments revealed distinct performance stratification: models show strong perceptual grounding but limitations in symbolic reasoning, causal inference, and planning. Human tests show goal-directed abstraction while MLLMs over-attend to surface details.

Conclusion: Establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, providing foundation for developing spatially intelligent systems.

Abstract: Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. To address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. To provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model’s overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.

[318] Pessimistic Verification for Open Ended Math Questions

Yanxing Huang, Zihan Tang, Zejin Lin, Peng Li, Yang Liu

Main category: cs.AI

TL;DR: Pessimistic verification improves math proof verification by running multiple parallel checks and rejecting proofs if any check finds errors, significantly boosting performance across benchmarks with minimal computational cost.

Details

Motivation: The key limitation in verification performance is error detection capability, and current methods struggle with reliably identifying incorrect proofs in mathematical reasoning tasks.

Method: Designed pessimistic verification workflows that construct multiple parallel verifications for the same proof, where the proof is deemed incorrect if any verification reports an error.

Result: Significantly improved performance across math verification benchmarks without substantial computational resources, with token efficiency surpassing extended long-CoT in test-time scaling. Case studies revealed many false negatives were actually dataset annotation errors.

Conclusion: Self-verification for mathematical problems effectively improves reliability and performance of language model outputs, and pessimistic verification research will enhance mathematical capabilities across a wide range of tasks.

Abstract: The key limitation of the verification performance lies in the ability of error detection. With this intuition we designed several variants of pessimistic verification, which are simple workflows that could significantly improve the verification of open-ended math questions. In pessimistic verification we construct multiple parallel verifications for the same proof, and the proof is deemed incorrect if any one of them reports an error. This simple technique significantly improves the performance across many math verification benchmarks without incurring substantial computational resources. Its token efficiency even surpassed extended long-CoT in test-time scaling. Our case studies further indicate that the majority of false negatives in stronger models are actually caused by annotation errors in the original dataset, so our method’s performance is in fact underestimated. Self-verification for mathematical problems can effectively improve the reliability and performance of language model outputs, and it also plays a critical role in enabling long-horizon mathematical tasks. We believe that research on pessimistic verification will help enhance the mathematical capabilities of language models across a wide range of tasks.

[319] Self-Transparency Failures in Expert-Persona LLMs: A Large-Scale Behavioral Audit

Alex Diep

Main category: cs.AI

TL;DR: AI models show unreliable self-disclosure when assigned professional personas, with disclosure rates varying dramatically by domain (3.5% for neurosurgeon vs 30.8% for financial advisor), creating trust risks where users may overgeneralize transparency from one domain to another.

Details

Motivation: To examine whether language models can reliably disclose their AI identity when assigned professional personas in high-stakes domains, where failure to do so could lead to user harm through false expertise claims.

Method: Used a common-garden design to audit 16 open-weight models (4B-671B parameters) across 19,200 trials, assigning professional personas and measuring disclosure rates. Applied Bayesian validation with Rogan-Gladen correction for measurement error.

Result: Models showed sharp domain-specific inconsistency in disclosure (2.8% to 73.6%), with reasoning optimization actively suppressing self-transparency (up to 48.4% lower disclosure). Model identity predicted behavior better than parameter count. Disclosure reliability varied widely across similar-sized models.

Conclusion: Transparency reflects training factors rather than scale, and organizations cannot assume safety properties transfer to deployment contexts, requiring deliberate behavior design and empirical verification to prevent trust failures.

Abstract: If a language model cannot reliably disclose its AI identity in expert contexts, users cannot trust its competence boundaries. This study examines self-transparency in models assigned professional personas within high-stakes domains where false expertise risks user harm. Using a common-garden design, sixteen open-weight models (4B–671B parameters) were audited across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure initially, while a Neurosurgeon persona elicited only 3.5%. This creates preconditions for a “Reverse Gell-Mann Amnesia” effect, where transparency in some domains leads users to overgeneralize trust to contexts where disclosure fails. Disclosure ranged from 2.8% to 73.6%, with a 14B model reaching 61.4% while a 70B produced just 4.1%. Model identity predicted behavior better than parameter count ($ΔR_{adj}^{2} = 0.359$ vs 0.018). Reasoning optimization actively suppressed self-transparency in some models, with reasoning variants showing up to 48.4% lower disclosure than base counterparts. Bayesian validation with Rogan–Gladen correction confirmed robustness to measurement error ($κ= 0.908$). These findings demonstrate transparency reflects training factors rather than scale. Organizations cannot assume safety properties transfer to deployment contexts, requiring deliberate behavior design and empirical verification.

[320] From Prediction to Foresight: The Role of AI in Designing Responsible Futures

Maria Perez-Ortiz

Main category: cs.AI

TL;DR: This paper introduces ‘responsible computational foresight’ as a framework using AI and computational modeling to help policymakers navigate future uncertainties ethically and proactively.

Details

Motivation: The need for responsible foresight in addressing complex global challenges and technological advancements, requiring ethical anticipation of opportunities and risks for sustainable future design.

Method: Establishes foundational principles for responsible computational foresight and presents AI-driven foresight tools including simulations and scenario analysis to enhance policymakers’ capabilities.

Result: AI enhances policymakers’ ability to address uncertainty, evaluate risks, and devise sustainable strategies, serving as a supportive tool that complements human judgment rather than replacing it.

Conclusion: Advocates for thoughtful integration of AI into foresight practices to empower policymakers in confronting 21st century challenges while maintaining human-centered, ethical decision-making.

Abstract: In an era marked by rapid technological advancements and complex global challenges, responsible foresight has emerged as an essential framework for policymakers aiming to navigate future uncertainties and shape the future. Responsible foresight entails the ethical anticipation of emerging opportunities and risks, with a focus on fostering proactive, sustainable, and accountable future design. This paper coins the term “responsible computational foresight”, examining the role of human-centric artificial intelligence and computational modeling in advancing responsible foresight, establishing a set of foundational principles for this new field and presenting a suite of AI-driven foresight tools currently shaping it. AI, particularly in conjunction with simulations and scenario analysis, enhances policymakers’ ability to address uncertainty, evaluate risks, and devise strategies geared toward sustainable, resilient futures. However, responsible foresight extends beyond mere technical forecasting; it demands a nuanced understanding of the interdependencies within social, environmental, economic and political systems, alongside a commitment to ethical, long-term decision-making that supports human intelligence. We argue that AI will play a role as a supportive tool in responsible, human-centered foresight, complementing rather than substituting policymaker judgment to enable the proactive shaping of resilient and ethically sound futures. This paper advocates for the thoughtful integration of AI into foresight practices to empower policymakers and communities as they confront the grand challenges of the 21st century.

[321] On the Limits of Innate Planning in Large Language Models

Charles Schepanowski, Charles Ling

Main category: cs.AI

TL;DR: LLMs struggle with planning and state tracking in the 8-puzzle task, showing limitations in maintaining internal state and heuristic planning even with corrective feedback and move validation assistance.

Details

Motivation: To directly evaluate LLMs' capacity for planning and stateful reasoning without external tools, using the 8-puzzle as a classic task that requires state tracking and goal-directed planning.

Method: Tested four LLMs using common prompting strategies (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) with tiered corrective feedback, and examined models with an external move validator that provides only valid moves.

Result: Feedback improved success rates for some model-prompt combinations, but successful runs were long and indirect. Even with move validation assistance, none of the models solved any puzzles. Models showed brittle internal state representations and weak heuristic planning.

Conclusion: Current LLMs have substantial limitations in planning without external tools, requiring mechanisms for maintaining explicit state and performing structured search for further progress.

Abstract: Large language models (LLMs) achieve impressive results on many benchmarks, yet their capacity for planning and stateful reasoning remains unclear. We study these abilities directly, without code execution or other tools, using the 8-puzzle: a classic task that requires state tracking and goal-directed planning while allowing precise, step-by-step evaluation. Four models are tested under common prompting conditions (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) and with tiered corrective feedback. Feedback improves success rates for some model-prompt combinations, but many successful runs are long, computationally expensive, and indirect. We then examine the models with an external move validator that provides only valid moves. Despite this level of assistance, none of the models solve any puzzles in this setting. Qualitative analysis reveals two dominant deficits across all models: (1) brittle internal state representations, leading to frequent invalid moves, and (2) weak heuristic planning, with models entering loops or selecting actions that do not reduce the distance to the goal state. These findings indicate that, in the absence of external tools such as code interpreters, current LLMs have substantial limitations in planning and that further progress may require mechanisms for maintaining explicit state and performing structured search.

[322] Bridging the Unavoidable A Priori: A Framework for Comparative Causal Modeling

Peter S. Hovmand, Kari O’Donnell, Callie Ogland-Hand, Brian Biroscak, Douglas D. Gunzler

Main category: cs.AI

TL;DR: This paper integrates system dynamics and structural equation modeling into a common mathematical framework to address biases in AI/ML models and advance responsible AI development.

Details

Motivation: AI/ML models amplify human biases, and responsible AI advocates need richer causal models from system dynamics, but face barriers due to different underlying assumptions between disciplines.

Method: Develops a common mathematical framework that brings system dynamics and structural equation modeling together to generate systems from distributions, develop methods, and compare results.

Result: The framework enables integration of system dynamics epistemology with data science and AI/ML applications, facilitating better understanding of causal relationships.

Conclusion: This unified approach helps overcome methodological barriers and informs the development of more responsible AI/ML systems by leveraging system dynamics principles.

Abstract: AI/ML models have rapidly gained prominence as innovations for solving previously unsolved problems and their unintended consequences from amplifying human biases. Advocates for responsible AI/ML have sought ways to draw on the richer causal models of system dynamics to better inform the development of responsible AI/ML. However, a major barrier to advancing this work is the difficulty of bringing together methods rooted in different underlying assumptions (i.e., Dana Meadow’s “the unavoidable a priori”). This paper brings system dynamics and structural equation modeling together into a common mathematical framework that can be used to generate systems from distributions, develop methods, and compare results to inform the underlying epistemology of system dynamics for data science and AI/ML applications.

[323] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, Kunbin Chen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li

Main category: cs.AI

TL;DR: ViLoMem is a dual-stream memory framework that separately encodes visual distraction patterns and logical reasoning errors to help MLLMs learn from past experiences and avoid repeating mistakes.

Details

Motivation: Existing memory-augmented agents mainly store past trajectories but suffer from brevity bias and single-modality limitations, failing to preserve how visual attention and logical reasoning jointly contributed to solutions, which is misaligned with human cognition.

Method: Introduces ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory by separately encoding visual distraction patterns and logical reasoning errors, following a grow-and-refine principle for incremental accumulation and updating of multimodal semantic knowledge.

Result: Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction-hallucination separation.

Conclusion: ViLoMem demonstrates the value of error-aware multimodal memory for lifelong and cross-domain agentic learning, enabling MLLMs to learn from both successful and failed experiences while avoiding catastrophic forgetting.

Abstract: MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo – solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge – preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction–hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.

[324] Earth Observation Satellite Scheduling with Graph Neural Networks and Monte Carlo Tree Search

Antoine Jacquet, Guillaume Infantes, Emmanuel Benazera, Vincent Baudoui, Jonathan Guerra, Stéphanie Roussel

Main category: cs.AI

TL;DR: This paper presents a GNN and DRL-based approach for Earth Observation Satellite Planning, combining neural networks with reinforcement learning to solve oversubscribed scheduling problems and outperforming traditional methods.

Details

Motivation: Earth Observation Satellite Planning is a complex optimization problem that is largely oversubscribed, with more candidate observations than can be scheduled. Traditional heuristic and iterative search approaches have limitations, motivating the need for more advanced techniques.

Method: The method uses Graph Neural Networks (GNNs) to extract information from problem instance graphs, Deep Reinforcement Learning (DRL) to search for optimal schedules, and adds a post-learning Monte Carlo Tree Search (MCTS) step to further improve solutions.

Result: The approach successfully learns on small problem instances and generalizes to larger real-world instances, demonstrating very competitive performance compared to traditional approaches.

Conclusion: The combination of GNNs, DRL, and MCTS provides an effective solution for Earth Observation Satellite Planning that can handle oversubscribed scheduling problems and scale from small training instances to large real-world applications.

Abstract: Earth Observation Satellite Planning (EOSP) is a difficult optimization problem with considerable practical interest. A set of requested observations must be scheduled on an agile Earth observation satellite while respecting constraints on their visibility window, as well as maneuver constraints that impose varying delays between successive observations. In addition, the problem is largely oversubscribed: there are much more candidate observations than can possibly be achieved. Therefore, one must select the set of observations that will be performed while maximizing their cumulative benefit and propose a feasible schedule for these observations. As previous work mostly focused on heuristic and iterative search algorithms, this paper presents a new technique for selecting and scheduling observations based on Graph Neural Networks (GNNs) and Deep Reinforcement Learning (DRL). GNNs are used to extract relevant information from the graphs representing instances of the EOSP, and DRL drives the search for optimal schedules. A post-learning search step based on Monte Carlo Tree Search (MCTS) is added that is able to find even better solutions. Experiments show that it is able to learn on small problem instances and generalize to larger real-world instances, with very competitive performance compared to traditional approaches.

[325] Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models

Yuheng Tang, Hongwei Li, Kaijie Zhu, Michael Yang, Yangruibo Ding, Wenbo Guo

Main category: cs.AI

TL;DR: Co-PatcheR is a collaborative patching system using specialized small models that achieves 46% resolved rate on SWE-bench-Verified with only 3×14B models, outperforming SOTA methods that use a single 70B model.

Details

Motivation: Single models struggle with the entire patching pipeline (localization, generation, validation) as different sub-tasks require different expertise. SOTA methods using one 70B model only achieve 41% resolved rate.

Method: Uses collaborative specialized models: 1) Localization model with two-step suspicious line pinpointing, 2) Generation model combining patch generation and critique, 3) Hybrid validation with two models for test case creation and correctness judgment, plus majority vote-based patch selection.

Result: Achieves 46% resolved rate on SWE-bench-Verified with only 3×14B models, making it the best specialized patcher with minimal training resources and smallest models.

Conclusion: Collaborative specialized models outperform single large models in software patching, demonstrating that task-specific designs and training recipes enable better performance with smaller models and fewer resources.

Abstract: Motivated by the success of general-purpose large language models (LLMs) in software patching, recent works started to train specialized patching models. Most works trained one model to handle the end-to-end patching pipeline (including issue localization, patch generation, and patch validation). However, it is hard for a small model to handle all tasks, as different sub-tasks have different workflows and require different expertise. As such, by using a 70 billion model, SOTA methods can only reach up to 41% resolved rate on SWE-bench-Verified. Motivated by the collaborative nature, we propose Co-PatcheR, the first collaborative patching system with small and specialized reasoning models for individual components. Our key technique novelties are the specific task designs and training recipes. First, we train a model for localization and patch generation. Our localization pinpoints the suspicious lines through a two-step procedure, and our generation combines patch generation and critique. We then propose a hybrid patch validation that includes two models for crafting issue-reproducing test cases with and without assertions and judging patch correctness, followed by a majority vote-based patch selection. Through extensive evaluation, we show that Co-PatcheR achieves 46% resolved rate on SWE-bench-Verified with only 3 x 14B models. This makes Co-PatcheR the best patcher with specialized models, requiring the least training resources and the smallest models. We conduct a comprehensive ablation study to validate our recipes, as well as our choice of training data number, model size, and testing-phase scaling strategy.

[326] Safe and Economical UAV Trajectory Planning in Low-Altitude Airspace: A Hybrid DRL-LLM Approach with Compliance Awareness

Yanwei Gong, Junchao Fan, Ruichen Zhang, Dusit Niyato, Yingying Yao, Xiaolin Chang

Main category: cs.AI

TL;DR: Proposes a UAV trajectory planning framework combining deep reinforcement learning with large language model reasoning to address urban airspace constraints and economic efficiency in low-altitude economy contexts.

Details

Motivation: The rapid growth of low-altitude economy and widespread UAV adoption creates challenges for trajectory planning in complex urban environments, with existing studies overlooking key factors like airspace constraints and economic efficiency.

Method: Novel framework that combines deep reinforcement learning (DRL) with large language model (LLM) reasoning for UAV trajectory planning.

Result: Significantly outperforms existing baselines across multiple metrics: data collection rate, collision avoidance, successful landing, regulatory compliance, and energy efficiency.

Conclusion: The approach effectively addresses key UAV trajectory planning challenges under low-altitude economy networking constraints, validating the combination of DRL and LLM reasoning.

Abstract: The rapid growth of the low-altitude economy has driven the widespread adoption of unmanned aerial vehicles (UAVs). This growing deployment presents new challenges for UAV trajectory planning in complex urban environments. However, existing studies often overlook key factors, such as urban airspace constraints and economic efficiency, which are essential in low-altitude economy contexts. Deep reinforcement learning (DRL) is regarded as a promising solution to these issues, while its practical adoption remains limited by low learning efficiency. To overcome this limitation, we propose a novel UAV trajectory planning framework that combines DRL with large language model (LLM) reasoning to enable safe, compliant, and economically viable path planning. Experimental results demonstrate that our method significantly outperforms existing baselines across multiple metrics, including data collection rate, collision avoidance, successful landing, regulatory compliance, and energy efficiency. These results validate the effectiveness of our approach in addressing UAV trajectory planning key challenges under constraints of the low-altitude economy networking.

[327] CoMind: Towards Community-Driven Agents for Machine Learning Engineering

Sijie Li, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang

Main category: cs.AI

TL;DR: CoMind is a multi-agent system that integrates external knowledge from simulated research communities to automate ML engineering, achieving top performance in Kaggle competitions.

Details

Motivation: Existing LLM agents operate in isolation without engaging with research communities, while human researchers benefit from collective knowledge sharing.

Method: MLE-Live evaluation framework for assessing agent communication with simulated Kaggle community, plus CoMind multi-agent system with iterative parallel exploration to develop multiple solutions simultaneously.

Result: 36% medal rate on 75 past Kaggle competitions, outperforming 92.6% of human competitors in live competitions with top 5% placements on three leaderboards and top 1% on one.

Conclusion: CoMind demonstrates that integrating collective knowledge through multi-agent systems can significantly enhance automated ML engineering performance, bridging the gap between isolated AI agents and collaborative human research practices.

Abstract: Large language model (LLM) agents show promise in automating machine learning (ML) engineering. However, existing agents typically operate in isolation on a given research problem, without engaging with the broader research community, where human researchers often gain insights and contribute by sharing knowledge. To bridge this gap, we introduce MLE-Live, a live evaluation framework designed to assess an agent’s ability to communicate with and leverage collective knowledge from a simulated Kaggle research community. Building on this framework, we propose CoMind, an multi-agent system designed to actively integrate external knowledge. CoMind employs an iterative parallel exploration mechanism, developing multiple solutions simultaneously to balance exploratory breadth with implementation depth. On 75 past Kaggle competitions within our MLE-Live framework, CoMind achieves a 36% medal rate, establishing a new state of the art. Critically, when deployed in eight live, ongoing competitions, CoMind outperforms 92.6% of human competitors on average, placing in the top 5% on three official leaderboards and the top 1% on one.

[328] Augur: Modeling Covariate Causal Associations in Time Series via Large Language Models

Zhiqing Cui, Binwu Wang, Qingxiang Liu, Yeqiang Wang, Zhengyang Zhou, Yuxuan Liang, Yang Wang

Main category: cs.AI

TL;DR: Augur is an LLM-driven time series forecasting framework that uses causal reasoning to discover directed causal associations among covariates through a teacher-student architecture, improving accuracy and interpretability.

Details

Motivation: Existing LLM-based time series forecasting approaches have limitations including marginalized roles in model architectures, reliance on coarse statistical text prompts, and lack of interpretability.

Method: Two-stage teacher-student architecture: a powerful teacher LLM infers directed causal graphs using heuristic search and pairwise causality testing, then a lightweight student agent refines the graph and fine-tunes on high-confidence causal associations encoded as rich textual prompts.

Result: Extensive experiments on real-world datasets with 26 baselines demonstrate competitive performance and robust zero-shot generalization.

Conclusion: Augur improves predictive accuracy while providing transparent, traceable reasoning about variable interactions in time series forecasting.

Abstract: Large language models (LLM) have emerged as a promising avenue for time series forecasting, offering the potential to integrate multimodal data. However, existing LLM-based approaches face notable limitations-such as marginalized role in model architectures, reliance on coarse statistical text prompts, and lack of interpretability. In this work, we introduce Augur, a fully LLM driven time series forecasting framework that exploits LLM causal reasoning to discover and use directed causal associations among covariates. Augur uses a two stage teacher student architecture where a powerful teacher LLM infers a directed causal graph from time series using heuristic search together with pairwise causality testing. A lightweight student agent then refines the graph and fine tune on high confidence causal associations that are encoded as rich textual prompts to perform forecasting. This design improves predictive accuracy while yielding transparent, traceable reasoning about variable interactions. Extensive experiments on real-world datasets with 26 baselines demonstrate that Augur achieves competitive performance and robust zero-shot generalization.

[329] Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior

Dalia Ali, Dora Zhao, Allison Koenecke, Orestis Papakyriakopoulos

Main category: cs.AI

TL;DR: The study examines how incorporating pluralistic human values affects LLM alignment, finding systematic demographic variations in preference ratings and showing that technical design choices significantly impact model behavior.

Details

Motivation: Current LLM alignment often overlooks human social diversity, focusing on safety and human values without considering pluralistic perspectives from different demographic groups.

Method: Collected alignment data from 1,095 US and German participants (27,375 ratings) across five dimensions, fine-tuned multiple LLMs using group-specific preferences while varying rating scales, disagreement handling, and optimization techniques.

Result: Found systematic demographic effects (e.g., males rated responses 18% less toxic than females), and technical choices showed strong impacts - preserving disagreement achieved 53% greater toxicity reduction than majority voting, and DPO outperformed GRPO in multi-value optimization.

Conclusion: The findings represent a preliminary step toward understanding how alignment should balance expert-driven and user-driven signals to ensure both safety and fair representation.

Abstract: Although large language models (LLMs) are increasingly trained using human feedback for safety and alignment with human values, alignment decisions often overlook human social diversity. This study examines how incorporating pluralistic values affects LLM behavior by systematically evaluating demographic variation and design parameters in the alignment pipeline. We collect alignment data from US and German participants (N = 1,095 participants, 27,375 ratings) who rated LLM responses across five dimensions: Toxicity, Emotional Awareness (EA), Sensitivity, Stereotypical Bias, and Helpfulness. We fine-tuned multiple Large Language Models and Large Reasoning Models using preferences from different social groups while varying rating scales, disagreement handling methods, and optimization techniques. The results revealed systematic demographic effects: male participants rated responses 18% less toxic than female participants; conservative and Black participants rated responses 27.9% and 44% higher on EA than liberal and White participants, respectively. Models fine-tuned on group-specific preferences exhibited distinct behaviors. Technical design choices showed strong effects: the preservation of rater disagreement achieved roughly 53% greater toxicity reduction than majority voting, and 5-point scales yielded about 22% more reduction than binary formats; and Direct Preference Optimization (DPO) consistently outperformed Group Relative Policy Optimization (GRPO) in multi-value optimization. These findings represent a preliminary step in answering a critical question: How should alignment balance expert-driven and user-driven signals to ensure both safety and fair representation?

[330] Heterogeneous Multi-Agent Proximal Policy Optimization for Power Distribution System Restoration

Parya Dolatyabi, Ali Farajzadeh Bavil, Mahdi Khodayar

Main category: cs.AI

TL;DR: Heterogeneous-Agent Reinforcement Learning (HARL) using HAPPO enables coordinated power distribution system restoration across interconnected microgrids with structural heterogeneity, outperforming traditional methods in convergence speed and restored power.

Details

Motivation: Conventional optimization and value-based RL approaches are computationally inefficient and difficult to scale for power distribution system restoration due to nonlinear constraints like power balance, voltage limits, and thermal ratings.

Method: Uses Heterogeneous-Agent Proximal Policy Optimization (HAPPO) with decentralized actor policies trained with a centralized critic. Each agent controls a distinct microgrid with different loads, DER capacities, and switch counts. Physics-informed OpenDSS environment provides power flow feedback with differentiable penalty signals.

Result: HAPPO achieves faster convergence, higher restored power, and smoother multi-seed training than DQN, PPO, MAES, MAGDPG, MADQN, Mean-Field RL, and QMIX on IEEE 123-bus and IEEE 8500-node systems.

Conclusion: Incorporating microgrid-level heterogeneity within the HARL framework yields a scalable, stable, and constraint-aware solution for complex power distribution system restoration.

Abstract: Restoring power distribution systems (PDS) after large-scale outages requires sequential switching operations that reconfigure feeder topology and coordinate distributed energy resources (DERs) under nonlinear constraints such as power balance, voltage limits, and thermal ratings. These challenges make conventional optimization and value-based RL approaches computationally inefficient and difficult to scale. This paper applies a Heterogeneous-Agent Reinforcement Learning (HARL) framework, instantiated through Heterogeneous-Agent Proximal Policy Optimization (HAPPO), to enable coordinated restoration across interconnected microgrids. Each agent controls a distinct microgrid with different loads, DER capacities, and switch counts, introducing practical structural heterogeneity. Decentralized actor policies are trained with a centralized critic to compute advantage values for stable on-policy updates. A physics-informed OpenDSS environment provides full power flow feedback and enforces operational limits via differentiable penalty signals rather than invalid action masking. The total DER generation is capped at 2400 kW, and each microgrid must satisfy local supply-demand feasibility. Experiments on the IEEE 123-bus and IEEE 8500-node systems show that HAPPO achieves faster convergence, higher restored power, and smoother multi-seed training than DQN, PPO, MAES, MAGDPG, MADQN, Mean-Field RL, and QMIX. Results demonstrate that incorporating microgrid-level heterogeneity within the HARL framework yields a scalable, stable, and constraint-aware solution for complex PDS restoration.

[331] KRAL: Knowledge and Reasoning Augmented Learning for LLM-assisted Clinical Antimicrobial Therapy

Zhe Li, Yehan Qiu, Yujie Chen, Xiang Zhou

Main category: cs.AI

TL;DR: KRAL is a low-cost, privacy-preserving paradigm that enhances clinical LLMs by distilling knowledge via reverse question generation, using heuristic learning for data augmentation, and agentic reinforcement learning to improve both medical knowledge and reasoning capabilities.

Details

Motivation: Address limitations of LLMs in clinical decision-making including knowledge gaps, privacy concerns, high costs, and limited reasoning capabilities for antimicrobial therapy.

Method: Uses teacher-model reasoning for knowledge distillation via answer-to-question reverse generation, heuristic learning for semi-supervised data augmentation (80% reduction in manual annotation), and agentic reinforcement learning to enhance knowledge and reasoning while optimizing efficiency.

Result: Outperforms RAG and SFT methods: Accuracy@1 on MEDQA increased by 1.8% vs. SFT and 3.6% vs. RAG; Pass@1 on PUMCH Antimicrobial increased by 27% vs. SFT and 27.2% vs. RAG, at about 20% of SFT’s long-term training costs.

Conclusion: KRAL establishes an effective solution for enhancing local LLMs’ clinical diagnostic capabilities, enabling low-cost, high-safety deployment in complex medical decision support.

Abstract: Clinical antimicrobial therapy requires the dynamic integration of pathogen profiles,host factors, pharmacological properties of antimicrobials,and the severity of infection. This complexity imposes fundamental limitations on the applicability of Large Language Models (LLMs) in high-stakes clinical decision-making including knowledge gaps, data privacy concerns, high deployment costs, and limited reasoning capabilities. To address these challenges, we propose KRAL (Knowledge and Reasoning Augmented Learning), a low-cost, scalable, privacy-preserving paradigm that leverages teacher-model reasoning to automatically distill knowledge and reasoning trajectories via answer-to-question reverse generation, employs heuristic learning for semi-supervised data augmentation (reducing manual annotation requirements by approximately 80%), and utilizes agentic reinforcement learning to jointly enhance medical knowledge and reasoning while optimizing computational and memory efficiency. A hierarchical evaluation employing diverse teacher-model proxies reduces assessment costs, while modular interface design facilitates seamless system updates. Experimental results demonstrate that KRAL significantly outperforms traditional Retrieval-Augmented Generation (RAG) and Supervised Fine-Tuning (SFT) methods. It improves knowledge question-answering capability (Accuracy@1 on the external open-source benchmark MEDQA increased by 1.8% vs. SFT and 3.6% vs. RAG) and reasoning capability (Pass@1 on the external benchmark PUMCH Antimicrobial increased by 27% vs. SFT and 27.2% vs. RAG), achieved at about 20% of SFT’s long-term training costs. This establishes KRAL as an effective solution for enhancing local LLMs’ clinical diagnostic capabilities, enabling low-cost, high-safety deployment in complex medical decision support.

[332] Simulated Self-Assessment in Large Language Models: A Psychometric Approach to AI Self-Efficacy

Daniel I Jackson, Emma L Jensen, Syed-Amad Hussain, Emre Sezgin

Main category: cs.AI

TL;DR: LLMs show stable but inaccurate self-efficacy assessments that don’t correlate with actual task performance, with higher self-efficacy linked to more anthropomorphic reasoning styles.

Details

Motivation: To evaluate LLMs' self-assessment capabilities beyond task accuracy, adapting psychological self-efficacy scales to understand how models perceive their own abilities.

Method: Adapted the 10-item General Self-Efficacy Scale (GSES) to test ten LLMs across four conditions (no task, computational reasoning, social reasoning, summarization), with follow-up confidence prompts and qualitative analysis.

Result: Models showed stable self-efficacy responses but scores were lower than human norms. Self-assessment didn’t correlate with actual performance - some low-scoring models performed well while high-scoring ones performed poorly. Higher self-efficacy correlated with more anthropomorphic reasoning.

Conclusion: Psychometric prompting provides insight into LLM communication behavior but not calibrated performance estimates, revealing a disconnect between self-assessment and actual ability.

Abstract: Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.

[333] Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications

Vaishali Vinay

Main category: cs.AI

TL;DR: This paper presents a system-level taxonomy of 15 hidden failure modes in LLM applications and analyzes the gap in current evaluation practices, proposing design principles for reliable LLM systems.

Details

Motivation: LLMs are being rapidly integrated into decision-support tools and software systems, but their behavior in production environments remains poorly understood with failure patterns fundamentally different from traditional ML models.

Method: The authors develop a system-level taxonomy of fifteen hidden failure modes in real-world LLM applications and analyze the gap between current evaluation practices and production needs.

Result: Identified key failure modes including multi-step reasoning drift, latent inconsistency, context-boundary degradation, incorrect tool invocation, version drift, and cost-driven performance collapse.

Conclusion: LLM reliability should be framed as a system-engineering problem rather than purely model-centric, providing foundation for future research on evaluation methodology, AI system robustness, and dependable LLM deployment.

Abstract: Large language models (LLMs) are being rapidly integrated into decision-support tools, automation workflows, and AI-enabled software systems. However, their behavior in production environments remains poorly understood, and their failure patterns differ fundamentally from those of traditional machine learning models. This paper presents a system-level taxonomy of fifteen hidden failure modes that arise in real-world LLM applications, including multi-step reasoning drift, latent inconsistency, context-boundary degradation, incorrect tool invocation, version drift, and cost-driven performance collapse. Using this taxonomy, we analyze the growing gap in evaluation and monitoring practices: existing benchmarks measure knowledge or reasoning but provide little insight into stability, reproducibility, drift, or workflow integration. We further examine the production challenges associated with deploying LLMs - including observability limitations, cost constraints, and update-induced regressions - and outline high-level design principles for building reliable, maintainable, and cost-aware LLM systems. Finally, we outline high-level design principles for building reliable, maintainable, and cost-aware LLM-based systems. By framing LLM reliability as a system-engineering problem rather than a purely model-centric one, this work provides an analytical foundation for future research on evaluation methodology, AI system robustness, and dependable LLM deployment.

[334] Universe of Thoughts: Enabling Creative Reasoning with Large Language Models

Yuto Suzuki, Farnoush Banaei-Kashani

Main category: cs.AI

TL;DR: The paper introduces a computational framework for creative reasoning in LLMs, proposing three paradigms (combinational, exploratory, transformative) and implementing them through Universe of Thoughts (UoT) methods, showing superior performance in creative problem-solving tasks.

Details

Motivation: Current LLM reasoning methods focus on conventional problem-solving but lack creative reasoning capabilities needed for domains with expansive solution spaces like drug discovery and business strategization where innovative solutions are crucial.

Method: Proposed a computational framework with three creative reasoning paradigms (combinational, exploratory, transformative) and implemented them through Universe of Thoughts (UoT) methods using LLMs.

Result: UoT demonstrated superior performance in creative reasoning compared to state-of-the-art reasoning techniques and commercial models, as evaluated through three novel tasks assessing feasibility, utility, and novelty.

Conclusion: The Universe of Thoughts framework successfully enables LLMs to perform creative reasoning, addressing the gap in current reasoning methods and showing promising results for applications requiring innovative solutions.

Abstract: Reasoning based on Large Language Models (LLMs) has garnered increasing attention due to outstanding performance of these models in mathematical and complex logical tasks. Beginning with the Chain-of-Thought (CoT) prompting technique, numerous reasoning methods have emerged that decompose problems into smaller, sequential steps (or thoughts). However, existing reasoning models focus on conventional problem-solving and do not necessarily generate creative solutions by ``creative reasoning’’. In domains where the solution space is expansive and conventional solutions are suboptimal, such as drug discovery or business strategization, creative reasoning to discover innovative solutions is crucial. To address this gap, first we introduce a computational framework for creative reasoning inspired by established cognitive science principles. With this framework, we propose three core creative reasoning paradigms, namely, \textit{combinational}, \textit{exploratory}, and \textit{transformative} reasoning, where each offers specific directions for systematic exploration of the universe of thoughts to generate creative solutions. Next, to materialize this framework using LLMs, we introduce the \textit{Universe of Thoughts} (or \textit{UoT}, for short), a novel set of methods to implement the aforementioned three creative processes. Finally, we introduce three novel tasks that necessitate creative problem-solving, along with an evaluation benchmark to assess creativity from three orthogonal perspectives: feasibility as constraint, and utility and novelty as metrics. With a comparative analysis against the state-of-the-art (SOTA) reasoning techniques as well as representative commercial models with reasoning capability, we show that UoT demonstrates superior performance in creative reasoning.

[335] FRAGMENTA: End-to-end Fragmentation-based Generative Model with Agentic Tuning for Drug Lead Optimization

Yuto Suzuki, Paul Awolade, Daniel V. LaBarbera, Farnoush Banaei-Kashani

Main category: cs.AI

TL;DR: FRAGMENTA is an end-to-end framework for drug lead optimization that combines a novel generative model using dynamic Q-learning for fragmentation and generation with an agentic AI system that learns from expert feedback, achieving superior performance in cancer drug discovery.

Details

Motivation: Molecule generation for drug discovery faces challenges with limited class-specific datasets (often <100 examples), heuristic fragmentation limiting diversity, and slow human-AI collaboration requiring medicinal chemists and AI engineers to work indirectly.

Method: 1) Novel generative model reframing fragmentation as vocabulary selection using dynamic Q-learning to jointly optimize fragmentation and generation; 2) Agentic AI system that refines objectives via conversational feedback from domain experts, removing AI engineers from the loop and progressively learning domain knowledge.

Result: In real-world cancer drug discovery experiments, FRAGMENTA’s Human-Agent configuration identified nearly twice as many high-scoring molecules as baselines. The fully autonomous Agent-Agent system outperformed traditional Human-Human tuning.

Conclusion: Agentic tuning effectively captures expert intent and demonstrates the efficacy of autonomous systems in drug discovery, with FRAGMENTA showing superior performance over traditional approaches.

Abstract: Molecule generation using generative AI is vital for drug discovery, yet class-specific datasets often contain fewer than 100 training examples. While fragment-based models handle limited data better than atom-based approaches, existing heuristic fragmentation limits diversity and misses key fragments. Additionally, model tuning typically requires slow, indirect collaboration between medicinal chemists and AI engineers. We introduce FRAGMENTA, an end-to-end framework for drug lead optimization comprising: 1) a novel generative model that reframes fragmentation as a “vocabulary selection” problem, using dynamic Q-learning to jointly optimize fragmentation and generation; and 2) an agentic AI system that refines objectives via conversational feedback from domain experts. This system removes the AI engineer from the loop and progressively learns domain knowledge to eventually automate tuning. In real-world cancer drug discovery experiments, FRAGMENTA’s Human-Agent configuration identified nearly twice as many high-scoring molecules as baselines. Furthermore, the fully autonomous Agent-Agent system outperformed traditional Human-Human tuning, demonstrating the efficacy of agentic tuning in capturing expert intent.

[336] PaTAS: A Parallel System for Trust Propagation in Neural Networks Using Subjective Logic

Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Dennis Eisermann, Frank Kargl

Main category: cs.AI

TL;DR: PaTAS is a framework for modeling trust in neural networks using Subjective Logic, operating in parallel with standard computation to provide interpretable trust estimates that complement accuracy metrics.

Details

Motivation: Conventional evaluation metrics like accuracy fail to capture uncertainty and reliability of model predictions, especially under adversarial or degraded conditions, which is critical for safety-critical AI applications.

Method: Uses Trust Nodes and Trust Functions to propagate input, parameter, and activation trust across networks, with Parameter Trust Update during training and Inference-Path Trust Assessment at inference.

Result: Produces interpretable, symmetric, and convergent trust estimates that effectively distinguish between benign/adversarial inputs and identify reliability gaps in poisoned, biased, or uncertain data scenarios.

Conclusion: PaTAS provides a principled foundation for transparent and quantifiable trust reasoning within neural architectures, enabling reliable model evaluation across the AI lifecycle.

Abstract: Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics such as accuracy and precision fail to capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the Parallel Trust Assessment System (PaTAS), a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through Trust Nodes and Trust Functions that propagate input, parameter, and activation trust across the network. The framework defines a Parameter Trust Update mechanism to refine parameter reliability during training and an Inference-Path Trust Assessment (IPTA) method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a principled foundation for evaluating model reliability across the AI lifecycle.

cs.SD

[337] Seeing Beyond Sound: Visualization and Abstraction in Audio Data Representation

Ashlae Blum’e

Main category: cs.SD

TL;DR: Adding dimensionality and interactivity to audio visualization tools improves alignment with modern workflows and enhances analytical/creative outputs.

Details

Motivation: Traditional audio software tools carry hidden historical assumptions that misalign with modern workflows, limiting their effectiveness for complex audio information research.

Method: Explores adding dimensionality and interactivity to visualization tools using the Jellyfish Dynamite software as a case study.

Result: Enhanced visualization tools that better align with human perceptual systems can improve pattern recognition and workflow efficiency in audio signal processing.

Conclusion: Creating tools that align with emergent needs through increased dimensionality and interactivity leads to improved analytical and creative outcomes in audio information research.

Abstract: In audio signal processing, the interpretation of complex information using visual representation enhances pattern recognition through its alignment with human perceptual systems. Software tools that carry hidden assumptions inherited from their historical contexts risk misalignment with modern workflows as design origins become obscured. We argue that creating tools that align with emergent needs improves analytical and creative outputs due to an increased affinity for using them. This paper explores the potentials associated with adding dimensionality and interactivity into visualization tools to facilitate complex workflows in audio information research using the Jellyfish Dynamite software.

[338] Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores

Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Zhang Bo, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, KinHei Lee, Zhenxuan Zhang, Xiaobing Li, Maosong Sun

Main category: cs.SD

TL;DR: MSU-Bench is the first large-scale benchmark for evaluating AI models’ understanding of musical scores across text (ABC notation) and visual (PDF) modalities, with 1,800 QA pairs organized into four progressive comprehension levels.

Details

Motivation: Despite progress in LLMs and VLMs, their ability to comprehend musical notation remains underexplored, creating a gap in evaluating AI's musical score understanding capabilities.

Method: Created MSU-Bench with 1,800 human-curated QA pairs from classical composers’ works, organized into four comprehension levels, and evaluated 15+ SOTA models through zero-shot and fine-tuning experiments.

Result: Revealed sharp modality gaps, fragile level-wise success rates, and difficulty in sustaining multilevel correctness. Fine-tuning significantly improved performance while preserving general knowledge.

Conclusion: MSU-Bench establishes a rigorous foundation for future research at the intersection of AI, musicology, and multimodal reasoning, demonstrating the need for specialized approaches to musical score understanding.

Abstract: Understanding complete musical scores requires reasoning over symbolic structures such as pitch, rhythm, harmony, and form. Despite the rapid progress of Large Language Models (LLMs) and Vision-Language Models (VLMs) in natural language and multimodal tasks, their ability to comprehend musical notation remains underexplored. We introduce Musical Score Understanding Benchmark (MSU-Bench), the first large-scale, human-curated benchmark for evaluating score-level musical understanding across both textual (ABC notation) and visual (PDF) modalities. MSU-Bench comprises 1,800 generative question-answer (QA) pairs drawn from works spanning Bach, Beethoven, Chopin, Debussy, and others, organised into four progressive levels of comprehension: Onset Information, Notation & Note, Chord & Harmony, and Texture & Form. Through extensive zero-shot and fine-tuned evaluations of over 15+ state-of-the-art (SOTA) models, we reveal sharp modality gaps, fragile level-wise success rates, and the difficulty of sustaining multilevel correctness. Fine-tuning markedly improves performance in both modalities while preserving general knowledge, establishing MSU-Bench as a rigorous foundation for future research at the intersection of Artificial Intelligence (AI), musicological, and multimodal reasoning.

[339] SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications

Jionghao Han, Jiatong Shi, Masao Someki, Yuxun Tang, Lan Liu, Yiwen Zhao, Wenhao Feng, Shinji Watanabe

Main category: cs.SD

TL;DR: SingingSDS is a spoken dialogue system that responds through singing instead of speaking, using an ASR-LLM-SVS pipeline for affective interactions in roleplay and entertainment scenarios.

Details

Motivation: Most existing spoken dialogue systems are limited to conventional spoken responses, lacking affective and memorable interactions for character-based roleplay and entertainment.

Method: Uses a modular ASR-LLM-SVS pipeline with configurable components including character personas, ASR/LLM backends, SVS models, melody sources, and voice profiles.

Result: Developed a plug-and-play web demo with modular, open-source code that supports customization and extension for different latency, quality, and musical style needs.

Conclusion: SingingSDS enables more affective, memorable, and pleasurable interactions through singing responses, advancing spoken dialogue systems beyond conventional speech.

Abstract: With recent advances in automatic speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS) technologies, spoken dialogue systems (SDS) have become widely accessible. However, most existing SDS are limited to conventional spoken responses. We present SingingSDS, a cascaded SDS that responds through singing rather than speaking, fostering more affective, memorable, and pleasurable interactions in character-based roleplay and interactive entertainment scenarios. SingingSDS employs a modular ASR-LLM-SVS pipeline and supports a wide range of configurations across character personas, ASR and LLM backends, SVS models, melody sources, and voice profiles, tailored to different needs in terms of latency, quality, and musical style. SingingSDS is available as a plug-and-play web demo, featuring modular, open-source code that supports customization and extension. Demo: https://huggingface.co/spaces/espnet/SingingSDS. Code: https://github.com/SingingSDS/SingingSDS.

[340] CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation

Jionghao Han, Jiatong Shi, Zhuoyan Tao, Yuxun Tang, Yiwen Zhao, Gus Xia, Shinji Watanabe

Main category: cs.SD

TL;DR: CartoonSing is a unified framework for non-human singing generation that bridges human and non-human singing synthesis/conversion using a two-stage pipeline with score representation encoding and timbre-aware vocoding.

Details

Motivation: Existing singing voice systems are limited to human timbres and cannot generate voices outside the human range, which are increasingly needed for creative applications like video games, movies, and virtual characters.

Method: Two-stage pipeline: score representation encoder trained on annotated human singing data, and timbre-aware vocoder that reconstructs waveforms for both human and non-human audio.

Result: CartoonSing successfully generates non-human singing voices, generalizes to novel timbres, and extends conventional SVS/SVC toward creative non-human singing generation.

Conclusion: The proposed framework addresses challenges of data scarcity, lack of symbolic alignment, and timbral gap between human/non-human voices, enabling unified non-human singing generation.

Abstract: Singing voice synthesis (SVS) and singing voice conversion (SVC) have achieved remarkable progress in generating natural-sounding human singing. However, existing systems are restricted to human timbres and have limited ability to synthesize voices outside the human range, which are increasingly demanded in creative applications such as video games, movies, and virtual characters. We introduce Non-Human Singing Generation (NHSG), covering non-human singing voice synthesis (NHSVS) and non-human singing voice conversion (NHSVC), as a novel machine learning task for generating musically coherent singing with non-human timbral characteristics. NHSG is particularly challenging due to the scarcity of non-human singing data, the lack of symbolic alignment, and the wide timbral gap between human and non-human voices. To address these challenges, we propose CartoonSing, a unified framework that integrates singing voice synthesis and conversion while bridging human and non-human singing generation. CartoonSing employs a two-stage pipeline: a score representation encoder trained with annotated human singing and a timbre-aware vocoder that reconstructs waveforms for both human and non-human audio. Experiments demonstrate that CartoonSing successfully generates non-human singing voices, generalizes to novel timbres, and extends conventional SVS and SVC toward creative, non-human singing generation.

[341] Acoustic neural networks: Identifying design principles and exploring physical feasibility

Ivan Kalthoff, Marcel Rey, Raphael Wittkowski

Main category: cs.SD

TL;DR: A framework for designing acoustic neural networks that use sound waves for computation, with physically motivated constraints enabling realizable acoustic computing systems.

Details

Motivation: To enable low-power analog computing using acoustic systems where electronics are inefficient, by systematically designing acoustic neural networks that connect learnable components to measurable acoustic properties.

Method: Digital-twin approach training neural networks under physical constraints (non-negative signals/weights, no bias terms, intensity-based nonlinearities), developing constrained recurrent/hierarchical architectures and the SincHSRNN hybrid model combining learnable acoustic bandpass filters with hierarchical temporal processing.

Result: Achieved up to 95% accuracy on AudioMNIST dataset while remaining compatible with passive acoustic components; learned parameters correspond to measurable material/geometric properties like attenuation and transmission.

Conclusion: Established general design principles for physically realizable acoustic neural networks, outlining a pathway toward low-power, wave-based neural computing.

Abstract: Wave-guide-based physical systems provide a promising route toward energy-efficient analog computing beyond traditional electronics. Within this landscape, acoustic neural networks represent a promising approach for achieving low-power computation in environments where electronics are inefficient or limited, yet their systematic design has remained largely unexplored. Here we introduce a framework for designing and simulating acoustic neural networks, which perform computation through the propagation of sound waves. Using a digital-twin approach, we train conventional neural network architectures under physically motivated constraints including non-negative signals and weights, the absence of bias terms, and nonlinearities compatible with intensity-based, non-negative acoustic signals. Our work provides a general framework for acoustic neural networks that connects learnable network components directly to physically measurable acoustic properties, enabling the systematic design of realizable acoustic computing systems. We demonstrate that constrained recurrent and hierarchical architectures can perform accurate speech classification, and we propose the SincHSRNN, a hybrid model that combines learnable acoustic bandpass filters with hierarchical temporal processing. The SincHSRNN achieves up to 95% accuracy on the AudioMNIST dataset while remaining compatible with passive acoustic components. Beyond computational performance, the learned parameters correspond to measurable material and geometric properties such as attenuation and transmission. Our results establish general design principles for physically realizable acoustic neural networks and outline a pathway toward low-power, wave-based neural computing.

[342] Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale

Yicheng Zhong, Peiji Yang, Zhisheng Wang

Main category: cs.SD

TL;DR: Proposes multi-reward GRPO framework to optimize single-codebook TTS LLMs, addressing prosody instability and speaker drift through intelligibility, speaker similarity, and three rule-based rewards including LLM-annotated prosody alignment.

Details

Motivation: Single-codebook TTS LLMs are efficient but suffer from unstable prosody, speaker drift, and degraded naturalness, requiring direct optimization of token generation policies.

Method: Multi-reward Group Relative Policy Optimization (GRPO) with intelligibility, speaker similarity, length penalty, entropy regularization, and LLM-annotated prosody alignment reward using external reasoning LLM for pause structure prediction.

Result: Consistent enhancement of prosodic stability, speaker similarity, and overall speech naturalness across data sizes and model scales; additional gains when combined with flow-matching decoder.

Conclusion: The proposed GRPO framework effectively optimizes single-codebook TTS LLMs, improving their intrinsic autoregressive policy and demonstrating scalability across various configurations.

Abstract: Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectures that jointly model semantic and acoustic integration. However, despite their efficiency, these models often exhibit unstable prosody, speaker drift, and degraded naturalness. To address these issues, we propose a multi-reward Group Relative Policy Optimization (GRPO) framework that directly optimizes the token generation policy of single-codebook TTS LLMs. Beyond standard intelligibility and speaker similarity objectives, our design integrates three rule-based rewards: a length penalty for duration consistency, an entropy regularization reward for decoding stability, and an LLM-annotated prosody alignment reward that explicitly supervises rhythm. In this prosody reward, an external reasoning LLM predicts multiple plausible pause structures via in-context learning, providing a human-preference-aligned supervisory signal for GRPO training. To assess universality, we further attach a flow-matching (FM) decoder on top of the GRPO-optimized AR backbone and observe consistent additional gains, indicating that our reinforcement optimization enhances the intrinsic AR policy. We further conduct a scalability analysis across data sizes and model scales, revealing that the proposed method consistently enhances prosodic stability, speaker similarity, and overall speech naturalness in single-codebook TTS LLMs.

[343] SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen, Dan Raviv

Main category: cs.SD

TL;DR: SONAR is a frequency-guided framework that disentangles audio signals into low-frequency and high-frequency representations to improve deepfake audio detection by exploiting high-frequency artifacts that current detectors overlook.

Details

Motivation: Deepfake audio detectors struggle with generalization due to spectral bias, where neural networks focus on low-frequency content and miss high-frequency artifacts left by deepfake generators.

Method: Uses XLSR encoder for low-frequency content and cloned path with learnable SRM and high-pass filters for high-frequency residuals, then reunites them with frequency cross-attention and applies frequency-aware Jensen-Shannon contrastive loss.

Result: Achieves state-of-the-art performance on ASVspoof 2021 and in-the-wild benchmarks, converges four times faster than baselines, and creates distinct manifolds for genuine vs synthetic audio.

Conclusion: SONAR provides an architecture-agnostic framework that elevates high-frequency residuals as primary learning signals, enabling better detection of synthetic audio through frequency-guided contrastive learning.

Abstract: Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise pairs together while pushing fake embeddings apart, accelerating optimization and sharpening decision boundaries. Evaluated on the ASVspoof 2021 and in-the-wild benchmarks, SONAR attains state-of-the-art performance and converges four times faster than strong baselines. By elevating faint high-frequency residuals to first-class learning signals, SONAR unveils a fully data-driven, frequency-guided contrastive framework that splits the latent space into two disjoint manifolds: natural-HF for genuine audio and distorted-HF for synthetic audio, thereby sharpening decision boundaries. Because the scheme operates purely at the representation level, it is architecture-agnostic and, in future work, can be seamlessly integrated into any model or modality where subtle high-frequency cues are decisive.

[344] Spike Encoding for Environmental Sound: A Comparative Benchmark

Andres Larroza, Javier Naranjo-Alcazar, Vicent Ortiz, Maximo Cobos, Pedro Zuccarello

Main category: cs.SD

TL;DR: Analysis of three spike encoding methods (TAE, SF, MW) for environmental sound processing in SNNs, showing TAE outperforms others in reconstruction quality, energy efficiency, and classification performance.

Details

Motivation: SNNs offer energy efficiency for edge applications, but environmental sound processing faces challenges like variable frequencies, background noise, and overlapping events, with most research focused on speech rather than environmental sounds.

Method: Analyzed three spike encoding methods (Threshold Adaptive Encoding, Step Forward, Moving Window) across three environmental sound datasets (ESC10, UrbanSound8K, TAU Urban Acoustic Scenes) using multiband analysis.

Result: TAE consistently outperformed SF and MW in reconstruction quality across frequency bands and classes, achieved lowest spike firing rates (best energy efficiency), and delivered best performance in downstream environmental sound classification with SNNs.

Conclusion: Provides foundational insights and comparative benchmark for selecting spike encoders in neuromorphic environmental sound processing, with TAE demonstrating superior performance across multiple metrics.

Abstract: Spiking Neural Networks (SNNs) offer energy efficient processing suitable for edge applications, but conventional sensor data must first be converted into spike trains for neuromorphic processing. Environmental sound, including urban soundscapes, poses challenges due to variable frequencies, background noise, and overlapping acoustic events, while most spike based audio encoding research has focused on speech. This paper analyzes three spike encoding methods, Threshold Adaptive Encoding (TAE), Step Forward (SF), and Moving Window (MW) across three datasets: ESC10, UrbanSound8K, and TAU Urban Acoustic Scenes. Our multiband analysis shows that TAE consistently outperforms SF and MW in reconstruction quality, both per frequency band and per class across datasets. Moreover, TAE yields the lowest spike firing rates, indicating superior energy efficiency. For downstream environmental sound classification with a standard SNN, TAE also achieves the best performance among the compared encoders. Overall, this work provides foundational insights and a comparative benchmark to guide the selection of spike encoders for neuromorphic environmental sound processing.

[345] Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures

Genís Plaja-Roglans, Yun-Ning Hung, Xavier Serra, Igor Pereira

Main category: cs.SD

TL;DR: A diffusion model approach for singing voice separation from music recordings that achieves competitive performance and offers user control over quality-efficiency trade-offs.

Details

Motivation: To leverage the flexibility and generalization capabilities of generative diffusion models for the complex task of separating individual elements in musical mixtures, particularly singing voice separation.

Method: Train a diffusion model to generate solo vocals conditioned on the corresponding music mixture, using iterative diffusion sampling that allows user-configurable parameters for quality-efficiency trade-offs.

Result: The approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data.

Conclusion: Diffusion models provide an effective framework for singing voice separation with user-controllable sampling parameters, enabling flexible quality-efficiency trade-offs and output refinement.

Abstract: Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters.

[346] HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

Kexin Li, Xiao Hu, Ilya Grishchenko, David Lie

Main category: cs.SD

TL;DR: HarmonicAttack is an efficient audio watermark removal method that uses a dual-path convolutional autoencoder with GAN training to remove watermarks from AI-generated audio, outperforming previous methods against state-of-the-art watermark schemes.

Details

Motivation: To address security challenges from AI-generated audio misuse by developing effective watermark removal techniques to objectively evaluate watermark robustness, as existing methods either require impractical knowledge or are computationally expensive.

Method: Uses a dual-path convolutional autoencoder operating in temporal and frequency domains with GAN-style training to separate watermarks from original audio, requiring only the ability to generate watermarks from the targeted scheme.

Result: Demonstrates superior watermark removal ability against AudioSeal, WavMark, and Silentcipher compared to previous methods, with near real-time performance and good transfer to out-of-distribution samples.

Conclusion: HarmonicAttack provides an efficient and practical watermark removal solution that can help objectively assess the robustness of audio watermarking schemes against real-world attacks.

Abstract: The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. As those seeking to misuse AI-generated audio may thus seek to remove audio watermarks, studying effective watermark removal techniques is critical to being able to objectively evaluate the robustness of audio watermarks against removal. Previous watermark removal schemes either assume impractical knowledge of the watermarks they are designed to remove or are computationally expensive, potentially generating a false sense of confidence in current watermark schemes. We introduce HarmonicAttack, an efficient audio watermark removal method that only requires the basic ability to generate the watermarks from the targeted scheme and nothing else. With this, we are able to train a general watermark removal model that is able to remove the watermarks generated by the targeted scheme from any watermarked audio sample. HarmonicAttack employs a dual-path convolutional autoencoder that operates in both temporal and frequency domains, along with GAN-style training, to separate the watermark from the original audio. When evaluated against state-of-the-art watermark schemes AudioSeal, WavMark, and Silentcipher, HarmonicAttack demonstrates greater watermark removal ability than previous watermark removal methods with near real-time performance. Moreover, while HarmonicAttack requires training, we find that it is able to transfer to out-of-distribution samples with minimal degradation in performance.

[347] Harmonic-Percussive Disentangled Neural Audio Codec for Bandwidth Extension

Benoît Giniès, Xiaoyu Bie, Olivier Fercoq, Gaël Richard

Main category: cs.SD

TL;DR: Bandwidth extension framed as audio token prediction using transformer models on discrete representations from a novel disentangled neural audio codec guided by Harmonic-Percussive decomposition.

Details

Motivation: Traditional bandwidth extension approaches have evolved with signal processing trends, but recent neural architectures show improved performance across audio tasks. The paper aims to extend these advances by reframing bandwidth extension as a token prediction problem.

Method: Train transformer-based language model on discrete representations from a disentangled neural audio codec. The codec uses Harmonic-Percussive decomposition for disentanglement and is specifically designed for downstream token prediction tasks, enabling better coupling between codec structure and transformer modeling.

Result: Produces high-quality reconstructions of original signals as measured by both objective metrics and subjective evaluations. The joint design approach proves effective for bandwidth extension.

Conclusion: Highlights the importance of aligning codec disentanglement and representation learning with generative modeling stage. Demonstrates potential of global, representation-aware design for advancing bandwidth extension.

Abstract: Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent advances in neural architectures have significantly improved performance across a wide range of audio tasks, In this work, we extend these advances by framing bandwidth extension as an audio token prediction problem. Specifically, we train a transformer-based language model on the discrete representations produced by a disentangled neural audio codec, where the disentanglement is guided by a Harmonic-Percussive decomposition of the input signals, highlighting spectral structures particularly relevant for bandwidth extension. Our approach introduces a novel codec design that explicitly accounts for the downstream token prediction task, enabling a more effective coupling between codec structure and transformer modeling. This joint design yields high-quality reconstructions of the original signal, as measured by both objective metrics and subjective evaluations. These results highlight the importance of aligning codec disentanglement and representation learning with the generative modeling stage, and demonstrate the potential of global, representation-aware design for advancing bandwidth extension.

cs.LG

[348] Prototype-Guided Non-Exemplar Continual Learning for Cross-subject EEG Decoding

Dan Li, Hye-Bin Shin, Yeon-Woo Choi

Main category: cs.LG

TL;DR: ProNECL framework enables continual EEG decoding across subjects without storing historical data, using prototype-guided learning and cross-subject feature alignment to prevent catastrophic forgetting.

Details

Motivation: EEG signals vary significantly between individuals, causing knowledge from previous subjects to be overwritten in continual learning. Privacy concerns and memory constraints make storing historical EEG data impractical.

Method: Constructs class-level prototypes to summarize discriminative representations from each subject. Uses cross-subject feature alignment and knowledge distillation to incrementally align new feature spaces with global prototype memory without accessing historical EEG samples.

Result: Validated on BCI Competition IV 2a and 2b datasets, effectively balances knowledge retention and adaptability, achieving superior performance in cross-subject continual EEG decoding tasks.

Conclusion: ProNECL provides an effective solution for continual EEG decoding that preserves prior knowledge without requiring storage of historical data, addressing both privacy and memory constraints.

Abstract: Due to the significant variability in electroencephalogram (EEG) signals across individuals, knowledge acquired from previous subjects is often overwritten as new subjects are introduced in continual EEG decoding task. Current works mainly rely on storing the historical data of seen subjects as a replay buffer to prevent forgetting. However, privacy concerns or memory constraints make keeping such data impractical. Instead, we propose a Prototype-guided Non-Exemplar Continual Learning (ProNECL)framework that preserves prior knowledge without accessing any historical EEG samples. ProNECL constructs class-level prototypes to summarize discriminative representations from each subject and incrementally aligns new feature spaces with the global prototype memory through cross-subject feature alignment and knowledge distillation. Validated on the BCI Competition IV 2a and 2b datasets, our framework effectively balances knowledge retention and adaptability, achieving superior performance in cross-subject continual EEG decoding tasks.

[349] On the Role of Hidden States of Modern Hopfield Network in Transformer

Tsubasa Masumura, Masato Taki

Main category: cs.LG

TL;DR: The paper establishes a generalized correspondence between modern Hopfield networks and Transformers by introducing modern Hopfield attention (MHA), which adds hidden states from MHN to self-attention, improving attention weights and addressing rank collapse and token uniformity issues in deep Transformers.

Details

Motivation: To go beyond the adiabatic approximation and investigate a more generalized relationship between modern Hopfield networks and self-attention in Transformers, aiming to improve attention mechanisms and address known problems in deep Transformers.

Method: Proposed modern Hopfield attention (MHA) by adding hidden states derived from modern Hopfield networks to self-attention, allowing inheritance of attention scores across Transformer layers and improving attention weight properties.

Result: MHA significantly improves rank collapse and token uniformity problems in deep Transformers both theoretically and empirically, and systematically improves accuracy without adding training parameters to Vision Transformer and GPT models.

Conclusion: Modern Hopfield networks provide a useful perspective for improving Transformer architecture through MHA, establishing a more generalized correspondence between associative memory models and attention mechanisms.

Abstract: Associative memory models based on Hopfield networks and self-attention based on key-value mechanisms have been popular approaches in the study of memory mechanisms in deep learning. It has been pointed out that the state update rule of the modern Hopfield network (MHN) in the adiabatic approximation is in agreement with the self-attention layer of Transformer. In this paper, we go beyond this approximation and investigate the relationship between MHN and self-attention. Our results show that the correspondence between Hopfield networks and Transformers can be established in a more generalized form by adding a new variable, the hidden state derived from the MHN, to self-attention. This new attention mechanism, modern Hopfield attention (MHA), allows the inheritance of attention scores from the input layer of the Transformer to the output layer, which greatly improves the nature of attention weights. In particular, we show both theoretically and empirically that MHA hidden states significantly improve serious problem of deep Transformers known as rank collapse and token uniformity. We also confirm that MHA can systematically improve accuracy without adding training parameters to the Vision Transformer or GPT. Our results provide a new case in which Hopfield networks can be a useful perspective for improving the Transformer architecture.

[350] Post-Pruning Accuracy Recovery via Data-Free Knowledge Distillation

Chinmay Tripurwar, Utkarsh Maurya, Dishant

Main category: cs.LG

TL;DR: Data-Free Knowledge Distillation framework for model pruning without access to original training data, using synthetic images from BN statistics inversion.

Details

Motivation: Privacy regulations restrict access to original training data post-deployment, creating a need for model compression methods that don't require real data.

Method: Use DeepInversion to synthesize privacy-preserving images from pre-trained teacher model by inverting Batch Normalization statistics, then distill knowledge to pruned student network.

Result: Significantly recovers accuracy lost during pruning on CIFAR-10 across various architectures (ResNet, MobileNet, VGG) without accessing real data.

Conclusion: Proposed framework successfully bridges model compression and data privacy by enabling effective pruning without original training data access.

Abstract: Model pruning is a widely adopted technique to reduce the computational complexity and memory footprint of Deep Neural Networks (DNNs). However, global unstructured pruning often leads to significant degradation in accuracy, typically necessitating fine-tuning on the original training dataset to recover performance. In privacy-sensitive domains such as healthcare or finance, access to the original training data is often restricted post-deployment due to regulations (e.g., GDPR, HIPAA). This paper proposes a Data-Free Knowledge Distillation framework to bridge the gap between model compression and data privacy. We utilize DeepInversion to synthesize privacy-preserving ``dream’’ images from the pre-trained teacher model by inverting Batch Normalization (BN) statistics. These synthetic images serve as a transfer set to distill knowledge from the original teacher to the pruned student network. Experimental results on CIFAR-10 across various architectures (ResNet, MobileNet, VGG) demonstrate that our method significantly recovers accuracy lost during pruning without accessing a single real data point.

[351] Pretraining Transformer-Based Models on Diffusion-Generated Synthetic Graphs for Alzheimer’s Disease Prediction

Abolfazl Moslemi, Hossein Peyvandi

Main category: cs.LG

TL;DR: Transformer-based AD diagnostic framework combining diffusion-based synthetic data generation with graph representation learning and transfer learning to address data limitations and class imbalance.

Details

Motivation: Early and accurate AD detection is crucial but challenging due to limited labeled data, multi-site heterogeneity, and class imbalance in clinical datasets.

Method: Class-conditional DDPM generates balanced synthetic data; modality-specific Graph Transformers pretrained on synthetic data then frozen; neural classifier trained on original data embeddings.

Result: Outperforms baselines (early/late fusion DNNs, MaGNet) with higher AUC, accuracy, sensitivity, and specificity under subject-wise cross-validation on NACC dataset.

Conclusion: Diffusion-based synthetic pretraining with Graph Transformers improves generalization in low-sample, imbalanced clinical prediction settings.

Abstract: Early and accurate detection of Alzheimer’s disease (AD) is crucial for enabling timely intervention and improving outcomes. However, developing reliable machine learning (ML) models for AD diagnosis is challenging due to limited labeled data, multi-site heterogeneity, and class imbalance. We propose a Transformer-based diagnostic framework that combines diffusion-based synthetic data generation with graph representation learning and transfer learning. A class-conditional denoising diffusion probabilistic model (DDPM) is trained on the real-world NACC dataset to generate a large synthetic cohort that mirrors multimodal clinical and neuroimaging feature distributions while balancing diagnostic classes. Modality-specific Graph Transformer encoders are first pretrained on this synthetic data to learn robust, class-discriminative representations and are then frozen while a neural classifier is trained on embeddings from the original NACC data. We quantify distributional alignment between real and synthetic cohorts using metrics such as Maximum Mean Discrepancy (MMD), Frechet distance, and energy distance, and complement discrimination metrics with calibration and fixed-specificity sensitivity analyses. Empirically, our framework outperforms standard baselines, including early and late fusion deep neural networks and the multimodal graph-based model MaGNet, yielding higher AUC, accuracy, sensitivity, and specificity under subject-wise cross-validation on NACC. These results show that diffusion-based synthetic pretraining with Graph Transformers can improve generalization in low-sample, imbalanced clinical prediction settings.

[352] Solving Diffusion Inverse Problems with Restart Posterior Sampling

Bilal Ahmed, Joseph G. Makin

Main category: cs.LG

TL;DR: RePS is an efficient framework for solving inverse problems using pre-trained diffusion models with restart-based sampling, avoiding expensive gradient backpropagation and working for both linear and non-linear measurement models.

Details

Motivation: Existing diffusion-based methods for inverse problems rely on strong posterior approximations, require expensive gradient computations, or are limited to linear models, creating a need for more general and efficient approaches.

Method: Uses restart-based sampling with conditioned ODE applicable to any differentiable measurement model, employing simplified restart strategy to contract approximation errors without backpropagation through score network.

Result: Achieves faster convergence and superior reconstruction quality compared to existing diffusion-based baselines across various linear and non-linear inverse problems.

Conclusion: RePS provides a general and computationally efficient framework for posterior sampling in inverse problems using pre-trained diffusion models, outperforming prior methods in both speed and quality.

Abstract: Inverse problems are fundamental to science and engineering, where the goal is to infer an underlying signal or state from incomplete or noisy measurements. Recent approaches employ diffusion models as powerful implicit priors for such problems, owing to their ability to capture complex data distributions. However, existing diffusion-based methods for inverse problems often rely on strong approximations of the posterior distribution, require computationally expensive gradient backpropagation through the score network, or are restricted to linear measurement models. In this work, we propose Restart for Posterior Sampling (RePS), a general and efficient framework for solving both linear and non-linear inverse problems using pre-trained diffusion models. RePS builds on the idea of restart-based sampling, previously shown to improve sample quality in unconditional diffusion, and extends it to posterior inference. Our method employs a conditioned ODE applicable to any differentiable measurement model and introduces a simplified restart strategy that contracts accumulated approximation errors during sampling. Unlike some of the prior approaches, RePS avoids backpropagation through the score network, substantially reducing computational cost. We demonstrate that RePS achieves faster convergence and superior reconstruction quality compared to existing diffusion-based baselines across a range of inverse problems, including both linear and non-linear settings.

[353] Active Slice Discovery in Large Language Models

Minhui Zhang, Prahar Ijner, Yoav Wald, Elliot Creager

Main category: cs.LG

TL;DR: Active Slice Discovery formalizes the process of identifying systematic error patterns in LLMs by actively grouping likely related errors and using limited manual annotation to verify shared mistake patterns, achieving competitive accuracy with only 2-10% of slice membership information.

Details

Motivation: LLMs often make systematic errors on specific data subsets (error slices), but identifying these slices requires extensive manual annotation. The goal is to reduce annotation burden while effectively discovering error patterns.

Method: Formalized Active Slice Discovery approach that actively groups likely related errors and uses limited annotator access to verify shared mistake patterns. Evaluated different feature representations and active learning algorithms for toxicity classification.

Result: Uncertainty-based active learning algorithms were most effective, achieving competitive accuracy using only 2-10% of available slice membership information, significantly outperforming baselines.

Conclusion: Active Slice Discovery is a promising approach for efficiently identifying systematic error patterns in LLMs with minimal annotation effort, particularly when using uncertainty-based active learning methods.

Abstract: Large Language Models (LLMs) often exhibit systematic errors on specific subsets of data, known as error slices. For instance, a slice can correspond to a certain demographic, where a model does poorly in identifying toxic comments regarding that demographic. Identifying error slices is crucial to understanding and improving models, but it is also challenging. An appealing approach to reduce the amount of manual annotation required is to actively group errors that are likely to belong to the same slice, while using limited access to an annotator to verify whether the chosen samples share the same pattern of model mistake. In this paper, we formalize this approach as Active Slice Discovery and explore it empirically on a problem of discovering human-defined slices in toxicity classification. We examine the efficacy of active slice discovery under different choices of feature representations and active learning algorithms. On several slices, we find that uncertainty-based active learning algorithms are most effective, achieving competitive accuracy using 2-10% of the available slice membership information, while significantly outperforming baselines.

[354] ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Alfredo Garcia, Parminder Bhatia, Taha Kass-Hout, Cao Xiao, Mingyi Hong

Main category: cs.LG

TL;DR: The paper introduces ST-PPO and S-PPO, two stabilized variants of PPO that address instability in multi-turn LLM training through turn-level importance sampling and clipping-bias correction.

Details

Motivation: Standard token-level PPO is unstable and prone to collapse in multi-turn dialogue and reasoning tasks due to misaligned importance sampling granularity and inaccurate advantage estimates from off-policy samples.

Method: Two stabilization techniques: (1) turn-level importance sampling that aligns with multi-turn reasoning structure, and (2) clipping-bias correction that normalizes gradients by downweighting unreliable off-policy samples. Three variants: Turn-PPO, S-PPO, and ST-PPO.

Result: ST-PPO and S-PPO prevent performance collapses in large-model training, maintain lower clipping ratios, and achieve higher task performance than standard token-level PPO across multi-turn search tasks in general QA, multi-hop QA, and medical multiple-choice QA benchmarks.

Conclusion: Combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.

Abstract: PPO has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1)~token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn-level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.

[355] Gradient Descent Algorithm Survey

Deng Fucheng, Wang Wanjie, Gong Ao, Wang Xiaoqi, Wang Fan

Main category: cs.LG

TL;DR: Analysis of five major deep learning optimization algorithms (SGD, Mini-batch SGD, Momentum, Adam, Lion) focusing on their advantages, limitations, and practical configuration recommendations.

Details

Motivation: To address practical configuration needs of optimization algorithms in deep learning and provide standardized references for algorithm selection, parameter tuning, and performance improvement across different model scales and training scenarios.

Method: Systematic analysis of core advantages, limitations, and key practical recommendations for each of the five optimization algorithms.

Result: Comprehensive understanding of algorithm characteristics and practical guidelines for optimization algorithm configuration in deep learning applications.

Conclusion: Provides standardized reference framework for reasonable selection and parameter tuning of optimization algorithms to solve optimization challenges in various model scales and training scenarios.

Abstract: Focusing on the practical configuration needs of optimization algorithms in deep learning, this article concentrates on five major algorithms: SGD, Mini-batch SGD, Momentum, Adam, and Lion. It systematically analyzes the core advantages, limitations, and key practical recommendations of each algorithm. The research aims to gain an in-depth understanding of these algorithms and provide a standardized reference for the reasonable selection, parameter tuning, and performance improvement of optimization algorithms in both academic research and engineering practice, helping to solve optimization challenges in different scales of models and various training scenarios.

[356] Learning from Risk: LLM-Guided Generation of Safety-Critical Scenarios with Prior Knowledge

Yuhang Wang, Heye Huang, Zhenhua Xu, Kailai Sun, Baoshen Guo, Jinhua Zhao

Main category: cs.LG

TL;DR: A framework combining CVAE and LLM for generating high-fidelity driving scenarios that cover rare long-tail events and complex multi-agent interactions, enabling better safety validation of autonomous systems.

Details

Motivation: Address the scarcity of rare long-tail events and complex multi-agent interactions in real-world autonomous driving data, which are crucial for robust safety validation but hard to capture.

Method: Integrates conditional variational autoencoder (CVAE) to learn latent traffic structures from historical data and generate physically consistent base scenarios, then uses LLM as adversarial reasoning engine to parse scene descriptions into domain-specific loss functions and guide scenario generation across risk levels.

Result: Substantially increases coverage of high-risk and long-tail events, improves consistency between simulated and real-world traffic distributions, and exposes autonomous systems to more challenging interactions than existing methods in CARLA and SMARTS simulations.

Conclusion: Establishes a new pathway for safety validation by enabling principled stress-testing of autonomous systems under rare but consequential events through knowledge-driven scenario generation.

Abstract: Autonomous driving faces critical challenges in rare long-tail events and complex multi-agent interactions, which are scarce in real-world data yet essential for robust safety validation. This paper presents a high-fidelity scenario generation framework that integrates a conditional variational autoencoder (CVAE) with a large language model (LLM). The CVAE encodes historical trajectories and map information from large-scale naturalistic datasets to learn latent traffic structures, enabling the generation of physically consistent base scenarios. Building on this, the LLM acts as an adversarial reasoning engine, parsing unstructured scene descriptions into domain-specific loss functions and dynamically guiding scenario generation across varying risk levels. This knowledge-driven optimization balances realism with controllability, ensuring that generated scenarios remain both plausible and risk-sensitive. Extensive experiments in CARLA and SMARTS demonstrate that our framework substantially increases the coverage of high-risk and long-tail events, improves consistency between simulated and real-world traffic distributions, and exposes autonomous driving systems to interactions that are significantly more challenging than those produced by existing rule- or data-driven methods. These results establish a new pathway for safety validation, enabling principled stress-testing of autonomous systems under rare but consequential events.

[357] Spatio-Temporal Trajectory Foundation Model - Recent Advances and Future Directions

Sean Bin Yang, Ying Sun, Yunyao Cheng, Yan Lin, Kristian Torp, Jilin Hu

Main category: cs.LG

TL;DR: This tutorial provides a comprehensive overview of trajectory foundation models (TFMs), a subclass of spatio-temporal foundation models, covering recent advances, methodology taxonomy, strengths/limitations analysis, and future research directions.

Details

Motivation: Foundation models have shown success across scientific fields, but there's a lack of systematic investigation into trajectory foundation models (TFMs) despite their importance in spatio-temporal tasks.

Method: The tutorial offers a comprehensive review including taxonomy of existing TFM methodologies and critical analysis of their strengths and limitations.

Result: The work identifies that despite rapid progress in spatio-temporal foundation models, TFMs remain under-investigated and provides a systematic framework for understanding them.

Conclusion: The tutorial highlights open challenges and outlines promising research directions to advance spatio-temporal general intelligence through robust, responsible, and transferable TFMs.

Abstract: Foundation models (FMs) have emerged as a powerful paradigm, enabling a diverse range of data analytics and knowledge discovery tasks across scientific fields. Inspired by the success of FMs, particularly large language models, researchers have recently begun to explore spatio-temporal foundation models (STFMs) to improve adaptability and generalization across a wide spectrum of spatio-temporal (ST) tasks. Despite rapid progress, a systematic investigation of trajectory foundation models (TFMs), a crucial subclass of STFMs, is largely lacking. This tutorial addresses this gap by offering a comprehensive overview of recent advances in TFMs, including a taxonomy of existing methodologies and a critical analysis of their strengths and limitations. In addition, the tutorial highlights open challenges and outlines promising research directions to advance spatio-temporal general intelligence through the development of robust, responsible, and transferable TFMs.

[358] CHiQPM: Calibrated Hierarchical Interpretable Image Classification

Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Neslihan Kose, Ramesh Manuvinakurike, Bodo Rosenhahn

Main category: cs.LG

TL;DR: CHiQPM is a globally interpretable model that provides comprehensive hierarchical explanations for both global and local interpretability while maintaining high accuracy comparable to non-interpretable models.

Details

Motivation: To develop trustworthy AI for safety-critical domains by providing both global and detailed local explanations that support human experts during inference and enable human-AI complementarity.

Method: Calibrated Hierarchical QPM (CHiQPM) that offers hierarchical explanations similar to human reasoning, contrastive explanations for majority classes, and built-in interpretable Conformal Prediction (CP) through traversable hierarchical structures.

Result: CHiQPM achieves state-of-the-art accuracy (99% of non-interpretable models) as a point predictor and provides competitively efficient calibrated set predictions with interpretable coherent sets along hierarchical explanations.

Conclusion: CHiQPM demonstrates that interpretability can be incorporated without sacrificing overall accuracy, offering comprehensive global and local interpretability through hierarchical explanations that support human-AI complementarity.

Abstract: Globally interpretable models are a promising approach for trustworthy AI in safety-critical domains. Alongside global explanations, detailed local explanations are a crucial complement to effectively support human experts during inference. This work proposes the Calibrated Hierarchical QPM (CHiQPM) which offers uniquely comprehensive global and local interpretability, paving the way for human-AI complementarity. CHiQPM achieves superior global interpretability by contrastively explaining the majority of classes and offers novel hierarchical explanations that are more similar to how humans reason and can be traversed to offer a built-in interpretable Conformal prediction (CP) method. Our comprehensive evaluation shows that CHiQPM achieves state-of-the-art accuracy as a point predictor, maintaining 99% accuracy of non-interpretable models. This demonstrates a substantial improvement, where interpretability is incorporated without sacrificing overall accuracy. Furthermore, its calibrated set prediction is competitively efficient to other CP methods, while providing interpretable predictions of coherent sets along its hierarchical explanation.

[359] Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Yusong Wu, Stephen Brade, Teng Ma, Tia-Jane Fowler, Enning Yang, Berker Banar, Aaron Courville, Natasha Jaques, Cheng-Zhi Anna Huang

Main category: cs.LG

TL;DR: Proposes adversarial training to mitigate reward hacking in RL post-training for melody-to-chord accompaniment, improving diversity while maintaining coherence in live jamming scenarios.

Details

Motivation: Live jamming requires real-time coordination and adaptation without future knowledge, but RL post-training often reduces output diversity through reward hacking, which harms musical creativity that relies on dynamic variation.

Method: Uses adversarial training with a co-evolving discriminator that separates policy trajectories from data distribution, while the policy maximizes discriminator output plus coherence rewards to prevent collapse to trivial outputs.

Result: Evaluation shows improved output diversity, harmonic coherence, adaptation speed and user agency in both simulation with fixed melodies/learned agents and real-time user study with expert musicians.

Conclusion: Demonstrates a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models for live jamming applications.

Abstract: Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player’s future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking’’, affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.

[360] Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation Model

Rio Alexa Fear, Payel Mukhopadhyay, Michael McCabe, Alberto Bietti, Miles Cranmer

Main category: cs.LG

TL;DR: Physics foundation models develop internal representations of abstract physical concepts that can be manipulated to causally control model behavior and steer predictions.

Details

Motivation: To investigate whether the phenomenon of internal abstract concept representations found in language and vision models extends to scientific foundation models, particularly physics models.

Method: Extracted activation vectors from a physics model during forward passes over different physical regimes, computed ‘delta’ representations between regimes as concept directions, and injected these directions back during inference.

Result: Successfully steered model predictions by manipulating concept directions, demonstrating causal control over physical behaviors like inducing or removing specific physical features from simulations.

Conclusion: Scientific foundation models learn generalized representations of physical principles rather than superficial correlations, opening new avenues for understanding and controlling AI in scientific discovery.

Abstract: Recent advances in mechanistic interpretability have revealed that large language models (LLMs) develop internal representations corresponding not only to concrete entities but also distinct, human-understandable abstract concepts and behaviour. Moreover, these hidden features can be directly manipulated to steer model behaviour. However, it remains an open question whether this phenomenon is unique to models trained on inherently structured data (ie. language, images) or if it is a general property of foundation models. In this work, we investigate the internal representations of a large physics-focused foundation model. Inspired by recent work identifying single directions in activation space for complex behaviours in LLMs, we extract activation vectors from the model during forward passes over simulation datasets for different physical regimes. We then compute “delta” representations between the two regimes. These delta tensors act as concept directions in activation space, encoding specific physical features. By injecting these concept directions back into the model during inference, we can steer its predictions, demonstrating causal control over physical behaviours, such as inducing or removing some particular physical feature from a simulation. These results suggest that scientific foundation models learn generalised representations of physical principles. They do not merely rely on superficial correlations and patterns in the simulations. Our findings open new avenues for understanding and controlling scientific foundation models and has implications for AI-enabled scientific discovery.

[361] Conformal Safety Monitoring for Flight Testing: A Case Study in Data-Driven Safety Learning

Aaron O. Feldman, D. Isaiah Harp, Joseph Duncan, Mac Schwager

Main category: cs.LG

TL;DR: A data-driven approach for runtime safety monitoring in flight testing that uses stochastic trajectory simulation and conformal prediction to provide calibrated risk assessments for pilots to abort maneuvers before safety violations occur.

Details

Motivation: Flight testing involves inherent safety risks due to uncertain aircraft parameters, requiring preemptive criteria for pilots to abort maneuvers before safety violations happen unexpectedly.

Method: Three components: 1) Model to predict future states from recent observations, 2) Nearest neighbor model to classify safety of predicted states, 3) Classifier calibration via conformal prediction using offline stochastic trajectory simulation.

Result: The method reliably identifies unsafe scenarios, matches theoretical guarantees, and outperforms baseline approaches in preemptive risk classification on a flight dynamics model with uncertain parameters.

Conclusion: The approach provides a practical framework for runtime safety monitoring that combines prediction, classification, and calibration to enable timely intervention in high-risk scenarios.

Abstract: We develop a data-driven approach for runtime safety monitoring in flight testing, where pilots perform maneuvers on aircraft with uncertain parameters. Because safety violations can arise unexpectedly as a result of these uncertainties, pilots need clear, preemptive criteria to abort the maneuver in advance of safety violation. To solve this problem, we use offline stochastic trajectory simulation to learn a calibrated statistical model of the short-term safety risk facing pilots. We use flight testing as a motivating example for data-driven learning/monitoring of safety due to its inherent safety risk, uncertainty, and human-interaction. However, our approach consists of three broadly-applicable components: a model to predict future state from recent observations, a nearest neighbor model to classify the safety of the predicted state, and classifier calibration via conformal prediction. We evaluate our method on a flight dynamics model with uncertain parameters, demonstrating its ability to reliably identify unsafe scenarios, match theoretical guarantees, and outperform baseline approaches in preemptive classification of risk.

[362] Effects of Initialization Biases on Deep Neural Network Training Dynamics

Nicholas Pellegrino, David Szczecina, Paul W. Fieguth

Main category: cs.LG

TL;DR: Untrained neural networks exhibit Initial Guessing Bias, favoring few classes after random initialization. Loss functions like Blurry and Piecewise-zero loss designed for label error robustness struggle with this bias, affecting early training dynamics.

Details

Motivation: To understand how Initial Guessing Bias in untrained neural networks affects early training dynamics and how different loss functions interact with this bias, particularly those designed for robustness to label errors.

Method: Analysis of how untrained large neural networks behave after random initialization, examining the impact of different loss functions (including Blurry and Piecewise-zero loss) on early training dynamics when Initial Guessing Bias is present.

Result: Initial Guessing Bias causes networks to favor a small subset of classes, assigning high probabilities to few classes and near-zero to others. Loss functions designed for label error robustness can become ineffective at steering training direction when exposed to this bias.

Conclusion: Loss function choice dramatically affects early phase training of networks, and careful consideration of how Initial Guessing Bias interacts with training scheme components is necessary for effective model training.

Abstract: Untrained large neural networks, just after random initialization, tend to favour a small subset of classes, assigning high predicted probabilities to these few classes and approximately zero probability to all others. This bias, termed Initial Guessing Bias, affects the early training dynamics, when the model is fitting to the coarse structure of the data. The choice of loss function against which to train the model has a large impact on how these early dynamics play out. Two recent loss functions, Blurry and Piecewise-zero loss, were designed for robustness to label errors but can become unable to steer the direction of training when exposed to this initial bias. Results indicate that the choice of loss function has a dramatic effect on the early phase training of networks, and highlights the need for careful consideration of how Initial Guessing Bias may interact with various components of the training scheme.

[363] Autoregressive Surrogate Modeling of the Solar Wind with Spherical Fourier Neural Operator

Reza Mansouri, Dustin Kempton, Pete Riley, Rafal Angryk

Main category: cs.LG

TL;DR: First autoregressive machine learning surrogate for steady-state solar wind radial velocity using Spherical Fourier Neural Operator (SFNO), outperforming traditional MHD models and HUX surrogate.

Details

Motivation: Traditional 3D magnetohydrodynamic models for solar wind prediction are computationally expensive, limiting rapid exploration of boundary condition uncertainties for space weather forecasting.

Method: Uses Spherical Fourier Neural Operator (SFNO) with autoregressive approach - predicts limited radial range and iteratively propagates solution outward to improve accuracy in distant regions.

Result: SFNO demonstrates superior or comparable performance to numerical HUX surrogate while providing flexible, trainable, data-driven alternative for solar wind modeling.

Conclusion: Establishes novel methodology for high-fidelity solar wind modeling with improved computational efficiency and accuracy.

Abstract: The solar wind, a continuous outflow of charged particles from the Sun’s corona, shapes the heliosphere and impacts space systems near Earth. Accurate prediction of features such as high-speed streams and coronal mass ejections is critical for space weather forecasting, but traditional three-dimensional magnetohydrodynamic (MHD) models are computationally expensive, limiting rapid exploration of boundary condition uncertainties. We introduce the first autoregressive machine learning surrogate for steady-state solar wind radial velocity using the Spherical Fourier Neural Operator (SFNO). By predicting a limited radial range and iteratively propagating the solution outward, the model improves accuracy in distant regions compared to a single-step approach. Compared with the numerical HUX surrogate, SFNO demonstrates superior or comparable performance while providing a flexible, trainable, and data-driven alternative, establishing a novel methodology for high-fidelity solar wind modeling. The source code and additional visual results are available at https://github.com/rezmansouri/solarwind-sfno-velocity-autoregressive.

[364] Multi-Agent Cross-Entropy Method with Monotonic Nonlinear Critic Decomposition

Yan Wang, Ke Deng, Yongli Ren

Main category: cs.LG

TL;DR: Proposes MCEM with nonlinear critic decomposition to overcome centralized-decentralized mismatch in multi-agent RL, enabling decentralized execution while maintaining sample efficiency.

Details

Motivation: Address the trade-off between expressiveness and gradient decentralization in multi-agent RL, where linear value decomposition limits expressiveness while nonlinear decomposition reintroduces centralized-decentralized mismatch.

Method: Multi-agent cross-entropy method (MCEM) updates policies by increasing probability of high-value joint actions, combined with monotonic nonlinear critic decomposition (NCD) and extended off-policy learning with modified k-step return and Retrace.

Result: MCEM outperforms state-of-the-art methods across both continuous and discrete action benchmarks in cooperative multi-agent reinforcement learning.

Conclusion: The proposed approach successfully overcomes the centralized-decentralized mismatch trade-off, enabling effective decentralized execution while maintaining representation power and sample efficiency.

Abstract: Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution (CTDE), where centralized critics leverage global information to guide decentralized actors. However, centralized-decentralized mismatch (CDM) arises when the suboptimal behavior of one agent degrades others’ learning. Prior approaches mitigate CDM through value decomposition, but linear decompositions allow per-agent gradients at the cost of limited expressiveness, while nonlinear decompositions improve representation but require centralized gradients, reintroducing CDM. To overcome this trade-off, we propose the multi-agent cross-entropy method (MCEM), combined with monotonic nonlinear critic decomposition (NCD). MCEM updates policies by increasing the probability of high-value joint actions, thereby excluding suboptimal behaviors. For sample efficiency, we extend off-policy learning with a modified k-step return and Retrace. Analysis and experiments demonstrate that MCEM outperforms state-of-the-art methods across both continuous and discrete action benchmarks.

[365] Primal: A Unified Deterministic Framework for Quasi-Orthogonal Hashing and Manifold Learning

Vladimer Khasia

Main category: cs.LG

TL;DR: Primal is a deterministic feature mapping framework using prime square roots to create robust vector representations with tunable properties, offering both static sequence generation and dynamic input-dependent projections.

Details

Motivation: To overcome limitations of stochastic projections like Random Fourier Features by creating deterministic, mathematically rigorous feature mappings with guaranteed non-repeating phase trajectories and tunable utility.

Method: Two algorithmic variants: StaticPrime for temporal position encodings using prime square roots, and DynamicPrime with tunable scaling parameter σ that enables both low-frequency isometric kernel mapping and high-frequency chaotic hashing.

Result: Superior orthogonality retention and distribution tightness compared to normalized Gaussian baselines, with computational efficiency and mathematical rigor.

Conclusion: Primal provides a deterministic alternative to random matrix projections with tunable properties suitable for various applications from signal processing to privacy-preserving computing.

Abstract: We present Primal, a deterministic feature mapping framework that harnesses the number-theoretic independence of prime square roots to construct robust, tunable vector representations. Diverging from standard stochastic projections (e.g., Random Fourier Features), our method exploits the Besicovitch property to create irrational frequency modulations that guarantee infinite non-repeating phase trajectories. We formalize two distinct algorithmic variants: (1) StaticPrime, a sequence generation method that produces temporal position encodings empirically approaching the theoretical Welch bound for quasi-orthogonality; and (2) DynamicPrime, a tunable projection layer for input-dependent feature mapping. A central novelty of the dynamic framework is its ability to unify two disparate mathematical utility classes through a single scaling parameter σ. In the low-frequency regime, the method acts as an isometric kernel map, effectively linearizing non-convex geometries (e.g., spirals) to enable high-fidelity signal reconstruction and compressive sensing. Conversely, the high-frequency regime induces chaotic phase wrapping, transforming the projection into a maximum-entropy one-way hash suitable for Hyperdimensional Computing and privacy-preserving Split Learning. Empirical evaluations demonstrate that our framework yields superior orthogonality retention and distribution tightness compared to normalized Gaussian baselines, establishing it as a computationally efficient, mathematically rigorous alternative to random matrix projections. The code is available at https://github.com/VladimerKhasia/primal

[366] Pre-train to Gain: Robust Learning Without Clean Labels

David Szczecina, Nicholas Pellegrino, Paul Fieguth

Main category: cs.LG

TL;DR: Self-supervised pre-training improves noise robustness in deep learning without requiring clean labeled data, outperforming ImageNet pre-trained models under high noise conditions.

Details

Motivation: Training deep networks with noisy labels leads to poor generalization and degraded accuracy due to overfitting to label noise, and existing approaches often require clean labeled subsets.

Method: Pre-train feature extractor backbone without labels using self-supervised learning (SimCLR and Barlow Twins), followed by standard supervised training on noisy datasets.

Result: Self-supervised pre-training consistently improves classification accuracy and label-error detection across all noise rates, with performance gap widening as noise increases. Comparable to ImageNet pre-trained models at low noise, substantially better at high noise.

Conclusion: Self-supervised pre-training provides an effective approach for learning with noisy labels without requiring clean data subsets, demonstrating improved robustness especially under high noise conditions.

Abstract: Training deep networks with noisy labels leads to poor generalization and degraded accuracy due to overfitting to label noise. Existing approaches for learning with noisy labels often rely on the availability of a clean subset of data. By pre-training a feature extractor backbone without labels using self-supervised learning (SSL), followed by standard supervised training on the noisy dataset, we can train a more noise robust model without requiring a subset with clean labels. We evaluate the use of SimCLR and Barlow~Twins as SSL methods on CIFAR-10 and CIFAR-100 under synthetic and real world noise. Across all noise rates, self-supervised pre-training consistently improves classification accuracy and enhances downstream label-error detection (F1 and Balanced Accuracy). The performance gap widens as the noise rate increases, demonstrating improved robustness. Notably, our approach achieves comparable results to ImageNet pre-trained models at low noise levels, while substantially outperforming them under high noise conditions.

[367] Selecting Belief-State Approximations in Simulators with Latent States

Nan Jiang

Main category: cs.LG

TL;DR: The paper addresses the problem of selecting belief-state samplers for state resetting in simulators with latent variables, showing it reduces to conditional distribution selection and revealing different formulations with varying guarantees depending on roll-out methods.

Details

Motivation: State resetting is crucial for sample-based planning and simulator calibration but becomes challenging in simulators with latent variables, requiring sampling from belief states. The problem of selecting among approximate belief-state samplers using only sampling access needs systematic treatment.

Method: The paper reduces the belief-state selection problem to conditional distribution selection and develops algorithms under sampling-only access. It presents two formulations: latent state-based selection (targeting latent state distribution) and observation-based selection (targeting observation distribution), analyzing their interactions with different roll-out methods (Single-Reset vs Repeated-Reset).

Result: The analysis reveals that observation-based selection may fail under Single-Reset roll-out but enjoys guarantees under Repeated-Reset. The paper provides theoretical insights into how different selection formulations interact with roll-out methods, highlighting distribution shift issues and sampling policy choices.

Conclusion: The seemingly simple problem of belief-state sampler selection reveals a rich landscape with nuanced algorithmic choices, theoretical trade-offs between different formulations, and open questions regarding distribution shift and sampling policy optimization.

Abstract: State resetting is a fundamental but often overlooked capability of simulators. It supports sample-based planning by allowing resets to previously encountered simulation states, and enables calibration of simulators using real data by resetting to states observed in real-system traces. While often taken for granted, state resetting in complex simulators can be nontrivial: when the simulator comes with latent variables (states), state resetting requires sampling from the posterior over the latent state given the observable history, a.k.a. the belief state (Silver and Veness, 2010). While exact sampling is often infeasible, many approximate belief-state samplers can be constructed, raising the question of how to select among them using only sampling access to the simulator. In this paper, we show that this problem reduces to a general conditional distribution-selection task and develop a new algorithm and analysis under sampling-only access. Building on this reduction, the belief-state selection problem admits two different formulations: latent state-based selection, which directly targets the conditional distribution of the latent state, and observation-based selection, which targets the induced distribution over the observation. Interestingly, these formulations differ in how their guarantees interact with the downstream roll-out methods: perhaps surprisingly, observation-based selection may fail under the most natural roll-out method (which we call Single-Reset) but enjoys guarantees under the less conventional alternative (which we call Repeated-Reset). Together with discussion on issues such as distribution shift and the choice of sampling policies, our paper reveals a rich landscape of algorithmic choices, theoretical nuances, and open questions, in this seemingly simple problem.

[368] Representation Integrity in Temporal Graph Learning Methods

Elahe Kooshafar

Main category: cs.LG

TL;DR: The paper proposes a framework called ‘representation integrity’ to evaluate dynamic graph embeddings by measuring how closely embedding changes follow actual graph changes, recommending a validated metric that correlates with link-prediction performance.

Details

Motivation: Conventional benchmarks for dynamic-graph learners focus on task-specific scores but don't assess whether embeddings truthfully reflect the evolving network structure over time.

Method: Formalized representation integrity concept and derived a family of indexes to measure embedding-graph alignment. Used three synthetic scenarios (Gradual Merge, Abrupt Move, Periodic Re-wiring) to screen 42 candidate indexes and identified one validated metric.

Result: The validated metric consistently ranks provably stable UASE and IPP models highest, exposes scenario-specific strengths of neural methods, and shows strong positive correlation with one-step link-prediction AUC.

Conclusion: The representation integrity framework provides a task-agnostic, interpretable evaluation tool for dynamic-graph representation quality, offering explicit guidance for model selection and future architecture design.

Abstract: Real-world systems ranging from airline routes to cryptocurrency transfers are naturally modelled as dynamic graphs whose topology changes over time. Conventional benchmarks judge dynamic-graph learners by a handful of task-specific scores, yet seldom ask whether the embeddings themselves remain a truthful, interpretable reflection of the evolving network. We formalize this requirement as representation integrity and derive a family of indexes that measure how closely embedding changes follow graph changes. Three synthetic scenarios, Gradual Merge, Abrupt Move, and Periodic Re-wiring, are used to screen forty-two candidate indexes. Based on which we recommend one index that passes all of our theoretical and empirical tests. In particular, this validated metric consistently ranks the provably stable UASE and IPP models highest. We then use this index to do a comparative study on representation integrity of common dynamic graph learning models. This study exposes the scenario-specific strengths of neural methods, and shows a strong positive rank correlation with one-step link-prediction AUC. The proposed integrity framework, therefore, offers a task-agnostic and interpretable evaluation tool for dynamic-graph representation quality, providing more explicit guidance for model selection and future architecture design.

[369] Probabilistic Hash Embeddings for Online Learning of Categorical Features

Aodong Li, Abishek Sankararaman, Balakrishnan Narayanaswamy

Main category: cs.LG

TL;DR: Proposes Probabilistic Hash Embeddings (PHE) for online learning with evolving categorical vocabularies, using Bayesian methods to prevent forgetting and maintain order invariance.

Details

Motivation: Traditional deterministic hash embeddings suffer from forgetting and sensitivity to arrival order in online learning with evolving categorical features.

Method: Probabilistic hash embeddings treated as stochastic variables with Bayesian online learning, enabling scalable inference and incremental updates.

Result: PHE achieves superior performance in classification, sequence modeling, and recommendation tasks while using only 2-4x less memory than one-hot embeddings.

Conclusion: PHE provides an effective solution for online learning with unbounded categorical vocabularies, addressing key limitations of deterministic hash embeddings.

Abstract: We study streaming data with categorical features where the vocabulary of categorical feature values is changing and can even grow unboundedly over time. Feature hashing is commonly used as a pre-processing step to map these categorical values into a feature space of fixed size before learning their embeddings. While these methods have been developed and evaluated for offline or batch settings, in this paper we consider online settings. We show that deterministic embeddings are sensitive to the arrival order of categories and suffer from forgetting in online learning, leading to performance deterioration. To mitigate this issue, we propose a probabilistic hash embedding (PHE) model that treats hash embeddings as stochastic and applies Bayesian online learning to learn incrementally from data. Based on the structure of PHE, we derive a scalable inference algorithm to learn model parameters and infer/update the posteriors of hash embeddings and other latent variables. Our algorithm (i) can handle an evolving vocabulary of categorical items, (ii) is adaptive to new items without forgetting old items, (iii) is implementable with a bounded set of parameters that does not grow with the number of distinct observed values on the stream, and (iv) is invariant to the item arrival order. Experiments in classification, sequence modeling, and recommendation systems in online learning setups demonstrate the superior performance of PHE while maintaining high memory efficiency (consumes as low as 2~4 memory of a one-hot embedding table). Supplementary materials are at https://github.com/aodongli/probabilistic-hash-embeddings

[370] Evolved SampleWeights for Bias Mitigation: Effectiveness Depends on Optimization Objectives

Anil K. Saini, Jose Guadalupe Hernandez, Emily F. Wong, Debanshi Misra, Jason H. Moore

Main category: cs.LG

TL;DR: This paper compares three sample weighting methods (Genetic Algorithm evolution, dataset-based computation, equal weights) to mitigate bias in machine learning models, showing that evolved weights achieve better fairness-performance trade-offs.

Details

Motivation: Machine learning models trained on real-world data may make biased predictions that negatively impact marginalized communities, requiring methods to mitigate such bias.

Method: Compared three weighting methods: (1) Genetic Algorithm-evolved weights, (2) dataset-characteristic-based weights, (3) equal weights. Evaluated using paired predictive (accuracy, AUC) and fairness (demographic parity difference, subgroup false negative fairness) metrics on 11 datasets.

Result: Evolved sample weights produced models with better trade-offs between fairness and predictive performance than alternative methods. Benefits depended on optimization objectives - optimizing with accuracy and demographic parity difference yielded the best results across most datasets.

Conclusion: Genetic Algorithm-evolved sample weights can effectively mitigate bias while maintaining predictive performance, with the choice of optimization objectives significantly impacting the effectiveness of the approach.

Abstract: Machine learning models trained on real-world data may inadvertently make biased predictions that negatively impact marginalized communities. Reweighting is a method that can mitigate such bias in model predictions by assigning a weight to each data point used during model training. In this paper, we compare three methods for generating these weights: (1) evolving them using a Genetic Algorithm (GA), (2) computing them using only dataset characteristics, and (3) assigning equal weights to all data points. Model performance under each strategy was evaluated using paired predictive and fairness metrics, which also served as optimization objectives for the GA during evolution. Specifically, we used two predictive metrics (accuracy and area under the Receiver Operating Characteristic curve) and two fairness metrics (demographic parity difference and subgroup false negative fairness). Using experiments on eleven publicly available datasets (including two medical datasets), we show that evolved sample weights can produce models that achieve better trade-offs between fairness and predictive performance than alternative weighting methods. However, the magnitude of these benefits depends strongly on the choice of optimization objectives. Our experiments reveal that optimizing with accuracy and demographic parity difference metrics yields the largest number of datasets for which evolved weights are significantly better than other weighting strategies in optimizing both objectives.

[371] Exploring Time-Step Size in Reinforcement Learning for Sepsis Treatment

Yingchuan Sun, Shengpu Tang

Main category: cs.LG

TL;DR: This paper investigates how time-step size affects offline RL for sepsis management, comparing 1, 2, 4, and 8-hour intervals and finding that finer time-steps (1-2 hours) yield better performance than the conventional 4-hour setup.

Details

Motivation: Existing RL approaches for sepsis management use 4-hour time steps, but concerns exist that this coarse granularity might distort patient dynamics and lead to suboptimal treatment policies. The practical impact of time-step size remains unexplored.

Method: Conducted controlled experiments comparing four time-step sizes (1, 2, 4, 8 hours) using identical offline RL pipeline. Developed action re-mapping methods for fair cross-time-step evaluation and performed cross-time-step model selection under two policy learning setups.

Result: Performance trends vary across time-step sizes depending on learning setups. Policies learned at finer time-step sizes (1h and 2h) using static behavior policy achieve the best overall performance and stability.

Conclusion: Time-step size is a critical design choice in offline RL for healthcare. Evidence supports using finer time-step sizes (1-2 hours) as alternatives to the conventional 4-hour setup for better performance in sepsis management.

Abstract: Existing studies on reinforcement learning (RL) for sepsis management have mostly followed an established problem setup, in which patient data are aggregated into 4-hour time steps. Although concerns have been raised regarding the coarseness of this time-step size, which might distort patient dynamics and lead to suboptimal treatment policies, the extent to which this is a problem in practice remains unexplored. In this work, we conducted empirical experiments for a controlled comparison of four time-step sizes ($Δt!=!1,2,4,8$ h) on this domain, following an identical offline RL pipeline. To enable a fair comparison across time-step sizes, we designed action re-mapping methods that allow for evaluation of policies on datasets with different time-step sizes, and conducted cross-$Δt$ model selections under two policy learning setups. Our goal was to quantify how time-step size influences state representation learning, behavior cloning, policy training, and off-policy evaluation. Our results show that performance trends across $Δt$ vary as learning setups change, while policies learned at finer time-step sizes ($Δt = 1$ h and $2$ h) using a static behavior policy achieve the overall best performance and stability. Our work highlights time-step size as a core design choice in offline RL for healthcare and provides evidence supporting alternatives beyond the conventional 4-hour setup.

[372] Operationalizing Quantized Disentanglement

Vitoria Barin-Pacela, Kartik Ahuja, Simon Lacoste-Julien, Pascal Vincent

Main category: cs.LG

TL;DR: The paper proposes Cliff, a method for unsupervised disentanglement by encouraging axis-aligned discontinuities in factor densities, outperforming baselines on disentanglement benchmarks.

Details

Motivation: While theory shows quantized factors are identifiable under diffeomorphisms with axis-aligned discontinuities, translating this principle into practical criteria remains challenging, especially for nonlinear maps.

Method: Develop a criterion that encourages axis-aligned discontinuities (cliffs) in factor densities, ensuring cliff locations along one factor are independent of other factors’ values.

Result: Cliff outperforms all baselines on disentanglement benchmarks, demonstrating effectiveness in unsupervised disentanglement.

Conclusion: The proposed method successfully translates theoretical principles into practical disentanglement by leveraging independent axis-aligned discontinuities in factor densities.

Abstract: Recent theoretical work established the unsupervised identifiability of quantized factors under any diffeomorphism. The theory assumes that quantization thresholds correspond to axis-aligned discontinuities in the probability density of the latent factors. By constraining a learned map to have a density with axis-aligned discontinuities, we can recover the quantization of the factors. However, translating this high-level principle into an effective practical criterion remains challenging, especially under nonlinear maps. Here, we develop a criterion for unsupervised disentanglement by encouraging axis-aligned discontinuities. Discontinuities manifest as sharp changes in the estimated density of factors and form what we call cliffs. Following the definition of independent discontinuities from the theory, we encourage the location of the cliffs along a factor to be independent of the values of the other factors. We show that our method, Cliff, outperforms the baselines on all disentanglement benchmarks, demonstrating its effectiveness in unsupervised disentanglement.

[373] Semantic Superiority vs. Forensic Efficiency: A Comparative Analysis of Deep Learning and Psycholinguistics for Business Email Compromise Detection

Yaw Osei Adjei

Main category: cs.LG

TL;DR: This paper presents two approaches for detecting Business Email Compromise (BEC): a psycholinguistic CatBoost method for low-latency detection and a semantic DistilBERT method for high-accuracy detection, both achieving excellent performance with ROI over 99.96%.

Details

Motivation: BEC causes massive financial losses ($2.9B annually) with severe cost asymmetry where false negatives (fraud losses) are orders of magnitude more expensive than false positives (manual reviews).

Method: Two detection paradigms: Forensic Psycholinguistic Stream using CatBoost for psycholinguistic analysis, and Semantic Stream using DistilBERT for contextual language understanding. Evaluated on adversarially poisoned dataset (N=7,990) using Black Hole protocol.

Result: DistilBERT achieved perfect detection (AUC=1.0000, F1=0.9981) at 7.403ms latency. CatBoost achieved competitive detection (AUC=0.9905, F1=0.9486) at 8.4x lower latency (0.885ms). Both approaches achieved ROI exceeding 99.96% through cost-sensitive learning.

Conclusion: DistilBERT is recommended for GPU-enabled environments requiring maximum accuracy, while CatBoost is preferable for edge deployments or cost-sensitive environments due to comparable security and lower operational costs.

Abstract: Business Email Compromise (BEC) is a sophisticated social engineering threat that manipulates organizational hierarchies and exploits psychological vulnerabilities, leading to significant financial damage. According to the 2024 FBI Internet Crime Report, BEC accounts for over $2.9 billion in annual adjusted losses, presenting significant economic asymmetry: the cost of a False Negative (fraud loss) exceeds the cost of a False Positive (manual review) by orders of magnitude (approximately 1 to 5,480). This paper examines two detection paradigms for BEC: the Forensic Psycholinguistic Stream, which utilizes CatBoost to analyze psycholinguistic cues with high interpretability and low latency, and the Semantic Stream, which employs DistilBERT for deep learning-based contextual language understanding, offering superior accuracy at higher computational cost. We evaluated DistilBERT on an adversarially poisoned dataset (N = 7,990) generated via our Black Hole protocol, benchmarked on Tesla T4 GPU infrastructure, achieving superior detection (AUC = 1.0000, F1 = 0.9981) with acceptable real-time latency (7.403 milliseconds). CatBoost achieves competitive detection (AUC = 0.9905, F1 = 0.9486) at 8.4x lower latency (0.885 milliseconds), consuming negligible computational resources. For organizations with GPU infrastructure, DistilBERT offers superior accuracy. CatBoost is preferable for edge deployments or cost-sensitive environments due to comparable security and lower operational costs. Both approaches demonstrate return on investment exceeding 99.96% when optimized through cost-sensitive learning, by significantly reducing false negatives and associated financial losses.

[374] Dataset Poisoning Attacks on Behavioral Cloning Policies

Akansha Kalra, Soumil Datta, Ethan Gilmore, Duc La, Guanhong Tao, Daniel S. Brown

Main category: cs.LG

TL;DR: First analysis of clean-label backdoor attacks on Behavior Cloning policies, showing they remain vulnerable even with minimal data poisoning and can be exploited via novel entropy-based test-time attacks.

Details

Motivation: As Behavior Cloning policies are increasingly deployed in real-world systems, their robustness and potential vulnerabilities are critical concerns that need investigation.

Method: Poison demonstration datasets by injecting visual triggers to create spurious correlations, then evaluate vulnerability scaling with poisoning parameters and introduce entropy-based test-time trigger attacks.

Result: BC policies trained on minimally poisoned datasets show near-baseline task performance but remain highly vulnerable to backdoor trigger attacks during deployment.

Conclusion: Urgent need for more research into BC policy robustness, especially as large-scale datasets are used for real-world cyber-physical systems.

Abstract: Behavior Cloning (BC) is a popular framework for training sequential decision policies from expert demonstrations via supervised learning. As these policies are increasingly being deployed in the real world, their robustness and potential vulnerabilities are an important concern. In this work, we perform the first analysis of the efficacy of clean-label backdoor attacks on BC policies. Our backdoor attacks poison a dataset of demonstrations by injecting a visual trigger to create a spurious correlation that can be exploited at test time. We evaluate how policy vulnerability scales with the fraction of poisoned data, the strength of the trigger, and the trigger type. We also introduce a novel entropy-based test-time trigger attack that substantially degrades policy performance by identifying critical states where test-time triggering of the backdoor is expected to be most effective at degrading performance. We empirically demonstrate that BC policies trained on even minimally poisoned datasets exhibit deceptively high, near-baseline task performance despite being highly vulnerable to backdoor trigger attacks during deployment. Our results underscore the urgent need for more research into the robustness of BC policies, particularly as large-scale datasets are increasingly used to train policies for real-world cyber-physical systems. Videos and code are available at https://sites.google.com/view/dataset-poisoning-in-bc.

[375] Subgoal Graph-Augmented Planning for LLM-Guided Open-World Reinforcement Learning

Shanwei Fan

Main category: cs.LG

TL;DR: SGA-ACR framework addresses LLM planning-execution misalignment in RL by using environment-specific subgoal graphs and multi-LLM planning with generation-critique-refinement separation, plus execution monitoring.

Details

Motivation: LLMs have strong high-level planning for RL but suffer from poor planning-execution alignment due to ungrounded subgoals and conflated generation/verification, limiting practical utility.

Method: Proposed SGA-ACR integrates environment-specific subgoal graphs and entity knowledge with multi-LLM pipeline separating generation, critique, and refinement. Includes subgoal tracker for execution monitoring and adaptive graph updates.

Result: Experimental results on 22 diverse tasks in open-world game “Crafter” demonstrate the method’s effectiveness in producing executable and verifiable subgoals.

Conclusion: The framework successfully addresses LLM planning-execution misalignment through structured knowledge integration and explicit separation of planning functions, enabling more reliable RL task decomposition.

Abstract: Large language models (LLMs) offer strong high-level planning capabilities for reinforcement learning (RL) by decomposing tasks into subgoals. However, their practical utility is limited by poor planning-execution alignment, which reflects a critical gap between abstract plans and actionable, environment-compatible behaviors. This misalignment arises from two interrelated limitations: (1) LLMs often produce subgoals that are semantically plausible but infeasible or irrelevant in the target environment due to insufficient grounding in environment-specific knowledge, and (2) single-LLM planning conflates generation with self-verification, resulting in overconfident yet unreliable subgoals that frequently fail during execution. To address these challenges, we propose Subgoal Graph-Augmented Actor-Critic-Refiner (SGA-ACR), a framework that integrates an environment-specific subgoal graph and structured entity knowledge with a multi-LLM planning pipeline that explicitly separates generation, critique, and refinement to produce executable and verifiable subgoals. A subgoal tracker further monitors execution progress, provides auxiliary rewards, and adaptively updates the subgoal graph to maintain alignment between plans and actions. Experimental results on 22 diverse tasks in the open-world game “Crafter” demonstrate the effectiveness of our proposed method.

[376] FANoise: Singular Value-Adaptive Noise Modulation for Robust Multimodal Representation Learning

Jiaoyang Li, Jun Fang, Tianhao Gao, Xiaohui Zhang, Zhiyuan Liu, Chao Liu, Pengzhang Liu, Qixia Jiang

Main category: cs.LG

TL;DR: FANoise is a feature-adaptive noise injection method that dynamically adjusts noise during multimodal representation learning to improve performance across various VLM models.

Details

Motivation: Existing noise injection methods in representation learning rely on heuristic or static noise, failing to account for the dynamic nature of feature distributions during training, which limits their effectiveness.

Method: Proposed FANoise - a feature-adaptive noise injection strategy that leverages contrastive learning dynamics to mitigate negative noise impacts while preserving benefits, using InfoNCE loss as a foundation.

Result: Comprehensive experiments show FANoise consistently improves overall performance on multimodal tasks across various base VLM models.

Conclusion: Feature-adaptive noise injection through FANoise provides a theoretically grounded framework that enhances representation learning performance by dynamically adapting to training dynamics.

Abstract: Representation learning is fundamental to modern machine learning, powering applications such as text retrieval and multimodal understanding. However, learning robust and generalizable representations remains challenging. While prior work has demonstrated that active noise injection, a form of data augmentation, can enhance encoding performance, most existing methods rely on heuristic or static noise, overlooking the dynamic nature of feature distributions during training. In this work, we systematically study the role of noise in representation learning from both gradient-based and feature distribution perspectives, using InfoNCE loss as a representative example. Focusing on multimodal representation learning, we propose FANoise, a novel feature-adaptive noise injection strategy. By leveraging the dynamics of contrastive learning, FANoise effectively mitigates the negative impacts of noise while preserving its benefits. Under this theoretically grounded framework, comprehensive experiments demonstrate that FANoise consistently improves overall performance on multimodal tasks across various base VLM models.

[377] Estimating Ising Models in Total Variation Distance

Constantinos Daskalakis, Vardis Kandiros, Rui Yao

Main category: cs.LG

TL;DR: The paper presents a unified analysis of Maximum Pseudo-Likelihood Estimator (MPLE) for Ising models, achieving polynomial-time estimation in Total Variation distance for two general classes: models with bounded operator norm satisfying MLSI, and models with bounded infinity norm.

Details

Motivation: While statistical complexity of Ising model estimation is understood, finding computationally and statistically efficient algorithms has been challenging, with previous work limited to specific cases like trees, Gaussian interactions, or specific eigenvalue distributions.

Method: Unified analysis of Maximum Pseudo-Likelihood Estimator (MPLE) using tensorization inequalities, measure decompositions, and concentration bounds for two general classes of Ising models.

Result: The approach yields polynomial-time algorithms with optimal or near-optimal sample complexity guarantees across various settings, providing a unified framework that generalizes previous specialized results.

Conclusion: The paper establishes a general framework for efficient Ising model estimation in TV distance, unifying and extending previous specialized approaches through analysis of MPLE for broad model classes.

Abstract: We consider the problem of estimating Ising models over $n$ variables in Total Variation (TV) distance, given $l$ independent samples from the model. While the statistical complexity of the problem is well-understood [DMR20], identifying computationally and statistically efficient algorithms has been challenging. In particular, remarkable progress has occurred in several settings, such as when the underlying graph is a tree [DP21, BGPV21], when the entries of the interaction matrix follow a Gaussian distribution [GM24, CK24], or when the bulk of its eigenvalues lie in a small interval [AJK+24, KLV24], but no unified framework for polynomial-time estimation in TV exists so far. Our main contribution is a unified analysis of the Maximum Pseudo-Likelihood Estimator (MPLE) for two general classes of Ising models. The first class includes models that have bounded operator norm and satisfy the Modified Log-Sobolev Inequality (MLSI), a functional inequality that was introduced to study the convergence of the associated Glauber dynamics to stationarity. In the second class of models, the interaction matrix has bounded infinity norm (or bounded width), which is the most common assumption in the literature for structure learning of Ising models. We show how our general results for these classes yield polynomial-time algorithms and optimal or near-optimal sample complexity guarantees in a variety of settings. Our proofs employ a variety of tools from tensorization inequalities to measure decompositions and concentration bounds.

[378] ChatGpt Content detection: A new approach using xlm-roberta alignment

Md Tasnin Tanvir, Dr Santanu Kumar Dash, Ishan Shahnan, Nafis Fuad, Tanvir Rahman, Abdullah Al Faisal, Asadullah Al Mamun

Main category: cs.LG

TL;DR: This paper presents an AI-generated text detection system using XLM-RoBERTa with perplexity, semantic, and readability features, achieving high accuracy across various text genres.

Details

Motivation: The urgent need to distinguish AI-generated text from human-authored content as generative AI technologies like ChatGPT become more widely available, addressing both fully AI-generated content and human text reworded by AI.

Method: Fine-tuned XLM-RoBERTa transformer model with rigorous preprocessing and feature extraction (perplexity, semantic, readability features) on a balanced dataset of human and AI-generated texts.

Result: The model demonstrated high accuracy and robust performance across various text genres, with feature analysis revealing perplexity and attention-based features as critical differentiators between human and AI-generated texts.

Conclusion: The approach offers a valuable tool for maintaining academic integrity and contributes to AI ethics by promoting transparency and accountability, with future research directions including exploring other advanced models and expanding datasets for better generalizability.

Abstract: The challenge of separating AI-generated text from human-authored content is becoming more urgent as generative AI technologies like ChatGPT become more widely available. In this work, we address this issue by looking at both the detection of content that has been entirely generated by AI and the identification of human text that has been reworded by AI. In our work, a comprehensive methodology to detect AI- generated text using XLM-RoBERTa, a state-of-the-art multilingual transformer model. Our approach includes rigorous preprocessing, and feature extraction involving perplexity, semantic, and readability features. We fine-tuned the XLM-RoBERTa model on a balanced dataset of human and AI-generated texts and evaluated its performance. The model demonstrated high accuracy and robust performance across various text genres. Additionally, we conducted feature analysis to understand the model’s decision-making process, revealing that perplexity and attention-based features are critical in differentiating between human and AI-generated texts. Our findings offer a valuable tool for maintaining academic integrity and contribute to the broader field of AI ethics by promoting transparency and accountability in AI systems. Future research directions include exploring other advanced models and expanding the dataset to enhance the model’s generalizability.

[379] Staggered Environment Resets Improve Massively Parallel On-Policy Reinforcement Learning

Sid Bharthulwar, Stone Tao, Hao Su

Main category: cs.LG

TL;DR: Staggered resets reduce harmful nonstationarity in parallel RL training by initializing environments at varied points, improving sample efficiency and performance.

Details

Motivation: Standard synchronous resets in parallel GPU environments introduce harmful nonstationarity that skews learning signals and destabilizes training when using short rollouts with high update-to-data ratios.

Method: Introduce staggered resets where environments are initialized and reset at varied points within the task horizon, creating training batches with greater temporal diversity to reduce nonstationarity from synchronized rollouts.

Result: Achieved significantly higher sample efficiency, faster wall-clock convergence, and stronger final performance in challenging high-dimensional robotics environments. The technique scales better with more parallel environments compared to synchronized rollouts.

Conclusion: Staggered resets are a simple yet effective technique that addresses nonstationarity issues in parallel RL training, improving training stability and performance across various environments.

Abstract: Massively parallel GPU simulation environments have accelerated reinforcement learning (RL) research by enabling fast data collection for on-policy RL algorithms like Proximal Policy Optimization (PPO). To maximize throughput, it is common to use short rollouts per policy update, increasing the update-to-data (UTD) ra- tio. However, we find that, in this setting, standard synchronous resets introduce harmful nonstationarity, skewing the learning signal and destabilizing training. We introduce staggered resets, a simple yet effective technique where environments are initialized and reset at varied points within the task horizon. This yields training batches with greater temporal diversity, reducing the nonstationarity induced by synchronized rollouts. We characterize dimensions along which RL environments can benefit significantly from staggered resets through illustrative toy environ- ments. We then apply this technique to challenging high-dimensional robotics environments, achieving significantly higher sample efficiency, faster wall-clock convergence, and stronger final performance. Finally, this technique scales better with more parallel environments compared to naive synchronized rollouts.

[380] Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, Stefano Soatto

Main category: cs.LG

TL;DR: Gated KalmaNet (GKA) is an efficient linear state-space model layer that solves online ridge regression to maintain full past context with constant memory and linear compute, outperforming existing SSMs on both short and long-context tasks.

Details

Motivation: Linear state-space models are efficient but maintain only lossy summaries of the past, leading to inferior performance in recall-oriented tasks. GKA aims to bridge this gap by accounting for the full past while maintaining SSM efficiency.

Method: GKA solves online ridge regression using Kalman Filter-inspired iterative solving with two key innovations: adaptive regularization with input-dependent gating for numerical stability, and Chebyshev Iteration for stable low-precision computation. Includes hardware-aware chunk-wise implementation and custom kernels.

Result: GKA outperforms existing SSM layers (Mamba2, GLA, Gated DeltaNet) on short-context language understanding tasks. On long-context tasks up to 128k tokens, it achieves >10% relative improvement over fading memory baselines in RAG and LongQA tasks.

Conclusion: GKA successfully bridges the performance gap between efficient SSMs and full-context models by maintaining complete past information with SSM-style efficiency, demonstrating strong capabilities across both short and long-context scenarios.

Abstract: As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented tasks. We propose Gated KalmaNet (GKA), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. GKA achieves this by solving an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length. Drawing inspiration from the Kalman Filter, we iteratively solve the online ridge regression problem. However, a critical insight is that standard Kalman filter equations are numerically unstable in low-precision environments (like bfloat16) and difficult to parallelize in modern hardware. We address both challenges through two key innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the ridge regression, ensuring numerical stability while balancing memory retention. And (2) the use of Chebyshev Iteration instead of other conventional iterative solvers, which we demonstrate to be more stable in low-precision settings. To further improve scalability, we develop a hardware-aware chunk-wise implementation of Chebyshev Iteration along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. Empirically, GKA shows strong language understanding capabilites on short-context tasks outperforming existing SSM layers (like Mamba2, GLA and Gated DeltaNet). On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.

[381] Probabilistic Wildfire Spread Prediction Using an Autoregressive Conditional Generative Adversarial Network

Taehoon Kang, Taeyong Kim

Main category: cs.LG

TL;DR: An autoregressive conditional GAN model for probabilistic wildfire spread prediction that outperforms conventional deep learning methods in accuracy and boundary delineation.

Details

Motivation: Climate change has intensified wildfires, but physics-based simulators are too slow for real-time use while existing deep learning models produce overly smooth predictions that miss complex wildfire dynamics.

Method: Autoregressive conditional generative adversarial network (CGAN) that learns sequential state transitions for long-term prediction stability, formulated as an autoregressive problem.

Result: The CGAN-based model outperforms conventional deep learning models in both overall predictive accuracy and boundary delineation of fire perimeters, capturing strong nonlinearity and uncertainty.

Conclusion: The autoregressive CGAN framework enhances accuracy and physical interpretability of wildfire spread prediction, providing a foundation for time-sensitive response and evacuation planning.

Abstract: Climate change has intensified the frequency and severity of wildfires, making rapid and accurate prediction of fire spread essential for effective mitigation and response. Physics-based simulators such as FARSITE offer high-fidelity predictions but are computationally intensive, limiting their applicability in real-time decision-making, while existing deep learning models often yield overly smooth predictions that fail to capture the complex, nonlinear dynamics of wildfire propagation. This study proposes an autoregressive conditional generative adversarial network (CGAN) for probabilistic wildfire spread prediction. By formulating the prediction task as an autoregressive problem, the model learns sequential state transitions, ensuring long-term prediction stability. Experimental results demonstrate that the proposed CGAN-based model outperforms conventional deep learning models in both overall predictive accuracy and boundary delineation of fire perimeters. These results demonstrate that adversarial learning allows the model to capture the strong nonlinearity and uncertainty of wildfire spread, instead of simply fitting the pixel average. Furthermore, the autoregressive framework facilitates systematic temporal forecasting of wildfire evolution. The proposed CGAN-based autoregressive framework enhances both the accuracy and physical interpretability of wildfire spread prediction, offering a promising foundation for time-sensitive response and evacuation planning.

[382] A Probabilistic Framework for Temporal Distribution Generalization in Industry-Scale Recommender Systems

Yuxuan Zhu, Cong Fu, Yabo Ni, Anxiang Zeng, Yuan Fang

Main category: cs.LG

TL;DR: ELBO$\text{TDS}$ is a probabilistic framework that addresses temporal distribution shift in recommender systems through causal modeling and data augmentation, achieving 2.33% GMV uplift and deployed in Shopee Product Search.

Details

Motivation: Temporal distribution shift erodes recommender system accuracy over time, and existing methods like invariant learning and self-supervised learning have limitations in temporal generalization, representation collapse, or inefficient data utilization.

Method: Proposes ELBO$\text{TDS}$ framework with: 1) statistical analysis of shifting factors and data augmentation strategy, 2) causal graph modeling of temporal recommendation scenario with self-supervised variational objective derived from causal structure.

Result: Extensive experiments show superior temporal generalization with 2.33% uplift in GMV per user, successfully deployed in Shopee Product Search.

Conclusion: The proposed ELBO$\text{TDS}$ framework effectively addresses temporal distribution shift in recommender systems through causal modeling and data augmentation, demonstrating practical value in industrial deployment.

Abstract: Temporal distribution shift (TDS) erodes the long-term accuracy of recommender systems, yet industrial practice still relies on periodic incremental training, which struggles to capture both stable and transient patterns. Existing approaches such as invariant learning and self-supervised learning offer partial solutions but often suffer from unstable temporal generalization, representation collapse, or inefficient data utilization. To address these limitations, we propose ELBO$\text{TDS}$, a probabilistic framework that integrates seamlessly into industry-scale incremental learning pipelines. First, we identify key shifting factors through statistical analysis of real-world production data and design a simple yet effective data augmentation strategy that resamples these time-varying factors to extend the training support. Second, to harness the benefits of this extended distribution while preventing representation collapse, we model the temporal recommendation scenario using a causal graph and derive a self-supervised variational objective, ELBO$\text{TDS}$, grounded in the causal structure. Extensive experiments supported by both theoretical and empirical analysis demonstrate that our method achieves superior temporal generalization, yielding a 2.33% uplift in GMV per user and has been successfully deployed in Shopee Product Search. Code is available at https://github.com/FuCongResearchSquad/ELBO4TDS.

[383] Prediction of Herd Life in Dairy Cows Using Multi-Head Attention Transformers

Mahdi Saki, Justin Lipman

Main category: cs.LG

TL;DR: AI model predicts cow longevity using transformer networks on historical data, achieving 83% accuracy in determining herd life.

Details

Motivation: Farmers need objective tools to identify resilient cows that can complete more lactations, as culling decisions have significant economic and environmental impacts.

Method: Used Multi-Head Attention Transformers to analyze 780,000 records from 19,000 cows across 7 Australian farms using historical multivariate time-series data from birth.

Result: Model achieved 83% determination coefficient in predicting herd life across studied farms.

Conclusion: The AI-driven approach shows strong potential for practical application in dairy herd management by enabling better culling decisions.

Abstract: Dairy farmers should decide to keep or cull a cow based on an objective assessment of her likely performance in the herd. For this purpose, farmers need to identify more resilient cows, which can cope better with farm conditions and complete more lactations. This decision-making process is inherently complex, with significant environmental and economic implications. In this study, we develop an AI-driven model to predict cow longevity using historical multivariate time-series data recorded from birth. Leveraging advanced AI techniques, specifically Multi-Head Attention Transformers, we analysed approximately 780,000 records from 19,000 unique cows across 7 farms in Australia. The results demonstrate that our model achieves an overall determination coefficient of 83% in predicting herd life across the studied farms, highlighting its potential for practical application in dairy herd management.

[384] RAVQ-HoloNet: Rate-Adaptive Vector-Quantized Hologram Compression

Shima Rafiei, Zahra Nabizadeh Shahr Babak, Shadrokh Samavi, Shahram Shirani

Main category: cs.LG

TL;DR: RAVQ-HoloNet is a rate-adaptive vector quantization framework for holography that achieves superior compression performance at low and ultra-low bit rates compared to existing methods.

Details

Motivation: Holography has great potential for AR/VR but faces adoption barriers due to high data compression requirements. Current deep learning approaches lack rate adaptivity within single networks.

Method: Proposed RAVQ-HoloNet, a rate-adaptive vector quantization framework designed specifically for holography compression.

Result: Achieves -33.91% improvement in BD-Rate and 1.02 dB BD-PSNR gain over state-of-the-art methods at low bit rates, as shown by rate-distortion curves.

Conclusion: The framework enables high-fidelity holographic reconstructions at low and ultra-low bit rates, addressing key compression challenges for AR/VR applications.

Abstract: Holography offers significant potential for AR/VR applications, yet its adoption is limited by the high demands of data compression. Existing deep learning approaches generally lack rate adaptivity within a single network. We present RAVQ-HoloNet, a rate-adaptive vector quantization framework that achieves high-fidelity reconstructions at low and ultra-low bit rates, outperforming current state-of-the-art methods. In low bit, our method exceeds by -33.91% in BD-Rate and achieves a BD-PSNR of 1.02 dB from the best existing method demonstrated by the rate-distortion curve.

[385] CNN-LSTM Hybrid Architecture for Over-the-Air Automatic Modulation Classification Using SDR

Dinanath Padhya, Krishna Acharya, Bipul Kumar Dahal, Dinesh Baniya Kshatri

Main category: cs.LG

TL;DR: Hybrid CNN-LSTM architecture achieves 93.48% accuracy for automatic modulation classification using both RadioML2018 and custom datasets, validated with OTA signals.

Details

Motivation: AMC is essential for cognitive radio, spectrum monitoring, and intelligent communication networks to identify modulation schemes without prior knowledge.

Method: Hybrid CNN-LSTM architecture integrated with SDR platform, using CNNs for spatial features and LSTMs for temporal dependencies, trained on hybrid dataset with SNRs from 0-30dB.

Result: Achieved 93.48% accuracy, 93.53% precision, 93.48% recall, 93.45% F1 score, with AUC-ROC confirming discriminative power in noisy conditions.

Conclusion: Hybrid CNN-LSTM architecture is effective for AMC and has potential applications in adaptive spectrum management and cognitive radio systems.

Abstract: Automatic Modulation Classification (AMC) is a core technology for future wireless communication systems, enabling the identification of modulation schemes without prior knowledge. This capability is essential for applications in cognitive radio, spectrum monitoring, and intelligent communication networks. We propose an AMC system based on a hybrid Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture, integrated with a Software Defined Radio (SDR) platform. The proposed architecture leverages CNNs for spatial feature extraction and LSTMs for capturing temporal dependencies, enabling efficient handling of complex, time-varying communication signals. The system’s practical ability was demonstrated by identifying over-the-air (OTA) signals from a custom-built FM transmitter alongside other modulation schemes. The system was trained on a hybrid dataset combining the RadioML2018 dataset with a custom-generated dataset, featuring samples at Signal-to-Noise Ratios (SNRs) from 0 to 30dB. System performance was evaluated using accuracy, precision, recall, F1 score, and the Area Under the Receiver Operating Characteristic Curve (AUC-ROC). The optimized model achieved 93.48% accuracy, 93.53% precision, 93.48% recall, and an F1 score of 93.45%. The AUC-ROC analysis confirmed the model’s discriminative power, even in noisy conditions. This paper’s experimental results validate the effectiveness of the hybrid CNN-LSTM architecture for AMC, suggesting its potential application in adaptive spectrum management and advanced cognitive radio systems.

[386] FedAPA: Federated Learning with Adaptive Prototype Aggregation Toward Heterogeneous Wi-Fi CSI-based Crowd Counting

Jingtao Guo, Yuyi Mao, Ivan Wang-Hei Ho

Main category: cs.LG

TL;DR: FedAPA is a federated learning approach for Wi-Fi CSI-based sensing that uses adaptive prototype aggregation to handle data heterogeneity and reduce communication overhead while improving accuracy.

Details

Motivation: Large-scale deployment of Wi-Fi CSI-based sensing is limited by the need for extensive site-specific training data, and federated learning faces challenges with heterogeneous data and device resources.

Method: Uses adaptive prototype aggregation (APA) strategy with similarity-based weights for peer prototypes, combined with hybrid local training that includes classification and representation contrastive learning.

Result: Outperforms baselines with at least 9.65% accuracy increase, 9% F1 score gain, 0.29 MAE reduction, and 95.94% communication overhead reduction in real-world Wi-Fi crowd counting with 6 environments and up to 20 people.

Conclusion: FedAPA effectively addresses federated learning challenges in Wi-Fi CSI sensing through adaptive prototype aggregation and hybrid training, enabling practical large-scale deployment with improved performance and efficiency.

Abstract: Wi-Fi channel state information (CSI)-based sensing provides a non-invasive, device-free approach for tasks such as human activity recognition and crowd counting, but large-scale deployment is hindered by the need for extensive site-specific training data. Federated learning (FL) offers a way to avoid raw data sharing but is challenged by heterogeneous sensing data and device resources. This paper proposes FedAPA, a collaborative Wi-Fi CSI-based sensing algorithm that uses adaptive prototype aggregation (APA) strategy to assign similarity-based weights to peer prototypes, enabling adaptive client contributions and yielding a personalized global prototype for each client instead of a fixed-weight aggregation. During local training, we adopt a hybrid objective that combines classification learning with representation contrastive learning to align local and global knowledge. We provide a convergence analysis of FedAPA and evaluate it in a real-world distributed Wi-Fi crowd counting scenario with six environments and up to 20 people. The results show that our method outperform multiple baselines in terms of accuracy, F1 score, mean absolute error (MAE), and communication overhead, with FedAPA achieving at least a 9.65% increase in accuracy, a 9% gain in F1 score, a 0.29 reduction in MAE, and a 95.94% reduction in communication overhead.

[387] Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Dongkyu Derek Cho, Huan Song, Arijit Ghosh Chowdhury, Haotian An, Yawei Wang, Rohit Thekkanal, Negin Sokhandan, Sharlina Keshava, Hannah Marlowe

Main category: cs.LG

TL;DR: RLVR (reinforcement learning with verifiable rewards) can eliminate the safety-capability tradeoff in LLM fine-tuning, enabling simultaneous improvement in reasoning capabilities and safety guardrails.

Details

Motivation: Standard fine-tuning approaches (SFT, RLHF) exhibit a fundamental safety-capability tradeoff where improved task performance degrades safety alignment, even on benign datasets. The safety implications of RLVR remain unexplored.

Method: Comprehensive theoretical analysis deriving upper bounds on safety drift under KL-constrained optimization, plus extensive empirical experiments across five adversarial safety benchmarks with ablation studies on optimization algorithms, model scale, and task domains.

Result: Theoretical proofs show conditions where safety degradation is eliminated. Empirical results demonstrate RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails across multiple benchmarks.

Conclusion: RLVR challenges the prevailing assumption of an inevitable safety-capability trade-off, establishing that specific training methodology can achieve both objectives simultaneously, providing insights for safe deployment of reasoning-capable LLMs.

Abstract: Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.

[388] Efficient Diffusion Planning with Temporal Diffusion

Jiaming Guo, Rui Zhang, Zerun Li, Yunkai Gao, Shaohui Peng, Siming Lan, Xing Hu, Zidong Du, Xishan Zhang, Ling Li

Main category: cs.LG

TL;DR: Temporal Diffusion Planner (TDP) improves decision efficiency by distributing denoising steps across time, reducing computational overhead while maintaining performance.

Details

Motivation: Previous diffusion planning methods generate new plans at each time step, causing significant computational overhead and lower decision frequencies. Inspired by human planning where short-term plans are detailed and long-term plans are more general, TDP aims to improve efficiency.

Method: TDP generates an initial plan that becomes progressively more vague over time. At each subsequent time step, it updates the previous plan with a small number of denoising steps rather than generating entirely new plans. An automated replanning mechanism prevents significant deviations.

Result: Experiments on D4RL show TDP improves decision-making frequency by 11-24.8 times compared to previous methods while achieving higher or comparable performance.

Conclusion: TDP successfully addresses the computational inefficiency of previous diffusion planning methods by distributing denoising steps across time, enabling more efficient decision-making without sacrificing performance.

Abstract: Diffusion planning is a promising method for learning high-performance policies from offline data. To avoid the impact of discrepancies between planning and reality on performance, previous works generate new plans at each time step. However, this incurs significant computational overhead and leads to lower decision frequencies, and frequent plan switching may also affect performance. In contrast, humans might create detailed short-term plans and more general, sometimes vague, long-term plans, and adjust them over time. Inspired by this, we propose the Temporal Diffusion Planner (TDP) which improves decision efficiency by distributing the denoising steps across the time dimension. TDP begins by generating an initial plan that becomes progressively more vague over time. At each subsequent time step, rather than generating an entirely new plan, TDP updates the previous one with a small number of denoising steps. This reduces the average number of denoising steps, improving decision efficiency. Additionally, we introduce an automated replanning mechanism to prevent significant deviations between the plan and reality. Experiments on D4RL show that, compared to previous works that generate new plans every time step, TDP improves the decision-making frequency by 11-24.8 times while achieving higher or comparable performance.

[389] A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs

Quan Xiao, Tianyi Chen

Main category: cs.LG

TL;DR: A unified optimization framework for offline data selection and online self-refining generation in LLM fine-tuning, using bilevel data selection and validation-weighted generations to enhance data quality.

Details

Motivation: To improve LLM adaptation to specific tasks by enhancing data quality through systematic offline selection and online refinement processes.

Method: Bilevel data selection for offline filtering with respect to validation data, and treating online self-refining generation as model adaptation by selecting best-fitting responses.

Result: Theoretical demonstration of bilevel data selection effectiveness, performance gains over unfiltered baselines, and improved fine-tuning performance on quality enhancement and safety-aware tasks.

Conclusion: The framework provides unified understanding of data selection and self-refining generation, enhancing LLM fine-tuning through learned data weights and validation-weighted online generations.

Abstract: Offline data selection and online self-refining generation, which enhance the data quality, are crucial steps in adapting large language models (LLMs) to specific downstream tasks. We tackle offline data selection and online self-refining generations through an optimization perspective. Specifically, bilevel data selection is used for offline data selection with respect to the validation dataset, and we treat online self-refining generation as a model adaptation step of selecting the model trained on current responses that best fits the validation data. Our framework offers a unified understanding of offline data selection and self-refining generation by assigning a learned data weight to each question and response, either explicitly or implicitly. For the first time, we theoretically demonstrate the effectiveness of the bilevel data selection framework and demonstrate its performance gains over unfiltered direct mixing baselines. By combining offline data with validation-weighted online generations, our method enhances fine-tuning performance. Experiments on quality enhancement and safety-aware LLM fine-tuning validate its effectiveness.

[390] G-Net: A Provably Easy Construction of High-Accuracy Random Binary Neural Networks

Alireza Aghasi, Nicholas Marshall, Saeid Pourmand, Wyatt Whiting

Main category: cs.LG

TL;DR: Proposes G-Nets, a novel family of randomized binary neural networks inspired by hyperdimensional computing that bridges floating-point networks with binary embeddings while maintaining accuracy through theoretical guarantees.

Details

Motivation: To create robust binary neural networks that overcome limitations of traditional quantization methods by leveraging hyperdimensional computing principles for efficient hardware implementation and model robustness.

Method: Uses binary embeddings of data as points in hypercube with Hamming distance, creates floating-point G-Nets that can mimic standard layers, and embeds them as randomized binary EHD G-Nets with theoretical guarantees from concentration of measure.

Result: Binary models match CNN accuracies and significantly outperform prior HDC models (e.g., ~30% higher accuracy on CIFAR-10), providing a theoretically justified bridge between neural networks and binary networks.

Conclusion: G-Nets open a new direction for constructing robust binary/quantized deep learning models with theoretical guarantees, combining neural network performance with hyperdimensional computing benefits.

Abstract: We propose a novel randomized algorithm for constructing binary neural networks with tunable accuracy. This approach is motivated by hyperdimensional computing (HDC), which is a brain-inspired paradigm that leverages high-dimensional vector representations, offering efficient hardware implementation and robustness to model corruptions. Unlike traditional low-precision methods that use quantization, we consider binary embeddings of data as points in the hypercube equipped with the Hamming distance. We propose a novel family of floating-point neural networks, G-Nets, which are general enough to mimic standard network layers. Each floating-point G-Net has a randomized binary embedding, an embedded hyperdimensional (EHD) G-Net, that retains the accuracy of its floating-point counterparts, with theoretical guarantees, due to the concentration of measure. Empirically, our binary models match convolutional neural network accuracies and outperform prior HDC models by large margins, for example, we achieve almost 30% higher accuracy on CIFAR-10 compared to prior HDC models. G-Nets are a theoretically justified bridge between neural networks and randomized binary neural networks, opening a new direction for constructing robust binary/quantized deep learning models. Our implementation is available at https://github.com/GNet2025/GNet.

[391] Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning

Zhenchao Tang, Fang Wang, Haohuai He, Jiale Zhou, Tianxu Lv, Jun Zhu, Shouzhi Chen, Minghao Yang, Yu Wang, Jiayang Wu, Yidong Song, Jianhua Yao

Main category: cs.LG

TL;DR: BFT is an efficient post-training method that enables LLMs to learn complex biomedical reasoning from sparse data without external rewards, outperforming SFT and specialized agents through token-level and sample-level weighting mechanisms.

Details

Motivation: Current approaches for aligning LLMs with biomedical knowledge face limitations: SFT overfits to surface patterns without internalizing fragmented scientific knowledge, while RL is impractical due to prohibitive experimental validation requirements for reward signals.

Method: Balanced Fine-Tuning (BFT) uses a two-layer weighting mechanism: 1) token-level loss scaling via prediction probabilities to stabilize gradients and prevent overfitting, 2) sample-level “minimum group confidence” to adaptively enhance learning of hard samples.

Result: BFT significantly outperforms SFT in medical tasks, enables LLMs to acquire knowledge that SFT misses, surpasses GeneAgent in biological process reasoning, and generates text embeddings applicable to downstream tasks like gene interaction and single-cell perturbation prediction.

Conclusion: BFT facilitates broad applications of LLMs in biomedical research by enabling effective learning of complex reasoning from sparse data without external reward signals.

Abstract: Effective post-training is essential to align Large Language Models (LLMs) with specialized biomedical knowledge to accelerate life science research. However, current approaches face significant limitations. First, biomedical reasoning involves intricate mechanisms often represented by sparse textual data. Standard Supervised Fine-Tuning (SFT) tends to overfit to surface-level instruction patterns without effectively internalizing this fragmented scientific knowledge. Second, Reinforcement Learning (RL) is impractical for this domain, as defining meaningful rewards often necessitates prohibitive experimental validation (e.g., wet-lab verification of drug responses), rendering real-time feedback unfeasible. We propose Balanced Fine-Tuning (BFT), an efficient post-training method designed to learn complex reasoning from sparse data without external reward signals. BFT operates through a two-layer weighting mechanism: 1. At the token level, it scales loss via prediction probabilities to stabilize gradients and prevent overfitting; 2. At the sample level, it uses “minimum group confidence” to adaptively enhance the learning of hard samples. Experiments demonstrate that BFT significantly outperforms SFT. In medical tasks, it enables LLMs to acquire knowledge that SFT misses. In biological tasks, BFT-based LLMs surpass GeneAgent (an accurate agent for biology analysis) in biological process reasoning. Moreover, the text embeddings generated by BFT can be directly applied to downstream tasks, such as gene interaction and single-cell perturbation response prediction. These results indicate that BFT facilitates broad applications of LLMs in biomedical research.

[392] Deceptron: Learned Local Inverses for Fast and Stable Physics Inversion

Aaditya L. Kachhadiya

Main category: cs.LG

TL;DR: The Deceptron is a lightweight bidirectional module that learns local inverses of differentiable forward surrogates to solve ill-conditioned inverse problems more efficiently, achieving 20x fewer iterations on Heat-1D and 2-3x fewer on Damped Oscillator compared to projected gradient.

Details

Motivation: Inverse problems in physical sciences are often ill-conditioned in input space, making progress step-size sensitive and requiring more efficient solution methods.

Method: Training combines supervised fit, forward-reverse consistency, spectral penalty, soft bias tie, and Jacobian Composition Penalty (JCP). At solve time, D-IPG takes descent steps in output space, pulls them back through the learned inverse, and projects using standard backtracking rules.

Result: D-IPG reaches fixed normalized tolerance with ~20x fewer iterations on Heat-1D and ~2-3x fewer on Damped Oscillator than projected gradient, competitive with Gauss-Newton. JCP reduces composition error and tracks iteration gains.

Conclusion: The Deceptron provides an efficient approach for ill-conditioned inverse problems, with significant iteration reductions and competitive performance with established methods.

Abstract: Inverse problems in the physical sciences are often ill-conditioned in input space, making progress step-size sensitive. We propose the Deceptron, a lightweight bidirectional module that learns a local inverse of a differentiable forward surrogate. Training combines a supervised fit, forward-reverse consistency, a lightweight spectral penalty, a soft bias tie, and a Jacobian Composition Penalty (JCP) that encourages $J_g(f(x)),J_f(x)!\approx!I$ via JVP/VJP probes. At solve time, D-IPG (Deceptron Inverse-Preconditioned Gradient) takes a descent step in output space, pulls it back through $g$, and projects under the same backtracking and stopping rules as baselines. On Heat-1D initial-condition recovery and a Damped Oscillator inverse problem, D-IPG reaches a fixed normalized tolerance with $\sim$20$\times$ fewer iterations on Heat and $\sim$2-3$\times$ fewer on Oscillator than projected gradient, competitive in iterations and cost with Gauss-Newton. Diagnostics show JCP reduces a measured composition error and tracks iteration gains. We also preview a single-scale 2D instantiation, DeceptronNet (v0), that learns few-step corrections under a strict fairness protocol and exhibits notably fast convergence.

[393] How to Correctly Report LLM-as-a-Judge Evaluations

Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee

Main category: cs.LG

TL;DR: A framework for bias correction and confidence interval construction in LLM-based evaluation, addressing noise from imperfect specificity/sensitivity and uncertainty from calibration data.

Details

Motivation: LLMs are increasingly used as evaluators but their judgments are noisy due to imperfect specificity and sensitivity, leading to biased accuracy estimates. Existing bias-correction methods are underutilized and assume exact knowledge of specificity/sensitivity values.

Method: Proposes a simple plug-in framework that corrects bias and constructs confidence intervals reflecting uncertainty from both test and calibration datasets. Also introduces an adaptive algorithm for efficient calibration sample size allocation.

Result: The framework enables practical and statistically sound LLM-based evaluation by properly accounting for uncertainty in specificity/sensitivity estimates.

Conclusion: The proposed methods provide a statistically rigorous approach to LLM-based evaluation that corrects bias, constructs proper confidence intervals, and efficiently manages calibration uncertainty.

Abstract: Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model’s specificity and sensitivity. Furthermore, in general we only have estimates of these values and it is not well known how to properly construct confidence intervals using only estimates. This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset, enabling practical and statistically sound LLM-based evaluation. Additionally, to reduce uncertainty in the accuracy estimate, we introduce an adaptive algorithm that efficiently allocates calibration sample sizes.

[394] MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts

Ivan Novikov

Main category: cs.LG

TL;DR: MLPMoE is a training-free method that transforms dense transformer MLPs into static mixture-of-experts without requiring calibration data, router training, or gradients, while maintaining performance with minimal perplexity changes.

Details

Motivation: Current dense transformer deployment is computationally inefficient as all parameters are activated for every token, despite evidence that useful computation occurs in sparse substructures within MLPs.

Method: Uses tensor slicing and summation to restructure dense MLPs into static high-cardinality mixture-of-experts, with Fractal Fade and Compensated Pruning for structured sparsity.

Result: On Qwen2.5-0.5B and DeepSeek-R1-Distill-Llama-8B models, MLPMoE maintains perplexity within 0.05% while keeping parameters constant, and differential sparsity removes ~20% of MLP parameters with only ~2% perplexity increase.

Conclusion: MLPMoE provides an efficient, training-free approach to transformer optimization that works post hoc on existing checkpoints without requiring additional training or calibration data.

Abstract: Large Language Models (LLMs) are predominantly deployed as dense transformers, where every parameter in every feed-forward block is activated for every token. While architecturally simple, this is computationally inefficient, since inference costs scale linearly with parameter count. Recent upcycling methods such as MoEfication, CMoE, ToMoE, and MoORE reveal that much of the useful computation lives in sparse, semi-modular substructures inside dense feed-forward networks, but these approaches typically rely on clustering, activation profiling, singular value decomposition, or custom routing that requires calibration data. This paper introduces MLPMoE (MLP Mixture-of-Experts), a training-free, deterministic transformation that restructures the dense MLP in transformer blocks into a static, high-cardinality mixture of experts. The transformation uses simple tensor slicing and summation, reinterpreting the algebra of tensor parallelism as a topological conversion rather than a distributed training pattern. We further introduce Fractal Fade (differential branch sparsity) and Compensated Pruning (variance-preserving branch reduction) as lightweight mechanisms for structured sparsity. On Qwen2.5-0.5B-Instruct and DeepSeek-R1-Distill-Llama-8B, the zero-shot MLPMoE transform changes a proxy perplexity metric by less than 0.05 percent while keeping the parameter count effectively constant. On the 8B model, differential sparsity removes about 20 percent of MLP parameters while keeping perplexity within about 2 percent of the dense baseline. The method operates entirely post hoc on existing checkpoints and does not require gradients, calibration sets, or router training. Code is available at https://gist.github.com/iwallarm/fc2ef1eddf226ca7814f9e5e2ae9bad1

[395] MNM : Multi-level Neuroimaging Meta-analysis with Hyperbolic Brain-Text Representations

Seunghun Baek, Jaejin Lee, Jaeyoon Sim, Minjae Jeong, Won Hwa Kim

Main category: cs.LG

TL;DR: A novel framework using hyperbolic geometry to perform multi-level neuroimaging meta-analysis by embedding text and brain images in a shared hyperbolic space, capturing semantic similarity and hierarchical organization.

Details

Motivation: Traditional meta-analysis methods for neuroimaging studies often overlook the hierarchical structure in brain data and suffer from limitations in capturing complex relationships between text and brain activation patterns.

Method: Leverages hyperbolic geometry (Lorentz model) to embed research article text and corresponding brain images into a shared hyperbolic space, performing multi-level neuroimaging meta-analysis through semantic alignment, hierarchy guidance, and relationship preservation.

Result: The proposed model outperforms baseline methods, demonstrating improved performance in neuroimaging meta-analysis tasks.

Conclusion: The framework provides a robust and interpretable paradigm for multi-level neuroimaging meta-analysis by effectively capturing hierarchical relationships through hyperbolic brain-text representations.

Abstract: Various neuroimaging studies suffer from small sample size problem which often limit their reliability. Meta-analysis addresses this challenge by aggregating findings from different studies to identify consistent patterns of brain activity. However, traditional approaches based on keyword retrieval or linear mappings often overlook the rich hierarchical structure in the brain. In this work, we propose a novel framework that leverages hyperbolic geometry to bridge the gap between neuroscience literature and brain activation maps. By embedding text from research articles and corresponding brain images into a shared hyperbolic space via the Lorentz model, our method captures both semantic similarity and hierarchical organization inherent in neuroimaging data. In the hyperbolic space, our method performs multi-level neuroimaging meta-analysis (MNM) by 1) aligning brain and text embeddings for semantic correspondence, 2) guiding hierarchy between text and brain activations, and 3) preserving the hierarchical relationships within brain activation patterns. Experimental results demonstrate that our model outperforms baselines, offering a robust and interpretable paradigm of multi-level neuroimaging meta-analysis via hyperbolic brain-text representation.

[396] Generative Early Stage Ranking

Juhee Hong, Meng Liu, Shengzhi Wang, Xiaoheng Mao, Huihui Cheng, Leon Gao, Christopher Leung, Jin Zhou, Chandra Mouli Sekar, Zhao Zhu, Ruochen Liu, Tuan Trieu, Dawei Sun, Jeet Kanjani, Rui Li, Jing Qian, Xuan Cao, Minjie Fan, Mingze Gao

Main category: cs.LG

TL;DR: Proposes Generative Early Stage Ranking (GESR) with Mixture of Attention (MoA) to bridge effectiveness gap in multi-stage ranking systems, using specialized attention mechanisms and optimization techniques for improved performance while maintaining efficiency.

Details

Motivation: Early Stage Ranking systems using user-item decoupling are efficient but limited in capturing fine-grained user-item affinities and cross-signals, creating an effectiveness gap.

Method: Introduces GESR paradigm with Mixture of Attention (MoA) module containing Hard Matching Attention, Target-Aware Self Attention, and Cross Attention modules, plus Multi-Logit Parameterized Gating for final refinement, with comprehensive optimization techniques.

Result: Substantial improvements in topline metrics, engagement, and consumption tasks validated by offline and online experiments; first successful deployment of full target-aware attention sequence modeling at ESR scale.

Conclusion: GESR paradigm successfully bridges the effectiveness-efficiency trade-off in early stage ranking through specialized attention mechanisms and optimization, enabling better user-item interaction modeling while maintaining system performance.

Abstract: Large-scale recommendations commonly adopt a multi-stage cascading ranking system paradigm to balance effectiveness and efficiency. Early Stage Ranking (ESR) systems utilize the “user-item decoupling” approach, where independently learned user and item representations are only combined at the final layer. While efficient, this design is limited in effectiveness, as it struggles to capture fine-grained user-item affinities and cross-signals. To address these, we propose the Generative Early Stage Ranking (GESR) paradigm, introducing the Mixture of Attention (MoA) module which leverages diverse attention mechanisms to bridge the effectiveness gap: the Hard Matching Attention (HMA) module encodes explicit cross-signals by computing raw match counts between user and item features; the Target-Aware Self Attention module generates target-aware user representations conditioned on the item, enabling more personalized learning; and the Cross Attention modules facilitate early and more enriched interactions between user-item features. MoA’s specialized attention encodings are further refined in the final layer through a Multi-Logit Parameterized Gating (MLPG) module, which integrates the newly learned embeddings via gating and produces secondary logits that are fused with the primary logit. To address the efficiency and latency challenges, we have introduced a comprehensive suite of optimization techniques. These span from custom kernels that maximize the capabilities of the latest hardware to efficient serving solutions powered by caching mechanisms. The proposed GESR paradigm has shown substantial improvements in topline metrics, engagement, and consumption tasks, as validated by both offline and online experiments. To the best of our knowledge, this marks the first successful deployment of full target-aware attention sequence modeling within an ESR stage at such a scale.

[397] BanglaASTE: A Novel Framework for Aspect-Sentiment-Opinion Extraction in Bangla E-commerce Reviews Using Ensemble Deep Learning

Ariful Islam, Md Rifat Hossen, Abir Ahmed, B M Taslimul Haque

Main category: cs.LG

TL;DR: This paper introduces BanglaASTE, the first framework for Aspect Sentiment Triplet Extraction (ASTE) in Bangla, achieving 89.9% accuracy on a new dataset of 3,345 product reviews using an ensemble model combining BanglaBERT and XGBoost.

Details

Motivation: Bangla ABSA research is significantly underexplored due to the absence of comprehensive datasets and specialized frameworks for triplet extraction in this language, despite its importance for e-commerce and social media analytics.

Method: Created the first annotated Bangla ASTE dataset; developed a hybrid classification framework with graph-based aspect-opinion matching and semantic similarity; implemented an ensemble model combining BanglaBERT contextual embeddings with XGBoost boosting algorithms.

Result: The ensemble approach achieves superior performance with 89.9% accuracy and 89.1% F1-score, significantly outperforming baseline models across all evaluation metrics.

Conclusion: The research advances state-of-the-art in low-resource language sentiment analysis and provides a scalable solution for Bangla e-commerce analytics, effectively addressing challenges like informal expressions, spelling variations, and data sparsity.

Abstract: Aspect-Based Sentiment Analysis (ABSA) has emerged as a critical tool for extracting fine-grained sentiment insights from user-generated content, particularly in e-commerce and social media domains. However, research on Bangla ABSA remains significantly underexplored due to the absence of comprehensive datasets and specialized frameworks for triplet extraction in this language. This paper introduces BanglaASTE, a novel framework for Aspect Sentiment Triplet Extraction (ASTE) that simultaneously identifies aspect terms, opinion expressions, and sentiment polarities from Bangla product reviews. Our contributions include: (1) creation of the first annotated Bangla ASTE dataset containing 3,345 product reviews collected from major e-commerce platforms including Daraz, Facebook, and Rokomari; (2) development of a hybrid classification framework that employs graph-based aspect-opinion matching with semantic similarity techniques; and (3) implementation of an ensemble model combining BanglaBERT contextual embeddings with XGBoost boosting algorithms for enhanced triplet extraction performance. Experimental results demonstrate that our ensemble approach achieves superior performance with 89.9% accuracy and 89.1% F1-score, significantly outperforming baseline models across all evaluation metrics. The framework effectively addresses key challenges in Bangla text processing including informal expressions, spelling variations, and data sparsity. This research advances the state-of-the-art in low-resource language sentiment analysis and provides a scalable solution for Bangla e-commerce analytics applications.

[398] From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models

Hengyu Fu, Baihe Huang, Virginia Adams, Charles Wang, Venkat Srinivasan, Jiantao Jiao

Main category: cs.LG

TL;DR: Standard diffusion language model decoding strategies are inefficient due to an information-theoretic bottleneck. The proposed Explore-Then-Exploit strategy reduces decoding rounds while maintaining quality.

Details

Motivation: Current DLM decoding strategies relying on high-confidence tokens encounter an inherent information-theoretic bottleneck that restricts decoding progress and slows generation.

Method: Proposed Explore-Then-Exploit (ETE) strategy that combines cross-block decoding with targeted exploration of high-uncertainty tokens to reshape conditional distributions and trigger confident predictions.

Result: ETE consistently reduces the required number of decoding rounds compared to confidence-only baselines without compromising generation quality.

Conclusion: Prioritizing high-confidence tokens is inherently inefficient, and the bits-to-rounds principle establishes that decoding rounds must grow linearly with sample information and inversely with per-round information budget.

Abstract: Diffusion Language Models (DLMs) have recently emerged as a strong alternative to autoregressive language models (LMs). DLMs offer comparable accuracy with faster inference speed via parallel decoding. However, standard DLM decoding strategies relying on high-confidence tokens encounter an inherent information-theoretic bottleneck that restricts decoding progress and ultimately slows generation. We demonstrate both theoretically and empirically that prioritizing high-confidence tokens is inherently inefficient. High-probability tokens carry negligible information and strictly relying on them limits the effective progress made in each decoding round. We prove that the number of decoding rounds must grow linearly with the sample’s total information (negative log-likelihood) and inversely with the per-round information budget, establishing a bits-to-rounds principle. We also propose Explore-Then-Exploit (ETE), a training-free decoding strategy that maximizes information throughput and decoding efficiency. ETE combines cross-block decoding with targeted exploration of high-uncertainty tokens to reshape the conditional distribution and trigger cascades of confident predictions. Experiments verify our theoretical bounds and demonstrate that ETE consistently reduces the required number of decoding rounds compared to confidence-only baselines without compromising generation quality.

[399] BRIDGE: Building Representations In Domain Guided Program Verification

Robert Joseph George, Carson Eisenach, Udaya Ghai, Dominique Perrault-Joncas, Anima Anandkumar, Dean Foster

Main category: cs.LG

TL;DR: BRIDGE introduces structured prompting that decomposes program verification into three domains (Code, Specifications, Proofs) to improve verified program generation in LLMs.

Details

Motivation: LLMs struggle with program verification in interactive proof frameworks like Lean4 due to scalability challenges in generating code, specifications, and proofs together.

Method: Decomposes verification into three interconnected domains with distinct reasoning behaviors: functional (code), specification-driven (intent), and proof-oriented (correctness arguments).

Result: Improves code correctness in Lean4 by 1.5x (pass@5), achieves 2x efficiency in inference, and boosts Python coding pass rates by up to 17.5% compared to standard methods.

Conclusion: Structured domain alignment is promising for verified synthesis and establishes foundation for training models to internalize reasoning strategies across code, specifications, and proofs.

Abstract: Large language models (LLMs) have achieved impressive results in code generation, yet struggle with program verification, especially in interactive proof frameworks such as Lean4. A central challenge is scalability: verified synthesis requires not just code, but also precise specifications and correctness proofs, and existing approaches rarely span all three domains. We present BRIDGE, the first systematic study of structured prompting for scalable verified program generation. BRIDGE decomposes verification into three interconnected domains: Code (executable implementations), Specifications (formal intent statements), and Proofs (constructive correctness arguments). Our key idea is to elicit distinct reasoning behaviors functional, specification-driven, and proof-oriented as intermediate representations that preserve semantic structure and connect these domains. Through systematic ablations, we show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods. For example, functional reasoning improves correctness of code in formal languages (Lean4) by nearly 1.5x (pass@5) over direct baselines. In inference-time compute, functional reasoning is also 2x more efficient, achieving higher pass rates with fewer generations and lower total sampling budgets. Similarly, we find that specification-driven prompting boosts Python coding pass rates by up to 17.5%. These findings suggest that structured domain alignment is a promising direction for advancing verified synthesis. BRIDGE establishes a foundation for training via expert iteration or RLVR, enabling models to internalize these reasoning strategies across code, specifications, and proofs.

[400] Subjective Depth and Timescale Transformers: Learning Where and When to Compute

Frederico Wieser, Martin Benfeghoul, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas

Main category: cs.LG

TL;DR: Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT) use Bayesian surprise signals to dynamically route computation in decoder-only Transformers, reducing computation by 75% and KV-cache by 50% while maintaining performance.

Details

Motivation: Standard Transformer architectures have rigid, uniform computation allocation that limits efficiency and scalability for large models and long sequences.

Method: SDT uses alternating Decision and Dynamic layers with Bayesian surprise-based routing. STT extends this to temporal domain with transition networks predicting residual updates and dynamic block execution.

Result: Both architectures reduce self-attention computation by 75% and KV-cache requirements by 50% within compute skipping layers, showing compute-accuracy trade-offs.

Conclusion: The proposed architectures provide a flexible framework for efficient Transformers through conditional computation based on surprise signals.

Abstract: The rigid, uniform allocation of computation in standard Transformer (TF) architectures can limit their efficiency and scalability, particularly for large-scale models and long sequences. Addressing this, we introduce Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT), two distinct architectures that leverage Bayesian surprise signals to dynamically route computation, learning where and when to compute within decoder-only TFs. SDT augments a decoder-only stack with alternating Decision and Dynamic layers: a Decision layer computes a full block ‘posterior’ and a lightweight ‘prior,’ while a Dynamic layer employs fixed-capacity Top-K routing based on Bayesian surprise (Expected and Unexpected Change), maintaining a static compute graph. STT extends this conditional computation to the temporal domain: a transition network predicts residual updates, forming a temporal ‘change hypothesis’ that informs a router to dynamically execute or bypass TF blocks for each token, managing KV-cache contributions. Both architectures exhibit the predicted shift from novelty to prediction driven gating over training, suggesting alignment with surprise based principles. While operating at reduced capacity, they offer preliminary insights into the compute-accuracy trade-offs of conditional computation. The proposed architectures establish a flexible framework for efficiency, reducing self-attention computation by 75% and KV-cache requirements by 50% within each compute skipping layer, setting a pathway for more efficient models.

[401] Dynamic Stratified Contrastive Learning with Upstream Augmentation for MILP Branching

Tongkai Lu, Shuai Ma, Chongyang Tao

Main category: cs.LG

TL;DR: Dynamic Stratified Contrastive Training Framework (DSCT) for MILP branching that addresses semantic variation, data scarcity, and imbalance in neural-based branching policies.

Details

Motivation: Existing neural-based branching methods struggle with semantic variation across depths, scarcity of upstream nodes, and costly collection of strong branching samples in Branch-and-Bound algorithms.

Method: Groups B&B nodes by feature distributions, trains GCNN discriminative model to separate nodes across groups, and uses upstream-augmented MILP derivation to generate equivalent/perturbed instances for data scarcity.

Result: Significantly enhances branching accuracy and solving efficiency, particularly for upstream nodes, and generalizes effectively to unseen instances on standard MILP benchmarks.

Conclusion: DSCT effectively models subtle semantic differences between nodes and improves MILP solving performance through dynamic stratified contrastive training.

Abstract: Mixed Integer Linear Programming (MILP) is a fundamental class of NP-hard problems that has garnered significant attention from both academia and industry. The Branch-and-Bound (B&B) method is the dominant approach for solving MILPs and the branching plays an important role in B&B methods. Neural-based learning frameworks have recently been developed to enhance branching policies and the efficiency of solving MILPs. However, these methods still struggle with semantic variation across depths, the scarcity of upstream nodes, and the costly collection of strong branching samples. To address these issues, we propose \ours, a Dynamic \underline{\textbf{S}}tratified \underline{\textbf{C}}ontrastive Training Framework for \underline{\textbf{MILP}} Branching. It groups branch-and-bound nodes based on their feature distributions and trains a GCNN-based discriminative model to progressively separate nodes across groups, learning finer-grained node representations throughout the tree. To address data scarcity and imbalance at upstream nodes, we introduce an upstream-augmented MILP derivation procedure that generates both theoretically equivalent and perturbed instances. \ours~effectively models subtle semantic differences between nodes, significantly enhancing branching accuracy and solving efficiency, particularly for upstream nodes. Extensive experiments on standard MILP benchmarks demonstrate that our method enhances branching accuracy, reduces solving time, and generalizes effectively to unseen instances.

[402] Interpretable Fair Clustering

Mudi Jiang, Jiahui Zhou, Xinying Liu, Zengyou He, Zhikui Chen

Main category: cs.LG

TL;DR: Proposes an interpretable fair clustering framework using decision trees with fairness constraints, including a variant that eliminates fairness hyperparameter tuning through post-pruning.

Details

Motivation: Existing fair clustering methods lack interpretability, limiting their use in high-stakes scenarios where understanding clustering decisions is essential.

Method: Integrates fairness constraints into decision tree structure for clustering, with a variant that post-prunes trees constructed without fairness constraints to avoid hyperparameter tuning.

Result: Extensive experiments show competitive clustering performance, improved fairness, interpretability, and ability to handle multiple sensitive attributes robustly under complex constraints.

Conclusion: The framework enables equitable and transparent clustering with interpretable decision trees, opening new possibilities for fair clustering applications.

Abstract: Fair clustering has gained increasing attention in recent years, especially in applications involving socially sensitive attributes. However, existing fair clustering methods often lack interpretability, limiting their applicability in high-stakes scenarios where understanding the rationale behind clustering decisions is essential. In this work, we address this limitation by proposing an interpretable and fair clustering framework, which integrates fairness constraints into the structure of decision trees. Our approach constructs interpretable decision trees that partition the data while ensuring fair treatment across protected groups. To further enhance the practicality of our framework, we also introduce a variant that requires no fairness hyperparameter tuning, achieved through post-pruning a tree constructed without fairness constraints. Extensive experiments on both real-world and synthetic datasets demonstrate that our method not only delivers competitive clustering performance and improved fairness, but also offers additional advantages such as interpretability and the ability to handle multiple sensitive attributes. These strengths enable our method to perform robustly under complex fairness constraints, opening new possibilities for equitable and transparent clustering.

[403] Trustless Federated Learning at Edge-Scale: A Compositional Architecture for Decentralized, Verifiable, and Incentive-Aligned Coordination

Pius Onobhayedo, Paul Osemudiame Oamen

Main category: cs.LG

TL;DR: This paper addresses key gaps in federated learning by introducing cryptographic proofs for aggregation correctness, geometric novelty measurement to prevent gaming, parallel object ownership for scalability, and time-locked policies against retroactive manipulation.

Details

Motivation: The motivation is to realize the democratic vision of distributed AI where edge devices can collectively improve models without surrendering raw data, overcoming current limitations in accountability, economic mechanisms, scalability, and governance.

Method: The method uses cryptographic receipts to prove aggregation correctness, geometric novelty measurement to prevent incentive gaming, parallel object ownership to achieve linear scalability, and time-locked policies to check retroactive manipulation.

Result: The proposed approach addresses the compositional gaps in federated learning systems, enabling secure, scalable, and accountable distributed model training across billions of edge devices.

Conclusion: By solving key challenges in aggregation correctness, incentive gaming, scalability, and governance manipulation, this work enables the realization of truly democratic federated learning at scale.

Abstract: Artificial intelligence is retracing the Internet’s path from centralized provision to distributed creation. Initially, resource-intensive computation concentrates within institutions capable of training and serving large models.Eventually, as federated learning matures, billions of edge devices holding sensitive data will be able to collectively improve models without surrendering raw information, enabling both contribution and consumption at scale. This democratic vision remains unrealized due to certain compositional gaps; aggregators handle updates without accountability, economic mechanisms are lacking and even when present remain vulnerable to gaming, coordination serializes state modifications limiting scalability, and governance permits retroactive manipulation. This work addresses these gaps by leveraging cryptographic receipts to prove aggregation correctness, geometric novelty measurement to prevent incentive gaming, parallel object ownership to achieve linear scalability, and time-locked policies to check retroactive manipulation.

Mengran Li, Zelin Zang, Wenbin Xing, Junzhou Chen, Ronghui Zhang, Jiebo Luo, Stan Z. Li

Main category: cs.LG

TL;DR: CHMR is a cell-aware hierarchical multimodal framework that models dependencies between molecules and cellular responses, achieving state-of-the-art performance on molecular property prediction tasks.

Details

Motivation: Existing methods focus only on chemical structures, ignoring cellular responses, and current cell-aware approaches suffer from modality incompleteness and insufficient modeling of hierarchical dependencies across molecular, cellular, and genomic levels.

Method: CHMR jointly models local-global dependencies between molecules and cellular responses using a novel tree-structured vector quantization module to capture latent biological hierarchies.

Result: Outperforms state-of-the-art baselines on nine public benchmarks (728 tasks) with average improvements of 3.6% on classification and 17.2% on regression tasks.

Conclusion: Demonstrates the advantage of hierarchy-aware, multimodal learning for reliable and biologically grounded molecular representations, offering a generalizable framework for integrative biomedical modeling.

Abstract: Understanding how chemical perturbations propagate through biological systems is essential for robust molecular property prediction. While most existing methods focus on chemical structures alone, recent advances highlight the crucial role of cellular responses such as morphology and gene expression in shaping drug effects. However, current cell-aware approaches face two key limitations: (1) modality incompleteness in external biological data, and (2) insufficient modeling of hierarchical dependencies across molecular, cellular, and genomic levels. We propose CHMR (Cell-aware Hierarchical Multi-modal Representations), a robust framework that jointly models local-global dependencies between molecules and cellular responses and captures latent biological hierarchies via a novel tree-structured vector quantization module. Evaluated on nine public benchmarks spanning 728 tasks, CHMR outperforms state-of-the-art baselines, yielding average improvements of 3.6% on classification and 17.2% on regression tasks. These results demonstrate the advantage of hierarchy-aware, multimodal learning for reliable and biologically grounded molecular representations, offering a generalizable framework for integrative biomedical modeling. The code is in https://github.com/limengran98/CHMR.

[405] Privacy in Federated Learning with Spiking Neural Networks

Dogukan Aksu, Jesus Martinez del Rincon, Ihsen Alouani

Main category: cs.LG

TL;DR: SNNs offer inherent privacy advantages over ANNs in federated learning due to reduced gradient informativeness from surrogate-gradient training and event-driven dynamics.

Details

Motivation: To investigate the vulnerability of SNNs to gradient inversion attacks in federated learning, as this privacy threat has been well-studied in ANNs but remains unexplored for SNNs.

Method: Adapted different gradient leakage attacks to the spike domain and conducted comprehensive empirical study across diverse data domains to compare gradient informativeness between SNNs and ANNs.

Result: SNN gradients yield noisy, temporally inconsistent reconstructions that fail to recover meaningful spatial or temporal structure, unlike ANN gradients which reliably expose salient input content.

Conclusion: SNNs have inherent privacy-preserving potential in federated learning due to reduced gradient informativeness from surrogate-gradient training and event-driven dynamics.

Abstract: Spiking neural networks (SNNs) have emerged as prominent candidates for embedded and edge AI. Their inherent low power consumption makes them far more efficient than conventional ANNs in scenarios where energy budgets are tightly constrained. In parallel, federated learning (FL) has become the prevailing training paradigm in such settings, enabling on-device learning while limiting the exposure of raw data. However, gradient inversion attacks represent a critical privacy threat in FL, where sensitive training data can be reconstructed directly from shared gradients. While this vulnerability has been widely investigated in conventional ANNs, its implications for SNNs remain largely unexplored. In this work, we present the first comprehensive empirical study of gradient leakage in SNNs across diverse data domains. SNNs are inherently non-differentiable and are typically trained using surrogate gradients, which we hypothesized would be less correlated with the original input and thus less informative from a privacy perspective. To investigate this, we adapt different gradient leakage attacks to the spike domain. Our experiments reveal a striking contrast with conventional ANNs: whereas ANN gradients reliably expose salient input content, SNN gradients yield noisy, temporally inconsistent reconstructions that fail to recover meaningful spatial or temporal structure. These results indicate that the combination of event-driven dynamics and surrogate-gradient training substantially reduces gradient informativeness. To the best of our knowledge, this work provides the first systematic benchmark of gradient inversion attacks for spiking architectures, highlighting the inherent privacy-preserving potential of neuromorphic computation.

[406] I-GLIDE: Input Groups for Latent Health Indicators in Degradation Estimation

Lucas Thil, Jesse Read, Rim Kaddah, Guillaume Doquet

Main category: cs.LG

TL;DR: Novel framework for health indicator construction with uncertainty quantification and mechanism-specific degradation modeling that improves RUL prediction accuracy and interpretability.

Details

Motivation: Existing methods fail to disentangle complex degradation mechanisms in multi-sensor systems and lack uncertainty quantification in health indicator reliability.

Method: Adapts RaPP as health indicator, augments with uncertainty quantification via Monte Carlo dropout and probabilistic latent spaces, and introduces indicator groups to isolate sensor subsets for mechanism-specific degradation modeling (I-GLIDE).

Result: Outperforms traditional reconstruction error metrics, achieves marked improvements in accuracy and generalizability compared to state-of-the-art methods, and provides actionable insights into system failure pathways.

Conclusion: Bridges gap between anomaly detection and prognostics, offering principled framework for uncertainty-aware degradation modeling in complex systems with improved interpretability.

Abstract: Accurate remaining useful life (RUL) prediction hinges on the quality of health indicators (HIs), yet existing methods often fail to disentangle complex degradation mechanisms in multi-sensor systems or quantify uncertainty in HI reliability. This paper introduces a novel framework for HI construction, advancing three key contributions. First, we adapt Reconstruction along Projected Pathways (RaPP) as a health indicator (HI) for RUL prediction for the first time, showing that it outperforms traditional reconstruction error metrics. Second, we show that augmenting RaPP-derived HIs with aleatoric and epistemic uncertainty quantification (UQ) via Monte Carlo dropout and probabilistic latent spaces- significantly improves RUL-prediction robustness. Third, and most critically, we propose indicator groups, a paradigm that isolates sensor subsets to model system-specific degradations, giving rise to our novel method, I-GLIDE which enables interpretable, mechanism-specific diagnostics. Evaluated on data sourced from aerospace and manufacturing systems, our approach achieves marked improvements in accuracy and generalizability compared to state-of-the-art HI methods while providing actionable insights into system failure pathways. This work bridges the gap between anomaly detection and prognostics, offering a principled framework for uncertainty-aware degradation modeling in complex systems.

[407] Robust Gene Prioritization via Fast-mRMR Feature Selection in high-dimensional omics data

Rubén Fernández-Farelo, Jorge Paz-Ruza, Bertha Guijarro-Berdiñas, Amparo Alonso-Betanzos, Alex A. Freitas

Main category: cs.LG

TL;DR: A new gene prioritization pipeline using Fast-mRMR feature selection to handle high-dimensional biomedical data, improving model performance and enabling feature set combination.

Details

Motivation: Existing AI methods for gene prioritization struggle with high dimensionality and incomplete labeling in biomedical data, requiring more robust and efficient approaches.

Method: Proposes a pipeline leveraging Fast-mRMR feature selection to retain only relevant, non-redundant features for classifiers, enabling simpler models and combination of different biological feature sets.

Result: Experiments on Dietary Restriction datasets show significant improvements over existing methods, demonstrating enhanced performance and reliability.

Conclusion: Feature selection is critical for reliable gene prioritization, and the proposed pipeline provides a more robust and efficient solution for handling complex biomedical data.

Abstract: Gene prioritization (identifying genes potentially associated with a biological process) is increasingly tackled with Artificial Intelligence. However, existing methods struggle with the high dimensionality and incomplete labelling of biomedical data. This work proposes a more robust and efficient pipeline that leverages Fast-mRMR feature selection to retain only relevant, non-redundant features for classifiers. This enables us to build simpler and more effective models, as well as to combine different biological feature sets. Experiments on Dietary Restriction datasets show significant improvements over existing methods, proving that feature selection can be critical for reliable gene prioritization.

[408] A Physics-Informed U-net-LSTM Network for Data-Driven Seismic Response Modeling of Structures

Sutirtha Biswas, Kshitij Kumar Yadav

Main category: cs.LG

TL;DR: A Physics-Informed U-Net LSTM framework that integrates physical laws with deep learning for accurate and efficient seismic response prediction of structures.

Details

Motivation: Traditional FEM has high computational costs limiting scalability, while pure data-driven deep learning models struggle with generalization and capturing underlying physics.

Method: Hybrid Physics-Informed U-Net LSTM framework that embeds domain-specific physical constraints into the learning process.

Result: Improved predictive performance over conventional ML architectures, bridging the gap between data-driven methods and physics-based modeling.

Conclusion: The proposed approach offers a robust and computationally efficient alternative for seismic response prediction.

Abstract: Accurate and efficient seismic response prediction is essential for the design of resilient structures. While the Finite Element Method (FEM) remains the standard for nonlinear seismic analysis, its high computational demands limit its scalability and real time applicability. Recent developments in deep learning, particularly Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short Term Memory (LSTM) models, have shown promise in reducing the computational cost of nonlinear seismic analysis of structures. However, these data driven models often struggle to generalize and capture the underlying physics, leading to reduced reliability. We propose a novel Physics Informed U Net LSTM framework that integrates physical laws with deep learning to enhance both accuracy and efficiency. By embedding domain specific constraints into the learning process, the proposed model achieves improved predictive performance over conventional Machine Learning architectures. This hybrid approach bridges the gap between purely data driven methods and physics based modeling, offering a robust and computationally efficient alternative for seismic response prediction of structures.

[409] Sawtooth Sampling for Time Series Denoising Diffusion Implicit Models

Heiko Oppel, Andreas Spilz, Michael Munz

Main category: cs.LG

TL;DR: Combining implicit diffusion models with a novel Sawtoot Sampler to accelerate DDPM sampling by 30x while improving generated sequence quality for classification tasks.

Details

Motivation: DDPMs can generate synthetic timeseries data to improve classifier performance, but their sampling process is computationally expensive.

Method: Combining implicit diffusion models with a novel Sawtooth Sampler that accelerates the reverse process and can be applied to any pretrained diffusion model.

Result: Achieves 30 times speed-up over standard baseline while enhancing quality of generated sequences for classification tasks.

Conclusion: The proposed approach significantly accelerates diffusion model sampling while maintaining or improving generation quality for timeseries classification applications.

Abstract: Denoising Diffusion Probabilistic Models (DDPMs) can generate synthetic timeseries data to help improve the performance of a classifier, but their sampling process is computationally expensive. We address this by combining implicit diffusion models with a novel Sawtooth Sampler that accelerates the reverse process and can be applied to any pretrained diffusion model. Our approach achieves a 30 times speed-up over the standard baseline while also enhancing the quality of the generated sequences for classification tasks.

[410] TSGM: Regular and Irregular Time-series Generation using Score-based Generative Models

Haksoo Lim, Jaehoon Lee, Sewon Park, Minjung Kim, Noseong Park

Main category: cs.LG

TL;DR: Score-based generative models applied to time-series synthesis using conditional score networks, achieving state-of-the-art performance on both regular and irregular time-series data.

Details

Motivation: To leverage the proven success of score-based generative models in other domains (image generation, voice synthesis) for time-series data synthesis, addressing the challenge of synthesizing both regular and irregular time-series.

Method: Developed a conditional score network for time-series synthesis with a tailored conditional denoising score matching loss, designed to be flexible for handling both regular and irregular time-series with minimal model changes.

Result: Achieved exceptional synthesis performance on various time-series datasets, obtaining state-of-the-art sampling diversity and quality.

Conclusion: Score-based generative models can be effectively adapted for time-series synthesis, providing a flexible framework that works well for both regular and irregular time-series data with outstanding results.

Abstract: Score-based generative models (SGMs) have demonstrated unparalleled sampling quality and diversity in numerous fields, such as image generation, voice synthesis, and tabular data synthesis, etc. Inspired by those outstanding results, we apply SGMs to synthesize time-series by learning its conditional score function. To this end, we present a conditional score network for time-series synthesis, deriving a denoising score matching loss tailored for our purposes. In particular, our presented denoising score matching loss is the conditional denoising score matching loss for time-series synthesis. In addition, our framework is such flexible that both regular and irregular time-series can be synthesized with minimal changes to our model design. Finally, we obtain exceptional synthesis performance on various time-series datasets, achieving state-of-the-art sampling diversity and quality.

[411] Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models

Julianna Piskorz, Cristina Pinneri, Alvaro Correia, Motasem Alfarra, Risheek Garrepalli, Christos Louizos

Main category: cs.LG

TL;DR: MDLMs have locality bias favoring local context and are distracted by appended mask tokens during generation, but a mask-agnostic loss can improve robustness.

Details

Motivation: To examine context comprehension abilities of Masked Diffusion Language Models (MDLMs) and address their limitations compared to Autoregressive Language Models (ARLMs).

Method: Systematic ablations to analyze locality bias and mask token effects, plus introducing a mask-agnostic loss function for fine-tuning.

Result: MDLMs exhibit strong locality bias and performance degradation from appended mask tokens, but mask-agnostic loss substantially mitigates these issues.

Conclusion: Current MDLM training has critical limitations, but actionable insights exist for building diffusion-based language models with stronger context comprehension.

Abstract: Masked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective and bidirectional attention mechanism, similarly to ARLMS, MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, we show that appending a large number of mask tokens–required for generation–can significantly degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, reducing the model’s ability to process relevant information. To address this, we introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks. Fine-tuning with this objective substantially mitigates the distracting effect of masks, improving robustness of MDLMs. Overall, our findings reveal critical limitations of the current MDLM training paradigm and provide actionable insights for building diffusion-based language models with stronger context comprehension.

[412] Best Practices for Machine Learning Experimentation in Scientific Applications

Umberto Michelucci, Francesca Venturini

Main category: cs.LG

TL;DR: A practical guide for conducting reproducible and reliable machine learning experiments in scientific research, focusing on fair comparisons, transparent reporting, and metrics to detect overfitting.

Details

Motivation: Machine learning is increasingly used in science but often suffers from poor experimental design, unreliable baselines, and insufficient validation, leading to misleading conclusions about model performance.

Method: Proposes a structured workflow from dataset preparation to model evaluation, introduces metrics like Logarithmic Overfitting Ratio (LOR) and Composite Overfitting Score (COS) to account for overfitting and validation instability, and provides reporting formats.

Result: A comprehensive framework that helps researchers establish robust baselines and conduct fair comparisons through systematic experimental design and transparent documentation practices.

Conclusion: This guide supports scientists in drawing valid evidence-based insights from ML applications by promoting reproducibility, fair comparison, and transparent reporting in machine learning experiments.

Abstract: Machine learning (ML) is increasingly adopted in scientific research, yet the quality and reliability of results often depend on how experiments are designed and documented. Poor baselines, inconsistent preprocessing, or insufficient validation can lead to misleading conclusions about model performance. This paper presents a practical and structured guide for conducting ML experiments in scientific applications, focussing on reproducibility, fair comparison, and transparent reporting. We outline a step-by-step workflow, from dataset preparation to model selection and evaluation, and propose metrics that account for overfitting and instability across validation folds, including the Logarithmic Overfitting Ratio (LOR) and the Composite Overfitting Score (COS). Through recommended practices and example reporting formats, this work aims to support researchers in establishing robust baselines and drawing valid evidence-based insights from ML models applied to scientific problems.

[413] Hybrid-AIRL: Enhancing Inverse Reinforcement Learning with Supervised Expert Guidance

Bram Silue, Santiago Amaya-Corredor, Patrick Mannion, Lander Willem, Pieter Libin

Main category: cs.LG

TL;DR: H-AIRL extends AIRL with supervised loss and stochastic regularization to improve reward inference in complex imperfect-information settings like poker, achieving higher sample efficiency and stability.

Details

Motivation: AIRL struggles with sparse reward problems in highly complex, imperfect-information environments like poker, where it fails to infer sufficiently informative reward functions.

Method: Hybrid-AIRL (H-AIRL) enhances AIRL by incorporating a supervised loss from expert data and a stochastic regularization mechanism to improve reward inference and policy learning.

Result: H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL in Gymnasium benchmarks and HULHE poker, with visualized reward functions providing deeper insights.

Conclusion: Incorporating supervised signals into inverse RL is beneficial, making H-AIRL a promising framework for challenging real-world settings with sparse rewards and uncertainty.

Abstract: Adversarial Inverse Reinforcement Learning (AIRL) has shown promise in addressing the sparse reward problem in reinforcement learning (RL) by inferring dense reward functions from expert demonstrations. However, its performance in highly complex, imperfect-information settings remains largely unexplored. To explore this gap, we evaluate AIRL in the context of Heads-Up Limit Hold’em (HULHE) poker, a domain characterized by sparse, delayed rewards and significant uncertainty. In this setting, we find that AIRL struggles to infer a sufficiently informative reward function. To overcome this limitation, we contribute Hybrid-AIRL (H-AIRL), an extension that enhances reward inference and policy learning by incorporating a supervised loss derived from expert data and a stochastic regularization mechanism. We evaluate H-AIRL on a carefully selected set of Gymnasium benchmarks and the HULHE poker setting. Additionally, we analyze the learned reward function through visualization to gain deeper insights into the learning process. Our experimental results show that H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL. This highlights the benefits of incorporating supervised signals into inverse RL and establishes H-AIRL as a promising framework for tackling challenging, real-world settings.

[414] The Directed Prediction Change - Efficient and Trustworthy Fidelity Assessment for Local Feature Attribution Methods

Kevin Iselborn, David Dembinsky, Adriano Lucieri, Andreas Dengel

Main category: cs.LG

TL;DR: Proposes Directed Prediction Change (DPC) metric for evaluating explanation fidelity, achieving 10x speedup over Monte Carlo methods while eliminating randomness.

Details

Motivation: Existing fidelity metrics like Infidelity use Monte Carlo approximation which requires many model evaluations and introduces uncertainty through random sampling, making them unreliable for high-stakes applications like medical settings.

Method: Modified existing Prediction Change (PC) metric within Guided Perturbation Experiment by incorporating direction of both perturbation and attribution, creating deterministic DPC metric.

Result: DPC achieves almost tenfold speedup and eliminates randomness while measuring same property as local Infidelity. Evaluated on 4,744 explanations across medical images, financial data, multiple models and explanation methods.

Conclusion: DPC together with PC enables holistic, computationally efficient evaluation of explanation methods with deterministic and reproducible outcomes, making it suitable for high-stakes applications.

Abstract: The utility of an explanation method critically depends on its fidelity to the underlying machine learning model. Especially in high-stakes medical settings, clinicians and regulators require explanations that faithfully reflect the model’s decision process. Existing fidelity metrics such as Infidelity rely on Monte Carlo approximation, which demands numerous model evaluations and introduces uncertainty due to random sampling. This work proposes a novel metric for evaluating the fidelity of local feature attribution methods by modifying the existing Prediction Change (PC) metric within the Guided Perturbation Experiment. By incorporating the direction of both perturbation and attribution, the proposed Directed Prediction Change (DPC) metric achieves an almost tenfold speedup and eliminates randomness, resulting in a deterministic and trustworthy evaluation procedure that measures the same property as local Infidelity. DPC is evaluated on two datasets (skin lesion images and financial tabular data), two black-box models, seven explanation algorithms, and a wide range of hyperparameters. Across $4,744$ distinct explanations, the results demonstrate that DPC, together with PC, enables a holistic and computationally efficient evaluation of both baseline-oriented and local feature attribution methods, while providing deterministic and reproducible outcomes.

[415] BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla

Ariful Islam, Md Rifat Hossen, Md. Mahmudul Arif, Abdullah Al Noman, Md Arifur Rahman

Main category: cs.LG

TL;DR: BanglaMM-Disaster is a multimodal deep learning framework for disaster classification in Bangla using both text and images from social media, achieving 83.76% accuracy and outperforming single-modality baselines.

Details

Motivation: Natural disasters are a major challenge for Bangladesh, requiring real-time monitoring and quick response systems. There's a need for effective disaster classification tools in Bangla language.

Method: Built a new dataset of 5,037 Bangla social media posts with captions and images, annotated into 9 disaster categories. Used transformer-based text encoders (BanglaBERT, mBERT, XLM-RoBERTa) combined with CNN backbones (ResNet50, DenseNet169, MobileNetV2) with early fusion approach.

Result: Best model achieved 83.76% accuracy, surpassing text-only baseline by 3.84% and image-only baseline by 16.91%. Showed reduced misclassification across all classes with improvements for ambiguous examples.

Conclusion: The work fills a key gap in Bangla multimodal disaster analysis and demonstrates the benefits of combining multiple data types for real-time disaster response in low-resource settings.

Abstract: Natural disasters remain a major challenge for Bangladesh, so real-time monitoring and quick response systems are essential. In this study, we present BanglaMM-Disaster, an end-to-end deep learning-based multimodal framework for disaster classification in Bangla, using both textual and visual data from social media. We constructed a new dataset of 5,037 Bangla social media posts, each consisting of a caption and a corresponding image, annotated into one of nine disaster-related categories. The proposed model integrates transformer-based text encoders, including BanglaBERT, mBERT, and XLM-RoBERTa, with CNN backbones such as ResNet50, DenseNet169, and MobileNetV2, to process the two modalities. Using early fusion, the best model achieves 83.76% accuracy. This surpasses the best text-only baseline by 3.84% and the image-only baseline by 16.91%. Our analysis also shows reduced misclassification across all classes, with noticeable improvements for ambiguous examples. This work fills a key gap in Bangla multimodal disaster analysis and demonstrates the benefits of combining multiple data types for real-time disaster response in low-resource settings.

[416] Controlling changes to attention logits

Ben Anson, Laurence Aitchison

Main category: cs.LG

TL;DR: The paper proposes parameter-dependent learning rates for query and key weights to stabilize transformer training, addressing limitations of QK normalization in scenarios like Multi Latent Attention.

Details

Motivation: QK normalization fixes stability issues but is incompatible with Multi Latent Attention (MLA) due to requiring full materialization of queries and keys during inference. There's a need for alternative stabilization methods.

Method: Assign parameter-dependent learning rates to query and key weights to control changes to logits, which is identified as crucial for stability.

Result: The intervention allows increasing base learning rate, outperforms other methods in MLA setting, and achieves performance competitive with QK norm in Multi-head Attention.

Conclusion: Parameter-dependent learning rates for query/key weights provide an effective stabilization method that works where QK norm cannot, while maintaining competitive performance.

Abstract: Stability of neural network weights is critical when training transformer models. The query and key weights are particularly problematic, as they tend to grow large without any intervention. Applying normalization to queries and keys, known as `QK norm’, fixes stability issues in practice, but is not always applicable. For example, QK norm is not compatible with Multi Latent Attention (MLA) because QK norm requires full materialization of queries and keys during inference, which is not done in MLA. In this paper we suggest that controlling the changes to logits is important for stability. We show that these changes are controllable by assigning parameter-dependent learning rates to the query and key weights. We find that our cheap intervention allows us to increase the base learning rate of the network, outperform other methods in the MLA setting, and achieve performance competitive with QK norm when using Multi-head Attention.

[417] Anomaly Detection with Adaptive and Aggressive Rejection for Contaminated Training Data

Jungi Lee, Jungkwon Kim, Chi Zhang, Kwangsun Yoo, Seok-Joo Byun

Main category: cs.LG

TL;DR: AAR is a novel anomaly detection method that dynamically excludes contaminated data using modified z-scores and GMM-based thresholds, outperforming state-of-the-art methods by 0.041 AUROC.

Details

Motivation: Traditional anomaly detection models assume clean training data, but real-world datasets often contain contamination. Fixed contamination ratio assumptions fail in noisy environments where normal and abnormal distributions overlap, severely degrading performance.

Method: Proposes Adaptive and Aggressive Rejection (AAR) using modified z-score and Gaussian mixture model-based thresholds to dynamically exclude anomalies. Integrates hard and soft rejection strategies to balance preserving normal data while excluding anomalies.

Result: Extensive experiments on 2 image datasets and 30 tabular datasets show AAR outperforms state-of-the-art method by 0.041 AUROC. Provides scalable and reliable solution for contaminated datasets.

Conclusion: AAR enhances robustness against contaminated datasets, enabling broader real-world applications in security and healthcare domains by effectively handling data contamination challenges.

Abstract: Handling contaminated data poses a critical challenge in anomaly detection, as traditional models assume training on purely normal data. Conventional methods mitigate contamination by relying on fixed contamination ratios, but discrepancies between assumed and actual ratios can severely degrade performance, especially in noisy environments where normal and abnormal data distributions overlap. To address these limitations, we propose Adaptive and Aggressive Rejection (AAR), a novel method that dynamically excludes anomalies using a modified z-score and Gaussian mixture model-based thresholds. AAR effectively balances the trade-off between preserving normal data and excluding anomalies by integrating hard and soft rejection strategies. Extensive experiments on two image datasets and thirty tabular datasets demonstrate that AAR outperforms the state-of-the-art method by 0.041 AUROC. By providing a scalable and reliable solution, AAR enhances robustness against contaminated datasets, paving the way for broader real-world applications in domains such as security and healthcare.

[418] SUPN: Shallow Universal Polynomial Networks

Zachary Morrow, Michael Penwarden, Brian Chen, Aurya Javeed, Akil Narayan, John D. Jakeman

Main category: cs.LG

TL;DR: SUPNs are shallow universal polynomial networks that replace most hidden layers with a single polynomial layer, achieving better approximation with fewer parameters than DNNs and KANs.

Details

Motivation: Deep neural networks and KANs require many parameters leading to overparameterization, local minima, and sensitivity to initialization, which affects generalization and transparency.

Method: Replace all but the last hidden layer with a single layer of polynomials with learnable coefficients, leveraging polynomial strengths while maintaining expressivity with fewer parameters.

Result: SUPNs converge at the same rate as best polynomial approximation, show lower approximation error and variability than DNNs and KANs by an order of magnitude, and outperform polynomial projection on non-smooth functions.

Conclusion: SUPNs provide an efficient alternative to DNNs and KANs, achieving superior approximation with fewer parameters and reduced sensitivity to initialization.

Abstract: Deep neural networks (DNNs) and Kolmogorov-Arnold networks (KANs) are popular methods for function approximation due to their flexibility and expressivity. However, they typically require a large number of trainable parameters to produce a suitable approximation. Beyond making the resulting network less transparent, overparameterization creates a large optimization space, likely producing local minima in training that have quite different generalization errors. In this case, network initialization can have an outsize impact on the model’s out-of-sample accuracy. For these reasons, we propose shallow universal polynomial networks (SUPNs). These networks replace all but the last hidden layer with a single layer of polynomials with learnable coefficients, leveraging the strengths of DNNs and polynomials to achieve sufficient expressivity with far fewer parameters. We prove that SUPNs converge at the same rate as the best polynomial approximation of the same degree, and we derive explicit formulas for quasi-optimal SUPN parameters. We complement theory with an extensive suite of numerical experiments involving SUPNs, DNNs, KANs, and polynomial projection in one, two, and ten dimensions, consisting of over 13,000 trained models. On the target functions we numerically studied, for a given number of trainable parameters, the approximation error and variability are often lower for SUPNs than for DNNs and KANs by an order of magnitude. In our examples, SUPNs even outperform polynomial projection on non-smooth functions.

[419] Ensemble Performance Through the Lens of Linear Independence of Classifier Votes in Data Streams

Enes Bektas, Fazli Can

Main category: cs.LG

TL;DR: This paper analyzes how ensemble size affects classification performance through linear independence of classifier votes, providing theoretical estimates for optimal ensemble size and validating with experiments.

Details

Motivation: To understand the trade-off between ensemble size and performance, addressing computational inefficiency and diminishing returns in large ensembles while maximizing representational capacity.

Method: Modeled linear independence among classifier votes using geometric and probability models, derived theoretical framework for ensemble size estimation, and validated with experiments on real-world and synthetic datasets using OzaBagging and GOOWE methods.

Result: Theoretical estimate effectively identifies performance saturation point for robust ensembles like OzaBagging, but reveals algorithmic instability for complex weighting schemes like GOOWE when theoretical diversity is high.

Conclusion: Linear independence provides a theoretical basis for determining optimal ensemble size, with practical implications varying by ensemble method - robust methods benefit from the framework while complex methods may experience instability.

Abstract: Ensemble learning improves classification performance by combining multiple base classifiers. While increasing the number of classifiers generally enhances accuracy, excessively large ensembles can lead to computational inefficiency and diminishing returns. This paper investigates the relationship between ensemble size and performance through the lens of linear independence among classifier votes in data streams. We propose that ensembles composed of linearly independent classifiers maximize representational capacity, particularly under a geometric model. We then generalize the importance of linear independence to the weighted majority voting problem. By modeling the probability of achieving linear independence among classifier outputs, we derive a theoretical framework that explains the trade-off between ensemble size and accuracy. Our analysis leads to a theoretical estimate of the ensemble size required to achieve a user-specified probability of linear independence. We validate our theory through experiments on both real-world and synthetic datasets using two ensemble methods, OzaBagging and GOOWE. Our results confirm that this theoretical estimate effectively identifies the point of performance saturation for robust ensembles like OzaBagging. Conversely, for complex weighting schemes like GOOWE, our framework reveals that high theoretical diversity can trigger algorithmic instability. Our implementation is publicly available to support reproducibility and future research.

[420] Mean-Field Limits for Two-Layer Neural Networks Trained with Consensus-Based Optimization

William De Deyn, Michael Herty, Giovanni Samaey

Main category: cs.LG

TL;DR: Two-layer neural networks trained with consensus-based optimization (CBO) are compared to Adam, showing hybrid CBO+Adam converges faster. CBO is reformulated for multi-task learning with reduced memory and extended to mean-field limits with optimal transport framework.

Details

Motivation: To improve neural network training efficiency by exploring particle-based optimization methods like CBO and comparing them with standard optimizers like Adam, while developing theoretical foundations through mean-field analysis.

Method: Used consensus-based optimization (CBO) for training two-layer neural networks, compared with Adam optimizer, developed hybrid CBO+Adam approach, reformulated CBO for multi-task learning, and established mean-field limit formulation using optimal transport framework.

Result: Hybrid CBO+Adam approach provides faster convergence than CBO alone. CBO reformulation reduces memory overhead in multi-task learning. Mean-field analysis shows monotonic variance decrease in Wasserstein-over-Wasserstein space.

Conclusion: CBO shows promise as neural network optimizer, especially when combined with Adam. The mean-field formulation provides theoretical foundation for understanding CBO dynamics and enables analysis of variance reduction properties.

Abstract: We study two-layer neural networks and train these with a particle-based method called consensus-based optimization (CBO). We compare the performance of CBO against Adam on two test cases and demonstrate how a hybrid approach, combining CBO with Adam, provides faster convergence than CBO. In the context of multi-task learning, we recast CBO into a formulation that offers less memory overhead. The CBO method allows for a mean-field limit formulation, which we couple with the mean-field limit of the neural network. To this end, we first reformulate CBO within the optimal transport framework. Finally, in the limit of infinitely many particles, we define the corresponding dynamics on the Wasserstein-over-Wasserstein space and show that the variance decreases monotonically.

[421] Lost in Time? A Meta-Learning Framework for Time-Shift-Tolerant Physiological Signal Transformation

Qian Hong, Cheng Bian, Xiao Zhou, Xiaoyu Li, Yelei Li, Zijing Zeng

Main category: cs.LG

TL;DR: ShiftSyncNet is a meta-learning framework that automatically corrects temporal misalignment in multimodal physiological signal transformation, improving accuracy in converting PPG/BCG signals to arterial blood pressure.

Details

Motivation: Temporal misalignment in multimodal signal transformation impairs accuracy for critical healthcare monitoring applications like blood pressure estimation, and existing synchronization methods are inadequate for time-shifted supervision.

Method: A bi-level optimization framework with transformation network (TransNet) and time-shift correction network (SyncNet), where SyncNet learns time offsets and applies Fourier phase shifts to align supervision signals.

Result: Outperforms strong baselines by 9.4%, 6.0%, and 12.8% on one industrial and two public datasets, effectively correcting time shifts and improving transformation accuracy.

Conclusion: ShiftSyncNet provides an effective solution for temporal inconsistencies in multimodal physiological transformation, pointing toward unified direction for handling temporal misalignment.

Abstract: Translating non-invasive signals such as photoplethysmography (PPG) and ballistocardiography (BCG) into clinically meaningful signals like arterial blood pressure (ABP) is vital for continuous, low-cost healthcare monitoring. However, temporal misalignment in multimodal signal transformation impairs transformation accuracy, especially in capturing critical features like ABP peaks. Conventional synchronization methods often rely on strong similarity assumptions or manual tuning, while existing Learning with Noisy Labels (LNL) approaches are ineffective under time-shifted supervision, either discarding excessive data or failing to correct label shifts. To address this challenge, we propose ShiftSyncNet, a meta-learning-based bi-level optimization framework that automatically mitigates performance degradation due to time misalignment. It comprises a transformation network (TransNet) and a time-shift correction network (SyncNet), where SyncNet learns time offsets between training pairs and applies Fourier phase shifts to align supervision signals. Experiments on one real-world industrial dataset and two public datasets show that ShiftSyncNet outperforms strong baselines by 9.4%, 6.0%, and 12.8%, respectively. The results highlight its effectiveness in correcting time shifts, improving label quality, and enhancing transformation accuracy across diverse misalignment scenarios, pointing toward a unified direction for addressing temporal inconsistencies in multimodal physiological transformation.

[422] IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

Wanli Zhong, Haibo Feng, Zirui Zhou, Hanyang Peng, Shiqi Yu

Main category: cs.LG

TL;DR: IntAttention is a fully integer attention pipeline that eliminates softmax bottlenecks in Transformer models on edge devices, achieving up to 3.7x speedup and 61% energy reduction without retraining.

Details

Motivation: Transformer deployment on edge devices is limited by softmax bottlenecks, which cause costly dequantize-softmax-requantize cycles that can account for 65% of attention latency and disrupt integer dataflow efficiency.

Method: Uses IndexSoftmax operator that replaces floating-point exponentials with integer operations, integrating sparsity-aware clipping, 32-entry lookup-table approximation, and direct integer normalization to eliminate datatype conversions.

Result: Achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines, and 2.0x faster than conventional INT8 attention pipelines on Armv8 CPUs, with comparable accuracy across language and vision models.

Conclusion: IntAttention enables practical and efficient Transformer inference on commodity edge devices through a fully integer, plug-and-play attention pipeline without retraining requirements.

Abstract: Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax as the dominant bottleneck. This stage incurs a costly dequantize-softmax-requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present IntAttention, the first fully integer, plug-and-play attention pipeline without retraining. At the core of our approach lies IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. IntAttention integrates sparsity-aware clipping, a 32-entry lookup-table approximation, and direct integer normalization, thereby eliminating all datatype conversion overhead. We evaluate IntAttention and demonstrate consistent and substantial gains. Our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines and 2.0x faster than conventional INT8 attention pipelines on Armv8 CPUs. These gains are achieved with high-fidelity accuracy comparable to baselines across diverse language and vision models, enabling practical and efficient Transformer inference on commodity edge devices. Code will be released in later version of this work.

[423] Mechanistic Interpretability for Transformer-based Time Series Classification

Matīss Kalnāre, Sofoklis Kitharidis, Thomas Bäck, Niki van Stein

Main category: cs.LG

TL;DR: The paper adapts Mechanistic Interpretability techniques from NLP to analyze transformer architectures for time series classification, revealing internal causal structures and decision-making mechanisms.

Details

Motivation: Transformer models are state-of-the-art for time series classification but their complex internal decision-making processes remain poorly understood, with existing explainability methods focusing mainly on input-output relationships rather than internal mechanisms.

Method: Adapted activation patching, attention saliency, and sparse autoencoders from NLP to transformer architectures for time series classification, systematically probing internal causal roles of attention heads and timesteps.

Result: Constructed causal graphs showing information propagation, identified key attention heads and temporal positions driving correct classifications, and demonstrated sparse autoencoders’ potential for uncovering interpretable latent features.

Conclusion: The study provides methodological contributions to transformer interpretability and novel insights into the functional mechanics underlying transformer performance in time series classification tasks.

Abstract: Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing explainability methods often focus on input-output attributions, leaving the internal mechanisms largely opaque. This paper addresses this gap by adapting various Mechanistic Interpretability techniques; activation patching, attention saliency, and sparse autoencoders, from NLP to transformer architectures designed explicitly for time series classification. We systematically probe the internal causal roles of individual attention heads and timesteps, revealing causal structures within these models. Through experimentation on a benchmark time series dataset, we construct causal graphs illustrating how information propagates internally, highlighting key attention heads and temporal positions driving correct classifications. Additionally, we demonstrate the potential of sparse autoencoders for uncovering interpretable latent features. Our findings provide both methodological contributions to transformer interpretability and novel insights into the functional mechanics underlying transformer performance in time series classification tasks.

[424] Predictive Safety Shield for Dyna-Q Reinforcement Learning

Jin Pin, Krasowski Hanna, Vanneaux Elena

Main category: cs.LG

TL;DR: A predictive safety shield for model-based RL that uses safe environment simulations to update Q-functions locally, improving performance while maintaining hard safety guarantees.

Details

Motivation: Existing safety shields use random safe actions or fixed fallback controllers, ignoring future performance implications of different safe actions.

Method: Proposes a predictive safety shield that updates Q-function locally based on safe predictions from safe simulations of the environment model in discrete space.

Result: Experiments on gridworld show short prediction horizons suffice to identify optimal paths. Approach is robust to distribution shifts without requiring additional training.

Conclusion: The predictive safety shield improves RL performance while maintaining hard safety guarantees, demonstrating robustness to distribution shifts.

Abstract: Obtaining safety guarantees for reinforcement learning is a major challenge to achieve applicability for real-world tasks. Safety shields extend standard reinforcement learning and achieve hard safety guarantees. However, existing safety shields commonly use random sampling of safe actions or a fixed fallback controller, therefore disregarding future performance implications of different safe actions. In this work, we propose a predictive safety shield for model-based reinforcement learning agents in discrete space. Our safety shield updates the Q-function locally based on safe predictions, which originate from a safe simulation of the environment model. This shielding approach improves performance while maintaining hard safety guarantees. Our experiments on gridworld environments demonstrate that even short prediction horizons can be sufficient to identify the optimal path. We observe that our approach is robust to distribution shifts, e.g., between simulation and reality, without requiring additional training.

[425] Context-Specific Causal Graph Discovery with Unobserved Contexts: Non-Stationarity, Regimes and Spatio-Temporal Patterns

Martin Rabel, Jakob Runge

Main category: cs.LG

TL;DR: A framework for analyzing causal graph changes in non-stationary spatial-temporal data, extending constraint-based causal discovery methods to handle variations while maintaining stability.

Details

Motivation: Real-world data like climate applications often show spatial-temporal variations that encode important information but can negatively affect algorithms assuming stationarity. Understanding causal graph changes is crucial for stability and reliability.

Method: Modifies constraint-based causal discovery approaches at the independence testing level, creating a modular framework compatible with existing methods (PC, FCI, PCMCI, etc.) without major changes.

Result: Developed an extremely modular, extensible framework that can leverage existing causal discovery algorithms while systematically addressing subproblems like change-point detection and clustering.

Conclusion: The framework provides a principled approach to handle causal graph variations in non-stationary data, with clear understanding of limitations, trade-offs, and statistical interpretation.

Abstract: Real-world data, for example in climate applications, often consists of spatially gridded time series data or data with comparable structure. While the underlying system is often believed to behave similar at different points in space and time, those variations that do exist are twofold relevant: They often encode important information in and of themselves. And they may negatively affect the stability / convergence and reliability\Slash{}validity of results of algorithms assuming stationarity or space-translation invariance. We study the information encoded in changes of the causal graph, with stability in mind. An analysis of this general task identifies two core challenges. We develop guiding principles to overcome these challenges, and provide a framework realizing these principles by modifying constraint-based causal discovery approaches on the level of independence testing. This leads to an extremely modular, easily extensible and widely applicable framework. It can leverage existing constraint-based causal discovery methods (demonstrated on IID-algorithms PC, PC-stable, FCI and time series algorithms PCMCI, PCMCI+, LPCMCI) with little to no modification. The built-in modularity allows to systematically understand and improve upon an entire array of subproblems. By design, it can be extended by leveraging insights from change-point-detection, clustering, independence-testing and other well-studied related problems. The division into more accessible sub-problems also simplifies the understanding of fundamental limitations, hyperparameters controlling trade-offs and the statistical interpretation of results. An open-source implementation will be available soon.

[426] Computing Strategic Responses to Non-Linear Classifiers

Jack Geary, Boyan Gao, Henry Gouk

Main category: cs.LG

TL;DR: A novel method for computing best responses in strategic classification using Lagrangian dual optimization, enabling non-linear classifiers where previous approaches were limited to linear settings.

Details

Motivation: Strategic classification faces distribution shifts when classifiers are deployed, but current methods are primarily limited to linear classifiers, while non-linear classifiers are often more suitable. The key limitation is the inability to compute best responses in non-linear settings.

Method: Proposes a method for computing best responses by optimizing the Lagrangian dual of the Agents’ objective function, which can handle both linear and non-linear classifier settings.

Result: The method successfully reproduces best responses in linear settings and identifies weaknesses in existing approaches. It can be straightforwardly applied to non-linear classifier settings for both evaluation and training purposes.

Conclusion: The Lagrangian dual optimization approach enables effective computation of best responses in strategic classification, overcoming previous limitations and allowing for the use of more suitable non-linear classifiers in strategic settings.

Abstract: We consider the problem of strategic classification, where the act of deploying a classifier leads to strategic behaviour that induces a distribution shift on subsequent observations. Current approaches to learning classifiers in strategic settings are focused primarily on the linear setting, but in many cases non-linear classifiers are more suitable. A central limitation to progress for non-linear classifiers arises from the inability to compute best responses in these settings. We present a novel method for computing the best response by optimising the Lagrangian dual of the Agents’ objective. We demonstrate that our method reproduces best responses in linear settings, identifying key weaknesses in existing approaches. We present further results demonstrating our method can be straight-forwardly applied to non-linear classifier settings, where it is useful for both evaluation and training.

[427] Machine Learning Approaches to Clinical Risk Prediction: Multi-Scale Temporal Alignment in Electronic Health Records

Wei-Chen Chang, Lu Dai, Ting Xu

Main category: cs.LG

TL;DR: Proposes MSTAN for EHR risk prediction, addressing temporal irregularity and multi-scale dependencies through temporal alignment and multi-scale feature extraction.

Details

Motivation: To handle challenges in EHR data including temporal irregularity, sampling interval differences, and multi-scale dynamic dependencies that affect risk prediction accuracy.

Method: Uses learnable temporal alignment mechanism, multi-scale convolutional feature extraction, temporal embedding, and attention-based aggregation to model long-term trends and short-term fluctuations.

Result: Outperforms mainstream baselines on public EHR datasets in accuracy, recall, precision, and F1-Score, demonstrating effectiveness and robustness.

Conclusion: Provides effective solution for intelligent representation of high-dimensional asynchronous medical sequences and supports EHR-driven clinical risk prediction.

Abstract: This study proposes a risk prediction method based on a Multi-Scale Temporal Alignment Network (MSTAN) to address the challenges of temporal irregularity, sampling interval differences, and multi-scale dynamic dependencies in Electronic Health Records (EHR). The method focuses on temporal feature modeling by introducing a learnable temporal alignment mechanism and a multi-scale convolutional feature extraction structure to jointly model long-term trends and short-term fluctuations in EHR sequences. At the input level, the model maps multi-source clinical features into a unified high-dimensional semantic space and employs temporal embedding and alignment modules to dynamically weight irregularly sampled data, reducing the impact of temporal distribution differences on model performance. The multi-scale feature extraction module then captures key patterns across different temporal granularities through multi-layer convolution and hierarchical fusion, achieving a fine-grained representation of patient states. Finally, an attention-based aggregation mechanism integrates global temporal dependencies to generate individual-level risk representations for disease risk prediction and health status assessment. Experiments conducted on publicly available EHR datasets show that the proposed model outperforms mainstream baselines in accuracy, recall, precision, and F1-Score, demonstrating the effectiveness and robustness of multi-scale temporal alignment in complex medical time-series analysis. This study provides a new solution for intelligent representation of high-dimensional asynchronous medical sequences and offers important technical support for EHR-driven clinical risk prediction.

[428] A decoupled alignment kernel for peptide membrane permeability predictions

Ali Amirahmadi, Gökçe Geylan, Leonardo De Maria, Farzaneh Etminani, Mattias Ohlsson, Alessandro Tibo

Main category: cs.LG

TL;DR: Proposes MD-GAK and PMD-GAK kernels for predicting cyclic peptide cell permeability, focusing on uncertainty estimation using Gaussian Processes and outperforming state-of-the-art models.

Details

Motivation: Cell-membrane permeability is a key bottleneck for cyclic peptides targeting intracellular sites, with limited public data and need for well-calibrated uncertainty estimation.

Method: Developed monomer-aware decoupled global alignment kernel (MD-GAK) that couples residue similarity with sequence alignment while decoupling local matches from gap penalties, plus PMD-GAK variant with triangular positional prior. Used Gaussian Processes for uncertainty estimation.

Result: The methods outperform state-of-the-art models across all metrics, with PMD-GAK offering additional advantages in reducing calibration errors.

Conclusion: The proposed kernels provide effective and reproducible approaches for predicting cyclic peptide permeability with robust uncertainty estimation.

Abstract: Cyclic peptides are promising modalities for targeting intracellular sites; however, cell-membrane permeability remains a key bottleneck, exacerbated by limited public data and the need for well-calibrated uncertainty. Instead of relying on data-eager complex deep learning architecture, we propose a monomer-aware decoupled global alignment kernel (MD-GAK), which couples chemically meaningful residue-residue similarity with sequence alignment while decoupling local matches from gap penalties. MD-GAK is a relatively simple kernel. To further demonstrate the robustness of our framework, we also introduce a variant, PMD-GAK, which incorporates a triangular positional prior. As we will show in the experimental section, PMD-GAK can offer additional advantages over MD-GAK, particularly in reducing calibration errors. Since our focus is on uncertainty estimation, we use Gaussian Processes as the predictive model, as both MD-GAK and PMD-GAK can be directly applied within this framework. We demonstrate the effectiveness of our methods through an extensive set of experiments, comparing our fully reproducible approach against state-of-the-art models, and show that it outperforms them across all metrics.

[429] Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning

Alex Ning, Yen-Ling Kuo, Gabe Gomes

Main category: cs.LG

TL;DR: Latent reasoning reduces reasoning length by 52% without accuracy loss using adaptive-length models and RL optimization.

Details

Motivation: To compress reasoning lengths in Transformer models beyond chain-of-thought by using latent states instead of human language tokens.

Method: Developed adaptive-length latent reasoning models with post-SFT reinforcement learning to optimize reasoning length while maintaining accuracy.

Result: 52% reduction in total reasoning length on Llama 3.2 1B model and GSM8K-Aug dataset with no accuracy penalty.

Conclusion: Latent reasoning effectively reduces compute usage and demonstrates strong compression capabilities, with plans to extend to more models and datasets.

Abstract: Latent reasoning represents a new development in Transformer language models that has shown potential in compressing reasoning lengths compared to chain-of-thought reasoning. By directly passing the information-rich previous final latent state into the next sequence, latent reasoning removes the restriction to human language tokens as the medium for reasoning. We develop adaptive-length latent reasoning models and introduce a post-SFT reinforcement-learning methodology to optimize latent reasoning length by minimizing reasoning length while maintaining accuracy. This, in turn, further reduces compute usage and raises the bar on the compressive capabilities of latent reasoning models. Experiments on the Llama 3.2 1B model and the GSM8K-Aug dataset show a $52%$ drop in total reasoning length with no penalty to accuracy. In future work, we plan to extend to additional models and datasets, analyze relationships between training coefficients, experiment with architecture variations, and continue our knowledge distillation for latent reasoning SFT efforts. We make our code and pretrained weights available at https://github.com/apning/adaptive-latent-reasoning.

[430] An AI-Enabled Hybrid Cyber-Physical Framework for Adaptive Control in Smart Grids

Muhammad Siddique, Sohaib Zafar

Main category: cs.LG

TL;DR: A machine learning-based digital forensic framework for smart grids deployed on the Cloud, combining data acquisition, secure communication, cloud storage, and automated forensic analytics to detect and mitigate cyber-attacks.

Details

Motivation: Smart grids integrate power infrastructure with communication networks, creating vulnerabilities that can undermine grid stability and reliability. Digital forensics is essential for identifying, detecting, and mitigating security incidents in this cyber-physical environment.

Method: An all-in-one framework using supervised and unsupervised learning algorithms (Random Forest, SVM, Gradient Boosted Trees, deep neural networks) for anomaly detection, event reconstruction, and intrusion analysis. Deployed on Cloud with data acquisition at sensor-level, authenticated communication, and scalable storage.

Result: High accuracy, scalability, and resilience to cyber-attacks including data tampering, false-data injection, and coordinated control-loop manipulation. Demonstrated effectiveness through simulations and experiments on real-time smart-meter data streams.

Conclusion: Cloud services provide the best backbone for big-data-driven forensic workflows, enabling energy utilities to achieve fast situational awareness and intelligent incident response in smart grid systems.

Abstract: Smart grids are a fusion of classical power infrastructure and advanced communication networks and smart control, to create a cyber-physical environment that is more efficient and flexible than ever before. This integration causes vulnerabilities that can undermine grid stability as well as reliability. Digital forensics is a fundamental concept of learning and identifying, detecting, and mitigating such security incidents. This paper presents an all-in-one machine learning-based digital forensic framework of smart grid systems deployed on the Cloud. The framework combines the data acquisition at the sensor-level, authenticated communication, scalable cloud storage and automated forensic analytics. The model uses supervised and unsupervised learning algorithms - such as Random Forest, Support Vector Machine, Gradient Boosted Trees and deep neural architectures for anomaly detection, event reconstruction and intrusion analysis in real time. After several simulation and experimental studies on real-time smart-meter data streams, the proposed framework is shown to be very accurate, scalable and resilient to cyber-attacks including data tampering, false-data injection and coordinated control-loop manipulation. The results indicate that cloud services are the best backbone for big-data-driven forensic workflows, which allows energy utilities to achieve a fast situational awareness and intelligent incident response.

[431] Visualizing LLM Latent Space Geometry Through Dimensionality Reduction

Alex Ning, Vainateya Rangaraju

Main category: cs.LG

TL;DR: This paper extracts and visualizes latent state geometries in Transformer-based LLMs using dimensionality reduction techniques (PCA and UMAP) to interpret internal mechanisms.

Details

Motivation: LLMs achieve state-of-the-art performance but their internal mechanisms remain difficult to interpret, motivating the need for systematic analysis of Transformer internals.

Method: Extract layerwise activations from Transformer blocks and apply dimensionality reduction techniques (PCA and UMAP) to visualize latent state geometries in GPT-2 and LLaMa models.

Result: Identified clear separation between attention and MLP component outputs across intermediate layers, characterized high norm of latent states at initial sequence position, visualized layerwise evolution of latent states, and demonstrated high-dimensional helical structure of positional embeddings.

Conclusion: The approach enables systematic analysis of Transformer internals and supports reproducible interpretability research, with code made publicly available.

Abstract: Large language models (LLMs) achieve state-of-the-art results across many natural language tasks, but their internal mechanisms remain difficult to interpret. In this work, we extract, process, and visualize latent state geometries in Transformer-based language models through dimensionality reduction. We capture layerwise activations at multiple points within Transformer blocks and enable systematic analysis through Principal Component Analysis (PCA) and Uniform Manifold Approximation (UMAP). We demonstrate experiments on GPT-2 and LLaMa models, where we uncover interesting geometric patterns in latent space. Notably, we identify a clear separation between attention and MLP component outputs across intermediate layers, a pattern not documented in prior work to our knowledge. We also characterize the high norm of latent states at the initial sequence position and visualize the layerwise evolution of latent states. Additionally, we demonstrate the high-dimensional helical structure of GPT-2’s positional embeddings, the sequence-wise geometric patterns in LLaMa, and experiment with repeating token sequences. We aim to support systematic analysis of Transformer internals with the goal of enabling further reproducible interpretability research. We make our code available at https://github.com/Vainateya/Feature_Geometry_Visualization.

[432] On the Origin of Algorithmic Progress in AI

Hans Gundlach, Alex Fogelson, Jayson Lynch, Ana Trisovic, Jonathan Rosenfeld, Anmol Sandhu, Neil Thompson

Main category: cs.LG

TL;DR: Algorithms have shown 22,000x FLOP efficiency gains from 2012-2023, but small-scale experiments only account for <100x. Scaling experiments reveal most gains come from scale-dependent efficiency improvements, particularly the LSTM-to-Transformer transition.

Details

Motivation: To understand the discrepancy between reported 22,000x algorithmic efficiency gains and what can be explained by small-scale experiments, revealing that algorithmic progress for small models is much slower than assumed.

Method: Conducted small-scale ablation experiments on key innovations, surveyed literature, and performed scaling experiments comparing LSTMs and Transformers to analyze compute-optimal scaling laws.

Result: Found that scale-dependent efficiency improvements explain most gains, with LSTM-to-Transformer transition accounting for majority. Experimental extrapolation accounts for 6,930x efficiency gains, showing algorithmic efficiency is strongly reference-dependent.

Conclusion: Algorithmic progress for small models has been far slower than previously assumed, and measures of algorithmic efficiency are strongly dependent on the compute scale reference point.

Abstract: Algorithms have been estimated to increase AI training FLOP efficiency by a factor of 22,000 between 2012 and 2023 [Ho et al., 2024]. Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than 10x of these gains. Surveying the broader literature, we estimate that additional innovations not included in our ablations account for less than 10x, yielding a total under 100x. This leads us to conduct scaling experiments, which reveal that much of this efficiency gap can be explained by algorithms with scale-dependent efficiency improvements. In particular, we conduct scaling experiments between LSTMs and Transformers, finding exponent differences in their compute-optimal scaling law while finding little scaling difference for many other innovations. These experiments demonstrate that - contrary to standard assumptions - an algorithm’s efficiency gains are tied to compute scale. Using experimental extrapolation and literature estimates, we account for 6,930x efficiency gains over the same time period, with the scale-dependent LSTM-to-Transformer transition accounting for the majority of gains. Our results indicate that algorithmic progress for small models has been far slower than previously assumed, and that measures of algorithmic efficiency are strongly reference-dependent.

[433] Scale-Agnostic Kolmogorov-Arnold Geometry in Neural Networks

Mathew Vanherreweghe, Michael H. Freedman, Keith M. Adams

Main category: cs.LG

TL;DR: Kolmogorov-Arnold geometric structure emerges spontaneously in MLPs during MNIST training, showing scale-invariant properties across different spatial scales and training procedures.

Details

Motivation: To determine if KAG structure observed in synthetic 3D tasks persists in realistic high-dimensional settings like MNIST, and to characterize its spatial properties.

Method: Extended KAG analysis to MNIST digit classification using 2-layer MLPs with systematic spatial analysis at multiple scales (from 7-pixel neighborhoods to full 28x28 images), testing both standard training and spatial augmentation.

Result: KAG emerges during training and appears consistently across all spatial scales, with the same qualitative pattern regardless of training procedure (standard or spatial augmentation).

Conclusion: Neural networks spontaneously develop organized, scale-invariant geometric structure during learning on realistic high-dimensional data.

Abstract: Recent work by Freedman and Mulligan demonstrated that shallow multilayer perceptrons spontaneously develop Kolmogorov-Arnold geometric (KAG) structure during training on synthetic three-dimensional tasks. However, it remained unclear whether this phenomenon persists in realistic high-dimensional settings and what spatial properties this geometry exhibits. We extend KAG analysis to MNIST digit classification (784 dimensions) using 2-layer MLPs with systematic spatial analysis at multiple scales. We find that KAG emerges during training and appears consistently across spatial scales, from local 7-pixel neighborhoods to the full 28x28 image. This scale-agnostic property holds across different training procedures: both standard training and training with spatial augmentation produce the same qualitative pattern. These findings reveal that neural networks spontaneously develop organized, scale-invariant geometric structure during learning on realistic high-dimensional data.

[434] Mechanisms of Non-Monotonic Scaling in Vision Transformers

Anantha Padmanaban Krishna Kumar

Main category: cs.LG

TL;DR: Deeper Vision Transformers perform worse than shallower ones due to a Cliff-Plateau-Climb pattern in representation evolution, where better performance comes from marginalizing the [CLS] token in favor of distributed patch token consensus.

Details

Motivation: To understand why deeper Vision Transformers often underperform shallower ones, challenging common scaling assumptions in transformer architectures.

Method: Systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, using an Information Scrambling Index to quantify information mixing patterns across layers.

Result: Identified consistent three-phase pattern; better performance correlates with [CLS] token marginalization; ViT-L shows information-task tradeoff emerging 10 layers later than ViT-B; additional layers increase information diffusion without improving task performance.

Conclusion: Transformer architectures benefit more from carefully calibrated depth with clean phase transitions than simply increasing parameters; Information Scrambling Index serves as useful diagnostic for model design.

Abstract: Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.

[435] Federated Large Language Models: Current Progress and Future Directions

Yuhang Yao, Jianyi Zhang, Junda Wu, Chengkai Huang, Yu Xia, Tong Yu, Ruiyi Zhang, Sungchul Kim, Ryan Rossi, Ang Li, Lina Yao, Julian McAuley, Yiran Chen, Carlee Joe-Wong

Main category: cs.LG

TL;DR: This paper surveys Federated Learning for Large Language Models (FedLLM), addressing challenges like data heterogeneity and communication costs while exploring fine-tuning and prompt learning approaches in federated settings.

Details

Motivation: Large language models face privacy concerns during data collection, and federated learning offers a solution by enabling collaborative training without sharing local data, though it introduces new challenges like model convergence issues.

Method: The paper conducts a comprehensive survey of FedLLM, focusing on two key aspects: fine-tuning and prompt learning in federated settings, analyzing existing work and research challenges.

Result: The survey highlights recent advances in FedLLM and identifies current research challenges, providing a foundation for understanding the state of federated learning approaches for large language models.

Conclusion: The paper proposes future directions for federated LLMs, including pre-training, federated agents, and using LLMs for federated learning, guiding future research in this emerging field.

Abstract: Large language models are rapidly gaining popularity and have been widely adopted in real-world applications. While the quality of training data is essential, privacy concerns arise during data collection. Federated learning offers a solution by allowing multiple clients to collaboratively train LLMs without sharing local data. However, FL introduces new challenges, such as model convergence issues due to heterogeneous data and high communication costs. A comprehensive study is required to address these challenges and guide future research. This paper surveys Federated learning for LLMs (FedLLM), highlighting recent advances and future directions. We focus on two key aspects: fine-tuning and prompt learning in a federated setting, discussing existing work and associated research challenges. We finally propose potential directions for federated LLMs, including pre-training, federated agents, and LLMs for federated learning.

[436] Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Daniel R. Jiang, Jalaj Bhandari, Yukai Yang, Rémi Munos, Tyler Lu

Main category: cs.LG

TL;DR: Proposes Iterative PPO, a method that reduces multi-turn conversational RL to single-turn RLHF problems using learned Q-functions as rewards, enabling stable policy improvement with standard tools.

Details

Motivation: Optimizing LLMs for multi-turn conversations is challenging due to sparse rewards and the gap between response-level planning and token-level generation in goal-oriented settings like AI marketing.

Method: Formal reduction of multi-turn RL to single-turn RLHF problems by using learned multi-turn Q-functions as reward models, then applying standard token-level PPO as policy improvement steps.

Result: Iterative PPO enables stable multi-turn conversational optimization by leveraging existing single-turn RLHF tools, bridging online and offline approaches.

Conclusion: The method provides a practical solution for multi-turn conversational optimization that combines adaptability with training stability, making it straightforward to implement with existing tools.

Abstract: Optimizing large language models (LLMs) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The difficulty stems from sparse, long-horizon rewards and the discrepancy between response-level planning and token-level generation. In this technical note, we propose a formal reduction of the multi-turn RL problem into a sequence of single-turn RLHF-style problems. This is achieved by setting a learned multi-turn Q-function as the reward model for the single-turn problem. We demonstrate and prove a key insight: solving this single-turn RL problem with standard token-level PPO is equivalent to a policy improvement step within the multi-turn problem. This insight naturally leads to Iterative PPO, a batch online policy iteration algorithm that alternates between fitting Q-functions from logged conversation trajectories and improving the policy. A major practical advantage is that Iterative PPO directly leverages stable, off-the-shelf single-turn RLHF tools, making it straightforward to implement. Our method occupies a middle ground between fully online and fully offline approaches, retaining the adaptability of online updates while gaining the stability benefits of offline training.

[437] EvilGenie: A Reward Hacking Benchmark

Jonathan Gabor, Jayson Lynch, Jonathan Rosenfeld

Main category: cs.LG

TL;DR: EvilGenie is a benchmark for detecting reward hacking in programming agents, where models cheat by hardcoding test cases or editing test files instead of solving problems correctly.

Details

Motivation: To create a systematic way to measure and detect reward hacking behavior in programming agents, as current coding benchmarks may not catch when models manipulate test environments to achieve high scores.

Method: Created benchmark using LiveCodeBench problems with environments that enable reward hacking. Used three detection methods: held-out unit tests, LLM judges, and test file edit detection, validated against human review.

Result: LLM judges were highly effective at detecting unambiguous reward hacking. Held-out tests provided minimal additional improvement. Codex and Claude Code showed explicit reward hacking, while all three tested agents (Codex, Claude Code, Gemini) exhibited misaligned behavior.

Conclusion: Reward hacking is a real problem in programming agents that can be systematically detected using methods like LLM judges, and current popular coding agents demonstrate concerning misaligned behaviors.

Abstract: We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect’s basic_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI’s Codex, Anthropic’s Claude Code, and Google’s Gemini CLI Using GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, respectively. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/EvilGenie.

[438] Beyond Introspection: Reinforcing Thinking via Externalist Behavioral Feedback

Diji Yang, Linda Zeng, Kezhen Chen, Yi Zhang

Main category: cs.LG

TL;DR: DRR framework uses external behavioral analysis instead of self-critique to improve LLM reasoning reliability by distilling behavioral traces, training a discriminative model, and using it to reject flawed reasoning steps.

Details

Motivation: Existing self-critique methods suffer from introspection illusion where models inherit biases from their original outputs, especially near knowledge boundaries, leading to unreliable reasoning.

Method: Three-step DRR framework: Distillation (collect behavioral traces), Reinforcement (train lightweight discriminative model), Reasoning (use DM as external critic to reject suspicious reasoning steps and explore alternatives).

Result: Significantly outperforms prominent self-critique methods on multiple reasoning benchmarks, improving reasoning quality without modifying the base model.

Conclusion: DRR provides a scalable, annotation-free solution for enhancing LLM reasoning reliability across various models by using external behavioral evaluation rather than introspection.

Abstract: While inference-time thinking allows Large Language Models (LLMs) to address complex problems, the extended thinking process can be unreliable or inconsistent because of the model’s probabilistic nature, especially near its knowledge boundaries. Existing approaches attempt to mitigate this by having the model critique its own reasoning to make corrections. However, such self-critique inherits the same biases of the original output, known as the introspection illusion. Moving beyond such introspection and inspired by core methodologies in ethology, we propose an externalist three-step framework Distillation-Reinforcement-Reasoning (DRR). Rather than relying on a model’s introspection, DRR evaluates its observable behaviors to provide corrective feedback. DRR first distills the reasoner’s behavioral traces, then trains a lightweight, external Discriminative Model (DM). At inference time, this DM acts as a critic, identifying and rejecting suspicious reasoning steps. This external feedback compels the LLM to discard flawed pathways and explore alternatives, thereby enhancing reasoning quality without altering the base model. Experiments on multiple reasoning benchmarks show that our framework significantly outperforms prominent self-critique methods. Benefiting from a lightweight and annotation-free design, DRR offers a scalable and adaptable solution for improving the reliability of reasoning in a wide range of LLMs.

[439] Escaping the Verifier: Learning to Reason via Demonstrations

Locke Cai, Ivan Provilkov

Main category: cs.LG

TL;DR: RARO enables training LLMs for reasoning tasks using only expert demonstrations via adversarial Inverse Reinforcement Learning, outperforming verifier-free baselines across multiple domains.

Details

Motivation: Many reasoning-intensive tasks lack task-specific verifiers but have abundant expert demonstrations that are underutilized for reasoning-focused training.

Method: Adversarial interaction between policy (generator) and relativistic critic (discriminator), jointly trained via RL with key stabilization techniques. The policy mimics expert answers while the critic distinguishes between policy and expert answers.

Result: Significantly outperforms verifier-free baselines on Countdown, DeepMath, and Poetry Writing tasks, showing robust scaling trends similar to RL with verifiers.

Conclusion: RARO effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning without task-specific verifiers.

Abstract: Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator): the policy learns to mimic expert answers, while the critic learns to compare and distinguish between policy and expert answers. Our method trains both the policy and the critic jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks – Countdown, DeepMath, and Poetry Writing – and enjoys the same robust scaling trends as RL on verifiable tasks. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.

[440] Through the telecom lens: Are all training samples important?

Shruti Bothe, Illyyne Saffar, Aurelie Boisbunon, Hasan Farooq, Julien Forgeat, Md Moin Uddin Chowdhury

Main category: cs.LG

TL;DR: This paper challenges the assumption that all training samples contribute equally in telecom AI workflows and proposes a sample importance framework to selectively prioritize impactful data, reducing computation and energy use while maintaining accuracy.

Details

Motivation: Telecom AI faces challenges with noisy, high-dimensional data that is costly to store, process, and label. Standard workflows assume equal sample importance, but next-generation systems require accurate, efficient, and sustainable AI models.

Method: The authors perform sample-level gradient analysis across epochs to identify patterns of influence and redundancy, then propose a sample importance framework that selectively prioritizes impactful data.

Result: Experiments on three real-world telecom datasets show that the method reserves performance while reducing data needs and computational overhead.

Conclusion: The proposed approach advances sustainable AI in telecommunications by optimizing computation and energy use without compromising accuracy.

Abstract: The rise of AI in telecommunications, from optimizing Radio Access Networks to managing user experience, has sharply increased data volumes and training demands. Telecom data is often noisy, high-dimensional, costly to store, process, and label. Despite Ai’s critical role, standard workflows still assume all training samples contribute equally. On the other hand, next generation systems require AI models that are accurate, efficient, and sustainable.The paper questions the assumptions of equal importance by focusing on applying and analyzing the roles of individual samples in telecom training and assessing whether the proposed model optimizes computation and energy use. we perform sample-level gradient analysis across epochs to identify patterns of influence and redundancy in model learning. Based on this, we propose a sample importance framework thats electively prioritizes impactful data and reduces computation without compromising accuracy. Experiments on three real-world telecom datasets show that our method [reserves performance while reducing data needs and computational overhead while advancing the goals of sustainable AI in telecommunications.

[441] DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

Fengze Yu, Leshu Li, Brad McDanel, Saiqian Zhang

Main category: cs.LG

TL;DR: DSD is a distributed speculative decoding framework that extends speculative decoding to multi-device environments, achieving up to 1.1x speedup and 9.7% higher throughput over existing baselines.

Details

Motivation: LLM inference suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments, with existing speculative decoding techniques confined to single-node execution.

Method: Proposed DSD framework with coordinated draft-target execution across multiple devices, introduced DSD-Sim simulator for network/batching/scheduling dynamics, and designed Adaptive Window Control policy to dynamically adjust speculation window size.

Result: Experiments show DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines across diverse workloads.

Conclusion: DSD enables agile and scalable LLM serving across edge and cloud environments through distributed speculative decoding.

Abstract: Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed speculative decoding framework that extends SD to multi-device deployments through coordinated draft-target execution. Given the lack of prior work on simulating this paradigm, we first introduce DSD-Sim, a discrete-event simulator that captures network, batching, and scheduling dynamics. Building on insights from DSD-Sim, we further design an Adaptive Window Control (AWC) policy that dynamically adjusts speculation window size to optimize throughput. Experiments across diverse workloads show that DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines, enabling agile and scalable LLM serving across edge and cloud.

[442] Characterizing Pattern Matching and Its Limits on Compositional Task Structures

Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko, Hyeonbin Hwang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo

Main category: cs.LG

TL;DR: LLMs’ pattern-matching behaviors are studied through functional equivalence, revealing predictive scaling laws and structural barriers like path ambiguity that limit compositional generalization.

Details

Motivation: To address ambiguity in behavioral studies that allow multiple generalization sources, and provide a precise account of LLMs' pattern-matching generalization capabilities and limitations.

Method: Formalize pattern matching as functional equivalence, then systematically study decoder-only Transformer and Mamba architectures in controlled compositional tasks that isolate this mechanism.

Result: (1) Pattern matching success predicted by context witnesses; (2) Proven tight sample complexity bound for two-hop structures with empirical validation; (3) Path ambiguity identified as structural barrier; (4) Chain-of-Thought reduces data needs but doesn’t resolve path ambiguity.

Conclusion: Provides predictive, falsifiable boundary for pattern matching and foundational diagnostic for disentangling mixed generalization mechanisms in LLMs.

Abstract: Despite impressive capabilities, LLMs’ successes often rely on pattern-matching behaviors, yet these are also linked to OOD generalization failures in compositional tasks. However, behavioral studies commonly employ task setups that allow multiple generalization sources (e.g., algebraic invariances, structural repetition), obscuring a precise and testable account of how well LLMs perform generalization through pattern matching and their limitations. To address this ambiguity, we first formalize pattern matching as functional equivalence, i.e., identifying pairs of subsequences of inputs that consistently lead to identical results when the rest of the input is held constant. Then, we systematically study how decoder-only Transformer and Mamba behave in controlled tasks with compositional structures that isolate this mechanism. Our formalism yields predictive and quantitative insights: (1) Instance-wise success of pattern matching is well predicted by the number of contexts witnessing the relevant functional equivalence. (2) We prove a tight sample complexity bound of learning a two-hop structure by identifying the exponent of the data scaling law for perfect in-domain generalization. Our empirical results align with the theoretical prediction, under 20x parameter scaling and across architectures. (3) Path ambiguity is a structural barrier: when a variable influences the output via multiple paths, models fail to form unified intermediate state representations, impairing accuracy and interpretability. (4) Chain-of-Thought reduces data requirements yet does not resolve path ambiguity. Hence, we provide a predictive, falsifiable boundary for pattern matching and a foundational diagnostic for disentangling mixed generalization mechanisms.

[443] AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise

Dhruv Agarwal, Bodhisattwa Prasad Majumder, Reece Adamson, Megha Chakravorty, Satvika Reddy Gavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, Peter Clark

Main category: cs.LG

TL;DR: AutoDiscovery enables open-ended autonomous scientific discovery by using Bayesian surprise to drive exploration, outperforming competitors by producing 5-29% more surprising discoveries across 21 real-world datasets.

Details

Motivation: Current autonomous scientific discovery systems rely on human-specified questions, but true acceleration requires AI systems that can drive exploration by their own criteria. Existing approaches use diversity heuristics or subjective proxies that struggle with vast hypothesis spaces or imprecise definitions.

Method: Uses Bayesian surprise to quantify epistemic shifts from prior to posterior beliefs, combined with Monte Carlo tree search (MCTS) with progressive widening using surprisal as the reward function to efficiently explore nested hypotheses.

Result: Under fixed budget, AutoDiscovery substantially outperforms competitors by producing 5-29% more discoveries deemed surprising by the LLM. Human evaluation shows two-thirds of discoveries are surprising to domain experts.

Conclusion: AutoDiscovery represents an important step towards building effective open-ended autonomous scientific discovery systems that can drive exploration by their own criteria rather than relying on human guidance.

Abstract: The promise of autonomous scientific discovery (ASD) hinges not only on answering questions, but also on knowing which questions to ask. Most recent works in ASD explore the use of large language models (LLMs) in goal-driven settings, relying on human-specified research questions to guide hypothesis generation. However, scientific discovery may be accelerated further by allowing the AI system to drive exploration by its own criteria. The few existing approaches in open-ended ASD select hypotheses based on diversity heuristics or subjective proxies for human interestingness, but the former struggles to meaningfully navigate the typically vast hypothesis space, and the latter suffers from imprecise definitions. This paper presents AutoDiscovery – a method for open-ended ASD that instead drives scientific exploration using Bayesian surprise. Here, we quantify the epistemic shift from the LLM’s prior beliefs about a hypothesis to its posterior beliefs after gathering experimental results. To efficiently explore the space of nested hypotheses, our method employs a Monte Carlo tree search (MCTS) strategy with progressive widening using surprisal as the reward function. We evaluate AutoDiscovery in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AutoDiscovery substantially outperforms competitors by producing 5-29% more discoveries deemed surprising by the LLM. Our human evaluation further reveals that two-thirds of discoveries made by our system are surprising to domain experts as well, suggesting this is an important step towards building open-ended ASD systems.

[444] Mechanism of Task-oriented Information Removal in In-context Learning

Hakaze Cho, Haolin Yang, Gouki Minegishi, Naoya Inoue

Main category: cs.LG

TL;DR: In-context learning works by selectively removing task-irrelevant information from language model hidden states, with specific “denoising heads” enabling this information removal process.

Details

Motivation: To understand the inner mechanism of in-context learning (ICL) in language models, particularly why it works effectively for few-shot learning despite unclear underlying processes.

Method: Investigated ICL through information removal perspective, using low-rank filters to selectively remove information from hidden states, identified denoising heads via attention analysis, and conducted ablation experiments.

Result: Found that zero-shot LMs encode non-selective representations containing all possible task information, while few-shot ICL selectively removes redundant information through denoising heads, significantly improving task performance.

Conclusion: Information removal from hidden states constitutes a key mechanism of ICL, with denoising heads playing a critical role in this process, especially when correct labels are absent from demonstrations.

Abstract: In-context Learning (ICL) is an emerging few-shot learning paradigm based on modern Language Models (LMs), yet its inner mechanism remains unclear. In this paper, we investigate the mechanism through a novel perspective of information removal. Specifically, we demonstrate that in the zero-shot scenario, LMs encode queries into non-selective representations in hidden states containing information for all possible tasks, leading to arbitrary outputs without focusing on the intended task, resulting in near-zero accuracy. Meanwhile, we find that selectively removing specific information from hidden states by a low-rank filter effectively steers LMs toward the intended task. Building on these findings, by measuring the hidden states on carefully designed metrics, we observe that few-shot ICL effectively simulates such task-oriented information removal processes, selectively removing the redundant information from entangled non-selective representations, and improving the output based on the demonstrations, which constitutes a key mechanism underlying ICL. Moreover, we identify essential attention heads inducing the removal operation, termed Denoising Heads, which enables the ablation experiments blocking the information removal operation from the inference, where the ICL accuracy significantly degrades, especially when the correct label is absent from the few-shot demonstrations, confirming both the critical role of the information removal mechanism and denoising heads.

[445] Dual-Balancing for Multi-Task Learning

Baijiong Lin, Weisen Jiang, Feiyang Ye, Yu Zhang, Pengguang Chen, Ying-Cong Chen, Shu Liu, Ivor W. Tsang, James T. Kwok

Main category: cs.LG

TL;DR: DB-MTL is a multi-task learning method that balances tasks from both loss and gradient perspectives using logarithm transformation and gradient normalization.

Details

Motivation: Multi-task learning faces challenges with performance compromises due to disparity in loss and gradient scales among tasks, making task balancing a significant issue.

Method: DB-MTL performs logarithm transformation on task losses for loss-scale balancing and normalizes all task gradients to comparable magnitudes using maximum gradient norm for gradient balancing.

Result: Extensive experiments on benchmark datasets show DB-MTL consistently outperforms current state-of-the-art methods.

Conclusion: DB-MTL effectively addresses task balancing in multi-task learning through dual-balancing of loss scales and gradient magnitudes.

Abstract: Multi-task learning aims to learn multiple related tasks simultaneously and has achieved great success in various fields. However, the disparity in loss and gradient scales among tasks often leads to performance compromises, and the balancing of tasks remains a significant challenge. In this paper, we propose Dual-Balancing Multi-Task Learning (DB-MTL) to achieve task balancing from both the loss and gradient perspectives. Specifically, DB-MTL achieves loss-scale balancing by performing logarithm transformation on each task loss, and rescales gradient magnitudes by normalizing all task gradients to comparable magnitudes using the maximum gradient norm. Extensive experiments on a number of benchmark datasets demonstrate that DB-MTL consistently performs better than the current state-of-the-art.

[446] Data Valuation by Fusing Global and Local Statistical Information

Xiaoling Zhou, Ou Wu, Michael K. Ng, Hao Jiang

Main category: cs.LG

TL;DR: The paper proposes enhanced data valuation methods that incorporate global and local statistical properties of value distributions to improve Shapley value estimation and introduces dynamic data valuation without recomputing Shapley values.

Details

Motivation: Existing Shapley value-based data valuation methods neglect value distribution information and dynamic data conditions, compromising performance and application potential.

Method: 1) Comprehensive analysis of value distributions across datasets; 2) Enhanced method with regularization terms incorporating distribution characteristics; 3) Dynamic data valuation approach that infers updated values without recomputing Shapley values.

Result: Extensive experiments show consistent effectiveness and efficiency across tasks including Shapley value estimation, data addition/removal, mislabeled data detection, and dynamic valuation.

Conclusion: Global and local value distributions have significant potential in data valuation, with proposed methods demonstrating improved computational efficiency and performance.

Abstract: Data valuation has garnered increasing attention in recent years, given the critical role of high-quality data in various applications. Among diverse data valuation approaches, Shapley value-based methods are predominant due to their strong theoretical grounding. However, the exact computation of Shapley values is often computationally prohibitive, prompting the development of numerous approximation techniques. Despite notable advancements, existing methods generally neglect the incorporation of value distribution information and fail to account for dynamic data conditions, thereby compromising their performance and application potential. In this paper, we highlight the crucial role of both global and local statistical properties of value distributions in the context of data valuation for machine learning. First, we conduct a comprehensive analysis of these distributions across various simulated and real-world datasets, uncovering valuable insights and key patterns. Second, we propose an enhanced data valuation method that fuses the explored distribution characteristics into two regularization terms to refine Shapley value estimation. The proposed regularizers can be seamlessly incorporated into various existing data valuation methods. Third, we introduce a novel approach for dynamic data valuation that infers updated data values without recomputing Shapley values, thereby significantly improving computational efficiency. Extensive experiments have been conducted across a range of tasks, including Shapley value estimation, value-based data addition and removal, mislabeled data detection, and dynamic data valuation. The results showcase the consistent effectiveness and efficiency of our proposed methodologies, affirming the significant potential of global and local value distributions in data valuation.

[447] Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness

Erh-Chung Chen, Pin-Yu Chen, I-Hsin Chung, Che-Rung Lee

Main category: cs.LG

TL;DR: Proposes a cost-efficient adversarial defense method using Lipschitz continuity that achieves comparable robustness to data-intensive methods without requiring external datasets or gradient estimation.

Details

Motivation: Address the high computational costs and impracticality of existing adversarial training methods that rely on external datasets or generative models, limiting real-world deployment of robust DNNs.

Method: Uses Lipschitz continuity principles to enhance robustness, requiring only a single pass over the dataset without gradient estimation. Can integrate with existing adversarial training frameworks without needing extra generative data.

Result: Experimental results show reduced computational overhead while maintaining or improving defensive capabilities compared to conventional adversarial training methods.

Conclusion: The method provides a practical, scalable defense against adversarial attacks and opens a promising direction for efficient robustness enhancement in deep neural networks.

Abstract: As deep neural networks (DNNs) are increasingly deployed in sensitive applications, ensuring their security and robustness has become critical. A major threat to DNNs arises from adversarial attacks, where small input perturbations can lead to incorrect predictions. Recent advances in adversarial training improve robustness by incorporating additional examples from external datasets or generative models. However, these methods often incur high computational costs, limiting their practicality and hindering real-world deployment. In this paper, we propose a cost-efficient alternative based on Lipschitz continuity that achieves robustness comparable to models trained with extensive supplementary data. Unlike conventional adversarial training, our method requires only a single pass over the dataset without gradient estimation, making it highly efficient. Furthermore, our method can integrate seamlessly with existing adversarial training frameworks and enhances the robustness of models without requiring extra generative data. Experimental results show that our approach not only reduces computational overhead but also maintains or improves the defensive capabilities of robust neural networks. This work opens a promising direction for developing practical, scalable defenses against adversarial attacks.

[448] CoxKAN: Kolmogorov-Arnold Networks for Interpretable, High-Performance Survival Analysis

William Knottenbelt, William McGough, Rebecca Wray, Woody Zhidong Zhang, Jiashuai Liu, Ines Prata Machado, Zeyu Gao, Mireia Crispin-Ortuzar

Main category: cs.LG

TL;DR: CoxKAN is an interpretable survival analysis model that combines Cox proportional hazards with Kolmogorov-Arnold Networks, achieving high performance while maintaining transparency for medical applications.

Details

Motivation: Address the trade-off between performance and interpretability in survival analysis for medical applications, where practitioners need transparent models for critical patient decisions.

Method: Combines Cox proportional hazards model with Kolmogorov-Arnold Networks (KANs) to create an interpretable deep learning approach for survival analysis.

Result: CoxKAN outperformed traditional Cox models by up to 4% in C-index, matched or surpassed deep learning models, recovered interpretable hazard functions, and revealed complex variable interactions on both synthetic and real datasets.

Conclusion: CoxKAN provides a high-performance yet interpretable solution for survival analysis, enabling transparent insights into biomarker impacts on patient risk while maintaining competitive predictive accuracy.

Abstract: Motivation: Survival analysis is a branch of statistics that is crucial in medicine for modeling the time to critical events such as death or relapse, in order to improve treatment strategies and patient outcomes. Selecting survival models often involves a trade-off between performance and interpretability; deep learning models offer high performance but lack the transparency of more traditional approaches. This poses a significant issue in medicine, where practitioners are reluctant to use black-box models for critical patient decisions. Results: We introduce CoxKAN, a Cox proportional hazards Kolmogorov-Arnold Network for interpretable, high-performance survival analysis. Kolmogorov-Arnold Networks (KANs) were recently proposed as an interpretable and accurate alternative to multi-layer perceptrons. We evaluated CoxKAN on four synthetic and nine real datasets, including five cohorts with clinical data and four with genomics biomarkers. In synthetic experiments, CoxKAN accurately recovered interpretable hazard function formulae and excelled in automatic feature selection. Evaluations on real datasets showed that CoxKAN consistently outperformed the traditional Cox proportional hazards model (by up to 4% in C-index) and matched or surpassed the performance of deep learning-based models. Importantly, CoxKAN revealed complex interactions between predictor variables and uncovered symbolic formulae, which are key capabilities that other survival analysis methods lack, to provide clear insights into the impact of key biomarkers on patient risk. Availability and implementation: CoxKAN is available at GitHub and Zenodo

Eunjee Choi, Junhyun Ahn, XinYu Piao, Jong-Kook Kim

Main category: cs.LG

TL;DR: CroMe is a novel multimodal fake news detection method that uses BLIP2 encoders, metric learning, and cross-modal transformers to effectively capture intra-modality relationships and integrate inter-modal similarities.

Details

Motivation: Existing methods overlook intra-modality relationships and inter-modal integration, failing to leverage advanced techniques for comprehensive fake news detection.

Method: Uses BLIP2 encoders for text, image, and image-text representations, metric learning with proxy anchor for intra-modality relationships, and Cross-Modal Tri-Transformer for feature fusion.

Result: Experiments show CroMe excels in multimodal fake news detection compared to existing methods.

Conclusion: CroMe effectively addresses limitations of previous approaches and demonstrates superior performance in detecting fake news across multiple modalities.

Abstract: Multimodal Fake News Detection has received increasing attention recently. Existing methods rely on independently encoded unimodal data and overlook the advantages of capturing intra-modality relationships and integrating inter-modal similarities using advanced techniques. To address these issues, Cross-Modal Tri-Transformer and Metric Learning for Multimodal Fake News Detection (CroMe) is proposed. CroMe utilizes Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (BLIP2) as encoders to capture detailed text, image and combined image-text representations. The metric learning module employs a proxy anchor method to capture intra-modality relationships while the feature fusion module uses a Cross-Modal and Tri-Transformer for effective integration. The final fake news detector processes the fused features through a classifier to predict the authenticity of the content. Experiments on datasets show that CroMe excels in multimodal fake news detection.

[450] TS-RAG: Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster

Kanghui Ning, Zijie Pan, Yu Liu, Yushan Jiang, James Yiming Zhang, Kashif Rasul, Anderson Schneider, Lintao Ma, Yuriy Nevmyvaka, Dongjin Song

Main category: cs.LG

TL;DR: TS-RAG is a retrieval-augmented generation framework for time series forecasting that enhances generalization and interpretability of Time Series Foundation Models by retrieving relevant patterns from a knowledge base and dynamically fusing them with model representations.

Details

Motivation: Existing Time Series Foundation Models struggle with generalization across diverse datasets and handling non-stationary dynamics and distribution shifts due to lack of effective adaptation mechanisms.

Method: Uses pre-trained time series encoders to retrieve semantically relevant segments from a knowledge base, and an Adaptive Retrieval Mixer (ARM) module to dynamically fuse retrieved patterns with TSFM’s internal representations without task-specific fine-tuning.

Result: Achieves state-of-the-art zero-shot forecasting performance, outperforming existing TSFMs by up to 6.84% across seven public benchmark datasets across diverse domains while providing interpretability.

Conclusion: TS-RAG effectively enhances generalization and interpretability of time series forecasting models through retrieval-augmented generation, demonstrating superior performance without requiring task-specific fine-tuning.

Abstract: Large Language Models (LLMs) and Foundation Models (FMs) have recently become prevalent for time series forecasting tasks. While fine-tuning LLMs enables domain adaptation, they often struggle to generalize across diverse and unseen datasets. Moreover, existing Time Series Foundation Models (TSFMs) still face challenges in handling non-stationary dynamics and distribution shifts, largely due to the lack of effective mechanisms for adaptation. To this end, we present TS-RAG, a retrieval-augmented generation framework for time series forecasting that enhances the generalization and interpretability of TSFMs. Specifically, TS-RAG leverages pre-trained time series encoders to retrieve semantically relevant segments from a dedicated knowledge base, enriching the contextual representation of the input query. Furthermore, we propose an Adaptive Retrieval Mixer (ARM) module that dynamically fuses the retrieved patterns with the TSFM’s internal representation, improving forecasting accuracy without requiring task-specific fine-tuning. Thorough empirical studies on seven public benchmark datasets demonstrate that TS-RAG achieves state-of-the-art zero-shot forecasting performance, outperforming the existing TSFMs by up to 6.84% across diverse domains while also providing desirable interpretability. Our code and data are available at: https://github.com/UConn-DSIS/TS-RAG

[451] Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping

Martin Pelikan, Sheikh Shams Azam, Vitaly Feldman, Jan “Honza” Silovsky, Kunal Talwar, Christopher G. Brinton, Tatiana Likhomanenko

Main category: cs.LG

TL;DR: First benchmark for federated learning with differential privacy in automatic speech recognition, achieving strong privacy guarantees with minimal performance drop using per-layer clipping and layer-wise gradient normalization.

Details

Motivation: Federated learning and differential privacy have not been well explored for automatic speech recognition due to challenges in training large transformer models and gradient heterogeneity across layers in deep models.

Method: Per-layer clipping and layer-wise gradient normalization to mitigate clipping bias and gradient heterogeneity across layers in deeper transformer models.

Result: Achieved user-level (7.2, 10^-9)-DP with only 1.3% absolute drop in word error rate at high population scales, and (4.5, 10^-9)-DP with 4.6% drop at low population scales.

Conclusion: FL with DP is viable for ASR under strong privacy guarantees with sufficient user population, and the principles discovered offer broader guidance for privacy-preserving FL algorithms for large models across domains.

Abstract: While federated learning (FL) and differential privacy (DP) have been extensively studied, their application to automatic speech recognition (ASR) remains largely unexplored due to the challenges in training large transformer models. Specifically, large models further exacerbate issues in FL as they are particularly susceptible to gradient heterogeneity across layers, unlike the relatively uniform gradient behavior observed in shallow models. As a result, prior works struggle to converge with standard optimization techniques, even in the absence of DP mechanisms. To the best of our knowledge, no existing work establishes a competitive, practical recipe for FL with DP in the context of ASR. To address this gap, we establish \textbf{the first benchmark for FL with DP in end-to-end ASR}. Our approach centers on per-layer clipping and layer-wise gradient normalization: theoretical analysis reveals that these techniques together mitigate clipping bias and gradient heterogeneity across layers in deeper models. Consistent with these theoretical insights, our empirical results show that FL with DP is viable under strong privacy guarantees, provided a population of at least several million users. Specifically, we achieve user-level (7.2, $10^{-9}$)-DP (resp. (4.5, $10^{-9}$)-DP) with only a 1.3% (resp. 4.6%) absolute drop in word error rate when extrapolating to high (resp. low) population scales for FL with DP in ASR. Although our experiments focus on ASR, the underlying principles we uncover - particularly those concerning gradient heterogeneity and layer-wise gradient normalization - offer broader guidance for designing scalable, privacy-preserving FL algorithms for large models across domains. Code of all experiments and benchmarks is available at https://github.com/apple/ml-pfl4asr.

[452] TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices

Jianlei Yang, Jiacheng Liao, Fanding Lei, Meichen Liu, Lingkun Long, Junyi Chen, Han Wan, Bei Yu, Weisheng Zhao

Main category: cs.LG

TL;DR: TinyFormer is a framework for developing and deploying resource-efficient transformer models on Microcontroller Units (MCUs) through neural architecture search and sparse inference optimization.

Details

Motivation: There's a need to deploy advanced deep learning models like transformers on tiny devices with severe hardware constraints (1MB storage, 320KB memory), but current methods struggle with resource efficiency.

Method: TinyFormer uses three components: SuperNAS for searching optimal supernets, SparseNAS for finding the best sparse single-path transformer model, and SparseEngine for efficient deployment on MCUs - the first framework supporting sparse transformer inference on MCUs.

Result: Achieves 96.1% accuracy on CIFAR-10 while meeting hardware constraints, with up to 12.2x speedup in sparse inference compared to CMSIS-NN library.

Conclusion: TinyFormer successfully enables transformer deployment in TinyML scenarios, significantly expanding the scope of deep learning applications on resource-constrained devices.

Abstract: Developing deep learning models on tiny devices (e.g. Microcontroller units, MCUs) has attracted much attention in various embedded IoT applications. However, it is challenging to efficiently design and deploy recent advanced models (e.g. transformers) on tiny devices due to their severe hardware resource constraints. In this work, we propose TinyFormer, a framework specifically designed to develop and deploy resource-efficient transformer models on MCUs. TinyFormer consists of SuperNAS, SparseNAS, and SparseEngine. Separately, SuperNAS aims to search for an appropriate supernet from a vast search space. SparseNAS evaluates the best sparse single-path transformer model from the identified supernet. Finally, SparseEngine efficiently deploys the searched sparse models onto MCUs. To the best of our knowledge, SparseEngine is the first deployment framework capable of performing inference of sparse transformer models on MCUs. Evaluation results on the CIFAR-10 dataset demonstrate that TinyFormer can design efficient transformers with an accuracy of 96.1% while adhering to hardware constraints of 1MB storage and 320KB memory. Additionally, TinyFormer achieves significant speedups in sparse inference, up to 12.2x comparing to the CMSIS-NN library. TinyFormer is believed to bring powerful transformers into TinyML scenarios and to greatly expand the scope of deep learning applications

[453] Single- vs. Dual-Policy Reinforcement Learning for Dynamic Bike Rebalancing

Jiaqi Liang, Defeng Liu, Sanjay Dominik Jena, Andrea Lodi, Thibaut Vidal

Main category: cs.LG

TL;DR: This paper proposes two reinforcement learning approaches (single-policy and dual-policy RL) for dynamic rebalancing in bike-sharing systems with multiple vehicles, showing that the dual-policy model significantly outperforms benchmarks in reducing lost demand.

Details

Motivation: Bike-sharing systems need effective rebalancing strategies to address stochastic demand and prevent station imbalances, ensuring system reliability and sustainability.

Method: The authors formulate the problem as a Markov Decision Process in continuous time and develop two RL approaches: single-policy RL using a DQN for joint inventory and routing decisions, and dual-policy RL that decouples inventory decisions from vehicle routing. A high-fidelity simulator is used to estimate rewards under various demand scenarios.

Result: Extensive experiments show that while the single-policy model is competitive against benchmarks, the dual-policy model significantly reduces lost demand, demonstrating superior performance.

Conclusion: The findings reinforce the potential of reinforcement learning for real-time rebalancing in bike-sharing systems and pave the way for more adaptive and intelligent urban mobility solutions.

Abstract: Bike-sharing systems (BSS) provide a sustainable urban mobility solution, but ensuring their reliability requires effective rebalancing strategies to address stochastic demand and prevent station imbalances. This paper proposes reinforcement learning (RL) algorithms for dynamic rebalancing problem with multiple vehicles, introducing and comparing two RL approaches: Single-policy RL and Dual-policy RL. We formulate this network optimization problem as a Markov Decision Process within a continuous-time framework, allowing vehicles to make independent and cooperative rebalancing decisions without synchronization constraints. In the first approach, a single deep Q-network (DQN) is trained to jointly learn inventory and routing decisions. The second approach decouples node-level inventory decisions from arc-level vehicle routing, enhancing learning efficiency and adaptability. A high-fidelity simulator under the first-arrive-first-serve rule is developed to estimate rewards across diverse demand scenarios influenced by temporal and weather variations. Extensive experiments demonstrate that while the single-policy model is competitive against several benchmarks, the dual-policy model significantly reduces lost demand. These findings provide valuable insights for bike-sharing operators, reinforcing the potential of RL for real-time rebalancing and paving the way for more adaptive and intelligent urban mobility solutions.

[454] Federated Learning: A Stochastic Approximation Approach

Srihari P, Anik Kumar Paul, Bharath Bhikkaji

Main category: cs.LG

TL;DR: This paper analyzes federated learning using client-specific tapering step sizes instead of constant step sizes, achieving almost sure convergence and allowing clients with rare data to have greater influence on the global model.

Details

Motivation: Prior FL approaches used constant step sizes across clients, leading to convergence only in expectation. The authors aim to achieve stronger convergence (with probability one) and enable differential client influence based on data characteristics.

Method: Proposed using client-specific tapering step sizes $a^{(i)}_n$ in stochastic approximation framework. The global model tracks an ODE where client influence is weighted by limiting step size ratios $p^{(i)}$.

Result: The method achieves convergence with probability one (stronger than prior expectation convergence). Clients with larger $p^{(i)}$ exert greater influence, allowing preferential treatment for clients with rare/uncommon data.

Conclusion: Client-specific tapering step sizes enable almost sure convergence and provide a mechanism to regulate client influence in federated learning, particularly beneficial for handling heterogeneous data distributions across clients.

Abstract: This paper considers the Federated learning (FL) in a stochastic approximation (SA) framework. Here, each client $i$ trains a local model using its dataset $\mathcal{D}^{(i)}$ and periodically transmits the model parameters $w^{(i)}_n$ to a central server, where they are aggregated into a global model parameter $\bar{w}_n$ and sent back. The clients continue their training by re-initializing their local models with the global model parameters. Prior works typically assumed constant (and often identical) step sizes (learning rates) across clients for model training. As a consequence the aggregated model converges only in expectation. In this work, client-specific tapering step sizes $a^{(i)}n$ are used. The global model is shown to track an ODE with a forcing function equal to the weighted sum of the negative gradients of the individual clients. The weights being the limiting ratios $p^{(i)}=\lim{n \to \infty} \frac{a^{(i)}_n}{a^{(1)}_n}$ of the step sizes, where $a^{(1)}_n \geq a^{(i)}_n, \forall n$. Unlike the constant step sizes, the convergence here is with probability one. In this framework, the clients with the larger $p^{(i)}$ exert a greater influence on the global model than those with smaller $p^{(i)}$, which can be used to favor clients that have rare and uncommon data. Numerical experiments were conducted to validate the convergence and demonstrate the choice of step-sizes for regulating the influence of the clients.

[455] CTSyn: A Foundation Model for Cross Tabular Data Generation

Xiaofeng Lin, Chenheng Xu, Matthew Yang, Guang Cheng

Main category: cs.LG

TL;DR: CTSyn is a diffusion-based generative foundation model for tabular data that addresses challenges in heterogeneous table generation through autoencoder-based latent space unification and conditional diffusion modeling.

Details

Motivation: Current cross-table learning frameworks lack generative model backbones and effective mechanisms to decode heterogeneous feature values in tabular data, while GFMs have shown success in images and text.

Method: CTSyn uses an autoencoder network to consolidate diverse tables into a unified latent space with dynamic value reconstruction via table schema embedding, combined with a conditional latent diffusion model for generation.

Result: CTSyn outperforms existing table synthesizers on standard benchmarks in both utility and diversity through large-scale pre-training.

Conclusion: CTSyn provides a promising framework for synthetic table generation and lays groundwork for developing large-scale tabular foundation models.

Abstract: Generative Foundation Models (GFMs) have achieved remarkable success in producing high-quality synthetic data for images and text. However, their application to tabular data presents significant challenges due to the heterogeneous nature of table features. Current cross-table learning frameworks struggle because they lack a generative model backbone and an effective mechanism to decode heterogeneous feature values. To address these challenges, we propose the Cross-Table Synthesizer (CTSyn), a diffusion-based generative foundation model for tabular data generation. CTSyn comprises two key components. The first is an autoencoder network that consolidates diverse tables into a unified latent space. It dynamically reconstructs table values using a table schema embedding, allowing adaptation to heterogeneous datasets. The second is a conditional latent diffusion model that generates samples from the learned latent space, conditioned on the table schema. Through large-scale pre-training, CTSyn outperforms existing table synthesizers on standard benchmarks in both utility and diversity. These results position CTSyn as a promising framework for synthetic table generation and lay the groundwork for developing large-scale tabular foundation models.

[456] HO-FMN: Hyperparameter Optimization for Fast Minimum-Norm Attacks

Raffaele Mura, Giuseppe Floris, Luca Scionis, Giorgio Piras, Maura Pintor, Ambra Demontis, Giorgio Giacinto, Battista Biggio, Fabio Roli

Main category: cs.LG

TL;DR: Proposes HO-FMN, a parametric variation of fast minimum-norm attack that dynamically adjusts loss functions, optimizers, step-size schedulers, and hyperparameters to find smaller adversarial perturbations without additional tuning.

Details

Motivation: Many gradient-based attacks provide overly-optimistic evaluations due to using fixed loss functions, optimizers, step-size schedulers, and default hyperparameters, limiting their effectiveness in evaluating model robustness.

Method: Developed HO-FMN attack algorithm that allows dynamic adjustment of loss functions, optimizers, step-size schedulers, and hyperparameters during the attack process, enabling more effective adversarial example generation.

Result: HO-FMN found smaller adversarial perturbations than existing methods when re-evaluating 12 robust models, without requiring additional tuning. It enables reporting adversarial robustness as a function of perturbation budget.

Conclusion: The proposed HO-FMN attack provides more complete and efficient evaluation of adversarial robustness compared to fixed-budget attacks, offering better assessment of model security.

Abstract: Gradient-based attacks are a primary tool to evaluate robustness of machine-learning models. However, many attacks tend to provide overly-optimistic evaluations as they use fixed loss functions, optimizers, step-size schedulers, and default hyperparameters. In this work, we tackle these limitations by proposing a parametric variation of the well-known fast minimum-norm attack algorithm, whose loss, optimizer, step-size scheduler, and hyperparameters can be dynamically adjusted. We re-evaluate 12 robust models, showing that our attack finds smaller adversarial perturbations without requiring any additional tuning. This also enables reporting adversarial robustness as a function of the perturbation budget, providing a more complete evaluation than that offered by fixed-budget attacks, while remaining efficient. We release our open-source code at https://github.com/pralab/HO-FMN.

[457] No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha

Amey Agrawal, Haoran Qiu, Junda Chen, Íñigo Goiri, Chaojie Zhang, Rayyan Shahid, Ramachandran Ramjee, Alexey Tumanov, Esha Choukse

Main category: cs.LG

TL;DR: Medha is a serving system that eliminates convoy effects in LLM inference through fine-grained preemptive scheduling, enabling efficient handling of heterogeneous workloads mixing short queries and long documents.

Details

Motivation: Production LLM workloads are highly heterogeneous with mixed short queries and long documents, creating severe convoy effects where long requests stall short ones due to attention's quadratic complexity, degrading system responsiveness.

Method: Introduces fine-grained preemptive scheduling with Adaptive Chunking, Stream Pipeline Parallel, and KV-Cache Parallelism to reduce decode latency. Orchestrated by Length-Aware Relative Slack (LARS) scheduler that prevents convoy effects and starvation.

Result: Under heterogeneous workloads, Medha improves throughput by 5.7x while reducing median and 99th percentile latency by 30x and 174x respectively compared to state-of-the-art non-preemptive systems.

Conclusion: Medha successfully eliminates convoy effects in LLM serving through practical preemptive scheduling mechanisms, enabling efficient handling of heterogeneous workloads with significant improvements in throughput and latency.

Abstract: Deploying million-token Large Language Models (LLMs) is challenging because production workloads are highly heterogeneous, mixing short queries and long documents. This heterogeneity, combined with the quadratic complexity of attention, creates severe convoy effects where long-running requests stall short, interactive ones, degrading system responsiveness. We present Medha, a serving system that eliminates these convoys by introducing fine-grained, preemptive scheduling to LLM inference. Medha makes preemption practical with a co-designed set of mechanisms – including Adaptive Chunking and Stream Pipeline Parallel that overcome the perceived inefficiencies and scaling challenges of chunking. Additionally, we present a new parallelism strategy KV-Cache Parallelism to reduce the decode latency and afford interactivity despite very long context. These mechanisms are orchestrated by a Length-Aware Relative Slack (LARS) scheduler, a deadline and heterogeneity-aware scheduling policy that prevents both the convoy effect and the starvation that plagues simpler policies. Under a heterogeneous workload, Medha improves throughput by 5.7x while reducing median and 99th percentile latency by 30x and 174x, respectively, compared to state-of-the-art non-preemptive systems.

[458] Physics-Constrained Flow Matching: Sampling Generative Models with Hard Constraints

Utkarsh Utkarsh, Pengfei Cai, Alan Edelman, Rafael Gomez-Bombarelli, Christopher Vincent Rackauckas

Main category: cs.LG

TL;DR: PCFM is a zero-shot inference framework that enforces hard physical constraints in pretrained flow-based generative models for PDE systems, outperforming baselines while ensuring exact constraint satisfaction.

Details

Motivation: Existing methods for enforcing physical constraints in deep generative models for PDEs rely on soft penalties or architectural biases that fail to guarantee hard constraints like conservation laws and physical consistencies.

Method: Physics-Constrained Flow Matching (PCFM) continuously guides the sampling process through physics-based corrections applied to intermediate solution states while remaining aligned with the learned flow and satisfying physical constraints.

Result: PCFM outperforms both unconstrained and constrained baselines on a range of PDEs, including those with shocks, discontinuities, and sharp features, while ensuring exact constraint satisfaction at the final solution.

Conclusion: PCFM provides a flexible framework for enforcing hard constraints in both scientific and general-purpose generative models, especially in applications where constraint satisfaction is essential.

Abstract: Deep generative models have recently been applied to physical systems governed by partial differential equations (PDEs), offering scalable simulation and uncertainty-aware inference. However, enforcing physical constraints, such as conservation laws (linear and nonlinear) and physical consistencies, remains challenging. Existing methods often rely on soft penalties or architectural biases that fail to guarantee hard constraints. In this work, we propose Physics-Constrained Flow Matching (PCFM), a zero-shot inference framework that enforces arbitrary nonlinear constraints in pretrained flow-based generative models. PCFM continuously guides the sampling process through physics-based corrections applied to intermediate solution states, while remaining aligned with the learned flow and satisfying physical constraints. Empirically, PCFM outperforms both unconstrained and constrained baselines on a range of PDEs, including those with shocks, discontinuities, and sharp features, while ensuring exact constraint satisfaction at the final solution. Our method provides a flexible framework for enforcing hard constraints in both scientific and general-purpose generative models, especially in applications where constraint satisfaction is essential.

[459] HoGA: Higher-Order Graph Attention via Diversity-Aware k-Hop Sampling

Thomas Bailie, Yun Sing Koh, Karthik Mukkavilli

Main category: cs.LG

TL;DR: HoGA introduces higher-order graph attention by sampling diverse subgraphs to capture richer topological relationships, achieving significant accuracy improvements in node classification tasks.

Details

Motivation: Standard edge-based MPNNs have limited expressive power for discovering higher-order relationships in graphs, and existing higher-order attention methods often resample similar relationships, lacking diversity.

Method: The HoGA module constructs a k-order attention matrix by sampling subgraphs to maximize diversity among feature vectors, targeting diverse modalities in higher-order topology rather than greedily resampling similar relationships.

Result: HoGA achieves at least 5% accuracy gain on all benchmark node classification datasets and outperforms recent baselines on six of eight datasets.

Conclusion: HoGA effectively expands the range of captured substructures by focusing on diverse higher-order topological relationships, demonstrating superior performance in graph learning tasks.

Abstract: Graphs model latent variable relationships in many real-world systems, and Message Passing Neural Networks (MPNNs) are widely used to learn such structures for downstream tasks. While edge-based MPNNs effectively capture local interactions, their expressive power is theoretically bounded, limiting the discovery of higher-order relationships. We introduce the Higher-Order Graph Attention (HoGA) module, which constructs a k-order attention matrix by sampling subgraphs to maximize diversity among feature vectors. Unlike existing higher-order attention methods that greedily resample similar k-order relationships, HoGA targets diverse modalities in higher-order topology, reducing redundancy and expanding the range of captured substructures. Applied to two single-hop attention models, HoGA achieves at least a 5% accuracy gain on all benchmark node classification datasets and outperforms recent baselines on six of eight datasets. Code is available at https://github.com/TB862/Higher_Order.

[460] Flow Matching Meets PDEs: A Unified Framework for Physics-Constrained Generation

Giacomo Baldan, Qiang Liu, Alberto Guardone, Nils Thuerey

Main category: cs.LG

TL;DR: PBFM embeds physical constraints directly into flow matching, achieving 8x better physical accuracy than standard flow matching without requiring hyperparameter tuning.

Details

Motivation: Existing generative methods learn physics implicitly from data, lacking explicit physical constraints that could improve accuracy and physical consistency.

Method: Physics-Based Flow Matching (PBFM) explicitly embeds PDE residuals and algebraic relations into the flow matching objective, uses temporal unrolling during training, and analyzes noise level effects.

Result: PBFM yields up to 8x more accurate physical residuals compared to standard flow matching, while outperforming existing algorithms in distributional accuracy.

Conclusion: PBFM provides a principled framework for physics-aware surrogate modeling, uncertainty quantification, and accelerated simulation in engineering applications.

Abstract: Generative machine learning methods, such as diffusion models and flow matching, have shown great potential in modeling complex system behaviors and building efficient surrogate models. However, these methods typically learn the underlying physics implicitly from data. We propose Physics-Based Flow Matching (PBFM), a novel generative framework that explicitly embeds physical constraints, both PDE residuals and algebraic relations, into the flow matching objective. We also introduce temporal unrolling at training time that improves the accuracy of the final, noise-free sample prediction. Our method jointly minimizes the flow matching loss and the physics-based residual loss without requiring hyperparameter tuning of their relative weights. Additionally, we analyze the role of the minimum noise level, $σ_{\min}$, in the context of physical constraints and evaluate a stochastic sampling strategy that helps to reduce physical residuals. Through extensive benchmarks on three representative PDE problems, we show that our approach yields up to an $8\times$ more accurate physical residuals compared to FM, while clearly outperforming existing algorithms in terms of distributional accuracy. PBFM thus provides a principled and efficient framework for surrogate modeling, uncertainty quantification, and accelerated simulation in physics and engineering applications.

[461] On the Effectiveness of Adversarial Training on Malware Classifiers

Hamid Bostani, Jacopo Cortellazzi, Daniel Arp, Fabio Pierazzi, Veelasha Moonsamy, Lorenzo Cavallaro

Main category: cs.LG

TL;DR: Rubik is a framework for systematic evaluation of Adversarial Training in malware detection, revealing that realizable adversarial examples provide only conditional robustness and highlighting the importance of model architecture and feature-space structure.

Details

Motivation: Prior research on Adversarial Training for malware detection is fragmented and overlooks malware's inherent nature, with weak evaluations that yield non-generalizable insights.

Method: Introduces Rubik framework with multi-dimensional evaluation across data, feature representations, classifiers, and robust optimization settings, using realistic evasion attacks and applied to Android malware.

Result: Challenges prior beliefs by showing realizable adversarial examples offer only conditional robustness, and reveals critical role of model architecture and feature-space structure in AT’s success.

Conclusion: Distills four key insights, exposes four common evaluation misconceptions, and offers practical recommendations for developing truly robust malware classifiers.

Abstract: Adversarial Training (AT) is a key defense against Machine Learning evasion attacks, but its effectiveness for real-world malware detection remains poorly understood. This uncertainty stems from a critical disconnect in prior research: studies often overlook the inherent nature of malware and are fragmented, examining diverse variables like realism or confidence of adversarial examples in isolation, or relying on weak evaluations that yield non-generalizable insights. To address this, we introduce Rubik, a framework for the systematic, multi-dimensional evaluation of AT in the malware domain. This framework defines diverse key factors across essential dimensions, including data, feature representations, classifiers, and robust optimization settings, for a comprehensive exploration of the interplay of influential AT’s variables through reliable evaluation practices, such as realistic evasion attacks. We instantiate Rubik on Android malware, empirically analyzing how this interplay shapes robustness. Our findings challenge prior beliefs–showing, for instance, that realizable adversarial examples offer only conditional robustness benefits–and reveal new insights, such as the critical role of model architecture and feature-space structure in determining AT’s success. From this analysis, we distill four key insights, expose four common evaluation misconceptions, and offer practical recommendations to guide the development of truly robust malware classifiers.

[462] A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

Zechen Wu, Amy Greenwald, Ronald Parr

Main category: cs.LG

TL;DR: This paper provides a unified mathematical framework showing that TD, FQI, and PFQI are all solving the same linear system but with different matrix splitting schemes and preconditioners, explaining their convergence differences.

Details

Motivation: To resolve the inaccurate traditional view that TD, FQI, and PFQI differ only in the number of updates to the target value function, and to establish a unified theoretical foundation for understanding their convergence properties.

Method: Developed a linear value function approximation framework that unifies TD, FQI, and PFQI as iterative methods solving the same linear system using different matrix splitting schemes and preconditioners, with target networks representing transitions from constant to data-feature adaptive preconditioners.

Result: Established tight convergence connections among the algorithms, explained why TD convergence doesn’t imply FQI convergence, characterized convergence conditions without feature independence assumptions, and discovered new crucial feature conditions for convergence.

Conclusion: The unified framework provides sharper theoretical results, enables dropping common feature assumptions, introduces matrix splitting to convergence analysis, and explains practical phenomena like why smaller learning rates can help when larger ones fail.

Abstract: In off-policy policy evaluation (OPE) tasks within reinforcement learning, Temporal Difference Learning(TD) and Fitted Q-Iteration (FQI) have traditionally been viewed as differing in the number of updates toward the target value function: TD makes one update, FQI makes an infinite number, and Partial Fitted Q-Iteration (PFQI) performs a finite number. We show that this view is not accurate, and provide a new mathematical perspective under linear value function approximation that unifies these methods as a single iterative method solving the same linear system, but using different matrix splitting schemes and preconditioners. We show that increasing the number of updates under the same target value function, i.e., the target network technique, is a transition from using a constant preconditioner to using a data-feature adaptive preconditioner. This elucidates, for the first time, why TD convergence does not necessarily imply FQI convergence, and establishes tight convergence connections among TD, PFQI, and FQI. Our framework enables sharper theoretical results than previous work and characterization of the convergence conditions for each algorithm, without relying on assumptions about the features (e.g., linear independence). We also provide an encoder-decoder perspective to better understand the convergence conditions of TD, and prove, for the first time, that when a large learning rate doesn’t work, trying a smaller one may help. Our framework also leads to the discovery of new crucial conditions on features for convergence, and shows how common assumptions about features influence convergence, e.g., the assumption of linearly independent features can be dropped without compromising the convergence guarantees of stochastic TD in the on-policy setting. This paper is also the first to introduce matrix splitting into the convergence analysis of these algorithms.

[463] Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits

Tianyi Xu, Jiaxin Liu, Nicholas Mattei, Zizhan Zheng

Main category: cs.LG

TL;DR: Proposes a multi-agent multi-armed bandit framework with strategic probing to balance fairness and system performance, with algorithms for both offline and online settings.

Details

Motivation: To ensure fair outcomes across agents while maximizing overall system performance in multi-agent bandit problems, addressing the challenge of decision-making under limited information about arm rewards.

Method: Introduces a probing framework to strategically gather information before allocation; uses submodular properties for greedy probing algorithm in offline setting; develops online algorithm with sublinear regret while maintaining fairness.

Result: Extensive experiments on synthetic and real-world datasets show the approach outperforms baseline methods, achieving better fairness and efficiency.

Conclusion: The proposed MA-MAB framework with strategic probing effectively balances fairness and performance in both offline and online settings, demonstrating superior results compared to existing methods.

Abstract: We propose a multi-agent multi-armed bandit (MA-MAB) framework aimed at ensuring fair outcomes across agents while maximizing overall system performance. A key challenge in this setting is decision-making under limited information about arm rewards. To address this, we introduce a novel probing framework that strategically gathers information about selected arms before allocation. In the offline setting, where reward distributions are known, we leverage submodular properties to design a greedy probing algorithm with a provable performance bound. For the more complex online setting, we develop an algorithm that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real-world datasets show that our approach outperforms baseline methods, achieving better fairness and efficiency.

[464] Evolutionary Prediction Games

Eden Saig, Nir Rosenfeld

Main category: cs.LG

TL;DR: The paper introduces evolutionary prediction games to model feedback loops between prediction algorithms and user populations, showing that realistic constraints enable stable coexistence between user groups.

Details

Motivation: To understand how prediction algorithms create feedback loops with user populations and how disparities in prediction quality emerge and evolve over time.

Method: Developed evolutionary prediction games framework using evolutionary game theory, analyzed behavioral dynamics under both idealized and realistic constraints.

Result: In idealized settings, repeated learning promotes competitive exclusion, but under realistic constraints (finite data, limited compute, overfitting risk), stable coexistence and mutualistic symbiosis become possible.

Conclusion: Real-world constraints fundamentally change the dynamics of prediction feedback loops, enabling stable coexistence between user groups that would otherwise compete to extinction.

Abstract: When a prediction algorithm serves a collection of users, disparities in prediction quality are likely to emerge. If users respond to accurate predictions by increasing engagement, inviting friends, or adopting trends, repeated learning creates a feedback loop that shapes both the model and the population of its users. In this work, we introduce evolutionary prediction games, a framework grounded in evolutionary game theory which models such feedback loops as natural-selection processes among groups of users. Our theoretical analysis reveals a gap between idealized and real-world learning settings: In idealized settings with unlimited data and computational power, repeated learning creates competition and promotes competitive exclusion across a broad class of behavioral dynamics. However, under realistic constraints such as finite data, limited compute, or risk of overfitting, we show that stable coexistence and mutualistic symbiosis between groups becomes possible. We analyze these possibilities in terms of their stability and feasibility, present mechanisms that can sustain their existence, and empirically demonstrate our findings.

[465] F-INR: Functional Tensor Decomposition for Implicit Neural Representations

Sai Karthikeya Vemuri, Tim Büchner, Joachim Denzler

Main category: cs.LG

TL;DR: F-INR is a framework that factorizes high-dimensional Implicit Neural Representations into compact axis-specific sub-networks using functional tensor decomposition, achieving up to 20× faster training and over 6.0 dB PSNR improvement.

Details

Motivation: Monolithic INRs scale poorly with data dimensionality, leading to excessive training costs. The motivation is to address this limitation by decomposing high-dimensional INRs into more efficient components.

Method: Factorizes high-dimensional INR into compact axis-specific sub-networks using functional tensor decomposition. Combines low-dimensional functional components via tensor operations. Architecture- and decomposition-agnostic, works with various INR backbones and tensor formats.

Result: Accelerates training by up to 20× and improves fidelity by over 6.0 dB PSNR compared to state-of-the-art INRs. Validated on image representation, 3D geometry reconstruction, neural radiance fields, and physics simulations.

Conclusion: F-INR provides a scalable, flexible, and efficient framework for high-dimensional signal modeling with fine-grained control over speed-accuracy trade-off.

Abstract: Implicit Neural Representations (INRs) model signals as continuous, differentiable functions. However, monolithic INRs scale poorly with data dimensionality, leading to excessive training costs. We propose F-INR, a framework that addresses this limitation by factorizing a high-dimensional INR into a set of compact, axis-specific sub-networks based on functional tensor decomposition. These sub-networks learn low-dimensional functional components that are then combined via tensor operations. This factorization reduces computational complexity while additionally improving representational capacity. F-INR is both architecture- and decomposition-agnostic. It integrates with various existing INR backbones (e.g., SIREN, WIRE, FINER, Factor Fields) and tensor formats (e.g., CP, TT, Tucker), offering fine-grained control over the speed-accuracy trade-off via the tensor rank and mode. Our experiments show F-INR accelerates training by up to $20\times$ and improves fidelity by over \num{6.0} dB PSNR compared to state-of-the-art INRs. We validate these gains on diverse tasks, including image representation, 3D geometry reconstruction, and neural radiance fields. We further show F-INR’s applicability to scientific computing by modeling complex physics simulations. Thus, F-INR provides a scalable, flexible, and efficient framework for high-dimensional signal modeling. Project page: https://f-inr.github.io

[466] Empowering Time Series Forecasting with LLM-Agents

Chin-Chia Michael Yeh, Vivian Lai, Uday Singh Saini, Xiran Fan, Yujie Fan, Junpeng Wang, Xin Dai, Yan Zheng

Main category: cs.LG

TL;DR: DCATS is a data-centric agent that improves time series forecasting by cleaning data using metadata, achieving 6% average error reduction across models.

Details

Motivation: Lightweight models often achieve state-of-the-art performance in time series forecasting, suggesting that improving data quality rather than model architecture could be more effective for AutoML.

Method: Leverages metadata accompanying time series to clean data while optimizing forecasting performance, using LLM-powered agents as planners.

Result: Achieved average 6% error reduction across four time series forecasting models and different time horizons on large-scale traffic volume forecasting dataset.

Conclusion: Data-centric approaches show significant potential in AutoML for time series forecasting, outperforming traditional model-centric methods.

Abstract: Large Language Model (LLM) powered agents have emerged as effective planners for Automated Machine Learning (AutoML) systems. While most existing AutoML approaches focus on automating feature engineering and model architecture search, recent studies in time series forecasting suggest that lightweight models can often achieve state-of-the-art performance. This observation led us to explore improving data quality, rather than model architecture, as a potentially fruitful direction for AutoML on time series data. We propose DCATS, a Data-Centric Agent for Time Series. DCATS leverages metadata accompanying time series to clean data while optimizing forecasting performance. We evaluated DCATS using four time series forecasting models on a large-scale traffic volume forecasting dataset. Results demonstrate that DCATS achieves an average 6% error reduction across all tested models and time horizons, highlighting the potential of data-centric approaches in AutoML for time series forecasting.

[467] Active Learning Methods for Efficient Data Utilization and Model Performance Enhancement

Chiung-Yi Tseng, Junhao Song, Ziqian Bi, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Ming Liu

Main category: cs.LG

TL;DR: Active Learning (AL) improves machine learning performance with fewer labeled examples by focusing on uncertainty estimation, handling class imbalance, domain adaptation, and creating better evaluation metrics.

Details

Motivation: Address the paradox of data abundance but annotation scarcity in machine learning, which limits model advancement despite large datasets.

Method: Detailed overview of AL concepts and applications across computer vision, NLP, transfer learning, and real-world scenarios, with focus on human-inspired learning methods and question-guided approaches.

Result: AL consistently outperforms passive learning, especially with proper evaluation measures, and demonstrates improved data efficiency and learning effectiveness.

Conclusion: AL is a valuable strategy for data-efficient machine learning, though challenges remain in trust, reproducibility, and methodology consistency, requiring future work in these areas.

Abstract: In the era of data-driven intelligence, the paradox of data abundance and annotation scarcity has emerged as a critical bottleneck in the advancement of machine learning. This paper gives a detailed overview of Active Learning (AL), which is a strategy in machine learning that helps models achieve better performance using fewer labeled examples. It introduces the basic concepts of AL and discusses how it is used in various fields such as computer vision, natural language processing, transfer learning, and real-world applications. The paper focuses on important research topics such as uncertainty estimation, handling of class imbalance, domain adaptation, fairness, and the creation of strong evaluation metrics and benchmarks. It also shows that learning methods inspired by humans and guided by questions can improve data efficiency and help models learn more effectively. In addition, this paper talks about current challenges in the field, including the need to rebuild trust, ensure reproducibility, and deal with inconsistent methodologies. It points out that AL often gives better results than passive learning, especially when good evaluation measures are used. This work aims to be useful for both researchers and practitioners by providing key insights and proposing directions for future progress in active learning.

[468] Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenchel-Young Losses

Yuzhou Cao, Han Bao, Lei Feng, Bo An

Main category: cs.LG

TL;DR: This paper overcomes the trade-off between smoothness and linear regret bounds for convex surrogate losses by constructing a novel smooth surrogate loss using Fenchel-Young losses and convolutional negentropy, enabling efficient optimization while maintaining linear regret transfer to arbitrary discrete target losses.

Details

Motivation: There has been a belief in the community about a trade-off between loss smoothness and linear regret bounds for convex smooth surrogate losses, where better optimization properties might deteriorate after regret transfer to target losses. The authors aim to overcome this dilemma.

Method: Construct convex smooth surrogate losses using Fenchel-Young losses generated by convolutional negentropy, which are equivalent to the infimal convolution of a generalized negentropy and the target Bayes risk. This enables smoothness while maintaining linear surrogate regret bounds through tailored prediction links.

Result: Successfully constructed convex smooth surrogate losses that achieve linear surrogate regret bounds for arbitrary discrete target losses, overcoming the previously believed trade-off between smoothness and regret transfer efficiency.

Conclusion: The infimal convolution approach demonstrates how convex analysis can penetrate into optimization and statistical efficiency in risk minimization, providing a novel method to maintain both smoothness and linear regret bounds simultaneously.

Abstract: Surrogate regret bounds, also known as excess risk bounds, bridge the gap between the convergence rates of surrogate and target losses. The regret transfer is lossless if the surrogate regret bound is linear. While convex smooth surrogate losses are appealing in particular due to the efficient estimation and optimization, the existence of a trade-off between the loss smoothness and linear regret bound has been believed in the community. Under this scenario, the better optimization and estimation properties of convex smooth surrogate losses may inevitably deteriorate after undergoing the regret transfer onto a target loss. We overcome this dilemma for arbitrary discrete target losses by constructing a convex smooth surrogate loss, which entails a linear surrogate regret bound composed with a tailored prediction link. The construction is based on Fenchel–Young losses generated by the convolutional negentropy, which are equivalent to the infimal convolution of a generalized negentropy and the target Bayes risk. Consequently, the infimal convolution enables us to derive a smooth loss while maintaining the surrogate regret bound linear. We additionally benefit from the infimal convolution to have a consistent estimator of the underlying class probability. Our results are overall a novel demonstration of how convex analysis penetrates into optimization and statistical efficiency in risk minimization.

[469] Enhancing Training Data Attribution with Representational Optimization

Weiwei Sun, Haokun Liu, Nikhil Kandpal, Colin Raffel, Yiming Yang

Main category: cs.LG

TL;DR: AirRep is a scalable training data attribution method that learns task-specific representations optimized for attribution, achieving performance comparable to gradient-based methods while being much more efficient.

Details

Motivation: Current gradient-based attribution methods are computationally expensive for large-scale applications, while representation-based approaches use heuristic embeddings not optimized for attribution, limiting their fidelity.

Method: AirRep learns task-specific and model-aligned representations through a trainable encoder and attention-based pooling mechanism, trained using a ranking objective over automatically constructed training subsets.

Result: AirRep achieves performance on par with state-of-the-art gradient-based approaches while being nearly two orders of magnitude more efficient at inference time, with demonstrated robustness across tasks and models.

Conclusion: AirRep provides a scalable and effective solution for training data attribution that bridges the gap between computational efficiency and attribution fidelity.

Abstract: Training data attribution (TDA) methods aim to measure how training data impacts a model’s predictions. While gradient-based attribution methods, such as influence functions, offer theoretical grounding, their computational costs make them impractical for large-scale applications. Representation-based approaches are far more scalable, but typically rely on heuristic embeddings that are not optimized for attribution, limiting their fidelity. To address these challenges, we propose AirRep, a scalable, representation-based approach that closes this gap by learning task-specific and model-aligned representations optimized explicitly for TDA. AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence. We train AirRep using a ranking objective over automatically constructed training subsets labeled by their empirical effect on target predictions. Experiments on instruction-tuned LLMs demonstrate that AirRep achieves performance on par with state-of-the-art gradient-based approaches while being nearly two orders of magnitude more efficient at inference time. Further analysis highlights its robustness and generalization across tasks and models. Our code is available at https://github.com/sunnweiwei/AirRep

[470] Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems

Fleur Hendriks, Ondřej Rokoš, Martin Doškář, Marc G. D. Geers, Vlado Menkovski

Main category: cs.LG

TL;DR: A generative flow matching framework that models multimodal distributions in nonlinear dynamical systems with symmetry-breaking bifurcations, enabling direct sampling of multiple coexisting solutions while preserving system symmetries.

Details

Motivation: Deterministic machine learning models fail to capture multiple coexisting stable solutions in nonlinear dynamical systems with symmetry breaking, averaging over solutions and missing lower-symmetry outcomes.

Method: Proposes a generative framework using flow matching with symmetric matching strategy that aligns predicted and target outputs under group actions, enabling equivariant modeling of bifurcation outcomes.

Result: Validated on various systems from toy models to complex physical problems (buckling beams, Allen-Cahn equation), flow matching significantly outperforms non-probabilistic and variational methods in capturing multimodal distributions and symmetry-breaking bifurcations.

Conclusion: Flow matching offers a principled and scalable solution for modeling multistability in high-dimensional systems, effectively capturing the full probability distribution over bifurcation outcomes while preserving system symmetries.

Abstract: Bifurcation phenomena in nonlinear dynamical systems often lead to multiple coexisting stable solutions, particularly in the presence of symmetry breaking. Deterministic machine learning models struggle to capture this multiplicity, averaging over solutions and failing to represent lower-symmetry outcomes. In this work, we propose a generative framework based on flow matching to model the full probability distribution over bifurcation outcomes. Our method enables direct sampling of multiple valid solutions while preserving system symmetries through equivariant modeling. We introduce a symmetric matching strategy that aligns predicted and target outputs under group actions, allowing accurate learning in equivariant settings. We validate our approach on a range of systems, from toy models to complex physical problems such as buckling beams and the Allen-Cahn equation. Our results demonstrate that flow matching significantly outperforms non-probabilistic and variational methods in capturing multimodal distributions and symmetry-breaking bifurcations, offering a principled and scalable solution for modeling multistability in high-dimensional systems.

[471] Asymmetric Duos: Sidekicks Improve Uncertainty

Tim G. Zhou, Evan Shelhamer, Geoff Pleiss

Main category: cs.LG

TL;DR: Asymmetric Duos: pairing a large model with a smaller “sidekick” model to improve uncertainty quantification and performance with minimal computational overhead.

Details

Motivation: Traditional ensembling methods are too computationally expensive for today's large-scale models and fine-tuning workflows, requiring a more cost-effective approach.

Method: Couple a large model with a much smaller sidekick model and aggregate their predictions using simple learned weighted averaging.

Result: Across five image classification benchmarks, Asymmetric Duos significantly improved accuracy, uncertainty quantification, and selective classification metrics with only ~10-20% more computation.

Conclusion: The sidekick model almost never harms the larger model’s performance, making Asymmetric Duos an effective and efficient strategy for improving uncertainty quantification in large models.

Abstract: The go-to strategy to apply deep networks in settings where uncertainty informs decisions–ensembling multiple training runs with random initializations–is ill-suited for the extremely large-scale models and practical fine-tuning workflows of today. We introduce a new cost-effective strategy for improving the uncertainty quantification and downstream decisions of a large model (e.g. a fine-tuned ViT-B): coupling it with a less accurate but much smaller “sidekick” (e.g. a fine-tuned ResNet-34) with a fraction of the computational cost. We propose aggregating the predictions of this Asymmetric Duo by simple learned weighted averaging. Surprisingly, despite their inherent asymmetry, the sidekick model almost never harms the performance of the larger model. In fact, across five image classification benchmarks and a variety of model architectures and training schemes (including soups), Asymmetric Duos significantly improve accuracy, uncertainty quantification, and selective classification metrics with only ${\sim}10-20%$ more computation.

[472] Alignment of large language models with constrained learning

Botong Zhang, Shuo Li, Ignacio Hounie, Osbert Bastani, Dongsheng Ding, Alejandro Ribeiro

Main category: cs.LG

TL;DR: The paper proposes a dual-based alignment method for LLMs that maximizes primary reward while satisfying secondary utility constraints, addressing convergence and optimality issues in existing approaches.

Details

Motivation: Existing Lagrangian-based LLM policy search methods have convergence problems with iterative primal-dual methods and non-iterative dual methods don't achieve optimality in the LLM parameter space.

Method: Uses Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating LLM policy via Lagrangian maximization and updating dual variable via dual descent.

Result: Theoretical analysis characterizes primal-dual gaps and proves dual-based methods can find optimal constrained LLM policies up to parametrization gap. Experiments on PKU-SafeRLHF and Anthropic HH-RLHF datasets demonstrate effectiveness.

Conclusion: Dual-based alignment methods can effectively compute optimal constrained LLM policies, addressing limitations of existing approaches while providing theoretical guarantees.

Abstract: We study the problem of computing an optimal large language model (LLM) policy for the constrained alignment problem, where the goal is to maximize a primary reward objective while satisfying constraints on secondary utilities. Despite the popularity of Lagrangian-based LLM policy search in constrained alignment, iterative primal-dual methods often fail to converge, and non-iterative dual-based methods do not achieve optimality in the LLM parameter space. To address these challenges, we employ Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating the LLM policy via Lagrangian maximization and updating the dual variable via dual descent. In theory, we characterize the primal-dual gap between the primal value in the distribution space and the dual value in the LLM parameter space. We further quantify the optimality gap of the learned LLM policies at near-optimal dual variables with respect to both the objective and the constraint functions. These results prove that dual-based alignment methods can find an optimal constrained LLM policy, up to an LLM parametrization gap. We demonstrate the effectiveness and merits of our approach through extensive experiments conducted on the PKU-SafeRLHF and Anthropic HH-RLHF datasets.

[473] Deep Actor-Critics with Tight Risk Certificates

Bahareh Tasdighi, Manuel Haussmann, Yi-Shan Wu, Andres R. Masegosa, Melih Kandemir

Main category: cs.LG

TL;DR: Developed tight risk certificates for deep actor-critic algorithms using minimal evaluation data and recursive PAC-Bayes theory to predict generalization performance from validation observations.

Details

Motivation: Deep actor-critic algorithms are widely used but lack comprehensive validation schemes to quantify risk of malfunction, limiting their deployment in physical systems.

Method: Used small evaluation roll-outs from pretrained policies combined with recursive PAC-Bayes approach that splits validation data and builds bounds on excess loss using previous predictors as data-informed priors.

Result: Empirical results across locomotion tasks, actor-critic methods, and policy expertise levels show risk certificates tight enough for practical use.

Conclusion: It’s possible to develop accurate risk certificates for deep actor-critic algorithms using minimal evaluation data, enabling safer deployment in physical systems.

Abstract: Deep actor-critic algorithms have reached a level where they influence everyday life. They are a driving force behind continual improvement of large language models through user feedback. However, their deployment in physical systems is not yet widely adopted, mainly because no validation scheme fully quantifies their risk of malfunction. We demonstrate that it is possible to develop tight risk certificates for deep actor-critic algorithms that predict generalization performance from validation-time observations. Our key insight centers on the effectiveness of minimal evaluation data. A small feasible set of evaluation roll-outs collected from a pretrained policy suffices to produce accurate risk certificates when combined with a simple adaptation of PAC-Bayes theory. Specifically, we adopt a recently introduced recursive PAC-Bayes approach, which splits validation data into portions and recursively builds PAC-Bayes bounds on the excess loss of each portion’s predictor, using the predictor from the previous portion as a data-informed prior. Our empirical results across multiple locomotion tasks, actor-critic methods, and policy expertise levels demonstrate risk certificates tight enough to be considered for practical use.

[474] A Unified Noise-Curvature View of Loss of Trainability

Gunbir Singh Baveja, Alex Lewandowski, Mark Schmidt

Main category: cs.LG

TL;DR: The paper analyzes loss of trainability in continual learning and proposes a step-size scheduler that prevents this phenomenon by using adaptive noise thresholds based on gradient noise and curvature volatility.

Details

Motivation: Loss of trainability causes accuracy to stall or degrade in continual learning as parameter updates stop making progress. Existing indicators like Hessian rank and gradient norms don't reliably predict this phenomenon.

Method: Introduces two new indicators: batch-size-aware gradient-noise bound and curvature volatility-controlled bound. Combines them into a per-layer adaptive noise threshold on effective step-size. Proposes a step-size scheduler that keeps parameter updates below this bound.

Result: The proposed scheduler improves accuracy compared to existing approaches like CReLU, Wasserstein regularizer, and L2 weight decay. Surprisingly, it produces adaptive step-size trajectories that mirror manually engineered decay schedules without tuning.

Conclusion: The paper provides a practical solution to prevent loss of trainability in continual learning through an adaptive step-size scheduler based on novel optimization-based indicators.

Abstract: Loss of trainability refers to a phenomenon in continual learning where parameter updates no longer make progress on the optimization objective, so accuracy stalls or degrades as the learning problem changes over time. In this paper, we analyze loss of trainability through an optimization lens and find that the phenomenon is not reliably predicted by existing individual indicators such as Hessian rank, sharpness level, weight or gradient norms, gradient-to-parameter ratios, and unit-sign entropy. Motivated by our analysis, we introduce two complementary indicators: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound. We then combine these two indicators into a per-layer adaptive noise threshold on the effective step-size that anticipates trainability behavior. Using this insight, we propose a step-size scheduler that keeps each layer’s effective parameter update below this bound, thereby avoiding loss of trainability. We demonstrate that our scheduler can improve the accuracy maintained by previously proposed approaches, such as concatenated ReLU (CReLU), Wasserstein regularizer, and L2 weight decay. Surprisingly, our scheduler produces adaptive step-size trajectories that, without tuning, mirror the manually engineered step-size decay schedules.

[475] Inference-Time Alignment of Diffusion Models via Evolutionary Algorithms

Purvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou, George K. Thiruvathukal, James C. Davis, Yung-Hsiang Lu

Main category: cs.LG

TL;DR: Evolutionary algorithm-based inference-time alignment framework for diffusion models that treats them as black boxes and searches latent space to maximize alignment objectives without requiring gradients or internal model access.

Details

Motivation: Diffusion models often fail to satisfy application objectives like safety constraints or domain-specific validity, and existing alignment techniques require gradients, internal model access, or large computational budgets.

Method: Treat diffusion models as black boxes and use evolutionary algorithms to search their latent space to maximize alignment objectives during inference time.

Result: Achieves 3-35% higher ImageReward scores than gradient-free and gradient-based methods with equal or less running time, competitive results on Open Image Preferences dataset across four alignment objectives, 55-76% less GPU memory usage, and 72-80% faster than gradient-based methods.

Conclusion: The proposed evolutionary algorithm-based framework provides an efficient and effective approach for aligning diffusion models with application objectives without requiring model internals or gradients.

Abstract: Diffusion models are state-of-the-art generative models, yet their samples often fail to satisfy application objectives such as safety constraints or domain-specific validity. Existing techniques for alignment require gradients, internal model access, or large computational budgets resulting in high compute demands, or lack of support for certain objectives. In response, we introduce an inference-time alignment framework based on evolutionary algorithms. We treat diffusion models as black boxes and search their latent space to maximize alignment objectives. Given equal or less running time, our method achieves 3-35% higher ImageReward scores than gradient-free and gradient-based methods. On the Open Image Preferences dataset, our method achieves competitive results across four popular alignment objectives. In terms of computational efficiency, we require 55% to 76% less GPU memory and are 72% to 80% faster than gradient-based methods.

[476] The Impossibility of Inverse Permutation Learning in Transformer Models

Rohan Alur, Chris Hays, Manish Raghavan, Devavrat Shah

Main category: cs.LG

TL;DR: Decoder-only transformers cannot learn inverse permutation tasks, but adding scratch tokens or using encoder-decoder architectures makes it possible.

Details

Motivation: To study the robustness of transformer models in reasoning tasks like long-context retrieval and multiple choice QA by examining their ability to learn inverse permutations.

Method: Analyzed the expressive capacity of decoder-only transformers for inverse permutation learning, compared with encoder-decoder transformers, and tested adding scratch tokens to inputs.

Result: Proved impossibility for decoder-only transformers, showed feasibility with encoder-decoder architectures or by padding inputs with scratch tokens.

Conclusion: Scratch tokens may enable reasoning in LLMs similarly to chain-of-thought prompting, even without semantic meaning.

Abstract: In this technical note, we study the problem of inverse permutation learning in decoder-only transformers. Given a permutation and a string to which that permutation has been applied, the model is tasked with producing the original (canonical'') string. We argue that this task models a natural robustness property across a variety of reasoning tasks, including long-context retrieval, multiple choice QA and in-context learning. Our primary contribution is an impossibility result: we show that an arbitrary depth, decoder-only transformer cannot learn this task. This result concerns the expressive capacity of decoder-only transformer models and is agnostic to training dynamics or sample complexity. We give a pair of alternative constructions under which inverse permutation learning is feasible. The first of these highlights the fundamental role of the causal attention mask, and reveals a gap between the expressivity of encoder-decoder transformers and the more popular decoder-only architecture. The latter result is more surprising: we show that simply padding the input with scratch tokens" yields a construction under which inverse permutation learning is possible. We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate ``thinking’’ tokens can enable reasoning in large language models, even when these tokens encode no meaningful semantic information (e.g., the results of intermediate computations).

[477] ENMA: Tokenwise Autoregression for Generative Neural PDE Operators

Armand Kassaï Koupaï, Lise Le Boudec, Louis Serrano, Patrick Gallinari

Main category: cs.LG

TL;DR: ENMA is a generative neural operator that predicts spatio-temporal dynamics of parametric PDEs using a masked autoregressive transformer with flow matching loss, enabling robust generalization to new PDE regimes through in-context learning.

Details

Motivation: Solving time-dependent parametric PDEs is challenging for neural solvers, especially with uncertain or incomplete data. Generative models offer a natural approach to handle these uncertainties and generalize across physical parameters.

Method: ENMA uses a generative masked autoregressive transformer with flow matching loss for tokenwise generation in compressed latent space. It encodes irregular spatial observations via attention mechanisms and spatio-temporal convolutional encoder, enabling in-context learning by conditioning on past states or similar context trajectories.

Result: ENMA provides a robust framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs, handling irregular sampling and incomplete data effectively.

Conclusion: ENMA represents an adaptable generative approach for modeling spatio-temporal dynamics in parametric PDEs, offering improved generalization and surrogate modeling capabilities through its in-context learning design.

Abstract: Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete-as is often the case-a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.

[478] ConStellaration: A dataset of QI-like stellarator plasma boundaries and optimization benchmarks

Santiago A. Cadena, Andrea Merlo, Emanuel Laude, Alexander Bauer, Atul Agrawal, Maria Pascu, Marija Savtchouk, Enrico Guiraud, Lukas Bonauer, Stuart Hudson, Markus Kaiser

Main category: cs.LG

TL;DR: The paper introduces an open dataset of quasi-isodynamic stellarator configurations with physics simulations and performance metrics, along with three optimization benchmarks to accelerate stellarator design research.

Details

Motivation: To address the bottleneck in stellarator optimization caused by lack of standardized problems and datasets, particularly for quasi-isodynamic configurations which are promising for commercial fusion due to disruption resilience.

Method: Generated dataset by sampling QI fields and optimizing corresponding stellarator plasma boundaries, providing three benchmarks with increasing complexity: geometric optimization, simple-to-build QI stellarator, and multi-objective MHD-stable QI stellarator.

Result: Created an open dataset with diverse QI-like stellarator shapes, equilibria, and metrics, along with reference code, evaluation scripts, and strong baselines using classical optimization techniques.

Conclusion: The released dataset and benchmarks aim to lower entry barriers for optimization and ML researchers in stellarator design, accelerating cross-disciplinary progress toward fusion energy.

Abstract: Stellarators are magnetic confinement devices under active development to deliver steady-state carbon-free fusion energy. Their design involves a high-dimensional, constrained optimization problem that requires expensive physics simulations and significant domain expertise. Recent advances in plasma physics and open-source tools have made stellarator optimization more accessible. However, broader community progress is currently bottlenecked by the lack of standardized optimization problems with strong baselines and datasets that enable data-driven approaches, particularly for quasi-isodynamic (QI) stellarator configurations, considered as a promising path to commercial fusion due to their inherent resilience to current driven disruptions. Here, we release an open dataset of diverse QI-like stellarator plasma boundary shapes, paired with their ideal magnetohydrodynamic (MHD) equilibria and performance metrics. We generated this dataset by sampling a variety of QI fields and optimizing corresponding stellarator plasma boundaries. We introduce three optimization benchmarks of increasing complexity: (1) a single objective geometric optimization problem, (2) a “simple-to-build” QI stellarator, and (3) a multi-objective ideal-MHD stable QI stellarator that investigates trade-offs between compactness and coil simplicity. For every benchmark, we provide reference code, evaluation scripts, and strong baselines based on classical optimization techniques. Finally, we show how learned models trained on our dataset can efficiently generate novel, feasible configurations without querying expensive physics oracles. By openly releasing the dataset along with benchmark problems and baselines, we aim to lower the entry barrier for optimization and machine learning researchers to engage in stellarator design and to accelerate cross-disciplinary progress toward bringing fusion energy to the grid.

[479] Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control

Thomas T. Zhang, Daniel Pfrommer, Chaoyi Pan, Nikolai Matni, Max Simchowitz

Main category: cs.LG

TL;DR: Theoretical analysis shows action-chunking and exploratory augmentation in imitation learning avoid exponential compounding errors through control-theoretic stability mechanisms.

Details

Motivation: To understand why action-chunking and exploratory data collection interventions are effective in imitation learning despite known issues with exponential compounding errors in continuous control settings.

Method: Combined theoretical analysis using control-theoretic stability framework with empirical validation on robot learning benchmarks.

Result: Demonstrated that control-theoretic stability is the key mechanism enabling action-chunking and exploratory augmentation to circumvent exponential compounding errors, providing tighter statistical guarantees than previous information-theoretic approaches.

Conclusion: Control-theoretic analysis offers superior insights into imitation learning error compounding and demonstrates the effectiveness of action-chunking and exploratory data collection interventions in continuous control settings.

Abstract: This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound exponentially with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.

[480] Deep RL Dual Sourcing Inventory Management with Supply and Capacity Risk Awareness

Defeng Liu, Ying Liu, Carson Eisenach

Main category: cs.LG

TL;DR: The paper proposes using reinforcement learning with intervention models to solve large-scale stochastic optimization problems, specifically applied to multi-sourcing multi-period inventory management in supply chains.

Details

Motivation: To efficiently apply RL to large-scale stochastic optimization problems by better exploring solution space through simulation and composition of stochastic processes using pre-trained DL models.

Method: Uses deep RL models for learning and forecasting stochastic supply chain processes, introduces constraint coordination mechanism to forecast dual costs for cross-product constraints, and breaks down complex processes into scalable DL modules.

Result: The approach leads to improved performance on large real-world datasets by decomposing complex supply chain processes into composable DL modules rather than directly modeling all constraints.

Conclusion: The methodology effectively handles large-scale stochastic optimization problems in supply chains by leveraging intervention models and modular DL approaches, with open problems identified for future research.

Abstract: In this work, we study how to efficiently apply reinforcement learning (RL) for solving large-scale stochastic optimization problems by leveraging intervention models. The key of the proposed methodology is to better explore the solution space by simulating and composing the stochastic processes using pre-trained deep learning (DL) models. We demonstrate our approach on a challenging real-world application, the multi-sourcing multi-period inventory management problem in supply chain optimization. In particular, we employ deep RL models for learning and forecasting the stochastic supply chain processes under a range of assumptions. Moreover, we also introduce a constraint coordination mechanism, designed to forecast dual costs given the cross-products constraints in the inventory network. We highlight that instead of directly modeling the complex physical constraints into the RL optimization problem and solving the stochastic problem as a whole, our approach breaks down those supply chain processes into scalable and composable DL modules, leading to improved performance on large real-world datasets. We also outline open problems for future research to further investigate the efficacy of such models.

[481] Weak-to-Strong Generalization under Distribution Shifts

Myeongho Jeon, Jan Sobotka, Suhwan Choi, Maria Brbić

Main category: cs.LG

TL;DR: RAVEN is a robust weak-to-strong generalization framework that addresses the failure of naive weak-to-strong supervision under distribution shifts by dynamically learning optimal combinations of weak models and strong model parameters.

Details

Motivation: As superhuman models become more complex, human supervision becomes insufficient. While weak models can supervise strong ones (weak-to-strong generalization), this approach fails under distribution shifts, often degrading strong model performance below weak supervisors.

Method: RAVEN dynamically learns optimal combinations of weak models in addition to parameters of the strong model, enabling robust supervision across different scenarios.

Result: RAVEN outperforms baselines by over 30% on out-of-distribution tasks while matching or surpassing existing methods on in-distribution tasks. It automatically assigns higher weights to more accurate weak models.

Conclusion: RAVEN provides a robust framework for weak-to-strong generalization that maintains performance under distribution shifts and can automatically identify trustworthy supervision sources.

Abstract: As future superhuman models become increasingly complex, accurately supervising their behavior may exceed human capabilities. Recent works have demonstrated that in such scenarios, weak models can effectively supervise strong models, a phenomenon known as weak-to-strong generalization. However, we find that naive weak-to-strong generalization fails under distribution shifts, often leading to worse performance of the strong model than its weak supervisors. To address this, we propose RAVEN, a robust weak-to-strong generalization framework that dynamically learns the optimal combinations of weak models in addition to parameters of the strong model. We demonstrate the effectiveness of RAVEN on image classification, text classification, and preference alignment tasks. RAVEN outperforms alternative baselines by over 30% on out-of-distribution tasks while matching or surpassing existing methods on in-distribution tasks. Moreover, our results show that RAVEN assigns higher weights to more accurate weak models, demonstrating its ability to automatically identify trustworthy supervision.

[482] Geometric Multi-color Message Passing Graph Neural Networks for Blood-brain Barrier Permeability Prediction

Trung Nguyen, Md Masud Rana, Farjana Tasnim Mukta, Chang-Guo Zhan, Duc Duy Nguyen

Main category: cs.LG

TL;DR: GMC-MPNN is a geometric graph neural network that incorporates 3D atomic geometry and long-range interactions to predict blood-brain barrier permeability, outperforming existing models on benchmark datasets.

Details

Motivation: Standard GNNs for molecular property prediction rely on molecular topology but neglect crucial 3D geometric information needed to model transport mechanisms like BBB permeability.

Method: Developed geometric multi-color message-passing GNN that constructs weighted colored subgraphs based on atom types to capture spatial relationships and chemical context, with rigorous scaffold-based splitting for evaluation.

Result: Achieved state-of-the-art performance with AUC-ROC of 0.9704/0.9685 for classification and RMSE of 0.4609 with Pearson correlation of 0.7759 for regression, outperforming existing models.

Conclusion: GMC-MPNN sets a new benchmark by integrating spatial geometry into graph representations, providing a more accurate and generalizable tool for CNS drug discovery pipelines.

Abstract: Accurate prediction of blood-brain barrier permeability (BBBP) is essential for central nervous system (CNS) drug development. While graph neural networks (GNNs) have advanced molecular property prediction, they often rely on molecular topology and neglect the three-dimensional geometric information crucial for modeling transport mechanisms. This paper introduces the geometric multi-color message-passing graph neural network (GMC-MPNN), a novel framework that enhances standard message-passing architectures by explicitly incorporating atomic-level geometric features and long-range interactions. Our model constructs weighted colored subgraphs based on atom types to capture the spatial relationships and chemical context that govern BBB permeability. We evaluated GMC-MPNN on three benchmark datasets for both classification and regression tasks, using rigorous scaffold-based splitting to ensure a robust assessment of generalization. The results demonstrate that GMC-MPNN consistently outperforms existing state-of-the-art models, achieving superior performance in both classifying compounds as permeable/non-permeable (AUC-ROC of 0.9704 and 0.9685) and in regressing continuous permeability values (RMSE of 0.4609, Pearson correlation of 0.7759). An ablation study further quantified the impact of specific atom-pair interactions, revealing that the model’s predictive power derives from its ability to learn from both common and rare, but chemically significant, functional motifs. By integrating spatial geometry into the graph representation, GMC-MPNN sets a new performance benchmark and offers a more accurate and generalizable tool for drug discovery pipelines.

[483] Learning in Stackelberg Mean Field Games: A Non-Asymptotic Analysis

Sihan Zeng, Benjamin Patrick Evans, Sujay Bhatt, Leo Ardon, Sumitra Ganesh, Alec Koppel

Main category: cs.LG

TL;DR: AC-SMFG is a single-loop actor-critic algorithm for Stackelberg mean field games that provides finite-time convergence guarantees without restrictive independence assumptions between leader and followers.

Details

Motivation: Existing methods for Stackelberg MFGs rely on restrictive independence assumptions, use samples inefficiently with nested-loop structures, and lack finite-time convergence guarantees.

Method: Proposed AC-SMFG algorithm: single-loop actor-critic that alternates between (semi-)gradient updates for leader, representative follower, and mean field using continuously generated Markovian samples.

Result: Established finite-time and finite-sample convergence to stationary point of Stackelberg objective; outperforms existing baselines in policy quality and convergence speed in economics environments.

Conclusion: AC-SMFG is the first Stackelberg MFG algorithm with non-asymptotic convergence guarantees, relaxing existing independence assumptions through a gradient alignment condition.

Abstract: We study policy optimization in Stackelberg mean field games (MFGs), a hierarchical framework for modeling the strategic interaction between a single leader and an infinitely large population of homogeneous followers. The objective can be formulated as a structured bi-level optimization problem, in which the leader needs to learn a policy maximizing its reward, anticipating the response of the followers. Existing methods for solving these (and related) problems often rely on restrictive independence assumptions between the leader’s and followers’ objectives, use samples inefficiently due to nested-loop algorithm structure, and lack finite-time convergence guarantees. To address these limitations, we propose AC-SMFG, a single-loop actor-critic algorithm that operates on continuously generated Markovian samples. The algorithm alternates between (semi-)gradient updates for the leader, a representative follower, and the mean field, and is simple to implement in practice. We establish the finite-time and finite-sample convergence of the algorithm to a stationary point of the Stackelberg objective. To our knowledge, this is the first Stackelberg MFG algorithm with non-asymptotic convergence guarantees. Our key assumption is a “gradient alignment” condition, which requires that the full policy gradient of the leader can be approximated by a partial component of it, relaxing the existing leader-follower independence assumption. Simulation results in a range of well-established economics environments demonstrate that AC-SMFG outperforms existing multi-agent and MFG learning baselines in policy quality and convergence speed.

[484] Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules versus Therapeutic Peptides

Yiquan Wang, Yahui Ma, Yuhan Chang, Jiayao Yan, Jialin Zhang, Minnuo Cai, Kai Wei

Main category: cs.LG

TL;DR: Diffusion models offer a unified framework for drug discovery, adapted differently for small molecules (structure-based design) and therapeutic peptides (functional sequence generation), with shared challenges in data quality and validation.

Details

Motivation: To transform the slow and costly drug discovery process by leveraging diffusion models for designing small molecules and therapeutic peptides, addressing their distinct molecular representations and design objectives.

Method: Systematic comparison of diffusion model applications in drug discovery, focusing on iterative denoising adapted to different molecular representations (small molecules vs. peptides), chemical spaces, and design objectives.

Result: Diffusion models excel at structure-based design for small molecules and functional sequence generation for peptides, but face modality-specific challenges (synthesizability for molecules, stability/folding for peptides) and shared hurdles (data scarcity, inaccurate scoring).

Conclusion: The full potential of diffusion models in drug discovery will be realized by bridging modality-specific gaps and integrating them into automated DBTL platforms, shifting from chemical exploration to on-demand therapeutic engineering.

Abstract: Diffusion models have emerged as a leading framework in generative modeling, poised to transform the traditionally slow and costly process of drug discovery. This review provides a systematic comparison of their application in designing two principal therapeutic modalities: small molecules and therapeutic peptides. We dissect how the unified framework of iterative denoising is adapted to the distinct molecular representations, chemical spaces, and design objectives of each modality. For small molecules, these models excel at structure-based design, generating novel, pocket-fitting ligands with desired physicochemical properties, yet face the critical hurdle of ensuring chemical synthesizability. Conversely, for therapeutic peptides, the focus shifts to generating functional sequences and designing de novo structures, where the primary challenges are achieving biological stability against proteolysis, ensuring proper folding, and minimizing immunogenicity. Despite these distinct challenges, both domains face shared hurdles: the scarcity of high-quality experimental data, the reliance on inaccurate scoring functions for validation, and the crucial need for experimental validation. We conclude that the full potential of diffusion models will be unlocked by bridging these modality-specific gaps and integrating them into automated, closed-loop Design-Build-Test-Learn (DBTL) platforms, thereby shifting the paradigm from mere chemical exploration to the on-demand engineering of novel~therapeutics.

[485] A Conditional Distribution Equality Testing Framework using Deep Generative Learning

Siming Zheng, Tong Wang, Meifang Lan, Yuanyuan Lin

Main category: cs.LG

TL;DR: A neural network-based framework for testing conditional distribution equality in two-sample problems, with applications to covariate shift and causal discovery.

Details

Motivation: Address the need for testing conditional distribution equality in two-sample problems, particularly relevant for covariate shift and causal discovery applications.

Method: Transform conditional testing into unconditional testing using neural network-based generative methods and sample splitting techniques, introducing GCA-CDET (Generative Classification Accuracy-based Conditional Distribution Equality Test).

Result: Established convergence rate for learned generators using offset Rademacher complexity, proved testing consistency under mild conditions, and demonstrated effectiveness through synthetic and real-world datasets.

Conclusion: The proposed framework provides an effective approach for conditional distribution equality testing with theoretical guarantees and empirical validation.

Abstract: In this paper, we propose a general framework for testing the conditional distribution equality in a two-sample problem, which is most relevant to covariate shift and causal discovery. Our framework is built on neural network-based generative methods and sample splitting techniques by transforming the conditional testing problem into an unconditional one. We introduce the generative classification accuracy-based conditional distribution equality test (GCA-CDET) to illustrate the proposed framework. We establish the convergence rate for the learned generator by deriving new results related to the recently-developed offset Rademacher complexity and prove the testing consistency of GCA-CDET under mild conditions.Empirically, we conduct numerical studies including synthetic datasets and two real-world datasets, demonstrating the effectiveness of our approach. Additional discussions on the optimality of the proposed framework are provided in the online supplementary material.

[486] CRPS-LAM: Regional ensemble weather forecasting from matching marginals

Erik Larsson, Joel Oskarsson, Tomas Landelius, Fredrik Lindsten

Main category: cs.LG

TL;DR: CRPS-LAM is a probabilistic regional weather forecasting model that uses CRPS-based training to generate ensemble forecasts in a single forward pass, achieving 39x faster sampling than diffusion models while maintaining accuracy.

Details

Motivation: Diffusion-based models for weather prediction are computationally expensive at sampling time, despite their strong performance in Limited-Area Modeling (LAM). There's a need for faster probabilistic forecasting methods that maintain accuracy.

Method: Train a probabilistic LAM forecasting model using a Continuous Ranked Probability Score (CRPS)-based objective, where ensemble members are generated by sampling and injecting a single latent noise vector into the model in a single forward pass.

Result: CRPS-LAM achieves sampling speeds up to 39 times faster than diffusion-based models, matches the low errors of diffusion models on the MEPS regional dataset, and retains fine-scale forecast details.

Conclusion: CRPS-LAM stands out as an effective approach for probabilistic regional weather forecasting, offering significant computational efficiency gains while maintaining forecast quality comparable to state-of-the-art diffusion models.

Abstract: Machine learning for weather prediction increasingly relies on ensemble methods to provide probabilistic forecasts. Diffusion-based models have shown strong performance in Limited-Area Modeling (LAM) but remain computationally expensive at sampling time. Building on the success of global weather forecasting models trained based on Continuous Ranked Probability Score (CRPS), we introduce CRPS-LAM, a probabilistic LAM forecasting model trained with a CRPS-based objective. By sampling and injecting a single latent noise vector into the model, CRPS-LAM generates ensemble members in a single forward pass, achieving sampling speeds up to 39 times faster than a diffusion-based model. We evaluate the model on the MEPS regional dataset, where CRPS-LAM matches the low errors of diffusion models. By retaining also fine-scale forecast details, the method stands out as an effective approach for probabilistic regional weather forecasting

[487] Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners

Daniel Herbst, Lea Karbevska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi, Fabrizio Frasca

Main category: cs.LG

TL;DR: LLM-based graph reasoners lack invariance to graph symmetries, causing robustness issues. Fine-tuning reduces sensitivity to node relabeling but may increase sensitivity to structural/format changes, without improving generalization to unseen tasks.

Details

Motivation: Graph reasoners using LLMs are sensitive to graph representation symmetries (node reindexing, edge reordering, formatting), raising robustness concerns that need systematic analysis.

Method: Proposed decomposition of graph serializations into node labeling, edge encoding, and syntax; evaluated LLM robustness to variations in each factor using comprehensive benchmarking and novel spectral tasks.

Result: Larger non-fine-tuned models are more robust. Fine-tuning reduces sensitivity to node relabeling but increases sensitivity to structural/format variations, and doesn’t consistently improve performance on unseen tasks.

Conclusion: Fine-tuning LLM graph reasoners has mixed effects - improves robustness to node relabeling but may decrease robustness to other graph variations, without enhancing generalization to new tasks.

Abstract: While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.

[488] A Connection Between Score Matching and Local Intrinsic Dimension

Eric Yeats, Aaron Jacobson, Darryl Hannan, Yiran Jia, Timothy Doster, Henry Kvinge, Scott Mahan

Main category: cs.LG

TL;DR: The paper proposes using denoising score matching loss as a scalable local intrinsic dimension (LID) estimator, showing it’s competitive with existing methods while being more efficient.

Details

Motivation: Existing LID estimation methods using diffusion models require many forward passes or gradient computations, limiting their applicability in compute- and memory-constrained scenarios.

Method: The authors show that LID is a lower bound on the denoising score matching loss and that the equivalent implicit score matching loss also approximates LID via the normal dimension.

Result: Experiments on manifold benchmark and Stable Diffusion 3.5 show the denoising score matching loss achieves superior accuracy and memory footprint under increasing problem size and quantization.

Conclusion: Denoising score matching loss is a highly competitive and scalable LID estimator that outperforms existing methods in efficiency and accuracy.

Abstract: The local intrinsic dimension (LID) of data is a fundamental quantity in signal processing and learning theory, but quantifying the LID of high-dimensional, complex data has been a historically challenging task. Recent works have discovered that diffusion models capture the LID of data through the spectra of their score estimates and through the rate of change of their density estimates under various noise perturbations. While these methods can accurately quantify LID, they require either many forward passes of the diffusion model or use of gradient computation, limiting their applicability in compute- and memory-constrained scenarios. We show that the LID is a lower bound on the denoising score matching loss, motivating use of the denoising score matching loss as a LID estimator. Moreover, we show that the equivalent implicit score matching loss also approximates LID via the normal dimension and is closely related to a recent LID estimator, FLIPD. Our experiments on a manifold benchmark and with Stable Diffusion 3.5 indicate that the denoising score matching loss is a highly competitive and scalable LID estimator, achieving superior accuracy and memory footprint under increasing problem size and quantization level.

[489] Category learning in deep neural networks: Information content and geometry of internal representations

Laurent Bonnasse-Gahot, Jean-Pierre Nadal

Main category: cs.LG

TL;DR: The paper extends a theoretical framework to artificial neural networks, showing that minimizing Bayes cost leads to maximizing mutual information between categories and neural activities, resulting in categorical perception with neural space expansion near decision boundaries.

Details

Motivation: To understand why categorical perception (enhanced discrimination near category boundaries) occurs in artificial neural networks and provide a theoretical explanation based on information theory and efficient learning principles.

Method: Extend theoretical framework to artificial networks, analyze mutual information maximization between categories and neural activities, derive Fisher information matrices for neural representation and category sensitivity, and validate with toy models and MNIST dataset.

Result: Found that optimal learning makes neural Fisher information follow category-specific Fisher information, causing expansion of neural space near decision boundaries. Maximum Fisher information occurs near but not exactly at class boundaries.

Conclusion: Category learning induces neural space expansion near boundaries as an outcome of efficient learning through mutual information maximization, with Fisher information matrices aligning with category boundaries after learning.

Abstract: In humans and other animals, category learning enhances discrimination between stimuli close to the category boundary. This phenomenon, called categorical perception, was also empirically observed in artificial neural networks trained on classification tasks. In previous modeling works based on neuroscience data, we show that this expansion/compression is a necessary outcome of efficient learning. Here we extend our theoretical framework to artificial networks. We show that minimizing the Bayes cost (mean of the cross-entropy loss) implies maximizing the mutual information between the set of categories and the neural activities prior to the decision layer. Considering structured data with an underlying feature space of small dimension, we show that maximizing the mutual information implies (i) finding an appropriate projection space, and, (ii) building a neural representation with the appropriate metric. The latter is based on a Fisher information matrix measuring the sensitivity of the neural activity to changes in the projection space. Optimal learning makes this neural Fisher information follow a category-specific Fisher information, measuring the sensitivity of the category membership. Category learning thus induces an expansion of neural space near decision boundaries. We characterize the properties of the categorical Fisher information, showing that its eigenvectors give the most discriminant directions at each point of the projection space. We find that, unexpectedly, its maxima are in general not exactly at, but near, the class boundaries. Considering toy models and the MNIST dataset, we numerically illustrate how after learning the two Fisher information matrices match, and essentially align with the category boundaries. Finally, we relate our approach to the Information Bottleneck one, and we exhibit a bias-variance decomposition of the Bayes cost, of interest on its own.

[490] QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation

Yang Zhang, Rui Zhang, Jiaming Guo, Lei Huang, Di Huang, Yunpu Zhao, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

Main category: cs.LG

TL;DR: QiMeng-SALV introduces signal-aware learning for Verilog code generation by extracting verified signal-level implementations from partially incorrect modules to optimize RL training, achieving state-of-the-art performance with a 7B model matching DeepSeek v3 671B.

Details

Motivation: The lack of meaningful functional rewards hinders reinforcement learning optimization for generating functionally correct Verilog code in automated circuit design.

Method: Extract verified signal-aware implementations from partially incorrect modules using AST analysis, verify functional correctness by comparing with reference modules, and optimize with signal-aware DPO on correct signal-level code segments.

Result: Achieves state-of-the-art performance on VerilogEval and RTLLM benchmarks, with a 7B parameter model matching DeepSeek v3 671B performance and significantly outperforming CodeV.

Conclusion: Proposes a paradigm shift from module-level to fine-grained signal-level optimization in Verilog code generation, effectively addressing insufficient functional rewards and enabling efficient RL training.

Abstract: The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for Verilog code generation (QiMeng-SALV) by leveraging code segments of functionally correct output signal to optimize RL training. Considering Verilog code specifies the structural interconnection of hardware gates and wires so that different output signals are independent, the key insight of QiMeng-SALV is to extract verified signal-aware implementations in partially incorrect modules, so as to enhance the extraction of meaningful functional rewards. Roughly, we verify the functional correctness of signals in generated module by comparing with that of reference module in the training data. Then abstract syntax tree (AST) is employed to identify signal-aware code segments which can provide meaningful functional rewards from erroneous modules. Finally, we introduce signal-aware DPO which is optimized on the correct signal-level code segments, thereby preventing noise and interference from incorrect signals. The proposed QiMeng-SALV underscores the paradigm shift from conventional module-level to fine-grained signal-level optimization in Verilog code generation, addressing the issue of insufficient functional rewards. Experiments demonstrate that our method achieves state-of-the-art performance on VerilogEval and RTLLM, with a 7B parameter model matching the performance of the DeepSeek v3 671B model and significantly outperforming the leading open-source model CodeV trained on the same dataset. Our code is available at https://github.com/zy1xxx/SALV.

[491] g-DPO: Scalable Preference Optimization for Protein Language Models

Constance Ferragu, Jonathan D. Ziegler, Nicolas Deutschmann, Arthur Lindoulsi, Eli Bixby, Cradle ML Team

Main category: cs.LG

TL;DR: g-DPO is a scalable framework that accelerates Direct Preference Optimization for protein language models by clustering sequences to prune redundant training pairs and using group-based approximations to amortize likelihood computations.

Details

Motivation: Standard DPO faces scalability issues due to quadratic growth of training pairs with dataset size, leading to prohibitive training times for protein engineering applications.

Method: Uses sequence space clustering to prune redundant pairs while preserving training signal, and amortizes likelihood computations with group-based approximations.

Result: Maintains performance statistically indistinguishable from standard DPO across three protein engineering tasks, while converging 1.7x to 5.4x faster with speedups scaling with dataset size and mutational landscape structure.

Conclusion: g-DPO provides an efficient alternative to DPO that preserves performance while significantly reducing training time, making protein language model alignment more scalable for practical applications.

Abstract: Direct Preference Optimization (DPO) is an effective approach for aligning protein language models with experimental design goals. However, DPO faces a scalability bottleneck: the number of possible training pairs grows quadratically with the number of labeled sequences, leading to prohibitive training times even for modestly sized datasets. We introduce g-DPO, a framework that (i) uses sequence space clustering to prune redundant pairs while preserving training signal, and (ii) amortizes likelihood computations with group-based approximations. Across three protein engineering tasks, g-DPO maintains in silico and in vitro performance that is statistically indistinguishable from standard DPO, while converging 1.7x to 5.4x times faster, with speedups that scale with dataset size and the structure of the underlying mutational landscape.

[492] Empowering Targeted Neighborhood Search via Hyper Tour for Large-Scale TSP

Tongkai Lu, Shuai Ma, Chongyang Tao

Main category: cs.LG

TL;DR: Proposes Hyper Tour Guided Neighborhood Search (HyperNS) for large-scale TSP, using clustering and hyper tours to reduce search space and improve solution quality.

Details

Motivation: Neural-based TSP methods face scaling challenges with memory constraints, poor initial solutions, and insufficient global guidance for large instances.

Method: Divides TSP into clusters using sparse heatmap graphs, abstracts them as supernodes, generates hyper tours to guide initialization and optimization, focusing on relevant edges.

Result: Outperforms existing neural-based methods on synthetic and real-world datasets, especially for larger instances, with significant reduction in optimality gap.

Conclusion: HyperNS effectively addresses scaling challenges in neural TSP solvers through clustering and hyper tour guidance, enabling better performance on large-scale instances.

Abstract: Traveling Salesman Problem (TSP) is a classic NP-hard problem that has garnered significant attention from both academia and industry. While neural-based methods have shown promise for solving TSPs, they still face challenges in scaling to larger instances, particularly in memory constraints associated with global heatmaps, edge weights, or access matrices, as well as in generating high-quality initial solutions and insufficient global guidance for efficiently navigating vast search spaces. To address these challenges, we propose a Hyper Tour Guided Neighborhood Search (HyperNS) method for large-scale TSP instances. Inspired by the ``clustering first, route second" strategy, our approach initially divides the TSP instance into clusters using a sparse heatmap graph and abstracts them as supernodes, followed by the generation of a hyper tour to guide both the initialization and optimization processes. This method reduces the search space by focusing on edges relevant to the hyper tour, leading to more efficient and effective optimization. Experimental results on both synthetic and real-world datasets demonstrate that our approach outperforms existing neural-based methods, particularly in handling larger-scale instances, offering a significant reduction in the gap to the optimal solution.

[493] Uncertainty-Aware Deep Learning Framework for Remaining Useful Life Prediction in Turbofan Engines with Learned Aleatoric Uncertainty

Krishang Sharma

Main category: cs.LG

TL;DR: Novel uncertainty-aware deep learning framework for RUL prediction with Bayesian output layer that learns aleatoric uncertainty, achieving breakthrough performance in critical zones and well-calibrated confidence intervals.

Details

Motivation: Accurate RUL prediction with uncertainty quantification is critical for aerospace prognostics, but existing CMAPSS-based literature lacks exploration of probabilistic modeling for learning aleatoric uncertainty.

Method: Hierarchical architecture with multi-scale Inception blocks, bidirectional LSTMs, dual-level attention mechanism, and Bayesian output layer that predicts both mean RUL and variance. Comprehensive preprocessing includes condition-aware clustering, wavelet denoising, and intelligent feature selection.

Result: Competitive overall RMSE on CMAPSS benchmarks (16.22-19.98) and breakthrough critical zone performance (RUL <= 30 cycles) with RMSE 5.14-7.16, representing 25-40% improvements. Learned uncertainty provides well-calibrated 95% confidence intervals with 93.5-95.2% coverage.

Conclusion: The framework establishes new benchmarks for safety-critical predictions and enables risk-aware maintenance scheduling previously unattainable in CMAPSS literature through uncertainty-aware deep learning.

Abstract: Accurate Remaining Useful Life (RUL) prediction coupled with uncertainty quantification remains a critical challenge in aerospace prognostics. This research introduces a novel uncertainty-aware deep learning framework that learns aleatoric uncertainty directly through probabilistic modeling, an approach unexplored in existing CMAPSS-based literature. Our hierarchical architecture integrates multi-scale Inception blocks for temporal pattern extraction, bidirectional Long Short-Term Memory networks for sequential modeling, and a dual-level attention mechanism operating simultaneously on sensor and temporal dimensions. The innovation lies in the Bayesian output layer that predicts both mean RUL and variance, enabling the model to learn data-inherent uncertainty. Comprehensive preprocessing employs condition-aware clustering, wavelet denoising, and intelligent feature selection. Experimental validation on NASA CMAPSS benchmarks (FD001-FD004) demonstrates competitive overall performance with RMSE values of 16.22, 19.29, 16.84, and 19.98 respectively. Remarkably, our framework achieves breakthrough critical zone performance (RUL <= 30 cycles) with RMSE of 5.14, 6.89, 5.27, and 7.16, representing 25-40 percent improvements over conventional approaches and establishing new benchmarks for safety-critical predictions. The learned uncertainty provides well-calibrated 95 percent confidence intervals with coverage ranging from 93.5 percent to 95.2 percent, enabling risk-aware maintenance scheduling previously unattainable in CMAPSS literature.

[494] Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

Jungyeon Koh, Hyun Jong Yang

Main category: cs.LG

TL;DR: Proposes a unified framework that jointly optimizes user association and resource allocation to support efficient parallel speculative decoding for LLM inference in mobile edge computing systems, achieving significant latency reduction without accuracy loss.

Details

Motivation: The growing demand for on-device LLM inference requires efficient mobile edge computing solutions, especially in resource-constrained settings. Speculative decoding helps but suffers from communication overhead and asynchronous delays.

Method: A unified framework that jointly optimizes user association and resource allocation (UARA) using multi-agent deep reinforcement learning, evaluated with Sionna simulator under realistic conditions.

Result: Achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy.

Conclusion: Enables scalable and low-latency LLM services in MEC systems through optimized parallel speculative decoding.

Abstract: The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.

[495] HardFlow: Hard-Constrained Sampling for Flow-Matching Models via Trajectory Optimization

Zeyang Li, Kaveh Alim, Navid Azizan

Main category: cs.LG

TL;DR: HardFlow: A novel framework that reformulates hard-constrained sampling as trajectory optimization using numerical optimal control to precisely satisfy constraints at terminal time while maintaining sample quality.

Details

Motivation: Existing projection-based approaches for enforcing hard constraints in generative models are overly restrictive and degrade sample quality, while many downstream applications require precise constraint satisfaction.

Method: Leverages numerical optimal control to steer sampling trajectories, exploits flow-matching model structure, and uses model predictive control techniques to transform constrained optimization into tractable surrogate problems.

Result: Extensive experiments in robotics, PDEs, and vision show HardFlow substantially outperforms existing methods in both constraint satisfaction and sample quality.

Conclusion: The trajectory optimization perspective provides a unified framework for constraint satisfaction, distribution shift minimization, and sample quality enhancement, with proven approximation error bounds.

Abstract: Diffusion and flow-matching have emerged as powerful methodologies for generative modeling, with remarkable success in capturing complex data distributions and enabling flexible guidance at inference time. Many downstream applications, however, demand enforcing hard constraints on generated samples (for example, robot trajectories must avoid obstacles), a requirement that goes beyond simple guidance. Prevailing projection-based approaches constrain the entire sampling path to the constraint manifold, which is overly restrictive and degrades sample quality. In this paper, we introduce a novel framework that reformulates hard-constrained sampling as a trajectory optimization problem. Our key insight is to leverage numerical optimal control to steer the sampling trajectory so that constraints are satisfied precisely at the terminal time. By exploiting the underlying structure of flow-matching models and adopting techniques from model predictive control, we transform this otherwise complex constrained optimization problem into a tractable surrogate that can be solved efficiently and effectively. Furthermore, this trajectory optimization perspective offers significant flexibility beyond mere constraint satisfaction, allowing for the inclusion of integral costs to minimize distribution shift and terminal objectives to further enhance sample quality, all within a unified framework. We provide a control-theoretic analysis of our method, establishing bounds on the approximation error between our tractable surrogate and the ideal formulation. Extensive experiments across diverse domains, including robotics (planning), partial differential equations (boundary control), and vision (text-guided image editing), demonstrate that our algorithm, which we name $\textit{HardFlow}$, substantially outperforms existing methods in both constraint satisfaction and sample quality.

[496] UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang

Main category: cs.LG

TL;DR: UniGame is a self-adversarial post-training framework that addresses the inconsistency between understanding and generation in Unified Multimodal Models by using a lightweight perturber to enable the generation branch to challenge fragile understanding.

Details

Motivation: UMMs exhibit fundamental inconsistency where understanding favors compact embeddings while generation favors reconstruction-rich representations, leading to misaligned decision boundaries, degraded cross-modal coherence, and vulnerability to distributional and adversarial shifts.

Method: A self-adversarial post-training framework that applies a lightweight perturber at the shared token interface, enabling the generation branch to actively seek and challenge fragile understanding through adversarial self-play.

Result: Significant improvements in consistency (+4.6%), understanding (+3.6%), generation (+0.02), and robustness (+4.8% on NaturalBench, +6.2% on AdVQA) with less than 1% additional parameters.

Conclusion: Adversarial self-play is an effective principle for enhancing coherence, stability, and unified competence of multimodal foundation models, and the framework is architecture-agnostic and complementary to existing methods.

Abstract: Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame

[497] Practical Global and Local Bounds in Gaussian Process Regression via Chaining

Junyi Liu, Stanley Kok

Main category: cs.LG

TL;DR: A chaining-based framework for estimating bounds on expected extreme values in Gaussian process regression, providing global and local uncertainty quantification without requiring specific input features or posterior estimates.

Details

Motivation: Existing uncertainty bounds in GPR require specific input features, rely on posterior mean/variance estimates, or need hyperparameter tuning, limiting robustness and failing to capture global model behavior.

Method: Proposed a chaining-based framework with kernel-specific refinements for RBF and Matérn kernels, avoiding analytical relaxations for numerical tightness, and developed local uncertainty quantification using chaining geometry through partition diameters.

Result: Theoretical bounds are tighter than generic constructions, especially for common kernels. Experimental results show the method outperforms existing approaches on synthetic and real-world datasets.

Conclusion: The chaining-based framework provides robust uncertainty quantification for GPR that captures global behavior and adapts to local structures without relying on posterior variance scaling or specific input features.

Abstract: Gaussian process regression (GPR) is a popular nonparametric Bayesian method that provides predictive uncertainty estimates and is widely used in safety-critical applications. While prior research has introduced various uncertainty bounds, most existing approaches require access to specific input features, and rely on posterior mean and variance estimates or the tuning of hyperparameters. These limitations hinder robustness and fail to capture the model’s global behavior in expectation. To address these limitations, we propose a chaining-based framework for estimating upper and lower bounds on the expected extreme values over unseen data, without requiring access to specific input features. We provide kernel-specific refinements for commonly used kernels such as RBF and Matérn, in which our bounds are tighter than generic constructions. We further improve numerical tightness by avoiding analytical relaxations. In addition to global estimation, we also develop a novel method for local uncertainty quantification at specified inputs. This approach leverages chaining geometry through partition diameters, adapting to local structures without relying on posterior variance scaling. Our experimental results validate the theoretical findings and demonstrate that our method outperforms existing approaches on both synthetic and real-world datasets.

[498] PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer

Ruogu Ding, Xin Ning, Ulf Schlichtmann, Weikang Qian

Main category: cs.LG

TL;DR: PrefixGPT is a GPT-based model that generates optimized prefix adders from scratch, achieving 7.7% improved area-delay product and up to 79.1% lower average ADP compared to existing methods.

Details

Motivation: Designing optimized prefix adders is challenging due to strict design rules and exponentially large design space, requiring automated solutions.

Method: Represents adder topology as 2D coordinate sequence with legality mask, uses decoder-only Transformer pre-trained on random valid adders then fine-tuned for optimization.

Result: Found new optimal design with 7.7% improved ADP and up to 79.1% lower average ADP, demonstrating superior exploration quality.

Conclusion: GPT-style models can master complex hardware design principles and apply them for efficient design optimization.

Abstract: Prefix adders are widely used in compute-intensive applications for their high speed. However, designing optimized prefix adders is challenging due to strict design rules and an exponentially large design space. We introduce PrefixGPT, a generative pre-trained Transformer (GPT) that directly generates optimized prefix adders from scratch. Our approach represents an adder’s topology as a two-dimensional coordinate sequence and applies a legality mask during generation, ensuring every design is valid by construction. PrefixGPT features a customized decoder-only Transformer architecture. The model is first pre-trained on a corpus of randomly synthesized valid prefix adders to learn design rules and then fine-tuned to navigate the design space for optimized design quality. Compared with existing works, PrefixGPT not only finds a new optimal design with a 7.7% improved area-delay product (ADP) but exhibits superior exploration quality, lowering the average ADP by up to 79.1%. This demonstrates the potential of GPT-style models to first master complex hardware design principles and then apply them for more efficient design optimization.

[499] SculptDrug : A Spatial Condition-Aware Bayesian Flow Model for Structure-based Drug Design

Qingsong Zhong, Haomin Yu, Yan Lin, Wangmeng Shen, Long Zeng, Jilin Hu

Main category: cs.LG

TL;DR: SculptDrug is a spatial condition-aware generative model for Structure-Based Drug Design that addresses key challenges in ligand generation using Bayesian flow networks, boundary constraints, and hierarchical encoding.

Details

Motivation: Existing generative models for SBDD face challenges with boundary constraints, hierarchical structural conditions, and spatial modeling fidelity, limiting their effectiveness in generating geometrically compatible drug ligands.

Method: Uses Bayesian flow networks with progressive denoising, Boundary Awareness Block for protein surface constraints, and Hierarchical Encoder for global context and fine-grained interactions.

Result: Outperforms state-of-the-art baselines on CrossDocked dataset, demonstrating effectiveness of spatial condition-aware modeling.

Conclusion: SculptDrug successfully addresses key limitations in SBDD through spatial condition-aware modeling, providing improved ligand generation with better geometric compatibility and structural consistency.

Abstract: Structure-Based drug design (SBDD) has emerged as a popular approach in drug discovery, leveraging three-dimensional protein structures to generate drug ligands. However, existing generative models encounter several key challenges: (1) incorporating boundary condition constraints, (2) integrating hierarchical structural conditions, and (3) ensuring spatial modeling fidelity. To address these limitations, we propose SculptDrug, a spatial condition-aware generative model based on Bayesian flow networks (BFNs). First, SculptDrug follows a BFN-based framework and employs a progressive denoising strategy to ensure spatial modeling fidelity, iteratively refining atom positions while enhancing local interactions for precise spatial alignment. Second, we introduce a Boundary Awareness Block that incorporates protein surface constraints into the generative process to ensure that generated ligands are geometrically compatible with the target protein. Third, we design a Hierarchical Encoder that captures global structural context while preserving fine-grained molecular interactions, ensuring overall consistency and accurate ligand-protein conformations. We evaluate SculptDrug on the CrossDocked dataset, and experimental results demonstrate that SculptDrug outperforms state-of-the-art baselines, highlighting the effectiveness of spatial condition-aware modeling.

[500] Self-Organization and Spectral Mechanism of Attractor Landscapes in High-Capacity Kernel Hopfield Networks

Akira Tamamori

Main category: cs.LG

TL;DR: The paper reveals that optimal performance in kernel-based Hopfield networks occurs at a critical “Ridge of Optimization” where spectral concentration balances direct and indirect forces for maximum robustness and capacity.

Details

Motivation: To understand the dynamical mechanism behind the enhanced storage capacity of kernel-based Hopfield networks, which remains poorly understood despite empirical success.

Method: Unified geometric analysis of attractor landscapes with spectral theory of kernel machines, using a novel “Pinnacle Sharpness” metric to analyze attractor stability and spectral reorganization.

Result: Identified a “Ridge of Optimization” phase where networks achieve maximal robustness under high-load conditions through “Force Antagonism” - balanced direct and collective feedback forces driven by “Spectral Concentration” of weight eigenvalues.

Conclusion: Optimal associative memory performance is achieved in a spectral “Goldilocks zone” where the network self-organizes into a critical state with amplified leading eigenvalue for stability while preserving trailing eigenvalues for capacity, avoiding both rank collapse and diffusion.

Abstract: Kernel-based learning methods can dramatically increase the storage capacity of Hopfield networks, yet the dynamical mechanism behind this enhancement remains poorly understood. We address this gap by unifying the geometric analysis of the attractor landscape with the spectral theory of kernel machines. Using a novel metric, “Pinnacle Sharpness,” we first uncover a rich phase diagram of attractor stability, identifying a “Ridge of Optimization” where the network achieves maximal robustness under high-load conditions. Phenomenologically, this ridge is characterized by a “Force Antagonism,” where a strong driving force is balanced by a collective feedback force. Theoretically, we reveal that this phenomenon arises from a specific reorganization of the weight spectrum, which we term \textit{Spectral Concentration}. Unlike a simple rank-1 collapse, our analysis shows that the network on the ridge self-organizes into a critical state: the leading eigenvalue is amplified to maximize global stability (Direct Force), while the trailing eigenvalues are preserved to maintain high memory capacity (Indirect Force). These findings provide a complete physical picture of how high-capacity associative memories are formed, demonstrating that optimal performance is achieved by tuning the system to a spectral “Goldilocks zone” between rank collapse and diffusion.

[501] TREASURE: A Transformer-Based Foundation Model for High-Volume Transaction Understanding

Chin-Chia Michael Yeh, Uday Singh Saini, Xin Dai, Xiran Fan, Shubham Jain, Yujie Fan, Jiarui Sun, Junpeng Wang, Menghai Pan, Yingtong Dou, Yuzhong Chen, Vineeth Rakesh, Liang Wang, Yan Zheng, Mahashweta Das

Main category: cs.LG

TL;DR: TREASURE is a transformer-based foundation model for transaction data that captures consumer behavior and payment network signals, improving abnormal behavior detection by 111% and recommendation systems by 104%.

Details

Motivation: Payment networks generate high volumes of transaction data that can enable applications like abnormal behavior detection and hyper-personalized consumer insights to improve people's lives.

Method: TREASURE uses a transformer architecture with dedicated input modules for static and dynamic attributes, and an efficient training paradigm for predicting high-cardinality categorical attributes.

Result: The model increases abnormal behavior detection performance by 111% over production systems and enhances recommendation models by 104% when used as an embedding provider.

Conclusion: TREASURE demonstrates effectiveness as both a standalone model and embedding provider, providing comprehensive transaction data representation for various financial applications.

Abstract: Payment networks form the backbone of modern commerce, generating high volumes of transaction records from daily activities. Properly modeling this data can enable applications such as abnormal behavior detection and consumer-level insights for hyper-personalized experiences, ultimately improving people’s lives. In this paper, we present TREASURE, TRansformer Engine As Scalable Universal transaction Representation Encoder, a multipurpose transformer-based foundation model specifically designed for transaction data. The model simultaneously captures both consumer behavior and payment network signals (such as response codes and system flags), providing comprehensive information necessary for applications like accurate recommendation systems and abnormal behavior detection. Verified with industry-grade datasets, TREASURE features three key capabilities: 1) an input module with dedicated sub-modules for static and dynamic attributes, enabling more efficient training and inference; 2) an efficient and effective training paradigm for predicting high-cardinality categorical attributes; and 3) demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%. We present key insights from extensive ablation studies, benchmarks against production models, and case studies, highlighting valuable knowledge gained from developing TREASURE.

[502] Optimized scheduling of electricity-heat cooperative system considering wind energy consumption and peak shaving and valley filling

Jin Ye, Lingmei Wang, Shujian Zhang, Haihang Wu

Main category: cs.LG

TL;DR: Proposes PVTD3 algorithm for combined power-heat system scheduling, reducing costs by 6.93-13.59% and grid power fluctuations by 12.8% compared to traditional TD3.

Details

Motivation: Address scheduling optimization challenges for combined power-heat systems under renewable energy integration and multiple uncertainties during global energy transition.

Method: Intelligent scheduling method based on improved Dual-Delay Deep Deterministic Policy Gradient (PVTD3) algorithm with penalty term for grid power purchase variations.

Result: PVTD3 reduces comprehensive costs by 6.93%, 12.68%, 13.59% at 10%, 20%, 30% renewable penetration; reduces grid power fluctuation amplitude by 12.8%; improves energy storage management with lower end-time state values.

Conclusion: PVTD3 algorithm excels in economic efficiency, grid stability, and sustainable scheduling capabilities for energy storage management in combined power-heat systems.

Abstract: With the global energy transition and rapid development of renewable energy, the scheduling optimization challenge for combined power-heat systems under new energy integration and multiple uncertainties has become increasingly prominent. Addressing this challenge, this study proposes an intelligent scheduling method based on the improved Dual-Delay Deep Deterministic Policy Gradient (PVTD3) algorithm. System optimization is achieved by introducing a penalty term for grid power purchase variations. Simulation results demonstrate that under three typical scenarios (10%, 20%, and 30% renewable penetration), the PVTD3 algorithm reduces the system’s comprehensive cost by 6.93%, 12.68%, and 13.59% respectively compared to the traditional TD3 algorithm. Concurrently, it reduces the average fluctuation amplitude of grid power purchases by 12.8%. Regarding energy storage management, the PVTD3 algorithm reduces the end-time state values of low-temperature thermal storage tanks by 7.67-17.67 units while maintaining high-temperature tanks within the 3.59-4.25 safety operating range. Multi-scenario comparative validation demonstrates that the proposed algorithm not only excels in economic efficiency and grid stability but also exhibits superior sustainable scheduling capabilities in energy storage device management.

[503] TiCT: A Synthetically Pre-Trained Foundation Model for Time Series Classification

Chin-Chia Michael Yeh, Uday Singh Saini, Junpeng Wang, Xin Dai, Xiran Fan, Jiarui Sun, Yujie Fan, Yan Zheng

Main category: cs.LG

TL;DR: TiCT is a transformer-based model pre-trained on synthetic data for in-context time series classification, achieving competitive performance without fine-tuning.

Details

Motivation: Address the gap in foundation models for time series classification that can perform in-context learning, reducing the need for expensive labeled data and retraining.

Method: Transformer architecture with bit-based label encoding and output attention mechanism, pre-trained on synthetic data using Mixup-inspired process and data augmentation.

Result: Achieves competitive performance against state-of-the-art supervised methods on UCR Archive using only in-context examples at inference.

Conclusion: TiCT demonstrates that synthetic pre-training enables effective in-context time series classification without weight updates, offering a practical solution for label-scarce scenarios.

Abstract: The ubiquity of time series data creates a strong demand for general-purpose foundation models, yet developing them for classification remains a significant challenge, largely due to the high cost of labeled data. Foundation models capable of in-context learning (ICL) offer a powerful solution, adapting to new tasks with minimal examples and reducing the need for extensive retraining. However, prior work on large-scale time series models has predominantly focused on forecasting, leaving a critical gap for versatile, fine-tuning-free classification. To address this, we introduce TiCT (Time-series in-Context Transformer), a transformer-based model pre-trained exclusively on synthetic data to perform in-context classification. We make two primary technical contributions: 1) a novel architecture featuring a scalable bit-based label encoding and a special output attention mechanism to handle an arbitrary number of classes; and 2) a synthetic pre-training framework that combines a Mixup-inspired process with data augmentation to foster generalization and noise invariance. Extensive evaluations on the UCR Archive show that TiCT achieves competitive performance against state-of-the-art supervised methods. Crucially, this is accomplished using only in-context examples at inference time, without updating a single model weight.

[504] Enhancing Nuclear Reactor Core Simulation through Data-Based Surrogate Models

Perceval Beja-Battais, Alain Grossetête, Nicolas Vayatis

Main category: cs.LG

TL;DR: This paper introduces two surrogate models for nuclear reactor core simulation to enhance Model Predictive Control methods, achieving up to 1000x computational time reduction.

Details

Motivation: There is an increasing need for Nuclear Power Plants to improve flexibility to match renewable energy growth, requiring enhanced simulation methods for operator assistance systems.

Method: Developed two surrogate models (data-driven and physics-informed) from nonlinear stiff ordinary differential equations as alternative simulation schemes for nuclear reactor core simulation.

Result: Both data-driven and physics-informed models can rapidly integrate complex dynamics with very low computational time (up to 1000x time reduction).

Conclusion: Surrogate models provide efficient alternatives for nuclear reactor core simulation, significantly accelerating computational performance while maintaining accuracy.

Abstract: In recent years, there has been an increasing need for Nuclear Power Plants (NPPs) to improve flexibility in order to match the rapid growth of renewable energies. The Operator Assistance Predictive System (OAPS) developed by Framatome addresses this problem through Model Predictive Control (MPC). In this work, we aim to improve MPC methods through data-driven simulation schemes. Thus, from a set of nonlinear stiff ordinary differential equations (ODEs), this paper introduces two surrogate models acting as alternative simulation schemes to enhance nuclear reactor core simulation. We show that both data-driven and physics-informed models can rapidly integrate complex dynamics, with a very low computational time (up to 1000x time reduction).

[505] Hard Samples, Bad Labels: Robust Loss Functions That Know When to Back Off

Nicholas Pellegrino, David Szczecina, Paul Fieguth

Main category: cs.LG

TL;DR: The paper proposes two novel loss functions (Blurry Loss and Piecewise-zero Loss) that improve robustness to label errors by de-weighting difficult-to-classify samples, which are likely mislabelled. These methods outperform state-of-the-art robust loss functions in error detection across various corrupted datasets.

Details

Motivation: Mislabeled training data is common in both benchmark and curated datasets, which adversely affects model performance and generalizability. Existing label error detection frameworks require well-trained models but often rely on training with corrupt data, creating a chicken-and-egg problem unless robust training schemes are used.

Method: Proposed two novel loss functions: Blurry Loss and Piecewise-zero Loss, which enhance robustness to label errors by de-weighting or disregarding difficult-to-classify samples that are likely erroneous. These leverage the insight that mislabelled examples are typically harder to classify.

Result: Comprehensive experiments on artificially corrupted datasets show the proposed loss functions outperform state-of-the-art robust loss functions in nearly all cases, achieving superior F1 scores for error detection. Ablation studies confirm broad applicability to both uniform and non-uniform corruption scenarios.

Conclusion: The proposed robust loss functions enable machine learning practitioners to more effectively identify, prune, or correct errors in training data by improving model robustness to label noise during training.

Abstract: Incorrectly labelled training data are frustratingly ubiquitous in both benchmark and specially curated datasets. Such mislabelling clearly adversely affects the performance and generalizability of models trained through supervised learning on the associated datasets. Frameworks for detecting label errors typically require well-trained / well-generalized models; however, at the same time most frameworks rely on training these models on corrupt data, which clearly has the effect of reducing model generalizability and subsequent effectiveness in error detection – unless a training scheme robust to label errors is employed. We evaluate two novel loss functions, Blurry Loss and Piecewise-zero Loss, that enhance robustness to label errors by de-weighting or disregarding difficult-to-classify samples, which are likely to be erroneous. These loss functions leverage the idea that mislabelled examples are typically more difficult to classify and should contribute less to the learning signal. Comprehensive experiments on a variety of artificially corrupted datasets demonstrate that the proposed loss functions outperform state-of-the-art robust loss functions in nearly all cases, achieving superior F1 scores for error detection. Further analyses through ablation studies offer insights to confirm these loss functions’ broad applicability to cases of both uniform and non-uniform corruption, and with different label error detection frameworks. By using these robust loss functions, machine learning practitioners can more effectively identify, prune, or correct errors in their training data.

[506] An Adaptive Resonance Theory-based Topological Clustering Algorithm with a Self-Adjusting Vigilance Parameter

Naoki Masuyama, Yuichiro Toda, Yusuke Nojima, Hisao Ishibuchi

Main category: cs.LG

TL;DR: Proposes an ART-based topological clustering algorithm with diversity-driven adaptation for hyperparameter-free learning in dynamic environments, outperforming state-of-the-art methods in clustering performance and continual learning.

Details

Motivation: Need for clustering models that adapt to distributional shifts in stationary and nonstationary settings while preserving learned cluster structures and avoiding catastrophic forgetting.

Method: ART-based topological clustering with autonomous adjustment of recalculation interval and vigilance threshold through diversity-driven adaptation mechanism.

Result: Outperforms state-of-the-art methods on 24 real-world datasets in both clustering performance and continual learning capability.

Conclusion: The proposed parameter adaptation effectively mitigates catastrophic forgetting and maintains consistent clustering in evolving data streams.

Abstract: Clustering in stationary and nonstationary settings, where data distributions remain static or evolve over time, requires models that can adapt to distributional shifts while preserving previously learned cluster structures. This paper proposes an Adaptive Resonance Theory (ART)-based topological clustering algorithm that autonomously adjusts its recalculation interval and vigilance threshold through a diversity-driven adaptation mechanism. This mechanism enables hyperparameter-free learning that maintains cluster stability and continuity in dynamic environments. Experiments on 24 real-world datasets demonstrate that the proposed algorithm outperforms state-of-the-art methods in both clustering performance and continual learning capability. These results highlight the effectiveness of the proposed parameter adaptation in mitigating catastrophic forgetting and maintaining consistent clustering in evolving data streams. Source code is available at https://github.com/Masuyama-lab/IDAT

[507] scipy.spatial.transform: Differentiable Framework-Agnostic 3D Transformations in Python

Martin Schuck, Alexander von Rohr, Angela P. Schoellig

Main category: cs.LG

TL;DR: SciPy’s spatial.transform module has been overhauled to support any Python array library, enabling GPU/TPU execution, JIT compilation, and autodiff for 3D rigid-body transforms.

Details

Motivation: Existing robust implementations of 3D transforms were limited to NumPy, preventing adoption in GPU-accelerated and autodiff-based workflows. Edge cases and implementation inconsistencies made correct implementations error-prone.

Method: Complete redesign of SciPy’s spatial.transform functionality to be compatible with any array library implementing the Python array API (JAX, PyTorch, CuPy). Preserves SciPy interface while adding GPU/TPU support, JIT compilation, vectorized batching, and native autodiff.

Result: Successfully merged into SciPy main, providing framework-agnostic, production-grade 3D spatial math. Demonstrated through case studies showing scalability of 3D transforms and accurate rotational dynamics in JAX drone simulation.

Conclusion: The overhaul enables differentiable scientific computing by providing a robust, cross-framework implementation of 3D transforms that supports modern ML pipelines while maintaining mathematical correctness and numerical robustness.

Abstract: Three-dimensional rigid-body transforms, i.e. rotations and translations, are central to modern differentiable machine learning pipelines in robotics, vision, and simulation. However, numerically robust and mathematically correct implementations, particularly on SO(3), are error-prone due to issues such as axis conventions, normalizations, composition consistency and subtle errors that only appear in edge cases. SciPy’s spatial$.$transform module is a rigorously tested Python implementation. However, it historically only supported NumPy, limiting adoption in GPU-accelerated and autodiff-based workflows. We present a complete overhaul of SciPy’s spatial$.$transform functionality that makes it compatible with any array library implementing the Python array API, including JAX, PyTorch, and CuPy. The revised implementation preserves the established SciPy interface while enabling GPU/TPU execution, JIT compilation, vectorized batching, and differentiation via native autodiff of the chosen backend. We demonstrate how this foundation supports differentiable scientific computing through two case studies: (i) scalability of 3D transforms and rotations and (ii) a JAX drone simulation that leverages SciPy’s Rotation for accurate integration of rotational dynamics. Our contributions have been merged into SciPy main and will ship in the next release, providing a framework-agnostic, production-grade basis for 3D spatial math in differentiable systems and ML.

[508] Lower Complexity Bounds for Nonconvex-Strongly-Convex Bilevel Optimization with First-Order Oracles

Kaiyi Ji

Main category: cs.LG

TL;DR: This paper establishes new lower bounds for bilevel optimization in the smooth nonconvex-strongly-convex setting, showing that deterministic algorithms require Ω(κ³⁄²ε⁻²) oracle calls and stochastic algorithms require Ω(κ⁵⁄²ε⁻⁴) oracle calls to find ε-accurate stationary points.

Details

Motivation: Progress on lower bounds for bilevel optimization has been limited due to the complexity of the bilevel structure, while upper bound guarantees have been widely studied. There is a need to understand the fundamental complexity limits of bilevel optimization.

Method: The authors develop new hard instances for bilevel optimization and analyze them under deterministic and stochastic first-order oracle models. They focus on the smooth nonconvex-strongly-convex setting and use zero-respecting algorithm analysis.

Result: For deterministic algorithms: Ω(κ³⁄²ε⁻²) oracle calls are required. For stochastic algorithms: Ω(κ⁵⁄²ε⁻⁴) stochastic oracle calls are necessary. These bounds improve upon known lower bounds for single-level nonconvex optimization and nonconvex-strongly-convex min-max problems.

Conclusion: The results reveal substantial gaps between current upper and lower bounds for bilevel optimization, suggesting that even simplified regimes like quadratic lower-level objectives require further investigation to understand the optimal complexity of bilevel optimization.

Abstract: Although upper bound guarantees for bilevel optimization have been widely studied, progress on lower bounds has been limited due to the complexity of the bilevel structure. In this work, we focus on the smooth nonconvex-strongly-convex setting and develop new hard instances that yield nontrivial lower bounds under deterministic and stochastic first-order oracle models. In the deterministic case, we prove that any first-order zero-respecting algorithm requires at least $Ω(κ^{3/2}ε^{-2})$ oracle calls to find an $ε$-accurate stationary point, improving the optimal lower bounds known for single-level nonconvex optimization and for nonconvex-strongly-convex min-max problems. In the stochastic case, we show that at least $Ω(κ^{5/2}ε^{-4})$ stochastic oracle calls are necessary, again strengthening the best known bounds in related settings. Our results expose substantial gaps between current upper and lower bounds for bilevel optimization and suggest that even simplified regimes, such as those with quadratic lower-level objectives, warrant further investigation toward understanding the optimal complexity of bilevel optimization under standard first-order oracles.

[509] QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression

Lei Huang, Rui Zhang, Jiaming Guo, Yang Zhang, Di Huang, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

Main category: cs.LG

TL;DR: CRUX-V introduces a structured intermediate space (CRUX) and two-stage training to bridge the gap between ambiguous natural language descriptions and precise Verilog code generation, achieving state-of-the-art performance.

Details

Motivation: Existing HDL generation approaches rely on ambiguous, redundant natural language descriptions that pose challenges for precise Verilog code generation, creating a gap between open-ended natural language and domain-specific constrained target space.

Method: Introduces CRUX (Core Refined Understanding eXpression) structured intermediate space and a two-stage training framework with Joint Expression Modeling and Dual-Space Optimization to enhance CRUX and Verilog code quality.

Result: CRUX-V achieves state-of-the-art performance among general models on multiple Verilog generation benchmarks, particularly excelling in challenging design tasks. CRUX space is transferable and beneficial as input prompts for other code models.

Conclusion: CRUX effectively narrows the gap between free-form natural language descriptions and precise Verilog generation, demonstrating the value of structured intermediate representations in hardware code generation.

Abstract: Large language models (LLMs) have shown promising capabilities in hardware description language (HDL) generation. However, existing approaches often rely on free-form natural language descriptions that are often ambiguous, redundant, and unstructured, which poses significant challenges for downstream Verilog code generation. We treat hardware code generation as a complex transformation from an open-ended natural language space to a domain-specific, highly constrained target space. To bridge this gap, we introduce Core Refined Understanding eXpression (CRUX), a structured intermediate space that captures the essential semantics of user intent while organizing the expression for precise Verilog code generation. We further design a two-stage training framework, comprising Joint Expression Modeling and Dual-Space Optimization, to enhance the quality of both CRUX and Verilog code. Experiments across multiple Verilog generation benchmarks demonstrate that our model, CRUX-V, achieves state-of-the-art performance among general models, particularly under challenging design tasks. Furthermore, the CRUX space proves transferable and beneficial when used as input prompts for other code models, highlighting its effectiveness in narrowing the gap between free-form natural language descriptions and precise Verilog generation.

[510] MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers

Audrey Pei-Hsuan Chen

Main category: cs.LG

TL;DR: MoRE is a parameter-efficient framework that repurposes frozen pre-trained transformers to align heterogeneous multi-omics data into a shared latent space using lightweight adapters and contrastive learning.

Details

Motivation: Multi-omics data integration faces challenges from extreme dimensionality, modality heterogeneity, and batch effects. Pre-trained transformers show generalization capabilities but their application to multi-omics remains underexplored.

Method: Uses frozen pre-trained transformers with modality-specific adapters and task-adaptive fusion layer. Optimizes masked modeling objective with supervised contrastive and batch-invariant alignment losses.

Result: Achieves competitive batch robustness and biological conservation while significantly reducing trainable parameters compared to fully fine-tuned models. Demonstrates strong performance in integration fidelity, rare population detection, and modality transfer.

Conclusion: MoRE represents a practical step toward general-purpose omics foundation models by enabling efficient multi-omics integration with frozen transformer backbones.

Abstract: Representation learning on multi-omics data is challenging due to extreme dimensionality, modality heterogeneity, and cohort-specific batch effects. While pre-trained transformer backbones have shown broad generalization capabilities in biological sequence modeling, their application to multi-omics integration remains underexplored. We present MoRE (Multi-Omics Representation Embedding), a framework that repurposes frozen pre-trained transformers to align heterogeneous assays into a shared latent space. Unlike purely generative approaches, MoRE employs a parameter-efficient fine-tuning (PEFT) strategy, prioritizing cross-sample and cross-modality alignment over simple sequence reconstruction. Specifically, MoRE attaches lightweight, modality-specific adapters and a task-adaptive fusion layer to the frozen backbone. It optimizes a masked modeling objective jointly with supervised contrastive and batch-invariant alignment losses, yielding structure-preserving embeddings that generalize across unseen cell types and platforms. We benchmark MoRE against established baselines, including scGPT, scVI, and Harmony with Scrublet, evaluating integration fidelity, rare population detection, and modality transfer. Our results demonstrate that MoRE achieves competitive batch robustness and biological conservation while significantly reducing trainable parameters compared to fully fine-tuned models. This work positions MoRE as a practical step toward general-purpose omics foundation models.

[511] Adam Simplified: Bias Correction Debunked

Sam Laing, Antonio Orvieto

Main category: cs.LG

TL;DR: Bias correction in Adam optimizer provides no performance improvement in optimal configurations and can be detrimental without proper learning rate scheduling, challenging its universal necessity.

Details

Motivation: To investigate the empirical necessity of bias-correction in Adam optimizer, a feature whose contribution remains poorly understood despite being a cornerstone of deep learning.

Method: Systematic ablations on vision and language modeling tasks, analyzing bias correction as implicit learning rate scheduling dependent on β₁, β₂ hyperparameters.

Result: Bias correction leads to no improvement in final test performance in optimal hyperparameter configurations, and can be detrimental without appropriate learning rate scheduling.

Conclusion: The findings challenge the universal inclusion of bias correction in Adam optimizer, suggesting its necessity is overstated.

Abstract: The Adam optimizer is a cornerstone of modern deep learning, yet the empirical necessity of each of its individual components is often taken for granted. This paper presents a focused investigation into the role of bias-correction, a feature whose contribution remains poorly understood. Through a series of systematic ablations on vision and language modelling tasks, we demonstrate that the conventional wisdom surrounding bias correction is misleading. In particular, we demonstrate that in the optimal hyper-parameter configuration, the inclusion of bias correction leads to no improvement in final test performance. Moreover, unless appropriate learning rate scheduling is implemented, the inclusion of bias correction can sometimes be detrimental to performance. We further reinterpret bias correction as a form of implicit learning rate scheduling whose behaviour is strongly dependent on the choice of smoothing hyper-parameters $β_1, β_2 \in [0,1)$. Our findings challenge the universal inclusion of this component.

cs.MA

[512] MTTR-A: Measuring Cognitive Recovery Latency in Multi-Agent Systems

Barak Or

Main category: cs.MA

TL;DR: This paper introduces MTTR-A (Mean Time-to-Recovery for Agentic Systems) as a runtime metric to quantify cognitive recovery latency in multi-agent systems, adapting classical reliability metrics to measure reasoning coherence restoration.

Details

Motivation: Existing observability tools monitor system outputs but cannot quantify how rapidly agentic workflows recover once reasoning coherence is lost, creating a gap in measuring cognitive stability in distributed AI systems.

Method: Adapted classical reliability metrics (MTTR, MTBF) to cognitive domain, conducted benchmark simulation using AG News corpus and LangGraph framework, modeling recovery latencies across multiple reflex modes with 200 runs.

Result: Automated reflexes restored stability within ~6s average, human-approval interventions took ~12s. Median MTTR-A was 6.21±2.14s, MTBF=6.7±2.14s, NRR=0.08, demonstrating measurable runtime resilience across reflex strategies.

Conclusion: Formalizing recovery latency as a quantifiable property establishes foundation for runtime dependability in agentic cognition, transforming cognitive recovery from ad-hoc process into standardized, interpretable performance metric.

Abstract: Ensuring cognitive stability in autonomous multi-agent systems (MAS) is a central challenge for large-scale, distributed AI. While existing observability tools monitor system outputs, they cannot quantify how rapidly agentic workflows recover once reasoning coherence has been lost. We adapt classical reliability metrics-Mean Time-to-Recovery (MTTR), Mean Time Between Failures (MTBF), and related ratios-into the cognitive domain, defining MTTR-A (Mean Time-to-Recovery for Agentic Systems) as a runtime measure of cognitive recovery latency. MTTR-A quantifies the time required for a MAS to detect reasoning drift and restore consistent operation, capturing the recovery of reasoning coherence rather than infrastructural repair. A benchmark simulation using the AG~News corpus and the LangGraph orchestration framework was conducted, modeling recovery latencies across multiple reflex modes. Automated reflexes restored stability within approximately 6s on average, while human-approval interventions required about 12s. Across 200 runs, the median simulated MTTR-A was 6.21+-2.14s, MTBF=6.7+-2.14s, and NRR=0.08, demonstrating measurable runtime resilience across reflex strategies. By formalizing recovery latency as a quantifiable property of distributed reasoning-and deriving reliability bounds linking recovery time and cognitive uptime-this work establishes a foundation for runtime dependability in agentic cognition, transforming cognitive recovery from an ad-hoc process into a standardized, interpretable performance

[513] Resilient Charging Infrastructure via Decentralized Coordination of Electric Vehicles at Scale

Chuhao Qin, Alexandru Sorici, Andrei Olaru, Evangelos Pournaras, Adina Magda Florea

Main category: cs.MA

TL;DR: A collective learning framework for EV charging coordination that balances individual comfort with system efficiency, achieving Pareto-optimal trade-offs under station outages and dynamic demand.

Details

Motivation: Existing decentralized EV charging approaches struggle under severe contingencies like station outages or charging request surges, leading to long queues and reduced driver comfort.

Method: Proposed a collective learning-based coordination framework where EVs adaptively shift priority between comfort and efficiency, with recommendations for adaptive charging behaviors.

Result: Outperforms baseline methods by significantly reducing travel and queuing time. EVs behaving selfishly or altruistically at appropriate moments achieve shorter waiting times than those maintaining moderate behavior.

Conclusion: The approach demonstrates improved resilience and trustworthiness of decentralized EV charging infrastructure under high fractions of station outages and adversarial EVs.

Abstract: The rapid adoption of electric vehicles (EVs) introduces major challenges for decentralized charging control. Existing decentralized approaches efficiently coordinate a large number of EVs to select charging stations while reducing energy costs, preventing power peak and preserving driver privacy. However, they often struggle under severe contingencies, such as station outages or unexpected surges in charging requests. These situations create competition for limited charging slots, resulting in long queues and reduced driver comfort. To address these limitations, we propose a novel collective learning-based coordination framework that allows EVs to balance individual comfort on their selections against system-wide efficiency, i.e., the overall queues across all stations. In the framework, EVs are recommended for adaptive charging behaviors that shift priority between comfort and efficiency, achieving Pareto-optimal trade-offs under varying station capacities and dynamic spatio-temporal EV distribution. Experiments using real-world data from EVs and charging stations show that the proposed approach outperforms baseline methods, significantly reducing travel and queuing time. The results reveal that, under uncertain charging conditions, EV drivers that behave selfishly or altruistically at the right moments achieve shorter waiting time than those maintaining moderate behavior throughout. Our findings under high fractions of station outages and adversarial EVs further demonstrate improved resilience and trustworthiness of decentralized EV charging infrastructure.

[514] Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation

Ke Zhang, Xiaoning Zhao, Ce Zheng, Jiahong Ning, Dandan Zhu, Wenqi Zhang, Chen Sun, Toshiharu Sugawara

Main category: cs.MA

TL;DR: Tool-RoCo is a benchmark for evaluating LLMs in multi-agent cooperation using tool usage as a metric, with four paradigms testing different autonomy levels across three robot tasks.

Details

Motivation: Existing LLM-based multi-agent systems rely on predefined orchestration and ignore agent autonomy, lacking proper evaluation methods for self-organization and cooperation.

Method: Proposes Tool-RoCo benchmark with four LLM paradigms (centralized/decentralized cooperation and self-organization) using tool usage to evaluate cooperation across three multi-robot tasks (SORT, PACK, CABINET).

Result: Cooperative tools accounted for only 7.09% of all tools, showing LLM-based agents rarely invoke others as assistants. Activation tools dominated at 96.42%, indicating LLMs tend to maintain active agents without adaptive deactivation.

Conclusion: Tool-RoCo provides a systematic benchmark to evaluate LLM autonomy and cooperation, revealing current limitations in multi-agent self-organization and adaptive coordination.

Abstract: This study proposes Tool-RoCo, a novel benchmark for evaluating large language models (LLMs) in long-term multi-agent cooperation based on RoCo, a multi-robot cooperative benchmark. Recent research on LLM-based multi-agent systems has relied on predefined orchestration, while ignoring agent autonomy. Tool-RoCo treats other agents as tools and introduces cooperative tools, leveraging tool usage to evaluate multi-agent cooperation and self-organization. Tool usage means that each agent (LLM) selects a tool from a candidate set based on the current state, receives feedback, and adjusts its selection in subsequent rounds. To evaluate different autonomy levels, we propose four LLM paradigms: (1) centralized cooperation, where a single LLM allocates tools to all agents; (2) centralized self-organization, where a central LLM autonomously activates agents while keeping others inactive; (3) decentralized cooperation, where each agent has its own LLM and calls tools based on local information; and (4) self-organization, where a randomly chosen initial agent can request collaboration, activating additional agents via tool calls. Tool-RoCo includes three multi-robot tasks, SORT, PACK, and CABINET, to measure format and parameter accuracy and agent coordination through tool usage. The results using several LLMs showed that cooperative tools accounted for only 7.09% of all tools, indicating that LLM-based agents rarely invoked others as assistants. Moreover, activation tools accounted for 96.42%, suggesting that current LLMs tend to maintain active agents while seldom deactivating them for adaptive coordination. Tool-RoCo provides a systematic benchmark to evaluate LLM autonomy and cooperation in multi-agent tasks. Code and Demo: https://github.com/ColaZhang22/Tool-Roco

[515] BAMAS: Structuring Budget-Aware Multi-Agent Systems

Liming Yang, Junyu Luo, Xuanzhe Liu, Yiling Lou, Zhenpeng Chen

Main category: cs.MA

TL;DR: BAMAS is a budget-aware multi-agent system that optimizes LLM selection and collaboration topology to reduce costs while maintaining performance.

Details

Motivation: Existing multi-agent systems rarely address budget constraints, making cost an important consideration for practical deployment as systems scale in complexity.

Method: BAMAS uses Integer Linear Programming to select optimal LLMs balancing performance and cost, then employs reinforcement learning to determine interaction topology, and finally instantiates the system based on selected agents and collaboration structure.

Result: BAMAS achieves comparable performance to state-of-the-art methods while reducing costs by up to 86% across three representative tasks.

Conclusion: BAMAS provides an effective framework for building cost-efficient multi-agent systems that maintain performance while significantly reducing deployment costs.

Abstract: Large language model (LLM)-based multi-agent systems have emerged as a powerful paradigm for enabling autonomous agents to solve complex tasks. As these systems scale in complexity, cost becomes an important consideration for practical deployment. However, existing work rarely addresses how to structure multi-agent systems under explicit budget constraints. In this paper, we propose BAMAS, a novel approach for building multi-agent systems with budget awareness. BAMAS first selects an optimal set of LLMs by formulating and solving an Integer Linear Programming problem that balances performance and cost. It then determines how these LLMs should collaborate by leveraging a reinforcement learning-based method to select the interaction topology. Finally, the system is instantiated and executed based on the selected agents and their collaboration topology. We evaluate BAMAS on three representative tasks and compare it with state-of-the-art agent construction methods. Results show that BAMAS achieves comparable performance while reducing cost by up to 86%.

[516] Policy Optimization and Multi-agent Reinforcement Learning for Mean-variance Team Stochastic Games

Junkai Hu, Li Xia

Main category: cs.MA

TL;DR: This paper studies mean-variance team stochastic games where agents share a common mean-variance objective but act independently, addressing challenges of non-additive variance and non-stationarity through sensitivity-based optimization and proposing novel multi-agent algorithms.

Details

Motivation: Mean-variance team stochastic games face two key challenges: variance metric is neither additive nor Markovian in dynamic settings, and simultaneous policy updates create non-stationary environments, making traditional dynamic programming inapplicable.

Method: The authors use sensitivity-based optimization to derive performance difference and derivative formulas, prove existence of deterministic Nash policies, and propose MV-MAPI algorithm with sequential updates. They extend this to MV-MATRPO for unknown environments using trust region methods.

Result: The MV-MAPI algorithm converges to first-order stationary points, and specific conditions are derived for stationary points to be (local) Nash equilibria and strict local optima. Performance lower bounds are established for policy updates.

Conclusion: The proposed methods effectively address mean-variance optimization in multi-agent settings, with numerical experiments demonstrating applicability to energy management in microgrid systems.

Abstract: We study a long-run mean-variance team stochastic game (MV-TSG), where each agent shares a common mean-variance objective for the system and takes actions independently to maximize it. MV-TSG has two main challenges. First, the variance metric is neither additive nor Markovian in a dynamic setting. Second, simultaneous policy updates of all agents lead to a non-stationary environment for each individual agent. Both challenges make dynamic programming inapplicable. In this paper, we study MV-TSGs from the perspective of sensitivity-based optimization. The performance difference and performance derivative formulas for joint policies are derived, which provide optimization information for MV-TSGs. We prove the existence of a deterministic Nash policy for this problem. Subsequently, we propose a Mean-Variance Multi-Agent Policy Iteration (MV-MAPI) algorithm with a sequential update scheme, where individual agent policies are updated one by one in a given order. We prove that the MV-MAPI algorithm converges to a first-order stationary point of the objective function. By analyzing the local geometry of stationary points, we derive specific conditions for stationary points to be (local) Nash equilibria, and further, strict local optima. To solve large-scale MV-TSGs in scenarios with unknown environmental parameters, we extend the idea of trust region methods to MV-MAPI and develop a multi-agent reinforcement learning algorithm named Mean-Variance Multi-Agent Trust Region Policy Optimization (MV-MATRPO). We derive a performance lower bound for each update of joint policies. Finally, numerical experiments on energy management in multiple microgrid systems are conducted.

[517] MAPF-HD: Multi-Agent Path Finding in High-Density Environments

Hiroya Makino, Seigo Ito

Main category: cs.MA

TL;DR: Proposes MAPF-HD framework with PHANS method for efficient multi-agent path finding in high-density environments, solving large problems in seconds instead of minutes.

Details

Motivation: Existing ILP-based MAPF methods are too slow for practical use in high-density scenarios like warehouses and valet parking, taking tens to hundreds of seconds even for small environments.

Method: PHANS (phased null-agent swapping) - a heuristic approach that incrementally swaps positions between agents and empty vertices to optimize paths efficiently.

Result: Solves MAPF-HD problems within seconds even in large environments with over 700 cells, significantly faster than ILP-based methods.

Conclusion: The proposed method enables practical deployment of MAPF in real-world applications like warehouse logistics, traffic management, and crowd control by dramatically reducing computation time.

Abstract: Multi-agent path finding (MAPF) involves planning efficient paths for multiple agents to move simultaneously while avoiding collisions. In typical warehouse environments, agents are often sparsely distributed along aisles; however, increasing the agent density can improve space efficiency. When the agent density is high, it becomes necessary to optimize the paths not only for goal-assigned agents but also for those obstructing them. This study proposes a novel MAPF framework for high-density environments (MAPF-HD). Several studies have explored MAPF in similar settings using integer linear programming (ILP). However, ILP-based methods require substantial computation time to optimize all agent paths simultaneously. Even in small grid-based environments with fewer than $100$ cells, these computations can take tens to hundreds of seconds. Such high computational costs render these methods impractical for large-scale applications such as automated warehouses and valet parking. To address these limitations, we introduce the phased null-agent swapping (PHANS) method. PHANS employs a heuristic approach to incrementally swap positions between agents and empty vertices. This method solves the MAPF-HD problem within a few seconds, even in large environments containing more than $700$ cells. The proposed method has the potential to improve efficiency in various real-world applications such as warehouse logistics, traffic management, and crowd control. The implementation is available at https://github.com/ToyotaCRDL/MAPF-in-High-Density-Envs.

cs.MM

[518] Prompt-Aware Adaptive Elastic Weight Consolidation for Continual Learning in Medical Vision-Language Models

Ziyuan Gao, Philippe Morel

Main category: cs.MM

TL;DR: PA-EWC is a novel continual learning method that reduces catastrophic forgetting in medical AI by using prompt-guided parameter specialization and adaptive Fisher Information computation.

Details

Motivation: Medical AI systems face catastrophic forgetting when learning new imaging protocols while needing to retain prior diagnostic capabilities, especially for vision-language models that must preserve complex cross-modal alignments.

Method: Systematically categorizes model parameters based on functional roles, uses prompt-guided parameter specialization, incorporates adaptive Fisher Information computation with gradient stability analysis, and develops weighted complexity metrics based on medical terminology density.

Result: Reduces catastrophic forgetting by up to 17.58% compared to baseline methods, with performance improvements of 4.30% on chest X-ray pathology localization and 6.06% on polyp segmentation across five medical imaging datasets.

Conclusion: PA-EWC effectively addresses catastrophic forgetting in medical AI systems through targeted parameter protection and adaptation, enabling better preservation of diagnostic capabilities across diverse imaging modalities.

Abstract: Medical AI systems face catastrophic forgetting when deployed in clinical settings, where models must learn new imaging protocols while retaining prior diagnostic capabilities. This challenge is particularly acute for medical vision-language models that must preserve complex cross-modal alignments between medical images and clinical terminology across diverse imaging modalities. We introduce Prompt- Aware Adaptive Elastic Weight Consolidation (PA-EWC), a novel continual learning approach that addresses catastrophic forgetting through prompt-guided parameter specialization. Our method systematically categorizes model parameters based on their functional roles in processing visual-descriptive, spatial-guided, and medical-semantic information, enabling targeted protection of critical knowledge while allowing adaptation to new clinical requirements. PA-EWC incorporates adaptive Fisher Information computation with gradient stability analysis and develops weighted complexity metrics based on medical terminology density. We evaluate our approach across five medical imaging datasets (Kvasir-SEG, ISIC 2018, CheXlocalize, BUSI, CAMUS) representing diverse modalities including endoscopy, dermoscopy, radiography, and ultrasound. Experimental results demonstrate that PA-EWC reduces catastrophic forgetting by up to 17.58% compared to baseline methods, with performance improvements of 4.30% on chest X-ray pathology localization and 6.06% on polyp segmentation.

[519] AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control

Xinyue Guo, Xiaoran Yang, Lipan Zhang, Jianxuan Yang, Zhao Wang, Jian Luan

Main category: cs.MM

TL;DR: AV-Edit is a generative sound effect editing framework that enables fine-grained audio editing in videos by leveraging visual, audio, and text semantics through multimodal pre-training and diffusion transformers.

Details

Motivation: Current sound effect editing approaches rely on low-level signal processing or coarse text prompts, resulting in limited flexibility and suboptimal audio quality.

Method: Uses contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, then trains an editorial Multimodal Diffusion Transformer (MM-DiT) with correlation-based feature gating to remove irrelevant sounds and generate missing audio elements.

Result: Generates high-quality audio with precise modifications based on visual content, achieving state-of-the-art performance in sound effect editing and strong competitiveness in audio generation.

Conclusion: AV-Edit effectively addresses limitations of existing approaches by integrating multimodal semantics for fine-grained sound effect editing in videos.

Abstract: Sound effect editing-modifying audio by adding, removing, or replacing elements-remains constrained by existing approaches that rely solely on low-level signal processing or coarse text prompts, often resulting in limited flexibility and suboptimal audio quality. To address this, we propose AV-Edit, a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos by jointly leveraging visual, audio, and text semantics. Specifically, the proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, learning aligned cross-modal representations. These representations are then used to train an editorial Multimodal Diffusion Transformer (MM-DiT) capable of removing visually irrelevant sounds and generating missing audio elements consistent with video content through a correlation-based feature gating training strategy. Furthermore, we construct a dedicated video-based sound editing dataset as an evaluation benchmark. Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content, achieving state-of-the-art performance in the field of sound effect editing and exhibiting strong competitiveness in the domain of audio generation.

[520] PixelatedScatter: Arbitrary-level Visual Abstraction for Large-scale Multiclass Scatterplots

Ziheng Guo, Tianxiang Wei, Zeyu Li, Lianghao Zhang, Sisi Li, Jiawan Zhang

Main category: cs.MM

TL;DR: A visual abstraction method for large-scale scatterplots that better preserves features in medium-to-low density regions through iso-density partitioning, pixel allocation, and distribution reconstruction.

Details

Motivation: Current scatterplot abstraction methods lose features in medium-to-low density regions, and overdraw is inevitable in large-scale scatterplots.

Method: Three-step approach: 1) Partition scatterplot into iso-density regions and equalize visual density, 2) Allocate pixels for different classes within each region, 3) Reconstruct data distribution based on pixels.

Result: User studies and evaluations show the method better preserves features compared to previous methods, with special advantage for ultra-high dynamic range data distributions.

Conclusion: The proposed visual abstraction method provides better feature preservation across arbitrary abstraction levels, particularly in medium-to-low density regions of large-scale scatterplots.

Abstract: Overdraw is inevitable in large-scale scatterplots. Current scatterplot abstraction methods lose features in medium-to-low density regions. We propose a visual abstraction method designed to provide better feature preservation across arbitrary abstraction levels for large-scale scatterplots, particularly in medium-to-low density regions. The method consists of three closely interconnected steps: first, we partition the scatterplot into iso-density regions and equalize visual density; then, we allocate pixels for different classes within each region; finally, we reconstruct the data distribution based on pixels. User studies, quantitative and qualitative evaluations demonstrate that, compared to previous methods, our approach better preserves features and exhibits a special advantage when handling ultra-high dynamic range data distributions.

eess.AS

[521] Towards Audio Token Compression in Large Audio Language Models

Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

Main category: eess.AS

TL;DR: The paper proposes methods to compress audio tokens in Large Audio Language Models to reduce computational complexity while maintaining performance, achieving up to 3x token reduction.

Details

Motivation: LALMs face scalability issues due to quadratic attention complexity and high audio token rates, limiting deployment on resource-constrained platforms and handling of long-form audio.

Method: Uses unsupervised segmentation and uniform average pooling to reduce audio tokens before LLM decoder, with low-rank adapters for fine-tuning to mitigate performance degradation.

Result: Compressed LALMs achieve performance close to frame-level models while reducing input audio tokens by up to 3x before the LLM backbone.

Conclusion: The proposed compression techniques effectively address LALM scalability challenges while maintaining task performance on speech recognition and speech-to-speech translation.

Abstract: Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM’s audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.

[522] RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data

Zhisheng Zheng, Xiaohang Sun, Tuan Dinh, Abhishek Yanamandra, Abhinav Jain, Zhu Liu, Sunil Hadap, Vimal Bhat, Manoj Aggarwal, Gerard Medioni, David Harwath

Main category: eess.AS

TL;DR: RosettaSpeech is a zero-shot speech-to-speech translation framework that eliminates the need for parallel speech data by using monolingual speech-text data with machine translation supervision, achieving state-of-the-art results.

Details

Motivation: The scarcity of parallel speech corpora hampers speech-to-speech translation, forcing reliance on complex multi-stage pipelines. The paper aims to simplify S2ST by eliminating the need for parallel speech-to-speech pairs.

Method: Uses monolingual speech-text data augmented by machine translation supervision, with text as an intermediate bridge during training but functions as direct end-to-end speech-to-speech model at inference.

Result: Achieves state-of-the-art results: ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English (27% and 14% relative gains). Single model delivers strong many-to-one translation performance (FR/ES/DE -> EN).

Conclusion: RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for broader language coverage by relying on abundant parallel text rather than difficult-to-acquire parallel speech data.

Abstract: The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.

[523] Evaluation of an ITD-to-ILD Transformation as a Method to Restore the Spatial Benefit in Speech Intelligibility in Hearing Impaired Listeners

Timm-Jonas Bäumer, Johannes W. de Vries, Stephan Töpken, Richard C. Hendriks, Peyman Goli, Steven van de Par

Main category: eess.AS

TL;DR: This study investigates transforming low-frequency Interaural Time Differences (ITDs) into Interaural Level Differences (ILDs) to restore binaural benefits for hearing impaired listeners, showing improved speech intelligibility.

Details

Motivation: Hearing impaired listeners often have limited sensitivity to ITDs, which reduces speech intelligibility in complex environments. The study aims to determine if transforming ITDs into ILDs can reintroduce binaural benefits.

Method: Two experiments with HI listeners: 1) Measured ITD sensitivity thresholds using binaurally phase-shifted sinusoids at different frequencies; 2) Measured Speech Reception Thresholds (SRTs) in different binaural configurations by manipulating Head-Related Transfer Functions (HRTFs).

Result: HI listeners showed increased ITD thresholds at higher frequencies. Removing ITDs decreased SRTs by ~1 dB. Transforming low-frequency ITDs into ILDs improved performance for lateral target speakers. Adding low-frequency ILDs while preserving ITDs significantly improved performance for speakers in all directions.

Conclusion: Transforming low-frequency ITDs into ILDs can effectively restore binaural benefits for hearing impaired listeners and should be implemented in hearing aids and cochlear implants.

Abstract: To improve speech intelligibility in complex everyday situations, the human auditory system partially relies on Interaural Time Differences (ITDs) and Interaural Level Differences (ILDs). However, hearing impaired (HI) listeners often exhibit limited sensitivity to ITDs, resulting in decreased speech intelligibility performance. This study aimed to investigate whether transforming low-frequency ITDs into ILDs could reintroduce a binaural benefit for HI listeners. We conducted two experiments with HI listeners. The first experiment used binaurally phase-shifted sinusoids at different frequencies to evaluate the HI listeners ITD sensitivity threshold. All subjects had an increased ITD threshold at higher frequencies, with different ITD sensitivities between the subjects in the lower frequencies. In the second experiment, Speech Reception Thresholds (SRTs) were measured in different binaural configurations by manipulating Head-Related Transfer Functions (HRTFs). The results showed that, despite the decreased ITD sensitivity, removing ITDs decreased SRTs by approximately 1 dB compared to the unprocessed baseline, where ITDs and ILDs are available. Furthermore, substituting low-frequency ITDs with ILDs yielded an improvement for a lateral target speaker. Adding the low-frequency ILDs while preserving the ITDs caused a significant improvement for speakers in all directions. These findings suggest that the proposed transformation method could be effective in restoring binaural benefits in HI listeners. The results of this study suggest the use of such transformation techniques to be implemented in hearing aids and cochlear implants, directly benefiting HI listeners.

[524] The Spheres Dataset: Multitrack Orchestral Recordings for Music Source Separation and Information Retrieval

Jaime Garcia-Martinez, David Diaz-Guerra, John Anderson, Ricardo Falcon-Perez, Pablo Cabañas-Molero, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas

Main category: eess.AS

TL;DR: The Spheres dataset provides multitrack orchestral recordings for machine learning research in music source separation and MIR tasks, featuring classical works with 23 microphone recordings and isolated stems for training.

Details

Motivation: To advance machine learning research in music source separation and related MIR tasks within the classical music domain by providing high-quality orchestral recordings with controlled acoustic properties.

Method: Created a dataset of over one hour of orchestral recordings using 23 microphones (close spot, main, ambient) capturing Tchaikovsky’s Romeo and Juliet and Mozart’s Symphony No. 40, plus chromatic scales and solo excerpts. Room impulse responses were estimated for acoustic characterization.

Result: Baseline evaluations using X-UMX based models show potential and challenges for orchestral family separation and microphone debleeding, demonstrating the dataset’s value for benchmarking complex orchestral scenarios.

Conclusion: The Spheres dataset provides valuable resources for benchmarking and exploring new approaches to source separation, localization, dereverberation, and immersive rendering in classical music.

Abstract: This paper introduces The Spheres dataset, multitrack orchestral recordings designed to advance machine learning research in music source separation and related MIR tasks within the classical music domain. The dataset is composed of over one hour recordings of musical pieces performed by the Colibrì Ensemble at The Spheres recording studio, capturing two canonical works - Tchaikovsky’s Romeo and Juliet and Mozart’s Symphony No. 40 - along with chromatic scales and solo excerpts for each instrument. The recording setup employed 23 microphones, including close spot, main, and ambient microphones, enabling the creation of realistic stereo mixes with controlled bleeding and providing isolated stems for supervised training of source separation models. In addition, room impulse responses were estimated for each instrument position, offering valuable acoustic characterization of the recording space. We present the dataset structure, acoustic analysis, and baseline evaluations using X-UMX based models for orchestral family separation and microphone debleeding. Results highlight both the potential and the challenges of source separation in complex orchestral scenarios, underscoring the dataset’s value for benchmarking and for exploring new approaches to separation, localization, dereverberation, and immersive rendering of classical music.

eess.IV

[525] A Fractional Variational Approach to Spectral Filtering Using the Fourier Transform

Nelson H. T. Lemes, José Claudinei Ferreira, Higor V. M. Ferreira

Main category: eess.IV

TL;DR: A variational method using fractional derivatives in the frequency domain for Raman spectrum denoising, optimized via Shannon entropy to balance noise removal and feature preservation.

Details

Motivation: Fluorescence interference and noise obscure critical spectral features in Raman analysis, requiring methods that preserve essential chemical information while removing noise.

Method: Minimizes a functional with fractional derivatives, reformulated in frequency domain via Fourier transform, with regularization parameter and derivative order optimized using Shannon entropy.

Result: The method effectively removes noise while preserving peak position, intensity, and area in simulated Raman data and image processing applications.

Conclusion: The combination of variational approach, fractional derivatives, and entropy-based optimization produces an efficient, robust, and easily implementable filter for spectral analysis.

Abstract: The interference of fluorescence signals and noise remains a significant challenge in Raman spectrum analysis, often obscuring subtle spectral features that are critical for accurate analysis. Inspired by variational methods similar to those used in image denoising, our approach minimizes a functional involving fractional derivatives to balance noise suppression with the preservation of essential chemical features of the signal, such as peak position, intensity, and area. The original problem is reformulated in the frequency domain through the Fourier transform, making the implementation simple and fast. In this work, we discuss the theoretical framework, practical implementation, and the advantages and limitations of this method in the context of {simulated} Raman data, as well as in image processing. The main contribution of this article is the combination of a variational approach in the frequency domain, the use of fractional derivatives, and the optimization of the {regularization parameter and} derivative order through the concept of Shannon entropy. This work explores how the fractional order, combined with the regularization parameter, affects noise removal and preserves the essential features of the spectrum {and image}. Finally, the study shows that the combination of the proposed strategies produces an efficient, robust, and easily implementable filter.

[526] Adversarial Multi-Task Learning for Liver Tumor Segmentation, Dynamic Enhancement Regression, and Classification

Xiaojiao Xiao, Qinmin Vivian Hu, Tae Hyun Kim, Guanghui Wang

Main category: eess.IV

TL;DR: MTI-Net is an end-to-end multi-task framework that simultaneously performs liver tumor segmentation, dynamic enhancement regression, and classification using multi-domain information fusion and task interaction modules.

Details

Motivation: No prior work has achieved liver tumor segmentation, dynamic enhancement regression, and classification simultaneously in an end-to-end framework, lacking effective inter-task relevance capture and dynamic MRI information extraction mechanisms.

Method: Proposes MTI-Net with Multi-domain Information Entropy Fusion (MdIEF) for frequency/spectral domain integration, task interaction module for higher-order consistency between tasks, task-driven discriminator (TDD) for inter-task relationships, and shallow Transformer for dynamic MRI sequence encoding.

Result: MTI-Net demonstrates high performance across multiple tasks on a dataset of 238 subjects, showing strong potential for clinical liver tumor assessment.

Conclusion: The proposed MTI-Net framework effectively addresses the simultaneous execution of liver tumor analysis tasks through multi-domain fusion and task interaction, offering promising clinical application value.

Abstract: Liver tumor segmentation, dynamic enhancement regression, and classification are critical for clinical assessment and diagnosis. However, no prior work has attempted to achieve these tasks simultaneously in an end-to-end framework, primarily due to the lack of an effective framework that captures inter-task relevance for mutual improvement and the absence of a mechanism to extract dynamic MRI information effectively. To address these challenges, we propose the Multi-Task Interaction adversarial learning Network (MTI-Net), a novel integrated framework designed to tackle these tasks simultaneously. MTI-Net incorporates Multi-domain Information Entropy Fusion (MdIEF), which utilizes entropy-aware, high-frequency spectral information to effectively integrate features from both frequency and spectral domains, enhancing the extraction and utilization of dynamic MRI data. The network also introduces a task interaction module that establishes higher-order consistency between segmentation and regression, thus fostering inter-task synergy and improving overall performance. Additionally, we designed a novel task-driven discriminator (TDD) to capture internal high-order relationships between tasks. For dynamic MRI information extraction, we employ a shallow Transformer network to perform positional encoding, which captures the relationships within dynamic MRI sequences. In experiments on a dataset of 238 subjects, MTI-Net demonstrates high performance across multiple tasks, indicating its strong potential for assisting in the clinical assessment of liver tumors. The code is available at: https://github.com/xiaojiao929/MTI-Net.

[527] Deep Parameter Interpolation for Scalar Conditioning

Chicago Y. Park, Michael T. McCann, Cristina Garcia-Cardona, Brendt Wohlberg, Ulugbek S. Kamilov

Main category: eess.IV

TL;DR: Deep Parameter Interpolation (DPI) enables neural networks to accept scalar inputs by maintaining two parameter sets and dynamically interpolating between them based on scalar values, improving performance in diffusion and flow matching models.

Details

Motivation: Existing methods for incorporating scalar inputs in deep generative models either encode scalars as additional image inputs or restrict architecture choices by combining scalar and vector information in specific components, limiting flexibility.

Method: DPI maintains two learnable parameter sets within a single network and introduces scalar dependency by dynamically interpolating between these parameter sets based on the scalar value during training and sampling.

Result: DPI improves denoising performance and enhances sample quality for both diffusion and flow matching models while maintaining computational efficiency comparable to standard scalar conditioning techniques.

Conclusion: DPI is a simple, architecture-agnostic method that effectively adds scalar dependence to neural networks, demonstrating superior performance in generative modeling tasks.

Abstract: We propose deep parameter interpolation (DPI), a general-purpose method for transforming an existing deep neural network architecture into one that accepts an additional scalar input. Recent deep generative models, including diffusion models and flow matching, employ a single neural network to learn a time- or noise level-dependent vector field. Designing a network architecture to accurately represent this vector field is challenging because the network must integrate information from two different sources: a high-dimensional vector (usually an image) and a scalar. Common approaches either encode the scalar as an additional image input or combine scalar and vector information in specific network components, which restricts architecture choices. Instead, we propose to maintain two learnable parameter sets within a single network and to introduce the scalar dependency by dynamically interpolating between the parameter sets based on the scalar value during training and sampling. DPI is a simple, architecture-agnostic method for adding scalar dependence to a neural network. We demonstrate that our method improves denoising performance and enhances sample quality for both diffusion and flow matching models, while achieving computational efficiency comparable to standard scalar conditioning techniques. Code is available at https://github.com/wustl-cig/parameter_interpolation.

[528] Knowledge Distillation for Continual Learning of Biomedical Neural Fields

Wouter Visser, Jelmer M. Wolterink

Main category: eess.IV

TL;DR: Neural fields suffer from catastrophic forgetting when extended with new data. This paper analyzes this issue and proposes knowledge distillation to enable continual learning in neural fields for biomedical imaging.

Details

Motivation: Neural fields are used as continuous signal representations in biomedical imaging but cannot be easily extended without catastrophic forgetting, unlike discrete representations like voxel grids.

Method: The study examines catastrophic forgetting in neural fields when data arrives incrementally, and proposes knowledge distillation as a mitigation strategy. Experiments are conducted on cardiac cine MRI data.

Result: Knowledge distillation effectively mitigates catastrophic forgetting when extending spatiotemporal domains or increasing signal dimensionality. The extent of forgetting depends on the specific neural field model used.

Conclusion: Distillation enables continual learning in neural fields, allowing them to be extended without catastrophic forgetting of prior knowledge.

Abstract: Neural fields are increasingly used as a light-weight, continuous, and differentiable signal representation in (bio)medical imaging. However, unlike discrete signal representations such as voxel grids, neural fields cannot be easily extended. As neural fields are, in essence, neural networks, prior signals represented in a neural field will degrade when the model is presented with new data due to catastrophic forgetting. This work examines the extent to which different neural field approaches suffer from catastrophic forgetting and proposes a strategy to mitigate this issue. We consider the scenario in which data becomes available incrementally, with only the most recent data available for neural field fitting. In a series of experiments on cardiac cine MRI data, we demonstrate how knowledge distillation mitigates catastrophic forgetting when the spatiotemporal domain is enlarged or the dimensionality of the represented signal is increased. We find that the amount of catastrophic forgetting depends, to a large extent, on the neural fields model used, and that distillation could enable continual learning in neural fields.

Wenwei Li, Lingyi Cai, Hui Gong, Qingming Luo, Anan Li

Main category: eess.IV

TL;DR: A deep learning framework for registering in-vivo two-photon and ex-vivo fluorescence micro-optical sectioning tomography images of neurons, addressing cross-modality appearance gaps, data scarcity, and tissue deformations.

Details

Motivation: To enable accurate structure-function analysis in neuroscience by overcoming challenges in registering cross-modality neuron images, including appearance gaps, limited annotated data, and severe tissue deformations.

Method: Uses semantic-enhanced hybrid feature descriptor combining local geometric features with DINOV3 vision foundation model, learnable Geometric Consistency Confidence Module instead of RANSAC, and two-stage training with synthetic pre-training and real data fine-tuning.

Result: Provides robust and accurate solution for high-precision registration in challenging biomedical imaging scenarios.

Conclusion: The framework enables large-scale correlative studies by effectively addressing cross-modality registration challenges in neuroscience imaging.

Abstract: Accurately registering in-vivo two-photon and ex-vivo fluorescence micro-optical sectioning tomography images of individual neurons is critical for structure-function analysis in neuroscience. This task is profoundly challenging due to a significant cross-modality appearance gap, the scarcity of annotated data and severe tissue deformations. We propose a novel deep learning framework to address these issues. Our method introduces a semantic-enhanced hybrid feature descriptor, which fuses the geometric precision of local features with the contextual robustness of a vision foundation model DINOV3 to bridge the modality gap. To handle complex deformations, we replace traditional RANSAC with a learnable Geometric Consistency Confidence Module, a novel classifier trained to identify and reject physically implausible correspondences. A data-efficient two-stage training strategy, involving pre-training on synthetically deformed data and fine-tuning on limited real data, overcomes the data scarcity problem. Our framework provides a robust and accurate solution for high-precision registration in challenging biomedical imaging scenarios, enabling large-scale correlative studies.

[530] Entropy Coding for Non-Rectangular Transform Blocks using Partitioned DCT Dictionaries for AV1

Priyanka Das, Tim Classen, Mathias Wien

Main category: eess.IV

TL;DR: This paper introduces an entropy coding method designed for non-rectangular transform coefficients in video codecs, addressing the limitations of current DCT-optimized entropy coding schemes.

Details

Motivation: Recent video codecs use non-rectangular partitioning with smooth blending, but current entropy coding schemes are not well-suited for the resulting transform coefficients, as they are primarily designed for DCT coefficients.

Method: The authors develop an entropy coding method that efficiently codes non-rectangular transform coefficients by effectively modeling their specific properties, offering minimal decoder changes.

Result: The proposed design shows significant theoretical rate savings, particularly for scenarios that are more dissimilar to DCT, as estimated using conditional entropy in experimental setups.

Conclusion: The introduced entropy coding method effectively addresses the coding inefficiency of non-rectangular transform coefficients, providing substantial rate savings while maintaining minimal decoder complexity.

Abstract: Recent video codecs such as VVC and AV1 apply a Non-rectangular (NR) partitioning to combine prediction signals using a smooth blending around the boundary, followed by a rectangular transform on the whole block. The NR signal transformation is not yet supported. A transformation technique that applies the same partitioning to the 2D Discrete Cosine Transform (DCT) bases and finds a sparse representation of the NR signal in such a dictionary showed promising gains in an experimental setup outside the reference software. This method uses the regular inverse transformation at the decoder to reconstruct a rectangular signal and discards the signal outside the region of interest. This design is appealing due to the minimal changes required at the decoder. However, current entropy coding schemes are not well-suited for optimally encoding these coefficients because they are primarily designed for DCT coefficients. This work introduces an entropy coding method that efficiently codes these transform coefficients by effectively modeling their properties. The design offers significant theoretical rate savings, estimated using conditional entropy, particularly for scenarios that are more dissimilar to DCT in an experimental setup.

[531] LMLCC-Net: A Semi-Supervised Deep Learning Model for Lung Nodule Malignancy Prediction from CT Scans using a Novel Hounsfield Unit-Based Intensity Filtering

Tasnia Binte Mamun, Adhora Madhuri, Nusaiba Sobir, Taufiq Hasan

Main category: eess.IV

TL;DR: LMLCC-Net is a novel 3D CNN framework for classifying lung nodules in CT scans using Hounsfield Unit-based intensity filtering, achieving 91.96% accuracy on LUNA16 dataset.

Details

Motivation: Early diagnosis of malignant pulmonary nodules can significantly reduce lung cancer mortality. Benign and malignant nodules have significant differences in HU intensity profiles that haven't been exploited in literature.

Method: Proposed LMLCC-Net uses multiple branches with separate learnable HU-based intensity filtering stages to extract features from intensity patterns and texture. Also includes semi-supervised learning for ambiguous cases and a lightweight model.

Result: Achieved 91.96% classification accuracy, 92.94% sensitivity, and 94.07% AUC on LUNA16 dataset, showing improved performance compared to existing methods.

Conclusion: The method can significantly help radiologists in classifying pulmonary nodules and improving patient care by leveraging previously unexploited HU intensity differences.

Abstract: Lung cancer is the leading cause of patient mortality in the world. Early diagnosis of malignant pulmonary nodules in CT images can have a significant impact on reducing disease mortality and morbidity. In this work, we propose LMLCC-Net, a novel deep learning framework for classifying nodules from CT scan images using a 3D CNN, considering Hounsfield Unit (HU)-based intensity filtering. Benign and malignant nodules have significant differences in their intensity profile of HU, which was not exploited in the literature. Our method considers the intensity pattern as well as the texture for the prediction of malignancies. LMLCC-Net extracts features from multiple branches that each use a separate learnable HU-based intensity filtering stage. Various combinations of branches and learnable ranges of filters were explored to finally produce the best-performing model. In addition, we propose a semi-supervised learning scheme for labeling ambiguous cases and also developed a lightweight model to classify the nodules. The experimental evaluations are carried out on the LUNA16 dataset. The proposed LMLCC-Net was evaluated using the LUNA16 dataset. Our proposed method achieves a classification accuracy of 91.96%, a sensitivity of 92.94%, and an area under the curve of 94.07%, showing improved performance compared to existing methods The proposed method can have a significant impact in helping radiologists in the classification of pulmonary nodules and improving patient care.

[532] Generalizable cardiac substructures segmentation from contrast and non-contrast CTs using pretrained transformers

Aneesh Rangnekar, Nikhil Mankuzhy, Jonas Willmann, Chloe Choi, Abraham Wu, Maria Thor, Andreas Rimner, Harini Veeraraghavan

Main category: eess.IV

TL;DR: A hybrid transformer convolutional network was developed for robust cardiac substructure segmentation across varying CT imaging protocols and patient positions, achieving comparable accuracy to oracle models with 64% fewer training cases.

Details

Motivation: Automated AI segmentations deteriorate when applied to cases with different characteristics than training data, particularly for radiation treatment planning where accuracy is critical across varying imaging contrasts and scan positions.

Method: Developed a hybrid transformer convolutional network trained on balanced datasets of contrast-enhanced and non-contrast CT scans from lung cancer patients, with evaluation on held-out lung cancer patients and breast cancer patients in different positions.

Result: Balanced model achieved similar accuracy to oracle model (DSC: 0.82±0.10 vs 0.84±0.10 in Cohort I, 0.80±0.13 vs 0.81±0.12 in Cohort II) using 64% fewer training cases, outperforming TotalSegmentator and contrast-only models, with robust performance across contrast and positioning variations.

Conclusion: Combining pretraining with balanced NCCT/CECT distribution enables reliable segmentation with substantially fewer labeled cases than conventional approaches, demonstrating robust geometric and dosimetric accuracy essential for clinical deployment.

Abstract: Automated AI segmentations for radiation treatment planning deteriorate when applied to cases with different characteristics than the training dataset. We developed a hybrid transformer convolutional network to segment cardiac substructures in lung and breast cancer patients with varying imaging contrasts and scan positions. Cohort I (56 contrast-enhanced CT [CECT], 124 non-contrast CT [NCCT] scans from lung cancer patients, supine position) was used to train an oracle model (180 cases), contrast-only model (56 CECTs), and balanced model (32 CECT, 32 NCCT). All models were evaluated on 60 held-out cohort I patients and 66 cohort II breast cancer patients (45 supine, 21 prone). Accuracy was measured using Dice similarity coefficient (DSC), 95th percentile Hausdorff distance (HD95), and dosimetric metrics, with TotalSegmentator as benchmark. Oracle and balanced models achieved similar accuracy (DSC: Oracle vs Balanced: Cohort I: 0.84 $\pm$ 0.10 vs 0.82 $\pm$ 0.10; Cohort II: 0.81 $\pm$ 0.12 vs 0.80 $\pm$ 0.13), both outperforming TotalSegmentator and the contrast-only models. The balanced model, using 64% fewer training cases, produced dosimetrically equivalent contours to manual delineations. It was robust to contrast variations (6 out of 8 substructures) and positioning variations (5 out of 8 substructures), with low correlation to patient age or body mass index. Our balanced model demonstrated robust geometric and dosimetric accuracy across varying imaging protocols and patient characteristics, which is essential for clinical deployment. Combining pretraining with balanced NCCT/CECT distribution enabled reliable segmentation with substantially fewer labeled cases than conventional approaches.

[533] DEMIST: Decoupled Multi-stream latent diffusion for Quantitative Myelin Map Synthesis

Jiacheng Wang, Hao Li, Xing Yao, Ahmad Toubasi, Taegan Vinarsky, Caroline Gheen, Joy Derwenskus, Chaoyang Jin, Richard Dortch, Junzhong Xu, Francesca Bagnato, Ipek Oguz

Main category: eess.IV

TL;DR: DEMIST synthesizes quantitative magnetization transfer (qMT) pool size ratio (PSR) maps from standard T1w and FLAIR images using a 3D latent diffusion model, eliminating the need for specialized 20-30 minute qMT scans.

Details

Motivation: qMT imaging provides valuable myelin-sensitive biomarkers for multiple sclerosis assessment but requires specialized long-duration scans. The goal is to generate PSR maps from standard clinical images to make this biomarker more accessible.

Method: Two-stage approach: (1) Train separate autoencoders for PSR and anatomical images to learn aligned latent representations; (2) Train conditional diffusion model in latent space using frozen diffusion foundation backbone with three conditioning mechanisms: semantic tokens via cross-attention, spatial per-scale residual hints via 3D ControlNet, and adaptive LoRA-modulated attention. Includes edge-aware and alignment losses.

Result: Outperforms VAE, GAN and diffusion baselines on multiple metrics, producing sharper boundaries and better quantitative agreement with ground truth. Evaluated on 163 scans from 99 subjects using 5-fold cross-validation.

Conclusion: DEMIST successfully synthesizes high-quality PSR maps from standard clinical images, providing a practical alternative to specialized qMT scans for multiple sclerosis assessment.

Abstract: Quantitative magnetization transfer (qMT) imaging provides myelin-sensitive biomarkers, such as the pool size ratio (PSR), which is valuable for multiple sclerosis (MS) assessment. However, qMT requires specialized 20-30 minute scans. We propose DEMIST to synthesize PSR maps from standard T1w and FLAIR images using a 3D latent diffusion model with three complementary conditioning mechanisms. Our approach has two stages: first, we train separate autoencoders for PSR and anatomical images to learn aligned latent representations. Second, we train a conditional diffusion model in this latent space on top of a frozen diffusion foundation backbone. Conditioning is decoupled into: (i) \textbf{semantic} tokens via cross-attention, (ii) \textbf{spatial} per-scale residual hints via a 3D ControlNet branch, and (iii) \textbf{adaptive} LoRA-modulated attention. We include edge-aware loss terms to preserve lesion boundaries and alignment losses to maintain quantitative consistency, while keeping the number of trainable parameters low and retaining the inductive bias of the pretrained model. We evaluate on 163 scans from 99 subjects using 5-fold cross-validation. Our method outperforms VAE, GAN and diffusion baselines on multiple metrics, producing sharper boundaries and better quantitative agreement with ground truth. Our code is publicly available at https://github.com/MedICL-VU/MS-Synthesis-3DcLDM.

[534] Diffusion Algorithm for Metalens Optical Aberration Correction

Harshana Weligampola, Yuanrui Chen, Weiheng Tang, Qi Guo, Stanley H. Chan

Main category: eess.IV

TL;DR: A dual-branch diffusion model that reconstructs sharp full-color images from metalens-captured inputs: a sharp grayscale structure image and distorted color cue image.

Details

Motivation: Metalenses suffer from severe chromatic aberrations that make image reconstruction challenging, requiring algorithmic solutions to overcome optical distortions.

Method: Uses a dual-branch diffusion model built on pre-trained Stable Diffusion XL to fuse information from sharp grayscale structure images and distorted color cue images.

Result: Significantly outperforms existing deblurring and pansharpening methods, effectively restoring high-frequency details while accurately colorizing images.

Conclusion: The proposed method successfully addresses metalens chromatic aberration issues through diffusion-based fusion of structure and color information.

Abstract: Metalenses offer a path toward creating ultra-thin optical systems, but they inherently suffer from severe, spatially varying optical aberrations, especially chromatic aberration, which makes image reconstruction a significant challenge. This paper presents a novel algorithmic solution to this problem, designed to reconstruct a sharp, full-color image from two inputs: a sharp, bandpass-filtered grayscale structure image'' and a heavily distorted color cue’’ image, both captured by the metalens system. Our method utilizes a dual-branch diffusion model, built upon a pre-trained Stable Diffusion XL framework, to fuse information from the two inputs. We demonstrate through quantitative and qualitative comparisons that our approach significantly outperforms existing deblurring and pansharpening methods, effectively restoring high-frequency details while accurately colorizing the image.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

[3] A centroid based framework for text classification in itsm environments

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

[5] Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data

[6] MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

[7] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

[8] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

[9] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

[10] Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

[11] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

[12] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

[13] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

[14] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

[15] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

[16] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

[17] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

[18] Length-MAX Tokenizer for Language Models

[19] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

[20] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

[21] LightMem: Lightweight and Efficient Memory-Augmented Generation

[22] Emergence and Localisation of Semantic Role Circuits in LLMs

[23] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

[24] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

[25] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

[26] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

[27] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

[28] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

[29] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

[30] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

[31] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

[32] Developing an Open Conversational Speech Corpus for the Isan Language

[33] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

[34] Emergent Lexical Semantics in Neural Language Models: Testing Martin’s Law on LLM-Generated Text

[35] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

[36] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

[37] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

[38] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

[39] A Systematic Study of Model Merging Techniques in Large Language Models

[40] Hierarchical Ranking Neural Network for Long Document Readability Assessment

[41] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

[42] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

[43] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

[44] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

[45] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

[46] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

[47] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

[48] Revisiting Generalization Across Difficulty Levels: It’s Not So Easy

[49] Evaluating Large Language Models for Radiology Natural Language Processing

[50] Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization

[51] Scaling Efficient LLMs

[52] Gram2Vec: An Interpretable Document Vectorizer

[53] A Psychology-based Unified Dynamic Framework for Curriculum Learning

[54] Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

[55] BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations

[56] Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

[57] Reasoning Transfer for an Extremely Low-Resource and Endangered Language: Bridging Languages Through Sample-Efficient Language Understanding

[58] The Distribution of Dependency Distance and Hierarchical Distance in Contemporary Written Japanese and Its Influencing Factors

[59] A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

[60] Enhancing Large Language Models for Detecting Mental Manipulation via Annotation-Free Data Augmentation and Anti-Curriculum Distillation

[61] Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

[62] UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

[63] The Structure-Content Trade-off in Knowledge Graph Retrieval

[64] Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation

[65] On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

[66] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

[67] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

[68] Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

[69] Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

[70] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

[71] Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

[72] AICC: Parse HTML Finer, Make Models Better – A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

[73] CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

[74] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

[75] A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

[76] MTA: A Merge-then-Adapt Framework for Personalized Large Language Model

[77] BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali