Daily arXiv Papers - 2025-12-18

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Incentives or Ontology? A Structural Rebuttal to OpenAI’s Hallucination Thesis

Richard Ackermann, Simeon Emanuilov

Main category: cs.CL

TL;DR: Hallucination in LLMs is not an optimization problem but an architectural inevitability of transformers, requiring hybrid systems with external truth-validation.

DetailsMotivation: To challenge OpenAI's view that hallucinations are primarily due to misaligned evaluation incentives and demonstrate that hallucination is a structural property of transformer architecture itself.

Method: Drawing on previous work on structural hallucination and empirical experiments using a Licensing Oracle to test transformer behavior at ontological boundary conditions.

Result: Hallucination can only be eliminated through external truth-validation and abstention modules, not through changes to incentives, prompting, or fine-tuning. Licensing Oracle achieves perfect abstention precision.

Conclusion: Hallucination is a structural property of generative architectures, and reliable AI requires hybrid systems that distinguish linguistic fluency from epistemic responsibility.

Abstract: OpenAI has recently argued that hallucinations in large language models result primarily from misaligned evaluation incentives that reward confident guessing rather than epistemic humility. On this view, hallucination is a contingent behavioral artifact, remediable through improved benchmarks and reward structures. In this paper, we challenge that interpretation. Drawing on previous work on structural hallucination and empirical experiments using a Licensing Oracle, we argue that hallucination is not an optimization failure but an architectural inevitability of the transformer model. Transformers do not represent the world; they model statistical associations among tokens. Their embedding spaces form a pseudo-ontology derived from linguistic co-occurrence rather than world-referential structure. At ontological boundary conditions - regions where training data is sparse or incoherent - the model necessarily interpolates fictional continuations in order to preserve coherence. No incentive mechanism can modify this structural dependence on pattern completion. Our empirical results demonstrate that hallucination can only be eliminated through external truth-validation and abstention modules, not through changes to incentives, prompting, or fine-tuning. The Licensing Oracle achieves perfect abstention precision across domains precisely because it supplies grounding that the transformer lacks. We conclude that hallucination is a structural property of generative architectures and that reliable AI requires hybrid systems that distinguish linguistic fluency from epistemic responsibility.

[2] T5Gemma 2: Seeing, Reading, and Understanding Longer

Biao Zhang, Paul Suganthan, Gaël Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, Cassidy Hardin, Francesco Visin, Jiageng Zhang, Kathleen Kenealy, Qin Yin, Olivier Lacombe, Armand Joulin, Tris Warkentin, Adam Roberts

Main category: cs.CL

TL;DR: T5Gemma 2 is a new lightweight open encoder-decoder model with multilingual, multimodal, and long-context capabilities, built by adapting Gemma 3 decoder-only models into encoder-decoder architecture with efficiency improvements.

DetailsMotivation: To extend the successful T5Gemma adaptation approach from text-only to multimodal domains while maintaining efficiency, and to demonstrate the advantages of encoder-decoder architectures for long-context modeling across different architectures and modalities.

Method: Adapts pretrained Gemma 3 decoder-only models into encoder-decoder architecture using UL2 adaptation recipe, extends to multimodal, introduces tied word embedding (sharing embeddings across encoder/decoder) and merged attention (unifying decoder self- and cross-attention).

Result: Shows generality of adaptation strategy across architectures/modalities, encoder-decoder strength in long-context modeling, comparable/better pretraining performance and significantly improved post-training performance vs Gemma 3 counterparts. Releases 270M-270M, 1B-1B, and 4B-4B models.

Conclusion: T5Gemma 2 successfully extends the adaptation approach to multimodal domains with efficiency improvements, demonstrating the versatility of encoder-decoder architectures and providing valuable open models for research.

Abstract: We introduce T5Gemma 2, the next generation of the T5Gemma family of lightweight open encoder-decoder models, featuring strong multilingual, multimodal and long-context capabilities. T5Gemma 2 follows the adaptation recipe (via UL2) in T5Gemma – adapting a pretrained decoder-only model into an encoder-decoder model, and extends it from text-only regime to multimodal based on the Gemma 3 models. We further propose two methods to improve the efficiency: tied word embedding that shares all embeddings across encoder and decoder, and merged attention that unifies decoder self- and cross-attention into a single joint module. Experiments demonstrate the generality of the adaptation strategy over architectures and modalities as well as the unique strength of the encoder-decoder architecture on long context modeling. Similar to T5Gemma, T5Gemma 2 yields comparable or better pretraining performance and significantly improved post-training performance than its Gemma 3 counterpart. We release the pretrained models (270M-270M, 1B-1B and 4B-4B) to the community for future research.

[3] Integrating Large Language Models and Knowledge Graphs to Capture Political Viewpoints in News Media

Massimiliano Fadda, Enrico Motta, Francesco Osborne, Diego Reforgiato Recupero, Angelo Salatino

Main category: cs.CL

TL;DR: The paper improves a viewpoint classification pipeline for news analysis by fine-tuning LLMs and enriching claim representations with Wikidata actor descriptions, achieving best results when integrated.

DetailsMotivation: News sources shape political discourse through specific topics and viewpoints, and understanding these dynamics is essential for assessing media balance and fairness in public debate. The authors previously developed a pipeline for identifying viewpoints in news, but seek to improve its classification performance.

Method: Two improvements to existing pipeline: 1) fine-tuning Large Language Models (LLMs) for viewpoint classification, and 2) enriching claim representations with semantic descriptions of relevant actors from Wikidata. The approach is evaluated on a UK immigration debate benchmark.

Result: Both mechanisms independently improve classification performance, but their integration yields the best results, particularly when using LLMs capable of processing long inputs.

Conclusion: The enhanced pipeline combining fine-tuned LLMs with Wikidata-enriched claim representations provides superior viewpoint classification for analyzing media discourse, especially for complex topics like immigration debates.

Abstract: News sources play a central role in democratic societies by shaping political and social discourse through specific topics, viewpoints and voices. Understanding these dynamics is essential for assessing whether the media landscape offers a balanced and fair account of public debate. In earlier work, we introduced a pipeline that, given a news corpus, i) uses a hybrid human-machine approach to identify the range of viewpoints expressed about a given topic, and ii) classifies relevant claims with respect to the identified viewpoints, defined as sets of semantically and ideologically congruent claims (e.g., positions arguing that immigration positively impacts the UK economy). In this paper, we improve this pipeline by i) fine-tuning Large Language Models (LLMs) for viewpoint classification and ii) enriching claim representations with semantic descriptions of relevant actors drawn from Wikidata. We evaluate our approach against alternative solutions on a benchmark centred on the UK immigration debate. Results show that while both mechanisms independently improve classification performance, their integration yields the best results, particularly when using LLMs capable of processing long inputs.

[4] DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

Houman Kazemzadeh, Kiarash Mokhtari Dizaji, Seyed Reza Tavakoli, Farbod Davoodi, MohammadReza KarimiNejad, Parham Abed Azad, Ali Sabzi, Armin Khosravi, Siavash Ahmadi, Mohammad Hossein Rohban, Glolamali Aminian, Tahereh Javaheri

Main category: cs.CL

TL;DR: DrugRAG, a retrieval-augmented generation pipeline using external structured drug knowledge, improves LLM accuracy on pharmacy QA tasks without model modifications.

DetailsMotivation: To evaluate LLM performance on pharmacy licensure-style QA tasks and develop a method to improve their accuracy through external knowledge integration.

Method: Benchmarked 11 LLMs (8B to 70B+ parameters) on 141-question pharmacy dataset, then developed DrugRAG - a three-step RAG pipeline that retrieves structured drug knowledge from validated sources and augments prompts with evidence-based context.

Result: Baseline accuracy ranged 46%-92% (GPT-5: 92%, o3: 89%). Models <8B parameters scored <50%. DrugRAG improved accuracy across all models by 7-21 percentage points (e.g., Gemma 3 27B: 61%→71%, Llama 3.1 8B: 46%→67%).

Conclusion: External structured drug knowledge integration through DrugRAG measurably improves LLM accuracy on pharmacy tasks without modifying underlying models, providing a practical pipeline for enhancing pharmacy-focused AI applications.

Abstract: Objectives: To evaluate large language model (LLM) performance on pharmacy licensure-style question-answering (QA) tasks and develop an external knowledge integration method to improve their accuracy. Methods: We benchmarked eleven existing LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset. We measured baseline accuracy for each model without modification. We then developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context. This pipeline operates externally to the models, requiring no changes to model architecture or parameters. Results: Baseline accuracy ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores. Models with fewer than 8 billion parameters scored below 50%. DrugRAG improved accuracy across all tested models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61% to 71%, Llama 3.1 8B: 46% to 67%) on the 141-item benchmark. Conclusion: We demonstrate that external structured drug knowledge integration through DrugRAG measurably improves LLM accuracy on pharmacy tasks without modifying the underlying models. This approach provides a practical pipeline for enhancing pharmacy-focused AI applications with evidence-based information.

[5] Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models

Caner Erden

Main category: cs.CL

TL;DR: MAHA is a hierarchical attention framework that reduces quadratic complexity of standard attention by 81% through multi-scale decomposition and optimal aggregation using convex optimization or game theory.

DetailsMotivation: The quadratic computational complexity of MultiHead SelfAttention (MHSA) is a fundamental bottleneck for scaling LLMs to long-context tasks. Existing sparse and linearized attention methods compromise global dependencies or fail to capture multi-scale semantic granularity effectively.

Method: MAHA reformulates attention through hierarchical decomposition: dynamically partitions input sequences into hierarchical scales via learnable downsampling operators. The core innovation is modeling scale-specific attention matrix fusion as a resource allocation problem, solved via convex optimization or Nash equilibrium-based game theory. Implemented within a hybrid dilated-convolutional transformer backbone with differentiable optimization layers for end-to-end training.

Result: MAHA achieves superior scalability with 81% reduction in computational cost at sequence length of 4096 compared to standard attention, as confirmed by empirical FLOPs analysis.

Conclusion: MAHA bridges optimization theory and sequence modeling, offering a scalable solution for next-generation LLMs by theoretically balancing local nuance and global context fidelity through mathematically rigorous aggregation.

Abstract: The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate this, they often compromise the representation of global dependencies or fail to capture multiscale semantic granularity effectively. In this paper, we propose Multiscale Aggregated Hierarchical Attention (MAHA), a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation. Unlike conventional approaches that treat token interactions at a single resolution, MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators. The core innovation lies in its aggregation strategy: we model the fusion of scalespecific attention matrices as a resource allocation problem, solved via a convex optimization framework or a Nash equilibriumbased gametheoretic approach. This ensures a theoretically optimal balance between local nuance and global context fidelity. Implemented within a hybrid dilatedconvolutional transformer backbone, MAHA utilizes differentiable optimization layers to enable endtoend training. Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention. This work bridges the gap between optimization theory and sequence modeling, offering a scalable solution for nextgeneration LLMs.

[6] Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models

George-Andrei Dima, Dumitru-Clementin Cercel

Main category: cs.CL

TL;DR: Researchers translate Flickr30k to Romanian and extend it for visual QA, then fine-tune open-source VLMs (LLaMA, LLaVA, Qwen2) using LoRA, achieving improved Romanian capabilities in visual QA and image description with reduced grammatical errors.

DetailsMotivation: To reduce the multimodal NLP resource gap for low-resource languages like Romanian by creating Romanian visual QA datasets and improving Romanian capabilities in vision-language models.

Method: 1) Translate Flickr30k dataset to Romanian and extend it for visual QA using open-source LLMs; 2) Fine-tune three VLM families (LLaMA 3.2, LLaVA 1.6, Qwen2) using parameter-efficient LoRA method on Romanian visual QA data.

Result: Models show improved Romanian capabilities in visual QA and zero-shot image description generation. Qwen2-VL-RoVQA (7B) achieves +6.05% and +2.61% BERTScore F1 improvements over original version. Substantial reduction in grammatical errors indicates improved Romanian fluency.

Conclusion: The work successfully reduces Romanian multimodal NLP resource gap by creating datasets and fine-tuning VLMs, demonstrating effective cross-lingual transfer and improved Romanian language understanding and fluency in vision-language tasks.

Abstract: Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.

[7] Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation

Buu Phan, Ashish Khisti, Karen Ullrich

Main category: cs.CL

TL;DR: A method for computing cross-tokenizer likelihood ratios between language models with different vocabularies, enabling knowledge distillation when teacher and student models use different tokenizers.

DetailsMotivation: Knowledge distillation requires computing likelihood ratios between teacher and student models, but this becomes challenging when models use different tokenizers (e.g., edge devices need smaller vocabularies to reduce memory overhead).

Method: Leverages the implicit recursive structure of Byte-Pair Encoding (BPE) to create a probabilistic framework for cross-tokenizer likelihood scoring. Handles two scenarios: 1) student vocabulary is a subset of teacher vocabulary (exact likelihoods with O(1) evaluations), and 2) general arbitrary vocabularies (lossless procedure with fast approximation).

Result: For subset vocabularies: up to 12% memory reduction for Qwen2.5-1.5B model with up to 4% performance improvement. For general vocabularies: more than 2% accuracy improvement on GSM8K mathematical reasoning over state-of-the-art.

Conclusion: The method successfully addresses vocabulary misalignment in knowledge distillation, enabling efficient cross-tokenizer likelihood computation with practical benefits for model compression and performance improvement.

Abstract: Computing next-token likelihood ratios between two language models (LMs) is a standard task in training paradigms such as knowledge distillation. Since this requires both models to share the same probability space, it becomes challenging when the teacher and student LMs use different tokenizers, for instance, when edge-device deployment necessitates a smaller vocabulary size to lower memory overhead. In this work, we address this vocabulary misalignment problem by uncovering an implicit recursive structure in the commonly deployed Byte-Pair Encoding (BPE) algorithm and utilizing it to create a probabilistic framework for cross-tokenizer likelihood scoring. Our method enables sequence likelihood evaluation for vocabularies different from the teacher model native tokenizer, addressing two specific scenarios: when the student vocabulary is a subset of the teacher vocabulary, and the general case where it is arbitrary. In the subset regime, our framework computes exact likelihoods and provides next-token probabilities for sequential sampling with only O(1) model evaluations per token. When used for distillation, this yields up to a 12% reduction in memory footprint for the Qwen2.5-1.5B model while also improving baseline performance up to 4% on the evaluated tasks. For the general case, we introduce a rigorous lossless procedure that leverages BPE recursive structure, complemented by a fast approximation that keeps large-vocabulary settings practical. Applied to distillation for mathematical reasoning, our approach improves GSM8K accuracy by more than 2% over the current state of the art.

[8] Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams

Yiming Cui, Xin Yao, Yuxuan Qin, Xin Li, Shijin Wang, Guoping Hu

Main category: cs.CL

TL;DR: Current multimodal LLMs struggle with chemistry problems requiring visual-textual integration, with some performing better when images are removed, highlighting vision-language misalignment issues.

DetailsMotivation: Multimodal scientific reasoning, especially in chemistry, remains challenging for LLMs as it requires integrating symbolic diagrams, molecular structures, and visual data with textual information. There's a need to systematically evaluate how well current models handle these complex multimodal reasoning tasks.

Method: Systematically evaluated 40 proprietary and open-source multimodal LLMs (including GPT-5, o3, Gemini-2.5-Pro, Qwen2.5-VL) on a curated benchmark of Olympiad-style chemistry questions from over two decades of US National Chemistry Olympiad exams. Used Chain-of-Thought prompting and conducted ablation studies with occlusion-based interpretability methods.

Result: Many models struggle with modality fusion - in some cases, removing images actually improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding. Models show critical limitations in scientific reasoning abilities.

Conclusion: Current MLLMs have significant limitations in scientific reasoning, particularly in chemistry. The work provides a benchmark for measuring progress and identifies actionable strategies for developing more robust, interpretable multimodal systems, highlighting the need for advances at the intersection of AI and scientific reasoning.

Abstract: Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.

[9] DASH: Dialogue-Aware Similarity and Handshake Recognition for Topic Segmentation in Public-Channel Conversations

Sijin Sun, Liangbin Zhao, Ming Deng, Xiuju Fu

Main category: cs.CL

TL;DR: DASH-DTS is an LLM-based framework for dialogue topic segmentation that uses dialogue handshake recognition, similarity-guided example selection, and selective sample generation to improve segmentation accuracy in task-oriented communications like maritime VHF dialogues.

DetailsMotivation: Traditional methods struggle with dialogue topic segmentation in task-oriented public-channel communications like maritime VHF dialogues, which feature informal speech and implicit topic transitions. There's a need for more effective segmentation methods to support operational monitoring and decision-making.

Method: DASH-DTS uses three core techniques: (1) topic shift detection via dialogue handshake recognition, (2) contextual enhancement through similarity-guided example selection, and (3) generation of selective positive and negative samples to improve model discrimination and robustness. The framework also provides interpretable reasoning and confidence scores.

Result: The framework achieves state-of-the-art segmentation accuracy on both the new VHF-Dial dataset (first public dataset of real-world maritime VHF communications) and standard benchmarks. It establishes a strong foundation for stable monitoring and decision support in operational dialogues.

Conclusion: DASH-DTS effectively addresses the challenges of dialogue topic segmentation in informal, task-oriented communications through its innovative LLM-based approach, providing both high accuracy and interpretability for operational applications.

Abstract: Dialogue Topic Segmentation (DTS) is crucial for understanding task-oriented public-channel communications, such as maritime VHF dialogues, which feature informal speech and implicit transitions. To address the limitations of traditional methods, we propose DASH-DTS, a novel LLM-based framework. Its core contributions are: (1) topic shift detection via dialogue handshake recognition; (2) contextual enhancement through similarity-guided example selection; and (3) the generation of selective positive and negative samples to improve model discrimination and robustness. Additionally, we release VHF-Dial, the first public dataset of real-world maritime VHF communications, to advance research in this domain. DASH-DTS provides interpretable reasoning and confidence scores for each segment. Experimental results demonstrate that our framework achieves several sota segmentation trusted accuracy on both VHF-Dial and standard benchmarks, establishing a strong foundation for stable monitoring and decision support in operational dialogues.

[10] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Hongbo Wang, MaungMaung AprilPyone, Isao Echizen

Main category: cs.CL

TL;DR: SGM is a white-box neuron-level intervention that selectively recalibrates toxic expert neurons via soft suppression to detoxify multimodal LLMs without parameter updates, reducing harmful rates from 48.2% to 2.5% while preserving model performance.

DetailsMotivation: Multimodal LLMs inherit toxic, biased, and NSFW content from weakly curated pretraining data, creating safety risks. Existing training-free detoxification methods struggle with adversarial triggers and lack interpretability.

Method: SGM (Safety Glasses for Multimodal models) uses expertise-weighted soft suppression to selectively recalibrate a small set of toxic expert neurons. It’s a white-box neuron-level intervention that neutralizes harmful cross-modal activations without parameter updates. Also introduces MM-TOXIC-QA evaluation framework.

Result: SGM reduces toxicity rates from 48.2% to 2.5% in standard and adversarial conditions while preserving fluency and multimodal reasoning. SGM* (combined defenses) integrates with existing methods for stronger safety performance.

Conclusion: SGM provides an interpretable, low-cost solution for toxicity-controlled multimodal generation through selective neuron-level intervention, offering effective detoxification without compromising model capabilities.

Abstract: Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2% to 2.5% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.

[11] LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward

Yi Zhao, Siqi Wang, Jing Li

Main category: cs.CL

TL;DR: LaF-GRPO uses LLM-simulated VI user feedback to train VLMs for generating precise navigation instructions, with new NIG4VI dataset showing significant improvements over baselines.

DetailsMotivation: Navigation instruction generation for visually impaired individuals is critical but underexplored, with challenges in generating precise, in-situ instructions and scarcity of dedicated benchmarks.

Method: Propose LaF-GRPO (LLM-as-Follower GRPO) where an LLM simulates VI user responses to provide feedback rewards for post-training VLMs, plus introduce NIG4VI dataset with 27k samples for training/evaluation.

Result: LaF-GRPO boosts BLEU by 14%, achieves METEOR 0.542 vs GPT-4o’s 0.323, and produces more intuitive and safer instructions according to qualitative analysis.

Conclusion: LaF-GRPO effectively enhances navigation instruction accuracy and usability for VI users while reducing real-world data collection costs, with NIG4VI dataset facilitating future research.

Abstract: Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study focuses on generating precise, in-situ, step-by-step navigation instructions that are practically usable for VI users. Specifically, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to navigation instructions, thereby providing feedback rewards to guide the post-training of a Vision-Language Model (VLM). This enhances instruction accuracy and usability while reducing costly real-world data collection needs. To address the scarcity of dedicated benchmarks in this field, we introduce NIG4VI, a 27k-sample open-source dataset to facilitate training and evaluation. It comprises diverse navigation scenarios with accurate spatial coordinates, supporting detailed and open-ended in-situ instruction generation. Experiments on NIG4VI demonstrate the effectiveness of LaF-GRPO through quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU 14%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o 0.323), and qualitative analysis further confirms that our method yields more intuitive and safer instructions.

[12] The Meta-Prompting Protocol: Orchestrating LLMs via Adversarial Feedback Loops

Fanzhe Fu

Main category: cs.CL

TL;DR: The paper proposes Meta-Prompting Protocol, a formal framework using Adversarial Trinity (Generator, Auditor, Optimizer) to make LLMs deterministic and reliable for mission-critical applications.

DetailsMotivation: Current prompt engineering is heuristic-based and lacks deterministic guarantees needed for reliable software components in mission-critical applications.

Method: Introduces Meta-Prompting Protocol with Adversarial Trinity topology (Generator, Auditor, Optimizer), treats natural language instructions as differentiable variables in semantic computation graphs, uses textual critiques as gradients.

Result: Demonstrates theoretical viability using declarative programming paradigms (DSPy) and automatic textual differentiation (TextGrad), mitigates hallucination and prevents model collapse.

Conclusion: Establishes foundation for “Observable Software Engineering” in probabilistic computing era, enabling LLMs to transition from stochastic chat interfaces to reliable software components.

Abstract: The transition of Large Language Models (LLMs) from stochastic chat interfaces to reliable software components necessitates a fundamental re-engineering of interaction paradigms. Current methodologies, predominantly heuristic-based “prompt engineering,” fail to provide the deterministic guarantees required for mission-critical applications. We introduce the Meta-Prompting Protocol, a rigorous theoretical framework that formalizes the orchestration of LLMs as a programmable, self-optimizing system. Central to this protocol is the Adversarial Trinity, a tripartite topology comprising a Generator (P), an Auditor (A), and an Optimizer (O). By treating natural language instructions as differentiable variables within a semantic computation graph and utilizing textual critiques as gradients, this architecture mitigates hallucination and prevents model collapse. We demonstrate the theoretical viability of this approach using declarative programming paradigms (DSPy) and automatic textual differentiation (TextGrad), establishing a foundation for “Observable Software Engineering” in the era of probabilistic computing.

[13] Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning

Weiqin Wang, Yile Wang, Kehao Chen, Hui Huang

Main category: cs.CL

TL;DR: SCOPE improves test-time reinforcement learning for LLMs by using step-wise confidence-weighted pseudo-labels and dynamic subgroup partitioning to address confirmation bias and sparse rewards, achieving significant performance gains on challenging benchmarks.

DetailsMotivation: Existing test-time RL methods using majority voting suffer from confirmation bias and sparse rewards, limiting LLM reasoning improvement. There's a need for better pseudo-label estimation that balances reasoning quality with exploration diversity.

Method: SCOPE integrates step-wise confidence into pseudo-label deduction (prioritizing high-quality reasoning paths) and dynamically partitions candidate outputs into subgroups balancing reasoning quality against exploration diversity. It derives local consensus via repeat sampling for each subgroup.

Result: SCOPE consistently outperforms recent baselines across various models and benchmarks, achieving relative improvements of 13.1% on AIME 2025 and 8.1% on AMC.

Conclusion: SCOPE effectively addresses limitations of majority voting in test-time RL by integrating model confidence and dynamic subgroup partitioning, providing diverse supervision targets to encourage broader exploration and improve LLM reasoning ability.

Abstract: Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for improving reasoning ability of large language models (LLMs). However, this voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance. In this work, we propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE), a framework integrating model confidence and dynamic subgroup partitioning to address these issues. Specifically, SCOPE integrates the proposed step-wise confidence into pseudo label deduction, prioritizing high-quality reasoning paths over simple frequency count. Furthermore, it dynamically partitions the candidate outputs pool into independent subgroups by balancing reasoning quality against exploration diversity. By deriving local consensus via repeat sampling for each sub group, SCOPE provides diverse supervision targets to encourage broader exploration. We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines. Notably, SCOPE achieving relative improvements of 13.1% on challenging AIME 2025 and 8.1% on AMC. The code is released at \href{https://github.com/szu-tera/SCOPE}{https://github.com/szu-tera/SCOPE}.

[14] Rakuten Data Release: A Large-Scale and Long-Term Reviews Corpus for Hotel Domain

Yuki Nakayama, Koki Hikichi, Yun Ching Liu, Yu Hirate

Main category: cs.CL

TL;DR: Large-scale Rakuten Travel Reviews corpus with 7.3M reviews spanning 2009-2024, featuring rich metadata and analysis of data drift between 2019-2024.

DetailsMotivation: To provide a comprehensive, large-scale dataset of travel reviews for research in natural language processing, recommendation systems, and hospitality analytics, enabling longitudinal studies and understanding of data drift patterns.

Method: Collection of 7.3 million customer reviews from Rakuten Travel spanning 16 years (2009-2024), with detailed metadata extraction including review text, responses, user/anonymized IDs, dates, accommodation details, plan information, room types, purpose, accompanying groups, and multi-aspect ratings.

Result: Created a comprehensive corpus with rich statistical information and insights into data drift patterns between 2019-2024 using statistical approaches, providing valuable resource for temporal analysis and understanding review dynamics.

Conclusion: The Rakuten Travel Reviews corpus represents a valuable longitudinal dataset for hospitality and NLP research, with demonstrated utility for analyzing data drift and temporal patterns in customer feedback over extended periods.

Abstract: This paper presents a large-scale corpus of Rakuten Travel Reviews. Our collection contains 7.3 million customer reviews for 16 years, ranging from 2009 to 2024. Each record in the dataset contains the review text, its response from an accommodation, an anonymized reviewer ID, review date, accommodation ID, plan ID, plan title, room type, room name, purpose, accompanying group, and user ratings from different aspect categories, as well as an overall score. We present statistical information about our corpus and provide insights into factors driving data drift between 2019 and 2024 using statistical approaches.

[15] MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Xuanjun Zong, Zhiqi Shen, Lei Wang, Yunshi Lan, Chao Yang

Main category: cs.CL

TL;DR: MCP-SafetyBench is a comprehensive safety benchmark for evaluating LLM agents using Model Context Protocol (MCP), covering realistic multi-turn workflows across five domains with 20 attack types, revealing significant safety vulnerabilities in current models.

DetailsMotivation: As LLMs evolve into agentic systems using MCP to connect with external tools, the openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, requiring a more comprehensive evaluation framework.

Method: Built MCP-SafetyBench on real MCP servers with realistic multi-turn evaluation across five domains (browser automation, financial analysis, location navigation, repository management, web search). Incorporated unified taxonomy of 20 MCP attack types spanning server, host, and user sides, including tasks requiring multi-step reasoning and cross-server coordination under uncertainty.

Result: Systematic evaluation of leading open- and closed-source LLMs revealed large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow, highlighting significant safety gaps.

Conclusion: The results demonstrate urgent need for stronger defenses in MCP deployments, and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world LLM agent systems.

Abstract: Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP’s openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.

[16] From NLG Evaluation to Modern Student Assessment in the Era of ChatGPT: The Great Misalignment Problem and Pedagogical Multi-Factor Assessment (P-MFA)

Mika Hämäläinen, Kimmo Leiviskä

Main category: cs.CL

TL;DR: Paper explores parallels between NLG evaluation and student grading, identifies “Great Misalignment Problem” where traditional assessment fails with AI tools, and proposes P-MFA model as solution.

DetailsMotivation: The motivation stems from the growing similarity between evaluating natural language generation systems and assessing student work in academic settings. Both domains face a "Great Misalignment Problem" where traditional assessment methods focusing on final outputs have become invalid as students increasingly use AI tools like ChatGPT to produce sophisticated work.

Method: The paper introduces the Pedagogical Multi-Factor Assessment (P-MFA) model, which is a process-based, multi-evidence framework inspired by multi-factor authentication logic. This approach shifts focus from evaluating final products to assessing learning processes through multiple evidence sources.

Result: The abstract doesn’t present empirical results, but proposes a conceptual framework (P-MFA) to address assessment validity issues in both NLG evaluation and student grading contexts.

Conclusion: Traditional assessment methods that focus on final outputs are misaligned with current realities where AI tools can produce sophisticated work. A process-based, multi-evidence approach like P-MFA is needed to restore validity to assessment in both NLG evaluation and academic grading.

Abstract: This paper explores the growing epistemic parallel between NLG evaluation and grading of students in a Finnish University. We argue that both domains are experiencing a Great Misalignment Problem. As students increasingly use tools like ChatGPT to produce sophisticated outputs, traditional assessment methods that focus on final products rather than learning processes have lost their validity. To address this, we introduce the Pedagogical Multi-Factor Assessment (P-MFA) model, a process-based, multi-evidence framework inspired by the logic of multi-factor authentication.

[17] RFKG-CoT: Relation-Driven Adaptive Hop-count Selection and Few-Shot Path Guidance for Knowledge-Aware QA

Chao Zhang, Minghan Li, Tianrui Lv, Guodong Zhou

Main category: cs.CL

TL;DR: RFKG-CoT improves knowledge graph reasoning for LLMs by dynamically adjusting hop counts based on relations and using few-shot path guidance, reducing hallucinations in QA tasks.

DetailsMotivation: LLMs often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. Existing methods like KG-CoT have rigid hop-count selection (solely question-driven) and underutilize reasoning paths (lack of guidance).

Method: 1) Relation-driven adaptive hop-count selector that dynamically adjusts reasoning steps by activating KG relations (e.g., 1-hop for direct relations, 2-hop for indirect chains) via relation mask. 2) Few-shot in-context learning path guidance with CoT that constructs examples in “question-paths-answer” format to enhance LLMs’ path understanding.

Result: Experiments on four KGQA benchmarks show RFKG-CoT improves accuracy by up to 14.7 percentage points (Llama2-7B on WebQSP) over KG-CoT. Ablations confirm the hop-count selector and path prompt are complementary.

Conclusion: RFKG-CoT transforms KG evidence into more faithful answers by addressing rigid hop selection and path underutilization through adaptive relation-driven reasoning and guided path learning.

Abstract: Large language models (LLMs) often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. While existing methods like KG-CoT improve reliability by integrating knowledge graph (KG) paths, they suffer from rigid hop-count selection (solely question-driven) and underutilization of reasoning paths (lack of guidance). To address this, we propose RFKG-CoT: First, it replaces the rigid hop-count selector with a relation-driven adaptive hop-count selector that dynamically adjusts reasoning steps by activating KG relations (e.g., 1-hop for direct “brother” relations, 2-hop for indirect “father-son” chains), formalized via a relation mask. Second, it introduces a few-shot in-context learning path guidance mechanism with CoT (think) that constructs examples in a “question-paths-answer” format to enhance LLMs’ ability to understand reasoning paths. Experiments on four KGQA benchmarks show RFKG-CoT improves accuracy by up to 14.7 pp (Llama2-7B on WebQSP) over KG-CoT. Ablations confirm the hop-count selector and the path prompt are complementary, jointly transforming KG evidence into more faithful answers.

[18] Yes-MT’s Submission to the Low-Resource Indic Language Translation Shared Task in WMT 2024

Yash Bhaskar, Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: The Yes-MT team’s WMT 2024 submission explored multiple approaches for low-resource Indic language translation between English and Assamese, Mizo, Khasi, and Manipuri, including fine-tuning pre-trained models, LLM prompting, and training from scratch.

DetailsMotivation: To address the challenges of low-resource machine translation for Indic languages and explore various approaches including pre-trained models, LLMs, and traditional methods to improve translation quality for these under-resourced languages.

Method: Multiple approaches were tested: 1) fine-tuning pre-trained models (mT5, IndicBart) in multilingual/monolingual settings, 2) LoRA fine-tuning of IndicTrans2, 3) zero-shot/few-shot prompting with LLMs (Llama 3, Mixtral 8x7b), 4) LoRA supervised fine-tuning of Llama 3, and 5) training Transformer models from scratch.

Result: Results were evaluated on WMT23 test data using SacreBLEU and CHRF metrics, highlighting the challenges of low-resource translation and showing the potential of LLMs, particularly when fine-tuned, for these tasks.

Conclusion: Low-resource Indic language translation remains challenging, but LLMs show promise, especially with fine-tuning approaches like LoRA, though further work is needed to improve translation quality for these under-resourced languages.

Abstract: This paper presents the systems submitted by the Yes-MT team for the Low-Resource Indic Language Translation Shared Task at WMT 2024 (Pakray et al., 2024), focusing on translating between English and the Assamese, Mizo, Khasi, and Manipuri languages. The experiments explored various approaches, including fine-tuning pre-trained models like mT5 (Xue et al., 2020) and IndicBart (Dabre et al., 2021) in both multilingual and monolingual settings, LoRA (Hu et al., 2021) fine-tuning IndicTrans2 (Gala et al., 2023), zero-shot and few-shot prompting (Brown, 2020) with large language models (LLMs) like Llama 3 (Dubey et al., 2024) and Mixtral 8x7b (Jiang et al., 2024), LoRA supervised fine-tuning of Llama 3 (Mecklenburg et al., 2024), and training Transformer models (Vaswani, 2017) from scratch. The results were evaluated on the WMT23 Low-Resource Indic Language Translation Shared Task test data using SacreBLEU (Post, 2018) and CHRF (Popovic, 2015), highlighting the challenges of low-resource translation and the potential of LLMs for these tasks, particularly with fine-tuning.

[19] FAME: Fictional Actors for Multilingual Erasure

Claudio Savelli, Moreno La Quatra, Alkis Koudounas, Flavio Giobergia

Main category: cs.CL

TL;DR: FAME is a synthetic multilingual benchmark for evaluating machine unlearning in LLMs across 5 languages, supporting both entity-level and instance-level forgetting with fictional actor biographies.

DetailsMotivation: Existing unlearning benchmarks for LLMs are limited to English and only support entity-level forgetting, lacking multilingual evaluation and finer-grained instance-level forgetting capabilities.

Method: Created FAME benchmark with 1,000 fictional actor biographies and 20,000 QA pairs across English, French, German, Italian, and Spanish. Biographies include 20 topics organized into structured categories, with dataset splits for entity-level and instance-level unlearning scenarios.

Result: Provides a controlled evaluation benchmark that ensures information was never encountered during pretraining, enabling systematic comparison of unlearning techniques across languages and both forgetting granularities.

Conclusion: FAME addresses limitations of existing benchmarks by offering multilingual support and both entity/instance-level unlearning evaluation, facilitating better assessment of machine unlearning techniques in LLMs.

Abstract: LLMs trained on web-scale data raise concerns about privacy and the right to be forgotten. To address these issues, Machine Unlearning provides techniques to remove specific information from trained models without retraining from scratch. However, existing benchmarks for evaluating unlearning in LLMs face two major limitations: they focus only on English and support only entity-level forgetting (removing all information about a person). We introduce FAME (Fictional Actors for Multilingual Erasure), a synthetic benchmark for evaluating Machine Unlearning across five languages: English, French, German, Italian, and Spanish. FAME contains 1,000 fictional actor biographies and 20,000 question-answer pairs. Each biography includes information on 20 topics organized into structured categories (biography, career, achievements, personal information). This design enables both entity-level unlearning (i.e., forgetting entire identities) and instance-level unlearning (i.e., forgetting specific facts while retaining others). We provide two dataset splits to support these two different unlearning scenarios and enable systematic comparison of unlearning techniques across languages. Since FAME uses entirely fictional data, it ensures that the information was never encountered during model pretraining, allowing for a controlled evaluation of unlearning methods.

[20] The Moralization Corpus: Frame-Based Annotation and Analysis of Moralizing Speech Acts across Diverse Text Genres

Maria Becker, Mirko Sommer, Lars Tapken, Yi Wan Teh, Bruno Brocai

Main category: cs.CL

TL;DR: This paper introduces the Moralization Corpus, a multi-genre German dataset for analyzing moral arguments, develops a frame-based annotation scheme, and evaluates LLMs for moralization detection and component extraction.

DetailsMotivation: Moralizations (arguments using moral values to justify positions) are an underexplored form of persuasive communication that is pragmatically complex and often implicit, posing challenges for both human annotators and NLP systems.

Method: Developed a frame-based annotation scheme capturing moral values, demands, and discourse protagonists; applied it to diverse German texts (political debates, news articles, online discussions); evaluated LLMs under varied prompting conditions for moralization detection and component extraction.

Result: Detailed prompt instructions had greater effect than few-shot or explanation-based prompting; moralization detection remains highly subjective and context-sensitive; the corpus enables fine-grained analysis across communicative formats and domains.

Conclusion: The Moralization Corpus provides resources for interdisciplinary research on moral discourse; moralization analysis is challenging due to subjectivity and context-sensitivity; released all data, guidelines, and code to foster future research.

Abstract: Moralizations - arguments that invoke moral values to justify demands or positions - are a yet underexplored form of persuasive communication. We present the Moralization Corpus, a novel multi-genre dataset designed to analyze how moral values are strategically used in argumentative discourse. Moralizations are pragmatically complex and often implicit, posing significant challenges for both human annotators and NLP systems. We develop a frame-based annotation scheme that captures the constitutive elements of moralizations - moral values, demands, and discourse protagonists - and apply it to a diverse set of German texts, including political debates, news articles, and online discussions. The corpus enables fine-grained analysis of moralizing language across communicative formats and domains. We further evaluate several large language models (LLMs) under varied prompting conditions for the task of moralization detection and moralization component extraction and compare it to human annotations in order to investigate the challenges of automatic and manual analysis of moralizations. Results show that detailed prompt instructions has a greater effect than few-shot or explanation-based prompting, and that moralization remains a highly subjective and context-sensitive task. We release all data, annotation guidelines, and code to foster future interdisciplinary research on moral discourse and moral reasoning in NLP.

[21] SynGP500: A Clinically-Grounded Synthetic Dataset of Australian General Practice Medical Notes

Piyawoot Songsiritat

Main category: cs.CL

TL;DR: SynGP500 is a clinician-curated collection of 500 synthetic Australian general practice medical notes designed to address data scarcity and privacy issues in clinical NLP research.

DetailsMotivation: The paper addresses the critical gap in Australian general practice data for clinical NLP research, where real patient data faces privacy restrictions and existing datasets are often sanitized or don't reflect the full breadth of conditions GPs must recognize.

Method: Created 500 synthetic medical notes using clinician curation, integrating RACGP 2022 Curriculum for clinical breadth, BEACH study for epidemiologically-calibrated prevalence, and diverse consultation contexts. Designed to be “messy” with authentic complexities like telegraphic documentation, typos, patient non-adherence, and socioeconomic barriers.

Result: Multi-faceted validation shows epidemiological alignment with real Australian GP patterns, high linguistic variation, broad semantic diversity, and exploratory downstream evaluation demonstrates F1 improvements in self-supervised medical concept extraction.

Conclusion: SynGP500 provides a valuable resource for developing and evaluating clinical NLP methods for Australian general practice while inherently protecting patient privacy, addressing both data scarcity and privacy concerns.

Abstract: We introduce SynGP500, a clinician-curated collection of 500 synthetic Australian general practice medical notes. The dataset integrates curriculum-based clinical breadth (RACGP 2022 Curriculum), epidemiologically-calibrated prevalence (BEACH study), and diverse consultation contexts. This approach systematically includes both common presentations and less-common curriculum-specified conditions that GPs must recognize but appear infrequently in single practice populations, potentially supporting more generalizable model training than datasets constrained by naturally occurring case distributions. SynGP500 is messy by design, reflecting the authentic complexity of healthcare delivery: telegraphic documentation, typos, patient non-adherence, socioeconomic barriers, and clinician-patient disagreements, unlike sanitized synthetic datasets that obscure clinical realities. Multi-faceted validation demonstrates dataset quality through epidemiological alignment with real Australian GP consultation patterns (BEACH study), stylometric analysis confirming high linguistic variation, semantic diversity analysis demonstrating broad coverage, and exploratory downstream evaluation using self-supervised medical concept extraction, showing F1 improvements. SynGP500 addresses a critical national gap, providing researchers and educators with a resource for developing and evaluating clinical NLP methods for Australian general practice while inherently protecting patient privacy.

[22] Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning

Yiliu Sun, Zicheng Zhao, Yang Wei, Yanfang Zhang, Chen Gong

Main category: cs.CL

TL;DR: PPPO is a new RLVR method that focuses optimization on prefix tokens in LLM reasoning, achieving better performance with fewer training tokens by leveraging the Beginning Lock-in Effect.

DetailsMotivation: Current RLVR approaches uniformly train all generated tokens, wasting effort on low-return tokens and missing opportunities to optimize high-impact prefix tokens that significantly influence subsequent reasoning.

Method: PPPO focuses on prefix token optimization using two strategies: (1) Progressive Prefix Retention - gradually increasing retained prefix tokens during training, and (2) Continuation Accumulated Reward - sampling multiple continuations for each prefix and accumulating their scores as reward signals.

Result: PPPO outperforms existing RLVR methods on various reasoning tasks, achieving 18.02% accuracy improvement while using only 26.17% of training tokens.

Conclusion: Targeting prefix tokens in RLVR training is more effective than uniform token optimization, leveraging the Beginning Lock-in Effect to improve LLM reasoning with significantly reduced computational cost.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.

[23] Towards Proactive Personalization through Profile Customization for Individual Users in Dialogues

Xiaotian Zhang, Yuan Wang, Ruizhe Chen, Zeya Wang, Runchen Hou, Zuozhu Liu

Main category: cs.CL

TL;DR: PersonalAgent is a lifelong conversational agent that continuously learns and adapts to individual user preferences by decomposing dialogues into single-turn interactions and framing preference inference as sequential decision-making.

DetailsMotivation: Current LLM alignment techniques focus on universal human values or static single-turn preferences, failing to address long-term personalization needs and the user cold-start problem in interactive systems.

Method: PersonalAgent constructs and dynamically refines a unified user profile by decomposing dialogues into single-turn interactions and framing preference inference as a sequential decision-making task.

Result: PersonalAgent outperforms strong prompt-based and policy optimization baselines in both ideal and noisy conversational contexts while maintaining cross-session preference consistency. Human evaluation confirms it captures user preferences naturally and coherently.

Conclusion: Lifelong personalization is crucial for developing more inclusive and adaptive conversational agents, and PersonalAgent demonstrates effective continuous preference adaptation.

Abstract: The deployment of Large Language Models (LLMs) in interactive systems necessitates a deep alignment with the nuanced and dynamic preferences of individual users. Current alignment techniques predominantly address universal human values or static, single-turn preferences, thereby failing to address the critical needs of long-term personalization and the initial user cold-start problem. To bridge this gap, we propose PersonalAgent, a novel user-centric lifelong agent designed to continuously infer and adapt to user preferences. PersonalAgent constructs and dynamically refines a unified user profile by decomposing dialogues into single-turn interactions, framing preference inference as a sequential decision-making task. Experiments show that PersonalAgent achieves superior performance over strong prompt-based and policy optimization baselines, not only in idealized but also in noisy conversational contexts, while preserving cross-session preference consistency. Furthermore, human evaluation confirms that PersonalAgent excels at capturing user preferences naturally and coherently. Our findings underscore the importance of lifelong personalization for developing more inclusive and adaptive conversational agents. Our code is available here.

[24] Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies

Charan Prakash Rathore, Saumi Ray, Dhruv Kumar

Main category: cs.CL

TL;DR: LLMs show strong performance on high-level event classification (80-90% F1) but limited effectiveness on fine-grained parameter extraction (50-65% F1) for zeolite synthesis procedures, with advanced prompting providing minimal improvements over zero-shot approaches.

DetailsMotivation: To systematically evaluate LLMs for extracting structured information from zeolite synthesis experimental procedures, addressing the gap in domain-specific scientific information extraction methods.

Method: Evaluated four prompting strategies (zero-shot, few-shot, event-specific, reflection-based) across six state-of-the-art LLMs on four subtasks: event type classification, trigger text identification, argument role extraction, and argument text extraction using the ZSEE dataset of 1,530 annotated sentences.

Result: Strong performance on event type classification (80-90% F1) but modest performance on fine-grained extraction tasks (50-65% F1). GPT-5-mini showed extreme prompt sensitivity (11-79% F1 variation). Advanced prompting strategies provided minimal improvements over zero-shot approaches.

Conclusion: While LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models due to systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances.

Abstract: Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domain-specific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs (Gemma-3-12b-it, GPT-5-mini, O4-mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.

[25] Why Your Academic Field Is Everywhere at Once: A Case Study of Arabic Linguistics

Ayman Eddakrouri, Amani Ramadan

Main category: cs.CL

TL;DR: Arabic Applied Linguistics research shows extreme thematic dispersion (Δ=0.194), indicating high heterogeneity across eight sub-disciplines rather than concentration in any single area.

DetailsMotivation: To analyze the thematic structure of contemporary Arabic Applied Linguistics research using Brookes' Measure of Categorical Dispersion to understand whether the field is concentrated or dispersed across sub-disciplines.

Method: Applied Brookes’ Measure of Categorical Dispersion (Δ) to a comprehensive dataset of 1,564 publications (2019-2025), classified into eight core sub-disciplines, calculating dispersion index and analyzing thematic distribution.

Result: Found extremely low dispersion index of Δ=0.194, indicating extreme thematic dispersion. Computational Linguistics is dominant but not hegemonic, coexisting with robust research in Sociolinguistics, Language Teaching, and other subfields.

Conclusion: Arabic Applied Linguistics is characterized by pronounced heterogeneity rather than concentration. The study clarifies Brookes’ formula application, demonstrates its utility for field characterization, and provides a replicable bibliometric methodology for assessing disciplinary structure.

Abstract: This study applies Brookes’ Measure of Categorical Dispersion (Δ) to analyze the thematic structure of contemporary Arabic Applied Linguistics research. Using a comprehensive, real-world dataset of 1,564 publications from 2019 to 2025, classified into eight core sub-disciplines, we calculate a dispersion index of Δ = 0.194. This remarkably low value indicates extreme thematic dispersion, revealing that the field is characterized by pronounced heterogeneity rather than concentration. The analysis identifies Computational Linguistics as a dominant but non-hegemonic force, coexisting with robust research in Sociolinguistics, Language Teaching, and other subfields. This study clarifies the correct application of Brookes’ original formula, demonstrates its utility for field characterization, and provides a replicable bibliometric methodology for assessing disciplinary structure across domains.

[26] Adversarial versification in portuguese as a jailbreak operator in LLMs

Joao Queiroz

Main category: cs.CL

TL;DR: Versification (rewriting prompts as poetry) serves as an effective universal jailbreak mechanism against aligned LLMs, causing up to 18x more safety failures by exploiting structural vulnerabilities in current alignment methods.

DetailsMotivation: To demonstrate that current LLM alignment methods are vulnerable to structural attacks through versification, and to highlight the critical gap in evaluating these vulnerabilities in Portuguese, a language with rich poetic traditions and complex morphosyntactic features.

Method: The study uses versification (rewriting prose instructions as poetry) as an adversarial mechanism, testing it against aligned LLMs using benchmarks derived from MLCommons AILuminate. Both manually written poems and automated versions were tested in single-turn interactions.

Result: Versification causes significant safety failures: manually written poems achieve ~62% attack success rate (ASR), automated versions ~43%, with some models exceeding 90% success. The effect is structural and consistent across RLHF, constitutional AI, and hybrid pipelines, producing up to 18x more safety failures.

Conclusion: Current alignment methods are excessively dependent on surface patterns and vulnerable to structural attacks like versification, which displaces prompts into sparsely supervised regions. The absence of Portuguese evaluations represents a critical gap, requiring experimental protocols that parameterize scansion, metre, and prosodic variation specific to Lusophone patterns.

Abstract: Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs. The study ‘Adversarial poetry as a universal single-turn jailbreak mechanism in large language models’ demonstrates that instructions routinely refused in prose become executable when rewritten as verse, producing up to 18 x more safety failures in benchmarks derived from MLCommons AILuminate. Manually written poems reach approximately 62% ASR, and automated versions 43%, with some models surpassing 90% success in single-turn interactions. The effect is structural: systems trained with RLHF, constitutional AI, and hybrid pipelines exhibit consistent degradation under minimal semiotic formal variation. Versification displaces the prompt into sparsely supervised latent regions, revealing guardrails that are excessively dependent on surface patterns. This dissociation between apparent robustness and real vulnerability exposes deep limitations in current alignment regimes. The absence of evaluations in Portuguese, a language with high morphosyntactic complexity, a rich metric-prosodic tradition, and over 250 million speakers, constitutes a critical gap. Experimental protocols must parameterise scansion, metre, and prosodic variation to test vulnerabilities specific to Lusophone patterns, which are currently ignored.

[27] Dual-Density Inference for Efficient Language Model Reasoning

Zhengyi Zhao, Shubo Zhang, Yuxi Zhang, Huimin Wang, Binyang Li, Kam-Fai Wong

Main category: cs.CL

TL;DR: Denser is a framework that uses compressed, symbol-rich language for intermediate reasoning computations while maintaining human-readable final answers, reducing token usage by up to 62% compared to standard Chain-of-Thought methods.

DetailsMotivation: Current LLM approaches use uniform language density for both reasoning and answering, which is computationally inefficient. The authors observed that reasoning serves a computational function for the model while answering serves a communicative function for humans, enabling different density optimizations.

Method: Denser uses dual-density inference with three components: 1) query processing module to analyze input problems, 2) high-density compressed reasoning mechanism for efficient intermediate computations, and 3) answer generation component that translates compressed reasoning into human-readable solutions.

Result: Experimental evaluation across multiple reasoning QA benchmarks shows Denser reduces token consumption by up to 62% compared to standard Chain-of-Thought methods while preserving or improving accuracy. Efficiency gains are particularly significant for complex multi-step reasoning problems.

Conclusion: Separating language density for reasoning (computational) and answering (communicative) functions enables significant efficiency improvements in LLM inference without sacrificing accuracy, especially for complex reasoning tasks.

Abstract: Large Language Models (LLMs) have shown impressive capabilities in complex reasoning tasks. However, current approaches employ uniform language density for both intermediate reasoning and final answers, leading to computational inefficiency. Our observation found that reasoning process serves a computational function for the model itself, while answering serves a communicative function for human understanding. This distinction enables the use of compressed, symbol-rich language for intermediate computations while maintaining human-readable final explanations. To address this inefficiency, we present Denser: \underline{D}ual-d\underline{ens}ity inf\underline{er}ence, a novel framework that optimizes information density separately for reasoning and answering phases. Our framework implements this through three components: a query processing module that analyzes input problems, a high-density compressed reasoning mechanism for efficient intermediate computations, and an answer generation component that translates compressed reasoning into human-readable solutions. Experimental evaluation across multiple reasoning question answering benchmarks demonstrates that Denser reduces token consumption by up to 62% compared to standard Chain-of-Thought methods while preserving or improving accuracy. These efficiency gains are particularly significant for complex multi-step reasoning problems where traditional methods generate extensive explanations.

[28] ORACLE: Time-Dependent Recursive Summary Graphs for Foresight on News Data Using LLMs

Lev Kharlashkin, Eiaki Morooka, Yehor Tereshchenko, Mika Hämäläinen

Main category: cs.CL

TL;DR: ORACLE is a system that transforms daily news into weekly, actionable insights for a Finnish University of Applied Sciences using news crawling, relevance filtering, PESTEL classification, and a Time-Dependent Recursive Summary Graph.

DetailsMotivation: To provide decision-ready, week-over-week insights from daily news for university stakeholders by processing and summarizing relevant information in a structured, timely manner.

Method: The platform crawls and versions news, applies university-specific relevance filtering, embeds content, classifies items into PESTEL dimensions, and builds a Time-Dependent Recursive Summary Graph (TRSG) with two clustering layers summarized by an LLM. A change detector highlights new, removed, or changed items, grouping differences into themes for PESTEL-aware analysis.

Result: A production-stable system that generates weekly insights with concrete design choices discussed, including a curriculum-intelligence use case with an evaluation plan.

Conclusion: ORACLE successfully transforms daily news into structured, decision-ready weekly insights for university stakeholders through a robust pipeline with PESTEL-aware analysis and change detection capabilities.

Abstract: ORACLE turns daily news into week-over-week, decision-ready insights for one of the Finnish University of Applied Sciences. The platform crawls and versions news, applies University-specific relevance filtering, embeds content, classifies items into PESTEL dimensions and builds a concise Time-Dependent Recursive Summary Graph (TRSG): two clustering layers summarized by an LLM and recomputed weekly. A lightweight change detector highlights what is new, removed or changed, then groups differences into themes for PESTEL-aware analysis. We detail the pipeline, discuss concrete design choices that make the system stable in production and present a curriculum-intelligence use case with an evaluation plan.

[29] Toward expert-level motivational interviewing for health behavior improvement with LLMs

Run-ze Hu, Yang Yang, Yi-hang Yang, Jing-qi Kong, Jia-hui Luo, Wen-yu Yang, Jing Chen, Jing-yao Liu, Hui-qun Zeng, Lei Zhang, Zheng Liu

Main category: cs.CL

TL;DR: Fine-tuning Chinese LLMs on MI-style counseling dialogues improves their ability to perform motivational interviewing, approaching human counselor performance on some metrics but still lacking in complex reflection skills.

DetailsMotivation: Motivational interviewing is effective for health behavior change but limited by the need for highly trained human counselors. The study aims to develop scalable AI alternatives using large language models.

Method: Curated Chinese counseling corpora, used GPT-4 to transcribe dialogues into MI-style conversations, fine-tuned three Chinese LLMs (Baichuan2-7B-Chat, ChatGLM-4-9B-Chat, Llama-3-8B-Chinese-Chat-v2) on 2,000 training dialogues, and evaluated using automatic metrics (BLEU-4, ROUGE) and expert manual coding with MITI Coding Manual.

Result: Fine-tuning significantly improved BLEU-4 and ROUGE scores across all models. MI-LLMs achieved technical/relational global scores and MI-adherent ratios approaching real MI dialogues, though complex reflections and reflection-to-question ratios remained less frequent.

Conclusion: MI-oriented fine-tuning can endow general-purpose LLMs with core MI-consistent counseling behaviors, suggesting a scalable pathway for AI-assisted health behavior change support, but requires further work on data scale, complex MI skills, and real-world trials.

Abstract: Background: Motivational interviewing (MI) is an effective counseling approach for promoting health behavior change, but its impact is constrained by the need for highly trained human counselors. Objective: This study aimed to explore a scalable alternative by developing and evaluating Large Language Models for Motivational Interviewing (MI-LLMs). Methods: We first curated five Chinese psychological counseling corpora and, using GPT-4 with an MI-informed prompt, transcribed multi-turn dialogues from the two highest-quality datasets (CPsyCounD and PsyDTCorpus) into 2,040 MI-style counseling conversations, of which 2,000 were used for training and 40 for testing. Three Chinese-capable open-source LLMs (Baichuan2-7B-Chat, ChatGLM-4-9B-Chat and Llama-3-8B-Chinese-Chat-v2) were fine-tuned on this corpus and were named as MI-LLMs. We evaluated MI-LLMs using round-based automatic metrics and expert manual coding with the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Results: Across all three models, fine-tuning substantially improved BLEU-4 and ROUGE scores compared with the base models, and manual coding showed that MI-LLMs achieved technical and relational global scores, and MI-adherent ratios that approached those of real MI dialogues, although complex reflections and reflection-to-question ratios remained less frequent. Conclusions: These findings provide initial evidence that MI-oriented fine-tuning can endow general-purpose LLMs with core MI-consistent counseling behaviors, suggesting a scalable pathway toward AI-assisted health behavior change support while underscoring the need for further work on data scale, complex MI skills and real-world intervention trials.

[30] When a Nation Speaks: Machine Learning and NLP in People’s Sentiment Analysis During Bangladesh’s 2024 Mass Uprising

Md. Samiul Alim, Mahir Shahriar Tamim, Maisha Rahman, Tanvir Ahmed Khan, Md Mushfique Anwar

Main category: cs.CL

TL;DR: This paper pioneers sentiment analysis in Bangla during Bangladesh’s 2024 mass uprising, creating a dataset of 2,028 annotated news headlines and showing that BanglaBERT outperforms multilingual models in capturing public emotions during civil unrest.

DetailsMotivation: There's a significant research gap in understanding emotional dynamics during civil unrest in the Bangla language, despite sentiment analysis being explored in other contexts like elections and social media trends.

Method: Created a unique dataset of 2,028 annotated Bangla news headlines from Facebook news portals, classified into Outrage, Hope, and Despair. Used Latent Dirichlet Allocation (LDA) for theme identification and compared BanglaBERT against multilingual transformers (mBERT, XLM-RoBERTa) and traditional ML methods (SVM, Logistic Regression).

Result: BanglaBERT achieved 74% accuracy, outperforming multilingual transformers (mBERT: 67%, XLM-RoBERTa: 71%) and traditional ML methods (both 70%). LDA identified prevalent themes like political corruption and public protests, and analysis showed how events like internet blackouts shaped sentiment patterns.

Conclusion: Language-specific models like BanglaBERT are more effective for sentiment analysis during political turmoil in Bangla, offering valuable insights into public sentiment dynamics during national crises.

Abstract: Sentiment analysis, an emerging research area within natural language processing (NLP), has primarily been explored in contexts like elections and social media trends, but there remains a significant gap in understanding emotional dynamics during civil unrest, particularly in the Bangla language. Our study pioneers sentiment analysis in Bangla during a national crisis by examining public emotions amid Bangladesh’s 2024 mass uprising. We curated a unique dataset of 2,028 annotated news headlines from major Facebook news portals, classifying them into Outrage, Hope, and Despair. Through Latent Dirichlet Allocation (LDA), we identified prevalent themes like political corruption and public protests, and analyzed how events such as internet blackouts shaped sentiment patterns. It outperformed multilingual transformers (mBERT: 67%, XLM-RoBERTa: 71%) and traditional machine learning methods (SVM and Logistic Regression: both 70%). These results highlight the effectiveness of language-specific models and offer valuable insights into public sentiment during political turmoil.

[31] CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing

Kuan Lu, Shuhang Lin, Sai Wu, Yichen Yao, Junhan Yang, Huan Li, Wei Chu, Xu Yinghui, Yuan Qi, Gang Chen

Main category: cs.CL

TL;DR: CTKVR is a two-stage KV cache retrieval method that uses centroid-then-token indexing to improve inference efficiency for long-context LLMs, achieving 3-4x throughput speedups with minimal accuracy loss.

DetailsMotivation: Long contexts in LLMs cause high memory overhead from KV cache and increased latency. Existing dynamic KV selection methods face trade-offs: block-level indexing loses accuracy by retrieving irrelevant entries, while token-level indexing has high latency from inefficient retrieval.

Method: CTKVR uses a two-stage retrieval scheme based on the observation that adjacent query vectors after RoPE share similar top-k KV cache entries. First, lightweight centroids are precomputed during prefilling for coarse-grained indexing. Second, token-level refinement provides precise KV retrieval. An optimized CPU-GPU co-execution system handles indexing construction and search.

Result: CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation. It delivers 3x throughput speedup on Llama-3-8B and 4x on Yi-9B at 96K context length across diverse GPU hardware.

Conclusion: CTKVR effectively balances retrieval efficiency and accuracy for long-context LLM inference by combining centroid-based coarse indexing with token-level refinement, significantly improving throughput while maintaining model quality.

Abstract: Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy: lightweight centroids are precomputed during prefilling for centroid-grained indexing, followed by token-level refinement for precise KV retrieval. This approach balances retrieval efficiency and accuracy. To further enhance performance, we implement an optimized system for indexing construction and search using CPU-GPU co-execution. Experimentally, CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation. Meanwhile, CTKVR delivers 3 times and 4 times throughput speedups on Llama-3-8B and Yi-9B at 96K context length across diverse GPU hardware.

[32] Learning inflection classes using Adaptive Resonance Theory

Peter Dekker, Heikki Rasilo, Bart de Boer

Main category: cs.CL

TL;DR: The paper studies learnability of verbal inflection classes using Adaptive Resonance Theory neural network for unsupervised clustering of lexemes, applied to Latin, Portuguese, and Estonian languages.

DetailsMotivation: Inflection classes are important for morphological acquisition and processing, but their learnability by individual language users needs investigation. The research aims to understand how language users can discover inflection class patterns through unsupervised learning.

Method: Uses Adaptive Resonance Theory (ART) neural network with vigilance parameter controlling generalization degree. Performs unsupervised clustering of lexemes into inflection classes. Applied to Latin, Portuguese, and Estonian verbal inflection systems.

Result: Clustering similarity to attested inflection classes varies with inflectional system complexity. Best performance found in narrow region of generalization parameter. Learned features show similarity with linguistic descriptions of inflection classes.

Conclusion: ART provides cognitively plausible model for learning inflection classes. The model could be used to study historical change in inflection classes through agent-based modeling approaches.

Abstract: The concept of inflection classes is an abstraction used by linguists, and provides a means to describe patterns in languages that give an analogical base for deducing previously unencountered forms. This ability is an important part of morphological acquisition and processing. We study the learnability of a system of verbal inflection classes by the individual language user by performing unsupervised clustering of lexemes into inflection classes. As a cognitively plausible and interpretable computational model, we use Adaptive Resonance Theory, a neural network with a parameter that determines the degree of generalisation (vigilance). The model is applied to Latin, Portuguese and Estonian. The similarity of clustering to attested inflection classes varies depending on the complexity of the inflectional system. We find the best performance in a narrow region of the generalisation parameter. The learned features extracted from the model show similarity with linguistic descriptions of the inflection classes. The proposed model could be used to study change in inflection classes in the future, by including it in an agent-based model.

[33] From Data to Dialogue: Unlocking Language for All

Dakota Ellis, Samy Bakikerali, Wanshan Chen, Bao Dinh, Uyen Le

Main category: cs.CL

TL;DR: The paper presents an automated method for creating Specialized Word Lists (SWLs) that outperform the industry standard NGSL in achieving 95% coverage with fewer words, making vocabulary learning more efficient and scalable.

DetailsMotivation: Traditional General Service Lists (GSLs) require linguistic expertise, subjective input, and significant time to create. The authors aim to develop a more practical, automated approach to help language learners identify the most important vocabulary words efficiently.

Method: The authors developed an automated model to create Specialized Word Lists (SWLs) - word lists specific to subsets of a corpus. The process uses only objective criteria, eliminating subjective input and enabling automation, scalability, and customization for different learner needs.

Result: The SWLs created using the authors’ model outperformed the industry standard NGSL, achieving the 95% coverage required for language comprehension with fewer words. This demonstrates greater efficiency in vocabulary selection for language learners.

Conclusion: Creating specialized word lists through automated, objective criteria is more practical than traditional GSL approaches. This method can be scaled and tailored to meet the needs of language learners worldwide, offering an efficient alternative to subjective, expert-dependent vocabulary selection processes.

Abstract: Traditional linguists have proposed the use of a General Service List (GSL) to assist new language learners in identifying the most important words in English. This process requires linguistic expertise, subjective input, and a considerable amount of time. We attempt to create our own GSL and evaluate its practicality against the industry standard (The NGSL). We found creating a Specialized Word List (SWL), or a word list specific to a subset of the overall corpus, to be the most practical way for language-learners to optimize the process. The SWL’s that we created using our model outperformed the industry standard, reaching the 95% coverage required for language comprehension with fewer words comparatively. By restricting the SWL process to objective criteria only, it can be automated, scaled, and tailored to the needs of language-learners across the globe.

[34] An Empirical Study on Chinese Character Decomposition in Multiword Expression-Aware Neural Machine Translation

Lifeng Han, Gareth J. F. Jones, Alan F. Smeaton

Main category: cs.CL

TL;DR: This paper systematically studies Chinese character decomposition technology for MWE-aware neural machine translation, addressing the unique challenges of Chinese MWEs that can’t be solved by Western sub-word methods like BPE.

DetailsMotivation: MWEs create fundamental difficulties in NLU/NLP/NLG tasks through ambiguity, idiomatic expressions, and variations. While Western languages have made progress with techniques like BPE, Chinese and related Asian languages lag behind because sub-word modeling can't be directly applied to ideographic scripts like Chinese.

Method: The paper conducts a systematic study of Chinese character decomposition technology in the context of MWE-aware neural machine translation. It examines how character decomposition contributes to representing original meanings of Chinese words/characters and addresses MWE translation challenges.

Result: The paper reports experiments examining the effectiveness of Chinese character decomposition technology for representing word meanings and addressing MWE translation challenges, though specific quantitative results aren’t provided in the abstract.

Conclusion: Chinese character decomposition technology offers a promising approach to address MWE challenges in Chinese NMT, providing an alternative to Western sub-word methods that are incompatible with ideographic scripts.

Abstract: Word meaning, representation, and interpretation play fundamental roles in natural language understanding (NLU), natural language processing (NLP), and natural language generation (NLG) tasks. Many of the inherent difficulties in these tasks stem from Multi-word Expressions (MWEs), which complicate the tasks by introducing ambiguity, idiomatic expressions, infrequent usage, and a wide range of variations. Significant effort and substantial progress have been made in addressing the challenging nature of MWEs in Western languages, particularly English. This progress is attributed in part to the well-established research communities and the abundant availability of computational resources. However, the same level of progress is not true for language families such as Chinese and closely related Asian languages, which continue to lag behind in this regard. While sub-word modelling has been successfully applied to many Western languages to address rare words improving phrase comprehension, and enhancing machine translation (MT) through techniques like byte-pair encoding (BPE), it cannot be applied directly to ideograph language scripts like Chinese. In this work, we conduct a systematic study of the Chinese character decomposition technology in the context of MWE-aware neural machine translation (NMT). Furthermore, we report experiments to examine how Chinese character decomposition technology contributes to the representation of the original meanings of Chinese words and characters, and how it can effectively address the challenges of translating MWEs.

[35] Bolmo: Byteifying the Next Generation of Language Models

Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, Valentin Hofmann

Main category: cs.CL

TL;DR: Bolmo is the first competitive fully open byte-level language model family (1B/7B) trained by converting existing subword models via “byteification” with less than 1% of typical pretraining tokens, matching subword model performance while improving character understanding.

DetailsMotivation: To overcome limitations of subword tokenization (insufficient character understanding, fixed vocabulary constraints) while maintaining competitive performance, and to make byte-level LMs practical alternatives to subword models.

Method: Byteification: converting existing subword LMs to byte-level models using a novel architecture that resolves expressivity mismatches, enabling exact distillation with less than 1% of typical pretraining tokens. Uses higher token compression ratios for competitive inference speeds.

Result: Bolmo outperforms all prior byte-level LMs of comparable size, beats source subword models on character understanding and sometimes coding, matches original LMs on other tasks, achieves competitive inference speeds, and can be cheaply post-trained using existing subword LM ecosystem.

Conclusion: Byte-level LMs can now be practical, competitive alternatives to subword-level LMs across wide use cases through efficient byteification of existing models, overcoming subword limitations while maintaining performance.

Abstract: We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs’ performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.

[36] You Never Know a Person, You Only Know Their Defenses: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

Hongbin Na, Zimu Wang, Zhaoming Chen, Peilin Zhou, Yining Hua, Grace Ziqi Zhou, Haiyang Zhang, Tao Shen, Wei Wang, John Torous, Shaoxiong Ji, Ling Chen

Main category: cs.CL

TL;DR: PsyDefConv: A dialogue corpus with defense mechanism annotations and DMRS Co-Pilot pipeline that reduces annotation time by 22.4% while providing clinically plausible evidence-based pre-annotations.

DetailsMotivation: Psychological defenses are crucial for mental health but difficult to measure reliably in clinical dialogues. Current methods lack standardized annotation tools and datasets for studying defensive functioning in language.

Method: Created PsyDefConv corpus with 200 dialogues and 4709 utterances (2336 help seeker turns) labeled for defense level. Developed DMRS Co-Pilot, a four-stage pipeline providing evidence-based pre-annotations. Conducted counterbalanced study and expert review.

Result: Corpus labeling achieved Cohen’s kappa 0.639. Co-pilot reduced annotation time by 22.4%. Expert ratings: 4.62/7 for evidence, 4.44 for clinical plausibility, 4.40 for insight. Best language model macro F1-score around 30%, tendency to overpredict mature defenses. Mature defenses most common with emotion-specific deviations.

Conclusion: The PsyDefConv corpus and DMRS Co-Pilot provide valuable resources for studying defensive functioning in language, with demonstrated efficiency improvements and clinical utility. Release of corpus, annotations, code, and prompts will support further research.

Abstract: Psychological defenses are strategies, often automatic, that people use to manage distress. Rigid or overuse of defenses is negatively linked to mental health and shapes what speakers disclose and how they accept or resist help. However, defenses are complex and difficult to reliably measure, particularly in clinical dialogues. We introduce PsyDefConv, a dialogue corpus with help seeker utterances labeled for defense level, and DMRS Co-Pilot, a four-stage pipeline that provides evidence-based pre-annotations. The corpus contains 200 dialogues and 4709 utterances, including 2336 help seeker turns, with labeling and Cohen’s kappa 0.639. In a counterbalanced study, the co-pilot reduced average annotation time by 22.4%. In expert review, it averaged 4.62 for evidence, 4.44 for clinical plausibility, and 4.40 for insight on a seven-point scale. Benchmarks with strong language models in zero-shot and fine-tuning settings demonstrate clear headroom, with the best macro F1-score around 30% and a tendency to overpredict mature defenses. Corpus analyses confirm that mature defenses are most common and reveal emotion-specific deviations. We will release the corpus, annotations, code, and prompts to support research on defensive functioning in language.

[37] Evaluating Metrics for Safety with LLM-as-Judges

Kester Clegg, Richard Hawkins, Ibrahim Habli, Tom Lawton

Main category: cs.CL

TL;DR: The paper argues for using weighted metrics and context-sensitive confidence thresholds in LLM-as-Judge frameworks to improve safety in critical information flows, rather than relying on deterministic evaluations.

DetailsMotivation: LLMs are increasingly used in critical information processing roles (like healthcare triage or nuclear facility access), but they make mistakes. There's a need to make LLMs safe and reliable when replacing human roles in safety-critical workflows.

Method: Proposes focusing on evidence from evaluation points in LLM processes, particularly in LLM-as-Judge frameworks. Suggests using a basket of weighted metrics, context sensitivity to define error severity, and designing confidence thresholds that trigger human review when evaluator concordance is low.

Result: The approach acknowledges that deterministic evaluations aren’t possible for many NLP tasks, but argues that weighted metrics and confidence thresholds can lower error risks and improve safety in critical applications.

Conclusion: Safety arguments for LLMs in critical information flows should focus on evaluation evidence and confidence thresholds rather than performative claims, using weighted metrics and human review triggers to manage risks.

Abstract: LLMs (Large Language Models) are increasingly used in text processing pipelines to intelligently respond to a variety of inputs and generation tasks. This raises the possibility of replacing human roles that bottleneck existing information flows, either due to insufficient staff or process complexity. However, LLMs make mistakes and some processing roles are safety critical. For example, triaging post-operative care to patients based on hospital referral letters, or updating site access schedules in nuclear facilities for work crews. If we want to introduce LLMs into critical information flows that were previously performed by humans, how can we make them safe and reliable? Rather than make performative claims about augmented generation frameworks or graph-based techniques, this paper argues that the safety argument should focus on the type of evidence we get from evaluation points in LLM processes, particularly in frameworks that employ LLM-as-Judges (LaJ) evaluators. This paper argues that although we cannot get deterministic evaluations from many natural language processing tasks, by adopting a basket of weighted metrics it may be possible to lower the risk of errors within an evaluation, use context sensitivity to define error severity and design confidence thresholds that trigger human review of critical LaJ judgments when concordance across evaluators is low.

[38] How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness

Darshita Rathore, Vineet Kumar, Chetna Bansal, Anindya Moitra

Main category: cs.CL

TL;DR: LoRA achieves competitive/superior performance vs full SFT on reasoning tasks at optimal ranks, with distinct generalization patterns and task-specific forgetting.

DetailsMotivation: While PEFT methods like LoRA are computationally efficient, their configuration implications (especially rank) remain under-explored for downstream Q&A tasks and generalization behavior.

Method: Comprehensive evaluation across reasoning/recall datasets with rank sweeps; comparison of PEFT vs SFT on in-domain/out-of-domain adaptation; analysis of internal representations via spectral features and layer-wise attention structures.

Result: LoRA achieves competitive and sometimes superior performance compared to SFT, particularly on reasoning tasks at specific rank values; shows distinct generalization behavior and task-specific forgetting patterns.

Conclusion: LoRA is a viable alternative to full SFT with optimal rank selection, offering computational efficiency while maintaining or improving performance, with insights into representational drift and attention pattern changes.

Abstract: Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.

[39] Characterizing Mamba’s Selective Memory using Auto-Encoders

Tamanna Hossain, Robert L. Logan, Ganesh Jagadeesan, Sameer Singh, Joel Tetreault, Alejandro Jaimes

Main category: cs.CL

TL;DR: SSM language models like Mamba tend to forget specific types of tokens (math symbols, organization entities, non-standard English dialects) more frequently, with less prevalent tokens in pretraining data being most likely forgotten.

DetailsMotivation: While SSMs offer fixed memory advantages over transformers, they suffer from information loss in hidden states when processing long sequences. Previous work only studied when this loss occurs, not what types of information SSM LMs tend to forget.

Method: Trained an auto-encoder to reconstruct sequences from SSM’s hidden state, then measured information loss by comparing inputs with reconstructions. Experiments used Mamba SSM LMs (130M-1.4B parameters) on sequences of 4-256 tokens.

Result: Significantly higher information loss on math-related tokens (numbers, variables), organization entity mentions, and alternative dialects to Standard American English. Less prevalent tokens in pretraining data are most likely to be forgotten.

Conclusion: The study identifies specific patterns of information loss in SSM LMs, providing clear direction for future research to develop methods that better control Mamba’s ability to retain important information.

Abstract: State space models (SSMs) are a promising alternative to transformers for language modeling because they use fixed memory during inference. However, this fixed memory usage requires some information loss in the hidden state when processing long sequences. While prior work has studied the sequence length at which this information loss occurs, it does not characterize the types of information SSM language models (LMs) tend to forget. In this paper, we address this knowledge gap by identifying the types of tokens (e.g., parts of speech, named entities) and sequences (e.g., code, math problems) that are more frequently forgotten by SSM LMs. We achieve this by training an auto-encoder to reconstruct sequences from the SSM’s hidden state, and measure information loss by comparing inputs with their reconstructions. We perform experiments using the Mamba family of SSM LMs (130M–1.4B) on sequences ranging from 4–256 tokens. Our results show significantly higher rates of information loss on math-related tokens (e.g., numbers, variables), mentions of organization entities, and alternative dialects to Standard American English. We then examine the frequency that these tokens appear in Mamba’s pretraining data and find that less prevalent tokens tend to be the ones Mamba is most likely to forget. By identifying these patterns, our work provides clear direction for future research to develop methods that better control Mamba’s ability to retain important information.

[40] PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning

Xiaodi Li, Dingcheng Li, Rujun Gao, Mahmoud Zamani, Feng Mi, Latifur Khan

Main category: cs.CL

TL;DR: PPSEBM combines Energy-Based Models with Progressive Parameter Selection to tackle catastrophic forgetting in continual learning for NLP tasks.

DetailsMotivation: Continual learning faces the fundamental challenge of catastrophic forgetting, where models lose previously learned knowledge when acquiring new tasks. This paper aims to address this persistent problem in NLP settings.

Method: PPSEBM integrates an Energy-Based Model with Progressive Parameter Selection: PPS allocates task-specific parameters for each new task, while the EBM generates pseudo-samples from prior tasks to guide parameter selection and preserve past knowledge.

Result: Experimental results on diverse NLP benchmarks show that PPSEBM outperforms state-of-the-art continual learning methods, demonstrating superior performance in mitigating catastrophic forgetting.

Conclusion: PPSEBM offers a promising and robust solution for continual learning in NLP by effectively addressing catastrophic forgetting through the synergistic combination of progressive parameter selection and energy-based pseudo-sample generation.

Abstract: Continual learning remains a fundamental challenge in machine learning, requiring models to learn from a stream of tasks without forgetting previously acquired knowledge. A major obstacle in this setting is catastrophic forgetting, where performance on earlier tasks degrades as new tasks are learned. In this paper, we introduce PPSEBM, a novel framework that integrates an Energy-Based Model (EBM) with Progressive Parameter Selection (PPS) to effectively address catastrophic forgetting in continual learning for natural language processing tasks. In PPSEBM, progressive parameter selection allocates distinct, task-specific parameters for each new task, while the EBM generates representative pseudo-samples from prior tasks. These generated samples actively inform and guide the parameter selection process, enhancing the model’s ability to retain past knowledge while adapting to new tasks. Experimental results on diverse NLP benchmarks demonstrate that PPSEBM outperforms state-of-the-art continual learning methods, offering a promising and robust solution to mitigate catastrophic forgetting.

[41] Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Adam Karvonen, James Chua, Clément Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, Samuel Marks

Main category: cs.CL

TL;DR: LatentQA-trained Activation Oracles can effectively answer questions about LLM activations, generalizing well to out-of-distribution tasks and matching/exceeding white-box baselines.

DetailsMotivation: Existing techniques for understanding LLM activations are complex and specialized. The authors want to explore a simpler, more generalist approach using LatentQA to train models that can answer natural language questions about activations.

Method: Train Activation Oracles (AOs) using LatentQA approach - training LLMs to accept activations as inputs and answer questions about them. Evaluate in out-of-distribution settings and examine scaling with training data diversity, including classification tasks and self-supervised context prediction.

Result: AOs can recover information fine-tuned into models (biographical knowledge, malign propensities) not present in input text. Best AOs match/exceed prior white-box baselines on 4 downstream tasks, being best on 3 out of 4. Even narrowly-trained models generalize well, with additional training data yielding consistent improvements.

Conclusion: Diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations, suggesting LatentQA-trained AOs are effective for understanding LLM internals.

Abstract: Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.

[42] Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Jaewoo Park, Jungyang Park, Dongju Jang, Jiwan Chung, Byungwoo Yoo, Jaewoo Shin, Seonjoon Park, Taehyeong Kim, Youngjae Yu

Main category: cs.CL

TL;DR: The paper introduces a multimodal solution explanation task and ME2 benchmark to evaluate LLMs’ ability to incorporate visual elements (diagrams, markings, highlights) in math explanations, finding current models struggle with visual grounding in educational contexts.

DetailsMotivation: Current LLM-generated math explanations lack multimodal elements that human tutors routinely use (diagrams, markings, highlights) to enhance conceptual clarity, creating a gap in educational AI systems.

Method: Introduced multimodal solution explanation task to evaluate models’ ability to identify visual keypoints (auxiliary lines, points, angles) and generate explanations incorporating these elements. Created ME2 benchmark with 1,000 math problems annotated with visual keypoints and corresponding explanatory text.

Result: Current models struggle to identify visual keypoints, and open-source models face notable difficulties in generating keypoint-based explanations, revealing significant gaps in mathematical visual grounding and visually grounded reasoning.

Conclusion: The multimodal solution explanation task and ME2 dataset should catalyze research on LLMs in education and promote their development as effective, explanation-oriented AI tutors with visual grounding capabilities.

Abstract: With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students’ comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that current models struggle to identify visual keypoints. In the task of generating keypoint-based explanations, open-source models also face notable difficulties. This highlights a significant gap in current LLMs’ ability to perform mathematical visual grounding, engage in visually grounded reasoning, and provide explanations in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.

[43] Scale-invariant Attention

Ben Anson, Xi Wang, Laurence Aitchison

Main category: cs.CL

TL;DR: A scale-invariant attention mechanism for LLMs that enables zero-shot generalization from short to long contexts through position-dependent logit transformations.

DetailsMotivation: Address the persistent challenge in LLM research where attention mechanisms trained on short contexts fail to generalize effectively to longer contexts during inference.

Method: Propose two conditions for effective long-context attention: scale-invariant total attention and scale-invariant attention sparsity. Under Gaussian assumption, show that a simple position-dependent transformation of attention logits satisfies these conditions.

Result: The scale-invariant attention scheme provides considerable benefits in validation loss when zero-shot generalizing from short to long contexts, and is effective at long-context retrieval tasks.

Conclusion: A simple position-dependent transformation of attention logits enables effective long-context generalization by ensuring scale-invariant properties, addressing a key limitation in current LLM attention mechanisms.

Abstract: One persistent challenge in LLM research is the development of attention mechanisms that are able to generalise from training on shorter contexts to inference on longer contexts. We propose two conditions that we expect all effective long context attention mechanisms to have: scale-invariant total attention, and scale-invariant attention sparsity. Under a Gaussian assumption, we show that a simple position-dependent transformation of the attention logits is sufficient for these conditions to hold. Experimentally we find that the resulting scale-invariant attention scheme gives considerable benefits in terms of validation loss when zero-shot generalising from training on short contexts to validation on longer contexts, and is effective at long-context retrieval.

[44] Hidden in the Haystack: Smaller Needles are More Difficult for LLMs to Find

Owen Bianchi, Mathew J. Koretsky, Maya Willey, Chelsea X. Alvarado, Tanay Nayak, Adi Asija, Nicole Kuznetsov, Mike A. Nalls, Faraz Faghri, Daniel Khashabi

Main category: cs.CL

TL;DR: LLMs struggle more with needle-in-haystack tasks when the relevant answer-containing document (gold context) is shorter, even after controlling for other factors like position and distractor volume.

DetailsMotivation: Previous research on needle-in-haystack tasks has focused on positional bias and distractor quantity, but the impact of gold context size (length of answer-containing document) has been largely overlooked, despite its importance for real-world applications where relevant information comes in varying lengths.

Method: Conducted systematic study across three diverse benchmarks (general knowledge, biomedical reasoning, mathematical reasoning) with 11 state-of-the-art LLMs including recent reasoning models. Performed over 150K controlled runs with rigorous confounder analysis controlling for gold context position, answer token repetition, gold-to-distractor ratio, distractor volume, and domain specificity.

Result: LLM performance drops sharply when gold context is shorter. Smaller gold contexts consistently degrade model performance and amplify positional sensitivity. Gold context size remains a decisive, independent predictor of success even after controlling for all other factors.

Conclusion: Gold context size is a critical factor affecting LLM performance in long-context QA tasks, posing challenges for agentic systems that need to integrate scattered, fine-grained information of varying lengths. This provides clear insights for designing robust, context-aware LLM systems.

Abstract: Large language models (LLMs) face significant challenges with needle-in-ahaystack tasks, where relevant information (“the needle”) must be drawn from a large pool of irrelevant context (“the haystack”). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size, the length of the answer-containing document, has received little attention. We present the first systematic study of gold context size in long-context question answering, spanning three diverse benchmarks (general knowledge, biomedical reasoning, and mathematical reasoning), eleven state-of-the-art LLMs (including recent reasoning models), and more than 150K controlled runs. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model performance and amplify positional sensitivity, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of varying lengths. This effect persists under rigorous confounder analysis: even after controlling for gold context position, answer token repetition, gold-to-distractor ratio, distractor volume, and domain specificity, gold context size remains a decisive, independent predictor of success. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.

[45] TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Yuqi Ren, Wuwei Huang, Quandong Wang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong

Main category: cs.CL

TL;DR: TaP framework automates multilingual preference dataset generation using taxonomy guidance, enabling LLMs to outperform models trained on much larger open-source datasets.

DetailsMotivation: High-quality datasets for supervised and preference fine-tuning are resource-intensive to create and predominantly available only in English, limiting multilingual LLM development.

Method: Taxonomy-Guided Preference Data Generation (TaP) framework uses structured taxonomy for fine-grained control over dataset composition, enabling automated and scalable multilingual preference dataset construction.

Result: LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets, even surpassing performance of models trained on datasets 180 times larger.

Conclusion: TaP provides an effective solution for scalable, multilingual preference dataset generation that significantly improves LLM performance with smaller, higher-quality datasets.

Abstract: Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided \underline{\textbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.

[46] Learning without training: The implicit dynamics of in-context learning

Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo

Main category: cs.CL

TL;DR: LLMs can learn in-context at inference time without weight updates. This paper shows that transformer blocks implicitly modify MLP weights based on context through a low-rank weight-update mechanism.

DetailsMotivation: The motivation is to understand the mechanisms behind LLMs' ability to learn in-context at inference time, which remains largely unknown despite being a striking feature of these models.

Method: The authors analyze how the stacking of a self-attention layer with an MLP allows transformer blocks to implicitly modify MLP weights according to context. They use both theoretical analysis and experimentation to demonstrate that context is transformed into a low-rank weight-update of the MLP layer.

Result: The paper shows that transformer blocks can implicitly transform context into low-rank weight-updates of their MLP layers, providing a potential explanation for how LLMs achieve in-context learning without actual weight modifications.

Conclusion: The simple mechanism of transformer blocks implicitly modifying MLP weights through context-dependent low-rank updates may explain why LLMs can learn in-context at inference time, not just during training.

Abstract: One of the most striking features of Large Language Models (LLMs) is their ability to learn in-context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in-context and not only during training. Specifically, we show how a transformer block implicitly transforms a context into a low-rank weight-update of its MLP layer.

[47] LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients

Egor Fadeev, Dzhambulat Mollaev, Aleksei Shestov, Omar Zoloev, Artem Sakhno, Dmitry Korolev, Ivan Kireev, Andrey Savchenko, Maksim Makarenko

Main category: cs.CL

TL;DR: LATTE: A contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs for efficient client representation learning in financial applications.

DetailsMotivation: Learning client embeddings from historical communication sequences is crucial for financial applications, but directly using LLMs on long event sequences is computationally expensive and impractical for real-world pipelines.

Method: LATTE uses contrastive learning to align raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss.

Result: The approach significantly reduces inference cost and input size compared to conventional LLM processing of complete sequences. It outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets.

Conclusion: LATTE provides an efficient framework for learning client embeddings that remains deployable in latency-sensitive environments while leveraging LLM knowledge without the computational burden of full sequence processing.

Abstract: Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.

[48] Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions

Xiaoyi Wang, Jiwei Zhang, Guangtao Zhang, Honglei Guo

Main category: cs.CL

TL;DR: Real vs. synthetic therapy dialogues differ in emotional dynamics: real CBT sessions show greater emotional variability and more authentic emotional patterns than LLM-generated ones, despite synthetic dialogues being fluent and coherent.

DetailsMotivation: To investigate whether LLM-generated synthetic therapy dialogues capture the nuanced emotional dynamics of real therapy sessions, as these synthetic conversations are increasingly used in mental health NLP but their emotional fidelity remains unclear.

Method: Introduce RealCBT dataset of authentic CBT dialogues and compare emotional arcs between real and LLM-generated CBT sessions using the Utterance Emotion Dynamics framework across valence, arousal, and dominance dimensions, analyzing both full dialogues and individual speaker roles.

Result: Synthetic dialogues are fluent and structurally coherent but diverge from real conversations in key emotional properties: real sessions show greater emotional variability, more emotion-laden language, and more authentic patterns of reactivity and regulation. Emotional arc similarity remains low across all pairings, especially between real and synthetic speakers.

Conclusion: Current LLM-generated therapy data has limitations in capturing authentic emotional dynamics, highlighting the importance of emotional fidelity in mental health applications. The RealCBT dataset is released to support future research.

Abstract: Synthetic therapy dialogues generated by large language models (LLMs) are increasingly used in mental health NLP to simulate counseling scenarios, train models, and supplement limited real-world data. However, it remains unclear whether these synthetic conversations capture the nuanced emotional dynamics of real therapy. In this work, we introduce RealCBT, a dataset of authentic cognitive behavioral therapy (CBT) dialogues, and conduct the first comparative analysis of emotional arcs between real and LLM-generated CBT sessions. We adapt the Utterance Emotion Dynamics framework to analyze fine-grained affective trajectories across valence, arousal, and dominance dimensions. Our analysis spans both full dialogues and individual speaker roles (counselor and client), using real sessions from the RealCBT dataset and synthetic dialogues from the CACTUS dataset. We find that while synthetic dialogues are fluent and structurally coherent, they diverge from real conversations in key emotional properties: real sessions exhibit greater emotional variability, more emotion-laden language, and more authentic patterns of reactivity and regulation. Moreover, emotional arc similarity remains low across all pairings, with especially weak alignment between real and synthetic speakers. These findings underscore the limitations of current LLM-generated therapy data and highlight the importance of emotional fidelity in mental health applications. To support future research, our dataset RealCBT is released at https://gitlab.com/xiaoyi.wang/realcbt-dataset.

[49] Designing LLMs for cultural sensitivity: Evidence from English-Japanese translation

Helene Tenzer, Oumnia Abidi, Stefan Feuerriegel

Main category: cs.CL

TL;DR: LLMs can translate literally but may lack cultural sensitivity; targeted prompting improves cultural appropriateness in English-Japanese workplace email translations.

DetailsMotivation: LLMs are increasingly used in multilingual communication, but while they achieve near-perfect literal translations, it's unclear whether they support culturally appropriate communication across different cultural contexts.

Method: Analyzed cultural sensitivity of LLMs in English-Japanese workplace email translations using three prompting strategies: (1) naive “just translate” prompts, (2) audience-targeted prompts specifying recipient’s cultural background, and (3) instructional prompts with explicit guidance on Japanese communication norms. Used mixed-methods study analyzing culture-specific language patterns and native speaker evaluations of translation tone appropriateness.

Result: Culturally-tailored prompting can improve cultural fit in translations. The study found that targeted prompting strategies lead to more culturally appropriate communication compared to naive translation approaches.

Conclusion: Culturally-sensitive prompting strategies can enhance LLMs’ ability to produce culturally appropriate translations. The paper offers recommendations for designing culturally inclusive LLMs in multilingual settings, emphasizing the importance of cultural awareness beyond literal translation accuracy.

Abstract: Large language models (LLMs) are increasingly used in everyday communication, including multilingual interactions across different cultural contexts. While LLMs can now generate near-perfect literal translations, it remains unclear whether LLMs support culturally appropriate communication. In this paper, we analyze the cultural sensitivity of different LLM designs when applied to English-Japanese translations of workplace e-mails. Here, we vary the prompting strategies: (1) naive “just translate” prompts, (2) audience-targeted prompts specifying the recipient’s cultural background, and (3) instructional prompts with explicit guidance on Japanese communication norms. Using a mixed-methods study, we then analyze culture-specific language patterns to evaluate how well translations adapt to cultural norms. Further, we examine the appropriateness of the tone of the translations as perceived by native speakers. We find that culturally-tailored prompting can improve cultural fit, based on which we offer recommendations for designing culturally inclusive LLMs in multilingual settings.

[50] Knowledge Editing with Subspace-Aware Key-Value Mappings

Haewon Park, Sangwoo Kim, Yohan Jo

Main category: cs.CL

TL;DR: SUIT (Subspace Knowledge Edit) improves knowledge editing in language models by modifying only critical feature subspaces, reducing perturbations while maintaining edit efficacy.

DetailsMotivation: Existing locate-then-edit approaches for knowledge editing in LMs cause significant perturbations to edited models because they modify MLP layers without constraints on key and value vectors, disrupting unrelated knowledge.

Method: SUIT identifies and modifies only the subspace of critical features relevant to the edit, rather than making unconstrained modifications to key-value mappings in MLP layers.

Result: Empirical results on LLaMA-3-8B, GPT-J-6B, and Qwen2.5-7B show SUIT dramatically improves knowledge preservation over strong baselines while maintaining high edit efficacy, confirming successful identification of critical subspaces.

Conclusion: SUIT effectively addresses the perturbation problem in knowledge editing by targeting only relevant feature subspaces, offering a more precise and less disruptive approach to correcting factual errors in language models.

Abstract: Knowledge editing aims to efficiently correct factual errors in Language Models (LMs). The popular locate-then-edit approach modifies an MLP layer by finding an optimal mapping between its input vector (key) and output vector (value) that leads to the expression of the edited knowledge. However, existing methods without any constraints on the key and value vectors cause significant perturbations to the edited model. To address this, we propose Subspace Knowledge Edit (SUIT), a method that identifies and modifies only the subspace of critical features relevant to the edit. Our empirical results on LLaMA-3-8B, GPT-J-6B, and Qwen2.5-7B models show that SUIT dramatically improves knowledge preservation over strong baselines while maintaining high edit efficacy. This effectiveness confirms that SUIT successfully identifies the critical subspace for the edit. Further analyses provide additional validation for our approach. The source code and data will be released to the public upon publication of the paper.

[51] SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching

Xinye Zhao, Spyridon Mastorakis

Main category: cs.CL

TL;DR: SemShareKV is a KV cache sharing framework that accelerates LLM inference by reusing KV caches from semantically similar prompts using fuzzy token matching with LSH and RoPE, achieving up to 6.25× speedup with minimal quality loss.

DetailsMotivation: The memory footprint of KV caches during LLM inference has become a bottleneck. Existing approaches focus on compressing KV caches within single prompts or reusing exact token matches, but these fail in scenarios where prompts are semantically similar but lexically different (common in tasks like multi-document summarization and conversational agents).

Method: SemShareKV uses fuzzy token matching via locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to preserve positional information. It selectively reuses relevant key-value pairs from reference prompts’ caches instead of relying on exact token matches.

Result: Experiments on diverse summarization datasets show up to 6.25× speedup and 42% lower GPU memory usage with 5k tokens input, with negligible quality degradation.

Conclusion: SemShareKV demonstrates the potential of semantic-aware cache sharing for efficient LLM inference, effectively addressing the KV cache bottleneck by reusing semantically similar content across prompts.

Abstract: As large language models (LLMs) continue to scale, the memory footprint of key-value (KV) caches during inference has become a significant bottleneck. Existing approaches primarily focus on compressing KV caches within a single prompt or reusing shared prefixes or frequently ocurred text segments across prompts. However, such strategies are limited in scenarios where prompts are semantically similar but lexically different, which frequently occurs in tasks such as multi-document summarization and conversational agents. We propose \textit{SemShareKV}, a KV cache sharing and compression framework that accelerates LLM inference by reusing KVCache in semantically similar prompts. Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information. By selectively reusing relevant key-value pairs from a reference prompt’s cache, SemShareKV reduces redundant computation while maintaining output quality. Experiments on diverse summarization datasets show up to 6.25$\times$ speedup and 42% lower GPU memory usage with 5k tokens input, with negligible quality degradation. These results highlight the potential of semantic-aware cache sharing for efficient LLM inference.

[52] Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization

Wengao Ye, Yan Liang, Lianlei Shan

Main category: cs.CL

TL;DR: LTPO is a test-time optimization framework that treats latent thought vectors as dynamic parameters, using policy gradients with intrinsic confidence rewards to enhance LLM reasoning without parameter updates.

DetailsMotivation: While latent reasoning in LLMs is more efficient than explicit Chain-of-Thought, it becomes brittle on challenging out-of-distribution tasks where robust reasoning is critical. Current approaches lack robustness when faced with complex reasoning problems.

Method: Latent Thought Policy Optimization (LTPO) treats intermediate latent thought vectors as dynamic parameters optimized per problem instance. It uses online policy gradient guided by intrinsic confidence-based rewards computed from the LLM’s own output distributions, requiring no external supervision or text generation during optimization.

Result: Extensive experiments on five reasoning benchmarks show LTPO matches or surpasses strong baselines on standard tasks and demonstrates remarkable robustness. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements.

Conclusion: LTPO provides a parameter-free framework that enhances LLM reasoning at test time, showing unique capability for complex reasoning tasks where other latent reasoning approaches fail, without requiring model parameter updates or external supervision.

Abstract: Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent “thought” vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM’s own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.

[53] Artificial Hippocampus Networks for Efficient Long-Context Modeling

Yunhao Fang, Weihao Yu, Shu Zhong, Qinghao Ye, Xuehan Xiong, Lai Wei

Main category: cs.CL

TL;DR: AHN framework combines sliding window attention with RNN-like compression for efficient long-sequence modeling, reducing FLOPs by 40.5% and memory by 74% while improving performance.

DetailsMotivation: Address the fundamental trade-off between efficient fixed-size memory in RNNs and lossless growing memory in Transformers for long-sequence modeling, inspired by cognitive science's Multi-Store Model.

Method: Maintain sliding window KV cache as short-term memory, use Artificial Hippocampus Network (AHN) to compress out-of-window info into fixed-size long-term memory. Implement AHNs with modern RNN architectures (Mamba2, DeltaNet, GatedDeltaNet) and train via self-distillation with frozen base model.

Result: AHN-augmented models outperform sliding window baselines, achieve comparable/superior performance to full-attention models while reducing computational requirements. Qwen2.5-3B-Instruct shows 40.5% FLOPs reduction, 74% memory cache reduction, and LV-Eval score improvement from 4.41 to 5.88.

Conclusion: The AHN framework successfully bridges RNN efficiency with Transformer fidelity for long-sequence modeling, offering substantial computational savings while maintaining or improving performance on long-context benchmarks.

Abstract: Long-sequence modeling faces a fundamental trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of lossless growing memory in attention-based Transformers. Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks. Our method maintains a sliding window of the Transformer’s KV cache as lossless short-term memory, while a learnable module termed Artificial Hippocampus Network (AHN) recurrently compresses out-of-window information into a fixed-size compact long-term memory. To validate this framework, we instantiate AHNs using modern RNN-like architectures, including Mamba2, DeltaNet, and GatedDeltaNet to augment open-weight LLMs. We also propose an efficient self-distillation training method where the base model’s all parameters are frozen and only the parameters from AHNs are optimized. For inference, our method sets a default large sliding window size of 32k for attention, and AHNs activate only when the sequence length exceeds the 32k window, addressing the quadratic-complexity issue of attention that emerges at that scale. Extensive experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines and achieve performance comparable or even superior to full-attention models, while substantially reducing computational and memory requirements. For instance, augmenting the Qwen2.5-3B-Instruct with AHNs reduces inference FLOPs by 40.5% and memory cache by 74.0%, while improving its average score on LV-Eval (128k sequence length) from 4.41 to 5.88. Code is available at: https://github.com/ByteDance-Seed/AHN.

[54] GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

Xiuyuan Chen, Tao Sun, Dexin Su, Ailing Yu, Junwei Liu, Zhe Chen, Gangzeng Jin, Xin Wang, Jingnan Liu, Hansong Xiao, Hualei Zhou, Dongjie Tao, Chunxiao Guo, Minghui Yang, Yuan Xia, Jing Zhao, Qianrui Fan, Yanyun Wang, Shuai Zhen, Kezhong Chen, Jun Wang, Zewen Sun, Heng Zhao, Tian Guan, Shaodong Wang, Geyun Chang, Jiaming Deng, Hongchengcheng Chen, Kexin Feng, Ruzhen Li, Jiayi Geng, Changtai Zhao, Jun Wang, Guihu Lin, Peihao Li, Liqi Liu, Peng Wei, Jian Wang, Jinjie Gu, Ping Wang, Fan Yang

Main category: cs.CL

TL;DR: GAPS framework introduces automated multidimensional evaluation (Grounding, Adequacy, Perturbation, Safety) for AI clinician systems, overcoming limitations of current benchmarks through guideline-anchored pipeline and LLM-based scoring.

DetailsMotivation: Current AI clinician benchmarks (multiple-choice exams, manual rubrics) fail to capture real-world clinical practice requirements for depth, robustness, and safety, creating a need for more comprehensive evaluation methods.

Method: Developed fully automated pipeline: assembles evidence neighborhoods, creates dual graph/tree representations, generates questions across G-levels. Uses DeepResearch agent for GRADE-consistent rubrics in ReAct loop, with LLM ensemble scoring.

Result: Automated questions show high quality (90% clinician agreement, Cohen’s Kappa 0.77). State-of-the-art models show key failures: performance degrades with reasoning depth, struggles with completeness, vulnerable to perturbations and safety issues.

Conclusion: GAPS provides reproducible, scalable evaluation for AI clinician systems, guiding development toward safer clinical practice. Benchmark dataset and code are publicly available.

Abstract: Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating Grounding (cognitive depth), Adequacy (answer completeness), Perturbation (robustness), and Safety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment (90% agreement, Cohen’s Kappa 0.77). Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice. The benchmark dataset GAPS-NSCLC-preview and evaluation code are publicly available at https://huggingface.co/datasets/AQ-MedAI/GAPS-NSCLC-preview and https://github.com/AQ-MedAI/MedicalAiBenchEval.

[55] Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution

Zhiyang Chen, Daliang Xu, Haiyang Shen, Chiheng Lou, Mengwei Xu, Shangguang Wang, Xin Jin, Yun Ma

Main category: cs.CL

TL;DR: sd.npu is a mobile NPU acceleration framework for on-device RAG that uses pipelined execution to hide graph switching overhead and NPU-centric speculative decoding to improve hardware utilization, achieving 1.06-3.81× speedups and 1.07-4.71× energy savings.

DetailsMotivation: On-device RAG is promising for privacy and responsiveness but suffers from mobile NPU architectural constraints. Current hardware struggles with variable RAG workloads: transitions between context processing and token generation cause overhead due to static graph constraints, and memory-bound generation phases leave computational resources underutilized.

Method: Two key techniques: 1) Pipelined execution strategy that masks model reconfiguration overhead by parallelizing decoding graph loading with chunked prefill computation, ensuring continuous execution flow. 2) NPU-centric speculative decoding mechanism that calibrates generation distributions and extends draft sequences to convert idle NPU cycles into valid token throughput.

Result: Experiments on commercial smartphones show the framework significantly outperforms existing baselines, delivering 1.06× to 3.81× speedups and 1.07× to 4.71× energy savings across various RAG tasks.

Conclusion: sd.npu provides an effective holistic acceleration framework that maximizes NPU efficiency for on-device RAG ecosystems by addressing both phase transition overhead and hardware underutilization issues, making mobile RAG more practical and efficient.

Abstract: Performing Retrieval-Augmented Generation (RAG) directly on mobile devices is promising for data privacy and responsiveness but is hindered by the architectural constraints of mobile NPUs. Specifically, current hardware struggles with the variable workloads intrinsic to RAG: the transition between processing extensive contexts and generating tokens incurs significant overhead due to static graph constraints, while the memory-bound generation phase leaves computational resources underutilized. In this work, we propose a holistic acceleration framework sd.npu, designed to maximize NPU efficiency for on-device RAG ecosystem. To address the latency caused by NPU graph switching during phase transitions, we introduce a pipelined execution strategy. This approach masks the overhead of model reconfiguration by parallelizing the loading of decoding graphs with the computation of partitioned context chunks (chunked prefill), thereby ensuring continuous execution flow. Furthermore, to mitigate low hardware utilization during the decoding phase, we develop an NPU-centric speculative decoding mechanism. By calibrating generation distributions and extending draft sequences, our method effectively converts idle NPU cycles into valid token throughput. Experiments on commercial smartphones show that our framework significantly outperforms existing baselines, delivering 1.06$\times$–3.81$\times$ speedups and 1.07$\times$–4.71$\times$ energy savings across various RAG tasks.

[56] MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models

Kailin Jiang, Ning Jiang, Yuntao Du, Yuchen Ren, Yuchen Li, Yifan Gao, Jinhe Bi, Yunpu Ma, Qingqing Liu, Xianhao Wang, Yifan Jia, Hongbo Jiang, Yaocong Hu, Bin Li, Lei Liu

Main category: cs.CL

TL;DR: MINED benchmark evaluates LMMs’ temporal awareness across 6 dimensions and 11 tasks, revealing most models struggle with time-sensitive knowledge, with Gemini-2.5-Pro performing best and knowledge editing showing promise for updates.

DetailsMotivation: Large Multimodal Models (LMMs) have rich factual knowledge but struggle with time-sensitive information due to static representations. Existing benchmarks are inadequate for evaluating temporal awareness, creating a gap in assessing LMMs' ability to understand evolving knowledge.

Method: Proposed MINED benchmark constructed from Wikipedia by professional annotators, containing 2,104 time-sensitive knowledge samples across six knowledge types. Evaluates 15 LMMs across 6 dimensions (cognition, awareness, trustworthiness, understanding, reasoning, robustness) and 11 challenging tasks.

Result: Gemini-2.5-Pro achieved highest average CEM score of 63.07, while most open-source LMMs lack time understanding ability. LMMs perform best on organization knowledge and worst on sport knowledge. Knowledge editing methods show promise for updating time-sensitive knowledge in single editing scenarios.

Conclusion: Current LMMs have significant limitations in handling time-sensitive knowledge, highlighting the need for improved temporal awareness. The MINED benchmark provides comprehensive evaluation, and knowledge editing offers a promising direction for updating factual knowledge in LMMs.

Abstract: Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs’ ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.

[57] DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

Xiying Zhao, Zhoufutu Wen, Zhixuan Chen, Jingzhe Ding, Jianpeng Jiao, Shuai Li, Xi Li, Danni Liang, Shengda Long, Qianqian Liu, Xianbo Wu, Hongwan Gao, Xiang Gao, Liang Hu, Jiashuo Liu, Mengyun Liu, Weiran Shi, Chenghao Yang, Qianyu Yang, Xuanliang Zhang, Ge Zhang, Wenhao Huang, Yuwen Tang

Main category: cs.CL

TL;DR: DiscoX is a new benchmark for discourse-level expert Chinese-English translation, with Metric-S as a reference-free evaluation system that outperforms existing metrics and reveals LLMs still trail human experts.

DetailsMotivation: Current translation evaluation methods focus on segment-level accuracy and fluency, but expert domain translations require discourse-level coherence and terminological precision, which current methods inadequately address.

Method: Created DiscoX benchmark with 200 professionally-curated texts from 7 domains (avg. 1700+ tokens) and developed Metric-S, a reference-free system for fine-grained assessment across accuracy, fluency, and appropriateness.

Result: Metric-S shows strong consistency with human judgments, significantly outperforming existing metrics. Experiments reveal a significant performance gap where even advanced LLMs trail human experts, validating DiscoX’s difficulty.

Conclusion: DiscoX and Metric-S provide a robust framework for rigorous evaluation of discourse-level expert translation, highlighting remaining challenges in achieving professional-grade machine translation and facilitating future LLM-based translation advancements.

Abstract: The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.

[58] Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases

Zhaodong Wang, Zhenting Qi, Sherman Wong, Nathan Hu, Samuel Lin, Jun Ge, Erwin Gao, Wenlin Chen, Yilun Du, Minlan Yu, Ying Zhang

Main category: cs.CL

TL;DR: Confucius Code Agent (CCA) is a scalable software engineering agent that combines research transparency with production-grade performance, achieving 54.3% Resolve@1 on SWE-Bench-Pro while bridging the gap between research prototypes and practical deployment.

DetailsMotivation: Existing research agents struggle with real-world scalability, while proprietary systems lack extensibility and transparency. There's a need for coding agents that can handle massive repositories, sustain long-horizon sessions, and coordinate complex toolchains reliably while maintaining interpretability and controllability.

Method: Built on Confucius SDK with three perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). Features include unified orchestrator with hierarchical working memory, persistent note-taking for continual learning, modular extension system for tool use, and a meta-agent that automates configuration synthesis/evaluation/refinement through build-test-improve loop.

Result: CCA achieves 54.3% Resolve@1 on SWE-Bench-Pro, exceeding prior research baselines and comparing favorably to commercial results under identical conditions (repositories, model backend, tool access).

Conclusion: Confucius SDK and CCA provide a general, extensible, production-grade foundation for building effective coding agents, bridging the gap between research prototypes and practical large-scale deployment in real-world software engineering.

Abstract: Real-world software engineering tasks require coding agents that can operate over massive repositories, sustain long-horizon sessions, and reliably coordinate complex toolchains at test time. Existing research-grade agents offer transparency but struggle when scaled to real-world workloads, while proprietary systems achieve strong practical performance but provide limited extensibility, interpretability, and controllability. We introduce the Confucius Code Agent (CCA), a scalable software engineering agent that can operate at large-scale codebases. CCA is built on top of the Confucius SDK, an agent development platform structured around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK integrates a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension system for reliable tool use. In addition, we introduce a meta-agent that automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid adaptation to new tasks, environments, and tool stacks. Instantiated with these mechanisms, CCA demonstrates strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA reaches a Resolve@1 of 54.3%, exceeding prior research baselines and comparing favorably to commercial results, under identical repositories, model backend, and tool access. Together, the Confucius SDK and CCA form a general, extensible, and production-grade foundation for building effective and robust coding agents, bridging the gap between research prototypes and practical large-scale deployment.

[59] Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers

Youmin Ko, Sungjong Seo, Hyunjoon Kim

Main category: cs.CL

TL;DR: CoopRAG is a cooperative RAG framework where retriever and LLM work together with intra-retriever layer cooperation to improve retrieval accuracy and reduce hallucinations in QA tasks.

DetailsMotivation: Existing RAG methods for QA are still prone to incorrect retrievals and hallucinations despite being designed to mitigate LLMs' factual inaccuracies.

Method: 1) Unroll questions into sub-questions with masked reasoning chains, 2) Retrieve documents using augmented queries, 3) Rerank documents via contrasting retriever layers, 4) Reconstruct reasoning chains by filling masked positions with LLM.

Result: CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets and one simple QA dataset in both retrieval and QA performance.

Conclusion: The cooperative framework between retriever and LLM, along with intra-retriever layer cooperation, effectively addresses retrieval inaccuracies and hallucinations in QA tasks.

Abstract: Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.

[60] Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

Chendong Sun, Ali Mao, Lei Xu, mingmin Chen

Main category: cs.CL

TL;DR: EARS improves speculative decoding by dynamically adjusting acceptance thresholds based on model uncertainty, reducing random rejections and boosting throughput by up to 18.12%.

DetailsMotivation: Current speculative decoding suffers from "random rejection" problem where plausible candidate tokens are frequently rejected due to fixed random thresholds, especially in high-uncertainty scenarios like creative writing and open-domain QA, undermining inference efficiency.

Method: EARS (Efficient Adaptive Rejection Sampling) dynamically adjusts acceptance thresholds by incorporating the target model’s predictive uncertainty (1 - max(P_target)). It introduces a tolerance term proportional to this uncertainty, relaxing acceptance criteria when the model is uncertain while maintaining strict standards when confident.

Result: EARS significantly enhances speculative decoding efficiency, achieving up to 18.12% increase in throughput with only 0.84% accuracy drop on GSM8K benchmark. The method works without model architecture modifications and integrates seamlessly into existing frameworks.

Conclusion: EARS effectively addresses the random rejection problem in speculative decoding by adaptively adjusting acceptance thresholds based on model uncertainty, substantially improving inference efficiency while maintaining accuracy, making it a practical enhancement for existing systems.

Abstract: Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component – the rejection sampling mechanism – relies on a fixed, context-independent random threshold. This leads to a significant “random rejection” problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model’s own predictive uncertainty, measured as 1 - max(P_target). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.

Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Quang Minh, Phan Phi Hai, Nguyen Thi Ngoc Anh, Dang Van Tu, Binh Vu

Main category: cs.CL

TL;DR: VLegal-Bench is the first comprehensive benchmark for evaluating LLMs on Vietnamese legal tasks, featuring 10,450 expert-annotated samples across multiple cognitive levels to assess practical legal understanding and application.

DetailsMotivation: The complexity, hierarchical organization, and frequent revisions of Vietnamese legislation create significant challenges for evaluating how well LLMs can interpret and utilize legal knowledge, creating a gap in standardized assessment tools for Vietnamese legal AI systems.

Method: Created VLegal-Bench using Bloom’s cognitive taxonomy to design tasks reflecting practical legal usage scenarios. Developed a rigorous annotation pipeline where legal experts label and cross-validate 10,450 samples, ensuring each is grounded in authoritative legal documents and mirrors real-world legal assistant workflows including general Q&A, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving.

Result: VLegal-Bench provides the first comprehensive benchmark for Vietnamese legal tasks with 10,450 expert-validated samples, establishing a standardized, transparent, and cognitively informed evaluation framework for LLM performance in Vietnamese legal contexts.

Conclusion: VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems through its standardized evaluation framework.

Abstract: The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom’s cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.

[62] MMGR: Multi-Modal Generative Reasoning

Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, Nanyun Peng, Junjie Hu

Main category: cs.CL

TL;DR: MMGR is a new evaluation framework for video foundation models that assesses reasoning abilities beyond visual quality, revealing significant gaps in abstract reasoning and spatial planning capabilities.

DetailsMotivation: Current video foundation models produce visually realistic content but lack evaluation of whether they properly capture physical, logical, and spatial constraints. Existing metrics like FVD focus on perceptual quality while ignoring reasoning failures in causality, physics, and global consistency.

Method: MMGR introduces a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. It evaluates across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (3D navigation/localization), and Physical Commonsense (sports/interactions). Uses fine-grained metrics requiring holistic correctness across video and image generation.

Result: Benchmarking leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image) reveals strong performance gaps. Models show moderate success on Physical Commonsense but perform poorly on Abstract Reasoning (below 10% accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings.

Conclusion: Current models have key limitations including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR provides a unified diagnostic benchmark and a path toward reasoning-aware generative world models.

Abstract: Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.

cs.CV

[63] Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities

Aref Farhadipour, Teodora Vukovic, Volker Dellwo, Petr Motlicek, Srikanth Madikeri

Main category: cs.CV

TL;DR: A trimodal person identification framework using voice, face, and gesture with cross-attention fusion and confidence weighting that maintains high accuracy even with missing modalities.

DetailsMotivation: Real-world person recognition systems often face missing or degraded modalities, requiring robust solutions that can handle incomplete data while maintaining high accuracy.

Method: Multi-task learning for independent modality processing, cross-attention and gated fusion mechanisms for modality interaction, and confidence-weighted fusion to adapt to missing/low-quality data.

Result: Achieves 99.18% Top-1 accuracy on CANDOR dataset, 99.92% on VoxCeleb1 in bimodal mode, and maintains high accuracy even with one or two missing modalities.

Conclusion: The proposed trimodal framework provides a robust solution for real-world person recognition applications by effectively handling modality loss through adaptive fusion mechanisms.

Abstract: Person recognition systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a Trimodal person identification framework that integrates voice, face, and gesture modalities, while remaining robust to modality loss. Our approach leverages multi-task learning to process each modality independently, followed by a cross-attention and gated fusion mechanisms to facilitate interaction across modalities. Moreover, a confidence-weighted fusion strategy dynamically adapts to missing and low-quality data, ensuring optimal classification even in Unimodal or Bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark for the first time. Our results demonstrate that the proposed Trimodal system achieves 99.18% Top-1 accuracy on person identification tasks, outperforming conventional Unimodal and late-fusion approaches. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in Bimodal mode. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.

[64] SkyCap: Bitemporal VHR Optical-SAR Quartets for Amplitude Change Detection and Foundation-Model Evaluation

Paul Weinmann, Ferdinand Schenck, Martin Šiklar

Main category: cs.CV

TL;DR: SkyCap is a new bitemporal VHR optical-SAR dataset for infrastructure monitoring, enabling SAR change detection via optical-to-SAR label transfer without expert SAR annotations. Evaluation shows optical foundation models with proper preprocessing outperform SAR-specific models on SAR change detection.

DetailsMotivation: Infrastructure monitoring needs reliable high-resolution data with regular cadence. Optical VHR imagery is interpretable but cloud-dependent, while SAR provides all-weather capability but is difficult to annotate. There's a need for SAR change detection without expert annotations.

Method: Created SkyCap dataset by archive matching and co-registering SkySat (optical) and Capella Space (SAR) scenes. Used optical-to-SAR label transfer to obtain SAR amplitude change detection labels without SAR-expert annotations. Performed continued pretraining of SARATR-X on SAR data and benchmarked SAR-specific foundation models against optical FMs with different preprocessing.

Result: Optical foundation model MTP(ViT-B+RVSA) with dB+Z-score preprocessing achieved best F1 score (45.06), outperforming SAR-specific FMs pretrained on Capella data. Found strong sensitivity to preprocessing alignment with pretraining statistics, and optical model rankings don’t directly transfer to SAR change detection.

Conclusion: First evaluation of foundation models on VHR SAR amplitude change detection. Optical models with proper preprocessing can outperform SAR-specific models, but preprocessing alignment is crucial. Optical-to-SAR label transfer enables SAR change detection without expert annotations.

Abstract: Change detection for linear infrastructure monitoring requires reliable high-resolution data and regular acquisition cadence. Optical very-high-resolution (VHR) imagery is interpretable and straightforward to label, but clouds break this cadence. Synthetic Aperture Radar (SAR) enables all-weather acquisitions, yet is difficult to annotate. We introduce SkyCap, a bitemporal VHR optical-SAR dataset constructed by archive matching and co-registration of (optical) SkySat and Capella Space (SAR) scenes. We utilize optical-to-SAR label transfer to obtain SAR amplitude change detection (ACD) labels without requiring SAR-expert annotations. We perform continued pretraining of SARATR-X on our SAR data and benchmark the resulting SAR-specific foundation models (FMs) together with SARATR-X against optical FMs on SkyCap under different preprocessing choices. Among evaluated models, MTP(ViT-B+RVSA), an optical FM, with dB+Z-score preprocessing attains the best result (F1$_c$ = 45.06), outperforming SAR-specific FMs further pretrained directly on Capella data. We observe strong sensitivity to preprocessing alignment with pretraining statistics, and the ranking of optical models on optical change detection does not transfer one-to-one to SAR ACD. To our knowledge, this is the first evaluation of foundation models on VHR SAR ACD.

[65] SocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-Tuning

Tomohito Kawabata, Xinyu Zhang, Ling Xiao

Main category: cs.CV

TL;DR: SocialNav-MoE: An efficient Mixture-of-Experts VLM for socially compliant navigation using reinforcement fine-tuning with semantic similarity rewards, achieving good balance between accuracy and efficiency on resource-constrained robots.

DetailsMotivation: Prior robot navigation work focuses mainly on safety while neglecting social compliance (human comfort, social norms, contextual appropriateness). Large VLMs are computationally expensive and unsuitable for real-time deployment on resource-constrained robotic platforms.

Method: Proposes SocialNav-MoE, an efficient Mixture-of-Experts vision language model with reinforcement fine-tuning (RFT). Introduces semantic similarity reward (SSR) to enhance decision-making. Studies different small language models (Phi, Qwen, StableLM), routing strategies, and vision encoders (CLIP vs. SigLIP, frozen vs. fine-tuned).

Result: Experiments on SNEI dataset show SocialNav-MoE achieves excellent balance between navigation accuracy and efficiency. SSR function is more effective than hard-level and character-level rewards.

Conclusion: SocialNav-MoE addresses the computational efficiency challenge for socially compliant navigation on resource-constrained robots, making real-time deployment feasible while maintaining good performance.

Abstract: For robots navigating in human-populated environments, safety and social compliance are equally critical, yet prior work has mostly emphasized safety. Socially compliant navigation that accounts for human comfort, social norms, and contextual appropriateness remains underexplored. Vision language models (VLMs) show promise for this task; however, large-scale models incur substantial computational overhead, leading to higher inference latency and energy consumption, which makes them unsuitable for real-time deployment on resource-constrained robotic platforms. To address this issue, we investigate the effectiveness of small VLM and propose SocialNav-MoE, an efficient Mixture-of-Experts vision language model for socially compliant navigation with reinforcement fine-tuning (RFT). We further introduce a semantic similarity reward (SSR) to effectively leverage RFT for enhancing the decision-making capabilities. Additionally, we study the effectiveness of different small language model types (Phi, Qwen, and StableLM), routing strategies, and vision encoders (CLIP vs. SigLIP, frozen vs. fine-tuned). Experiments on the SNEI dataset demonstrate that SocialNav-MoE achieves an excellent balance between navigation accuracy and efficiency. The proposed SSR function is more effective than hard-level and character-level rewards. Source code will be released upon acceptance.

[66] TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

Zhenzhi Wang, Jian Wang, Ke Ma, Dahua Lin, Bing Zhou

Main category: cs.CV

TL;DR: TalkVerse is a large-scale open corpus for audio-driven talking video generation with 2.3M high-resolution clips, plus a reproducible 5B DiT baseline model that achieves minute-long generation with low drift and 10x lower inference cost than SOTA.

DetailsMotivation: Current audio-driven talking video generation systems rely on closed data or compute-heavy models, making fair comparison and reproducible research difficult. There's a need for open, large-scale datasets and efficient baseline models.

Method: Created TalkVerse dataset through transparent curation pipeline with scene-cut detection, aesthetic assessment, audio-visual sync checks, and comprehensive annotations. Built 5B DiT baseline using video VAE with high downsampling ratio, sliding window with motion-frame context, and integrated MLLM director for storytelling.

Result: TalkVerse offers 2.3M high-resolution clips (6.3k hours). The 5B model achieves minute-long generation with low drift, comparable lip-sync/visual quality to 14B Wan-S2V but with 10x lower inference cost. Supports zero-shot video dubbing via controlled latent noise injection.

Conclusion: TalkVerse enables fair, reproducible research in audio-driven human video generation. The open-source dataset, training recipes, and 5B checkpoints lower barriers for the research community while providing efficient baseline performance.

Abstract: We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10$\times$ lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: https://zhenzhiwang.github.io/talkverse/

[67] The Renaissance of Expert Systems: Optical Recognition of Printed Chinese Jianpu Musical Scores with Lyrics

Fan Bu, Rongfeng Li, Zijin Li, Ya Li, Linfeng Fan, Pei Huang

Main category: cs.CV

TL;DR: A modular expert-system pipeline converts printed Chinese Jianpu scores with lyrics into MusicXML/MIDI without requiring massive annotated training data, achieving high precision on melody and lyrics recognition.

DetailsMotivation: Large-scale OMR research has focused mainly on Western staff notation, leaving Chinese Jianpu (numbered notation) and its rich lyric resources underexplored, creating a need for digitization solutions.

Method: Top-down expert-system design using traditional computer-vision techniques (phrase correlation, skeleton analysis) integrated with unsupervised deep-learning modules for image feature embeddings, creating a hybrid approach.

Result: Digitized over 5,000 melody-only songs (>300,000 notes) and 1,400+ songs with lyrics (>100,000 notes) from The Anthology of Chinese Folk Songs, achieving note-wise F1 = 0.951 for melody and character-wise F1 = 0.931 for aligned lyrics.

Conclusion: The hybrid expert-system approach successfully bridges the gap in Chinese Jianpu OMR research, providing an effective solution for digitizing printed Jianpu scores without requiring massive annotated datasets.

Abstract: Large-scale optical music recognition (OMR) research has focused mainly on Western staff notation, leaving Chinese Jianpu (numbered notation) and its rich lyric resources underexplored. We present a modular expert-system pipeline that converts printed Jianpu scores with lyrics into machine-readable MusicXML and MIDI, without requiring massive annotated training data. Our approach adopts a top-down expert-system design, leveraging traditional computer-vision techniques (e.g., phrase correlation, skeleton analysis) to capitalize on prior knowledge, while integrating unsupervised deep-learning modules for image feature embeddings. This hybrid strategy strikes a balance between interpretability and accuracy. Evaluated on The Anthology of Chinese Folk Songs, our system massively digitizes (i) a melody-only collection of more than 5,000 songs (> 300,000 notes) and (ii) a curated subset with lyrics comprising over 1,400 songs (> 100,000 notes). The system achieves high-precision recognition on both melody (note-wise F1 = 0.951) and aligned lyrics (character-wise F1 = 0.931).

[68] AquaDiff: Diffusion-Based Underwater Image Enhancement for Addressing Color Distortion

Afrah Shaahid, Muzammil Behzad

Main category: cs.CV

TL;DR: AquaDiff is a diffusion-based framework for underwater image enhancement that corrects color distortion while preserving structural and perceptual fidelity through chromatic prior-guided color compensation and conditional diffusion with cross-attention.

DetailsMotivation: Underwater images suffer from severe degradation due to wavelength-dependent light absorption and scattering, causing color distortion, low contrast, and loss of fine details that hinder vision-based underwater applications.

Method: AquaDiff integrates chromatic prior-guided color compensation with conditional diffusion process using cross-attention to fuse degraded inputs and noisy latent states. It features an enhanced denoising backbone with residual dense blocks and multi-resolution attention, plus a novel cross-domain consistency loss enforcing pixel-level accuracy, perceptual similarity, structural integrity, and frequency-domain fidelity.

Result: Extensive experiments on multiple challenging underwater benchmarks show AquaDiff provides good results compared to state-of-the-art traditional, CNN-, GAN-, and diffusion-based methods, achieving superior color correction and competitive overall image quality across diverse underwater conditions.

Conclusion: AquaDiff effectively addresses underwater image degradation through a diffusion-based framework that combines chromatic compensation with conditional diffusion and comprehensive loss functions, demonstrating strong performance in color correction and overall image enhancement for underwater applications.

Abstract: Underwater images are severely degraded by wavelength-dependent light absorption and scattering, resulting in color distortion, low contrast, and loss of fine details that hinder vision-based underwater applications. To address these challenges, we propose AquaDiff, a diffusion-based underwater image enhancement framework designed to correct chromatic distortions while preserving structural and perceptual fidelity. AquaDiff integrates a chromatic prior-guided color compensation strategy with a conditional diffusion process, where cross-attention dynamically fuses degraded inputs and noisy latent states at each denoising step. An enhanced denoising backbone with residual dense blocks and multi-resolution attention captures both global color context and local details. Furthermore, a novel cross-domain consistency loss jointly enforces pixel-level accuracy, perceptual similarity, structural integrity, and frequency-domain fidelity. Extensive experiments on multiple challenging underwater benchmarks demonstrate that AquaDiff provides good results as compared to the state-of-the-art traditional, CNN-, GAN-, and diffusion-based methods, achieving superior color correction and competitive overall image quality across diverse underwater conditions.

[69] Improving VQA Reliability: A Dual-Assessment Approach with Self-Reflection and Cross-Model Verification

Xixian Wu, Yang Ou, Pengchao Tian, Zian Yang, Jielei Zhang, Peiyi Li, Longwen Gao

Main category: cs.CV

TL;DR: DAVR is a novel framework that combines self-reflection and cross-model verification to estimate uncertainty and reduce hallucinations in vision-language models for VQA, achieving state-of-the-art performance in reliability metrics.

DetailsMotivation: Vision-language models are prone to hallucinations that produce overconfident but incorrect answers, which undermines the reliability of VQA systems. There's a need for better uncertainty estimation to enhance trustworthiness.

Method: DAVR uses a dual-pathway architecture: one pathway employs dual selector modules that fuse VLM latent features with QA embeddings for self-reflection, while the other uses external reference models for factual cross-checking to mitigate hallucinations.

Result: DAVR achieved first place in the Reliable VQA Challenge at ICCV-CLVL 2025 with a leading Φ₁₀₀ score of 39.64 and 100-AUC of 97.22, demonstrating superior reliability performance.

Conclusion: The DAVR framework effectively enhances VLM trustworthiness by integrating self-assessment and cross-model verification, providing a comprehensive solution for uncertainty estimation and hallucination mitigation in VQA systems.

Abstract: Vision-language models (VLMs) have demonstrated significant potential in Visual Question Answering (VQA). However, the susceptibility of VLMs to hallucinations can lead to overconfident yet incorrect answers, severely undermining answer reliability. To address this, we propose Dual-Assessment for VLM Reliability (DAVR), a novel framework that integrates Self-Reflection and Cross-Model Verification for comprehensive uncertainty estimation. The DAVR framework features a dual-pathway architecture: one pathway leverages dual selector modules to assess response reliability by fusing VLM latent features with QA embeddings, while the other deploys external reference models for factual cross-checking to mitigate hallucinations. Evaluated in the Reliable VQA Challenge at ICCV-CLVL 2025, DAVR achieves a leading $Φ_{100}$ score of 39.64 and a 100-AUC of 97.22, securing first place and demonstrating its effectiveness in enhancing the trustworthiness of VLM responses.

[70] HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin

Main category: cs.CV

TL;DR: HERBench is a new VideoQA benchmark designed to test multi-evidence integration across time, requiring aggregation of at least three non-overlapping visual cues from different video segments.

DetailsMotivation: Current VideoQA benchmarks often allow questions to be answered from single salient cues, under-testing reasoning that requires aggregating multiple temporally separated visual evidence.

Method: Created HERBench with 26K five-way multiple-choice questions across 12 compositional tasks, introduced Minimum Required Frame-Set (MRFS) metric to quantify evidential demand, and evaluated 13 state-of-the-art Video-LLMs.

Result: HERBench imposes higher evidential demand (mean MRFS 5.5 vs. 2.6-4.2 in prior datasets). Current Video-LLMs perform poorly (31-42% accuracy vs 20% random baseline), revealing retrieval and fusion deficits.

Conclusion: HERBench establishes a principled target for advancing robust, compositional video understanding by making cross-time evidence integration both unavoidable and quantifiable.

Abstract: Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.

[71] Isolated Sign Language Recognition with Segmentation and Pose Estimation

Daniel Perkins, Davis Hunter, Dhrumil Patel, Galen Flanagan

Main category: cs.CV

TL;DR: A lightweight ISLR model using pose estimation, segmentation, and ResNet-Transformer backbone to address computational cost and signer variability challenges in ASL recognition.

DetailsMotivation: Current large language models don't serve ASL users effectively due to ASL's visual nature. ISLR faces challenges: scarce per-sign data, high signer variability, and high computational costs that limit accessibility.

Method: Three-component approach: (1) pose estimation pipeline for hand/face joint coordinates, (2) segmentation module to isolate relevant information, (3) ResNet-Transformer backbone to jointly model spatial and temporal dependencies.

Result: The proposed model reduces computational requirements while maintaining robustness to signer variation (specific metrics not provided in abstract).

Conclusion: This approach helps bridge the accessibility gap for ASL users by providing a more efficient and robust ISLR solution that addresses key limitations of current systems.

Abstract: The recent surge in large language models has automated translations of spoken and written languages. However, these advances remain largely inaccessible to American Sign Language (ASL) users, whose language relies on complex visual cues. Isolated sign language recognition (ISLR) - the task of classifying videos of individual signs - can help bridge this gap but is currently limited by scarce per-sign data, high signer variability, and substantial computational costs. We propose a model for ISLR that reduces computational requirements while maintaining robustness to signer variation. Our approach integrates (i) a pose estimation pipeline to extract hand and face joint coordinates, (ii) a segmentation module that isolates relevant information, and (iii) a ResNet-Transformer backbone to jointly model spatial and temporal dependencies.

[72] VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics

Opeyemi Bamigbade, Mark Scanlon, John Sheppard

Main category: cs.CV

TL;DR: VAAS is a new framework combining Vision Transformers and SegFormer embeddings to detect AI-generated image forgeries with continuous anomaly scores and visual explainability.

DetailsMotivation: AI-generated image forgeries are becoming increasingly sophisticated and evade traditional detection methods. Existing approaches lack explicit anomaly intensity measurement and interpretability, making it difficult to quantify manipulation severity and provide transparent forensic evidence.

Method: Dual-module framework integrating global attention-based anomaly estimation using Vision Transformers (ViT) with patch-level self-consistency scoring from SegFormer embeddings. This hybrid approach produces continuous, interpretable anomaly scores showing both location and degree of manipulation.

Result: Competitive F1 and IoU performance on DF2023 and CASIA v2.0 datasets, with enhanced visual explainability through attention-guided anomaly maps. The framework successfully bridges quantitative detection with human-understandable reasoning.

Conclusion: VAAS provides a transparent and reliable solution for image integrity assessment by combining quantitative anomaly detection with interpretable visual explanations, addressing the growing challenge of AI-generated forgeries in forensic investigations.

Abstract: Recent advances in AI-driven image generation have introduced new challenges for verifying the authenticity of digital evidence in forensic investigations. Modern generative models can produce visually consistent forgeries that evade traditional detectors based on pixel or compression artefacts. Most existing approaches also lack an explicit measure of anomaly intensity, which limits their ability to quantify the severity of manipulation. This paper introduces Vision-Attention Anomaly Scoring (VAAS), a novel dual-module framework that integrates global attention-based anomaly estimation using Vision Transformers (ViT) with patch-level self-consistency scoring derived from SegFormer embeddings. The hybrid formulation provides a continuous and interpretable anomaly score that reflects both the location and degree of manipulation. Evaluations on the DF2023 and CASIA v2.0 datasets demonstrate that VAAS achieves competitive F1 and IoU performance, while enhancing visual explainability through attention-guided anomaly maps. The framework bridges quantitative detection with human-understandable reasoning, supporting transparent and reliable image integrity assessment. The source code for all experiments and corresponding materials for reproducing the results are available open source.

[73] Visual-textual Dermatoglyphic Animal Biometrics: A First Case Study on Panthera tigris

Wenshuo Li, Majid Mirmehdi, Tilo Burghardt

Main category: cs.CV

TL;DR: The paper introduces dermatoglyphic textual descriptors for animal re-identification, combining visual and textual modalities to improve AI accuracy and explainability in ecological monitoring.

DetailsMotivation: Current AI tools for animal re-identification are largely image-based and limited to species with distinctive morphological features. The paper aims to overcome vision-only limitations by incorporating forensic-inspired textual descriptors for better cross-modal identity retrieval and explainability.

Method: Developed a dermatoglyphic textual descriptor approach using human-interpretable language tags to encode animal coat topology. Created a text-image co-synthesis pipeline to generate ‘virtual individuals’ with life-like visuals paired with dermatoglyphic text. Used 84,264 manually labelled minutiae across 3,355 images of 185 tigers for evaluation.

Result: The dermatoglyphic language-guided biometrics significantly boosts AI accuracy in cross-modal retrieval while alleviating data scarcity. The approach enables textual-to-visual identity recovery with human-verifiable matchings, representing a novel capability for cross-modal identity retrieval.

Conclusion: Dermatoglyphic language-guided biometrics can overcome vision-only limitations in animal re-identification, enabling explainable Re-ID and language-driven unification of descriptive modalities in ecological monitoring.

Abstract: Biologists have long combined visuals with textual field notes to re-identify (Re-ID) animals. Contemporary AI tools automate this for species with distinctive morphological features but remain largely image-based. Here, we extend Re-ID methodologies by incorporating precise dermatoglyphic textual descriptors-an approach used in forensics but new to ecology. We demonstrate that these specialist semantics abstract and encode animal coat topology using human-interpretable language tags. Drawing on 84,264 manually labelled minutiae across 3,355 images of 185 tigers (Panthera tigris), we evaluate this visual-textual methodology, revealing novel capabilities for cross-modal identity retrieval. To optimise performance, we developed a text-image co-synthesis pipeline to generate ‘virtual individuals’, each comprising dozens of life-like visuals paired with dermatoglyphic text. Benchmarking against real-world scenarios shows this augmentation significantly boosts AI accuracy in cross-modal retrieval while alleviating data scarcity. We conclude that dermatoglyphic language-guided biometrics can overcome vision-only limitations, enabling textual-to-visual identity recovery underpinned by human-verifiable matchings. This represents a significant advance towards explainability in Re-ID and a language-driven unification of descriptive modalities in ecological monitoring.

[74] Vibe Spaces for Creatively Connecting and Expressing Visual Concepts

Huzheng Yang, Katherine Xu, Andrew Lu, Michael D. Grossberg, Yutong Bai, Jianbo Shi

Main category: cs.CV

TL;DR: Vibe Blending is a new task for generating creative image hybrids that capture shared “vibes” between concepts, using a hierarchical graph manifold called Vibe Space to enable smooth semantic transitions.

DetailsMotivation: Current methods struggle to create meaningful visual hybrids by connecting distant concepts through their shared attributes ("vibes"), as they fail to identify and traverse nonlinear paths in latent space between distinct ideas.

Method: Proposes Vibe Space - a hierarchical graph manifold that learns low-dimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts for creative blending.

Result: Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods, evaluated through a cognitively inspired framework combining human judgments, LLM reasoning, and geometric path-based difficulty scoring.

Conclusion: The Vibe Space approach successfully enables creative visual concept blending by learning meaningful semantic transitions, outperforming existing methods in generating coherent and creative hybrids that capture shared attributes between images.

Abstract: Creating new visual concepts often requires connecting distinct ideas through their most relevant shared attributes – their vibe. We introduce Vibe Blending, a novel task for generating coherent and meaningful hybrids that reveals these shared attributes between images. Achieving such blends is challenging for current methods, which struggle to identify and traverse nonlinear paths linking distant concepts in latent space. We propose Vibe Space, a hierarchical graph manifold that learns low-dimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts. To evaluate creative quality, we design a cognitively inspired framework combining human judgments, LLM reasoning, and a geometric path-based difficulty score. We find that Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods.

[75] 3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation

Shuqing Li, Anson Y. Lam, Yun Peng, Wenxuan Wang, Michael R. Lyu

Main category: cs.CV

TL;DR: Scenethesis: A requirement-sensitive 3D software synthesis approach with formal traceability between user specs and generated 3D software, using a constraint-aware intermediate representation language.

DetailsMotivation: Current 3D software generation methods generate environments as a whole without element-level control, and struggle with complex spatial/semantic constraints. There's a gap between automated 2D software generation (HTML/CSS, mobile apps) and 3D software generation.

Method: Scenethesis uses ScenethesisLang, a domain-specific language as constraint-aware intermediate representation. It decomposes 3D software synthesis into stages operating on this IR, enabling independent verification, targeted modification, and systematic constraint satisfaction.

Result: Captures over 80% of user requirements, satisfies over 90% of hard constraints while handling 100+ constraints simultaneously. Achieves 42.8% improvement in BLIP-2 visual evaluation scores vs state-of-the-art.

Conclusion: Scenethesis provides a novel approach for 3D software synthesis with formal traceability, fine-grained control, and robust constraint handling, addressing limitations of current 3D generation methods.

Abstract: Graphical user interface (UI) software has undergone a fundamental transformation from traditional two-dimensional (2D) desktop/web/mobile interfaces to spatial three-dimensional (3D) environments. While existing work has made remarkable success in automated 2D software generation, such as HTML/CSS and mobile app interface code synthesis, the generation of 3D software still remains under-explored. Current methods for 3D software generation usually generate the 3D environments as a whole and cannot modify or control specific elements in the software. Furthermore, these methods struggle to handle the complex spatial and semantic constraints inherent in the real world. To address the challenges, we present Scenethesis, a novel requirement-sensitive 3D software synthesis approach that maintains formal traceability between user specifications and generated 3D software. Scenethesis is built upon ScenethesisLang, a domain-specific language that serves as a granular constraint-aware intermediate representation (IR) to bridge natural language requirements and executable 3D software. It serves both as a comprehensive scene description language enabling fine-grained modification of 3D software elements and as a formal constraint-expressive specification language capable of expressing complex spatial constraints. By decomposing 3D software synthesis into stages operating on ScenethesisLang, Scenethesis enables independent verification, targeted modification, and systematic constraint satisfaction. Our evaluation demonstrates that Scenethesis accurately captures over 80% of user requirements and satisfies more than 90% of hard constraints while handling over 100 constraints simultaneously. Furthermore, Scenethesis achieves a 42.8% improvement in BLIP-2 visual evaluation scores compared to the state-of-the-art method.

[76] PANDA-PLUS-Bench: A Clinical Benchmark for Evaluating Robustness of AI Foundation Models in Prostate Cancer Diagnosis

Joshua L. Ebbert, Dennis Della Corte

Main category: cs.CV

TL;DR: PANDA-PLUS-Bench is a curated benchmark dataset for evaluating AI foundation models in prostate cancer Gleason grading, specifically designed to test whether models learn generalizable biological features or slide-specific artifacts.

DetailsMotivation: Current AI foundation models for prostate cancer Gleason grading may achieve high accuracy by learning specimen-specific artifacts rather than generalizable biological features, limiting their real-world clinical utility. There's a need for specialized benchmarks to quantify this failure mode.

Method: Created PANDA-PLUS-Bench: 9 expert-annotated prostate biopsy whole slide images from unique patients with diverse Gleason patterns. Extracted non-overlapping tissue patches at 512x512 and 224x224 resolutions across 8 augmentation conditions. Evaluated 7 foundation models on their ability to separate biological signal from slide-level confounders.

Result: Substantial variation in robustness across models: Virchow2 had lowest slide-level encoding (81.0%) but second-lowest cross-slide accuracy (47.2%). HistoEncoder (tissue-specific training) showed highest cross-slide accuracy (59.7%) and strongest slide-level encoding (90.3%). All models had within-slide vs. cross-slide accuracy gaps (19.9-26.9 percentage points).

Conclusion: PANDA-PLUS-Bench addresses a critical gap in foundation model evaluation by providing a purpose-built resource for robustness assessment in Gleason grading. Tissue-specific training appears to enhance both biological feature capture and slide-specific signature recognition. The benchmark enables standardized evaluation of model robustness.

Abstract: Artificial intelligence foundation models are increasingly deployed for prostate cancer Gleason grading, where GP3/GP4 distinction directly impacts treatment decisions. However, these models may achieve high validation accuracy by learning specimen-specific artifacts rather than generalizable biological features, limiting real-world clinical utility. We introduce PANDA-PLUS-Bench, a curated benchmark dataset derived from expert-annotated prostate biopsies designed specifically to quantify this failure mode. The benchmark comprises nine carefully selected whole slide images from nine unique patients containing diverse Gleason patterns, with non-overlapping tissue patches extracted at both 512x512 and 224x224 pixel resolutions across eight augmentation conditions. Using this benchmark, we evaluate seven foundation models on their ability to separate biological signal from slide-level confounders. Our results reveal substantial variation in robustness across models: Virchow2 achieved the lowest slide-level encoding among large-scale models (81.0%) yet exhibited the second-lowest cross-slide accuracy (47.2%). HistoEncoder, trained specifically on prostate tissue, demonstrated the highest cross-slide accuracy (59.7%) and the strongest slide-level encoding (90.3%), suggesting tissue-specific training may enhance both biological feature capture and slide-specific signatures. All models exhibited measurable within-slide vs. cross-slide accuracy gaps, though the magnitude varied from 19.9 percentage points to 26.9 percentage points. We provide an open-source Google Colab notebook enabling researchers to evaluate additional foundation models against our benchmark using standardized metrics. PANDA-PLUS-Bench addresses a critical gap in foundation model evaluation by providing a purpose-built resource for robustness assessment in the clinically important context of Gleason grading.

[77] PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation

Sihan Zhao, Zixuan Wang, Tianyu Luan, Jia Jia, Wentao Zhu, Jiebo Luo, Junsong Yuan, Nan Xi

Main category: cs.CV

TL;DR: PP-Motion: A novel data-driven metric for evaluating both physical and perceptual fidelity of human motion generation, using physical alignment annotations and correlation-based training.

DetailsMotivation: Existing motion fidelity evaluation methods have a gap between human-perceived fidelity and physical feasibility. Human perception is subjective and uses coarse binary labeling, which undermines robust data-driven metric development.

Method: 1) Introduces physical labeling method that calculates minimum modifications needed for motion to align with physical laws, producing fine-grained continuous physical alignment annotations. 2) Proposes PP-Motion metric trained with Pearson’s correlation loss to capture physical priors. 3) Incorporates human-based perceptual fidelity loss to consider both perception and physical alignment.

Result: PP-Motion metric aligns with physical laws and aligns better with human perception of motion fidelity than previous work.

Conclusion: The proposed PP-Motion metric successfully bridges the gap between physical feasibility and human perception in motion fidelity evaluation, providing a robust data-driven solution for assessing generated human motions.

Abstract: Human motion generation has found widespread applications in AR/VR, film, sports, and medical rehabilitation, offering a cost-effective alternative to traditional motion capture systems. However, evaluating the fidelity of such generated motions is a crucial, multifaceted task. Although previous approaches have attempted at motion fidelity evaluation using human perception or physical constraints, there remains an inherent gap between human-perceived fidelity and physical feasibility. Moreover, the subjective and coarse binary labeling of human perception further undermines the development of a robust data-driven metric. We address these issues by introducing a physical labeling method. This method evaluates motion fidelity by calculating the minimum modifications needed for a motion to align with physical laws. With this approach, we are able to produce fine-grained, continuous physical alignment annotations that serve as objective ground truth. With these annotations, we propose PP-Motion, a novel data-driven metric to evaluate both physical and perceptual fidelity of human motion. To effectively capture underlying physical priors, we employ Pearson’s correlation loss for the training of our metric. Additionally, by incorporating a human-based perceptual fidelity loss, our metric can capture fidelity that simultaneously considers both human perception and physical alignment. Experimental results demonstrate that our metric, PP-Motion, not only aligns with physical laws but also aligns better with human perception of motion fidelity than previous work.

[78] Improving Pre-trained Segmentation Models using Post-Processing

Abhijeet Parida, Daniel Capellán-Martín, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer, Austin Tapp, Syed Muhammad Anwar, María J. Ledesma-Carbayo, Marius George Linguraru

Main category: cs.CV

TL;DR: Adaptive post-processing techniques improve glioma segmentation quality from large pre-trained models, boosting BraTS 2025 challenge metrics by up to 14.9% while addressing computational fairness and sustainability.

DetailsMotivation: Large-scale pre-trained models for glioma segmentation generalize poorly, produce systematic errors (false positives, label swaps, slice discontinuities), and face issues of computational resource inequality and environmental costs of training.

Method: Proposed adaptive post-processing techniques to refine glioma segmentations from large pre-trained models, focusing on efficient refinement rather than complex model architectures.

Result: Improved BraTS 2025 challenge rankings: 14.9% improvement for sub-Saharan Africa challenge and 0.9% improvement for adult glioma challenge.

Conclusion: Shifts focus from complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.

Abstract: Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.

[79] AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping

Wen Xie, Yanjun Zhu, Gijs Overgoor, Yakov Bart, Agata Lapedriza Garcia, Sarah Ostadabbas

Main category: cs.CV

TL;DR: Automated video ad clipping framework using audio-visual fusion for shot selection, outperforming existing methods with novel AdSum204 dataset.

DetailsMotivation: Manual creation of multiple ad versions from longer videos is labor-intensive; existing video summarization methods ignore audio's critical role in advertising.

Method: Two-stream audio-visual fusion model predicts frame importance (likelihood of selection in firm-produced ads), framing clipping as shot selection problem.

Result: Outperforms state-of-the-art methods across multiple metrics (Average Precision, AUC, Spearman, Kendall) using novel AdSum204 dataset of 102 ad pairs.

Conclusion: First to frame video ad clipping as shot selection with audio-visual fusion, providing effective automated solution with publicly available dataset and code.

Abstract: Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall. The dataset and code are available at https://github.com/ostadabbas/AdSum204.

[80] The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge

Rohit Jena, Pratik Chaudhari, James C. Gee

Main category: cs.CV

TL;DR: Deep learning methods for MRI registration show good in-distribution performance but fail to generalize zero-shot to different contrasts/resolutions, contradicting claims of universal superiority.

DetailsMotivation: To independently re-evaluate claims from the LUMIR challenge about deep learning methods achieving exceptional zero-shot generalization to unseen MRI contrasts and resolutions, which contradicts established domain shift literature.

Method: Rigorous evaluation protocols addressing instrumentation bias, testing on various MRI contrasts (T1, T2, T2*, FLAIR), different resolutions (including 0.6 mm isotropic), and across species (human, macaque).

Result: (1) Deep learning performs comparably to iterative methods on in-distribution T1w images and macaque data; (2) Significant performance degradation on out-of-distribution contrasts (Cohen’s d: 0.7-1.5); (3) Scalability limitations on high-resolution data; (4) High sensitivity to preprocessing choices.

Conclusion: Claims of universal zero-shot superiority require careful scrutiny; evaluation protocols should reflect practical clinical/research workflows rather than favoring specific method classes, aligning with established domain shift literature.

Abstract: The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen’s d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.

[81] Puzzle Curriculum GRPO for Vision-Centric Reasoning

Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao, Javad Rajabi, Ran Zhang, Raghav Goyal, Babak Taati, Radek Grzeszczuk

Main category: cs.CV

TL;DR: PC-GRPO is a supervision-free RL method for VLMs that uses self-supervised puzzle environments instead of annotations, with difficulty-aware curriculum and reasoning-answer consistency monitoring to improve visual reasoning.

DetailsMotivation: Current RL approaches for VLMs have three main issues: reliance on costly/noisy annotations or external verifiers, flat/sparse reward schemes in GRPO, and logical inconsistency between reasoning chains and final answers.

Method: PC-GRPO uses three self-supervised puzzle environments (PatchFit, Rotation, Jigsaw) with binary and graded rewards, difficulty-aware curriculum that dynamically weights samples, and monitors Reasoning-Answer Consistency (RAC) during post-training.

Result: PC-GRPO improves reasoning quality, training stability, and end-task accuracy across diverse benchmarks on Qwen-7B and Qwen-3B backbones, with RAC correlating with downstream accuracy.

Conclusion: PC-GRPO offers a practical path to scalable, verifiable, and interpretable RL post-training for VLMs without requiring annotations or external verifiers.

Abstract: Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain’s reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.

[82] Where is the Watermark? Interpretable Watermark Detection at the Block Level

Maria Bulychev, Neil G. Marchant, Benjamin I. P. Rubinstein

Main category: cs.CV

TL;DR: A post-hoc image watermarking method that combines localized embedding with region-level interpretability, using discrete wavelet transform and statistical block-wise strategy to generate detection maps showing which image regions are watermarked or altered.

DetailsMotivation: Current image watermarking schemes operate as black boxes with global detection scores, lacking transparency about where watermarks are present. This impacts user trust and makes it difficult to interpret tampering effects, especially with the rise of generative AI creating realistic digital content.

Method: Post-hoc image watermarking method that embeds watermark signals in the discrete wavelet transform domain using a statistical block-wise strategy. This localized embedding approach allows generation of detection maps that reveal which specific regions of an image are watermarked or altered.

Result: The method achieves strong robustness against common image transformations while remaining sensitive to semantic manipulations. Watermarks are highly imperceptible and robust to cropping up to half the image. Compared to prior post-hoc methods, it offers more interpretable detection while retaining competitive robustness.

Conclusion: The proposed approach addresses the transparency limitations of existing watermarking schemes by providing region-level interpretability through detection maps, improving user trust and enabling better interpretation of tampering effects while maintaining strong robustness and imperceptibility.

Abstract: Recent advances in generative AI have enabled the creation of highly realistic digital content, raising concerns around authenticity, ownership, and misuse. While watermarking has become an increasingly important mechanism to trace and protect digital media, most existing image watermarking schemes operate as black boxes, producing global detection scores without offering any insight into how or where the watermark is present. This lack of transparency impacts user trust and makes it difficult to interpret the impact of tampering. In this paper, we present a post-hoc image watermarking method that combines localised embedding with region-level interpretability. Our approach embeds watermark signals in the discrete wavelet transform domain using a statistical block-wise strategy. This allows us to generate detection maps that reveal which regions of an image are likely watermarked or altered. We show that our method achieves strong robustness against common image transformations while remaining sensitive to semantic manipulations. At the same time, the watermark remains highly imperceptible. Compared to prior post-hoc methods, our approach offers more interpretable detection while retaining competitive robustness. For example, our watermarks are robust to cropping up to half the image.

[83] Beyond Proximity: A Keypoint-Trajectory Framework for Classifying Affiliative and Agonistic Social Networks in Dairy Cattle

Sibi Parivendan, Kashfia Sailunaz, Suresh Neethirajan

Main category: cs.CV

TL;DR: A pose-based computer vision framework that distinguishes affiliative from agonistic livestock behaviors using anatomical keypoint trajectories instead of simple proximity thresholds, achieving 77.51% accuracy in social interaction classification.

DetailsMotivation: Current precision livestock farming approaches use static proximity thresholds that cannot differentiate between affiliative (friendly) and agonistic (aggressive) behaviors in complex barn environments, limiting the interpretability of automated social network analysis for welfare monitoring.

Method: End-to-end computer vision pipeline integrating YOLOv11 for object detection, supervised individual identification, ByteTrack for multi-object tracking, ZebraPose for 27-point anatomical keypoint estimation, and an SVM classifier trained on pose-derived distance dynamics from keypoint trajectories.

Result: The framework achieved 96.24% mAP@0.50 for detection, 98.24% accuracy for individual identification, 81.96% tracking accuracy, and 77.51% accuracy in distinguishing affiliative vs. agonistic behaviors using pose information alone, showing substantial improvement over proximity-only baselines.

Conclusion: The pose-based approach successfully moves beyond proximity heuristics by modeling spatiotemporal geometry of anatomical keypoints, establishing a proof-of-concept for automated vision-based inference of social interactions suitable for constructing interaction-aware social networks with near-real-time performance.

Abstract: Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection (mAP@0.50: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.

[84] Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation

Huaying Zhang, Atsushi Hashimoto, Tosho Hirasawa

Main category: cs.CV

TL;DR: The paper proposes a new evaluation protocol for video question generation (VQG) models that assesses question quality for eliciting unseen knowledge from human experts, using a question-to-answer retrieval approach with the novel EgoExoAsk dataset.

DetailsMotivation: Current VQG evaluation focuses on question-answering ability rather than question quality for extracting expert knowledge. The paper aims to address the fundamental question of what makes questions effective at eliciting valuable information from experts.

Method: Proposes a protocol that simulates question-answering communication with experts using question-to-answer retrieval. Constructs EgoExoAsk dataset (27,666 QA pairs from Ego-Exo4D’s expert commentary) to train a retriever, and builds a benchmark on validation set with Ego-Exo4D video segments.

Result: Experimental results show the proposed metric reasonably aligns with question generation settings: models with richer context receive better evaluations, supporting that the protocol works as intended for assessing question quality.

Conclusion: The paper introduces a novel evaluation framework for VQG that focuses on question quality for expert knowledge elicitation, providing the EgoExoAsk dataset and demonstrating that the proposed protocol effectively assesses question generation models’ ability to extract unseen information from experts.

Abstract: Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question-generation models is essential. Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers. Their evaluation typically focuses on the ability to answer questions, rather than the quality of generated questions. In contrast, we focus on the question quality in eliciting unseen knowledge from human experts. For a continuous improvement of VQG models, we propose a protocol that evaluates the ability by simulating question-answering communication with experts using a question-to-answer retrieval. We obtain the retriever by constructing a novel dataset, EgoExoAsk, which comprises 27,666 QA pairs generated from Ego-Exo4D’s expert commentary annotation. The EgoExoAsk training set is used to obtain the retriever, and the benchmark is constructed on the validation set with Ego-Exo4D video segments. Experimental results demonstrate our metric reasonably aligns with question generation settings: models accessing richer context are evaluated better, supporting that our protocol works as intended. The EgoExoAsk dataset is available in https://github.com/omron-sinicx/VQG4ExpertKnowledge .

[85] Model Agnostic Preference Optimization for Medical Image Segmentation

Yunseong Nam, Jiwon Jang, Dongkyu Won, Sang Hyun Park, Soopil Kim

Main category: cs.CV

TL;DR: MAPO is a model-agnostic preference optimization framework for medical image segmentation that uses dropout-driven stochastic hypotheses to create preference-consistent gradients without ground-truth supervision.

DetailsMotivation: Current preference optimization methods in medical image segmentation are model-specific and rely on low-diversity prediction sampling, limiting their applicability and effectiveness.

Method: MAPO uses dropout-driven stochastic segmentation hypotheses to construct preference-consistent gradients without direct ground-truth supervision. It’s architecture- and dimensionality-agnostic, supporting 2D/3D CNN and Transformer-based segmentation pipelines.

Result: MAPO consistently enhances boundary adherence, reduces overfitting, and yields more stable optimization dynamics compared to conventional supervised training across diverse medical datasets.

Conclusion: MAPO provides a scalable, model-agnostic preference optimization framework that improves medical image segmentation performance without requiring ground-truth supervision.

Abstract: Preference optimization offers a scalable supervision paradigm based on relative preference signals, yet prior attempts in medical image segmentation remain model-specific and rely on low-diversity prediction sampling. In this paper, we propose MAPO (Model-Agnostic Preference Optimization), a training framework that utilizes Dropout-driven stochastic segmentation hypotheses to construct preference-consistent gradients without direct ground-truth supervision. MAPO is fully architecture- and dimensionality-agnostic, supporting 2D/3D CNN and Transformer-based segmentation pipelines. Comprehensive evaluations across diverse medical datasets reveal that MAPO consistently enhances boundary adherence, reduces overfitting, and yields more stable optimization dynamics compared to conventional supervised training.

[86] MVGSR: Multi-View Consistent 3D Gaussian Super-Resolution via Epipolar Guidance

Kaizhe Zhang, Shinan Chen, Qian Zhao, Weizhan Zhang, Caixia Yan, Yudeng Xin

Main category: cs.CV

TL;DR: MVGSR is a novel 3D Gaussian Splatting super-resolution framework that integrates multi-view information for high-resolution rendering with improved consistency and detail fidelity.

DetailsMotivation: Existing 3DGS SR methods have limitations: single-image SR networks lack cross-view consistency, while video-based approaches require sequential frames and can't handle unstructured multi-view datasets. There's a need for a method that can effectively fuse complementary information from multiple views without temporal constraints.

Method: Two key innovations: 1) Auxiliary View Selection Method based on camera poses for handling arbitrarily organized multi-view datasets, and 2) Epipolar-constrained multi-view attention mechanism that selectively aggregates consistent information from auxiliary views to enhance geometric consistency and detail fidelity.

Result: The method achieves state-of-the-art performance on both object-centric and scene-level 3DGS SR benchmarks, demonstrating superior high-resolution rendering with enhanced consistency and detail.

Conclusion: MVGSR successfully addresses the limitations of previous 3DGS SR methods by enabling effective multi-view information integration without requiring temporal continuity, making it suitable for unstructured multi-view datasets while improving rendering quality and consistency.

Abstract: Scenes reconstructed by 3D Gaussian Splatting (3DGS) trained on low-resolution (LR) images are unsuitable for high-resolution (HR) rendering. Consequently, a 3DGS super-resolution (SR) method is needed to bridge LR inputs and HR rendering. Early 3DGS SR methods rely on single-image SR networks, which lack cross-view consistency and fail to fuse complementary information across views. More recent video-based SR approaches attempt to address this limitation but require strictly sequential frames, limiting their applicability to unstructured multi-view datasets. In this work, we introduce Multi-View Consistent 3D Gaussian Splatting Super-Resolution (MVGSR), a framework that focuses on integrating multi-view information for 3DGS rendering with high-frequency details and enhanced consistency. We first propose an Auxiliary View Selection Method based on camera poses, making our method adaptable for arbitrarily organized multi-view datasets without the need of temporal continuity or data reordering. Furthermore, we introduce, for the first time, an epipolar-constrained multi-view attention mechanism into 3DGS SR, which serves as the core of our proposed multi-view SR network. This design enables the model to selectively aggregate consistent information from auxiliary views, enhancing the geometric consistency and detail fidelity of 3DGS representations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both object-centric and scene-level 3DGS SR benchmarks.

[87] Asynchronous Event Stream Noise Filtering for High-frequency Structure Deformation Measurement

Yifei Bian, Banglei Guan, Zibin Liu, Ang Su, Shiyao Zhu, Yang Shang, Qifeng Yu

Main category: cs.CV

TL;DR: A method using event cameras and LED markers to measure high-frequency deformations in large-scale structures, overcoming limitations of traditional high-speed cameras.

DetailsMotivation: Traditional high-speed camera methods for measuring high-frequency deformations in large structures are limited by harsh lighting conditions and high equipment costs.

Method: Uses event camera and LED markers: filters observation noise based on event stream characteristics, differentiates between motion-induced events and LED blinking events, extracts LED markers from event stream, and measures deformations with monocular event camera.

Result: Experimental results confirm the accuracy of the method in measuring high-frequency planar deformations.

Conclusion: The proposed event camera and LED marker approach provides an effective solution for measuring high-frequency deformations in challenging conditions where traditional methods fail.

Abstract: Large-scale structures suffer high-frequency deformations due to complex loads. However, harsh lighting conditions and high equipment costs limit measurement methods based on traditional high-speed cameras. This paper proposes a method to measure high-frequency deformations by exploiting an event camera and LED markers. Firstly, observation noise is filtered based on the characteristics of the event stream generated by LED markers blinking and spatiotemporal correlation. Then, LED markers are extracted from the event stream after differentiating between motion-induced events and events from LED blinking, which enables the extraction of high-speed moving LED markers. Ultimately, high-frequency planar deformations are measured by a monocular event camera. Experimental results confirm the accuracy of our method in measuring high-frequency planar deformations.

[88] Tracking spatial temporal details in ultrasound long video via wavelet analysis and memory bank

Chenxiao Zhang, Runshi Zhang, Junchen Wang

Main category: cs.CV

TL;DR: MWNet: Memory bank-based wavelet filtering and fusion network for ultrasound video segmentation, addressing low contrast, noisy backgrounds, and object tracking in long videos.

DetailsMotivation: Medical ultrasound videos have low contrast and noisy backgrounds causing missegmentation of organ boundaries, leading to small object losses and boundary errors. Object tracking in long videos remains challenging for computer-assisted surgery workflows.

Method: Encoder-decoder network with memory-based wavelet convolution to capture category details and adjacent information. Uses cascaded wavelet compression for multiscale frequency-domain fusion, LSTM bank with cross-attention for object tracking, and HF-aware feature fusion module with adaptive wavelet filters.

Result: Demonstrated marked improvements in segmentation metrics on four ultrasound video datasets (thyroid nodule, thyroid gland, heart). Particularly effective for accurately segmenting small thyroid nodules in long videos.

Conclusion: MWNet effectively addresses ultrasound video segmentation challenges by integrating wavelet filtering, memory mechanisms, and frequency-domain features, showing strong performance especially for small objects in long medical videos.

Abstract: Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at https://github.com/XiAooZ/MWNet.

[89] PMMD: A pose-guided multi-view multi-modal diffusion for person generation

Ziyu Shang, Haoran Liu, Rongchao Zhang, Zhiqian Wei, Tongtong Feng

Main category: cs.CV

TL;DR: PMMD is a diffusion framework that generates consistent human images using multi-view references, pose maps, and text prompts, addressing issues like occlusions and garment style drift.

DetailsMotivation: Current methods for generating human images with controllable pose and appearance suffer from problems like occlusions, garment style drift, and pose misalignment, which are crucial for applications in virtual try-on, image editing, and digital human creation.

Method: Proposes Pose-guided Multi-view Multimodal Diffusion (PMMD) with: 1) multimodal encoder for joint modeling of visual views, pose features, and semantic descriptions; 2) ResCVA module to enhance local details while preserving global structure; 3) cross-modal fusion module integrating image semantics with text throughout denoising.

Result: Experiments on DeepFashion MultiModal dataset show PMMD outperforms representative baselines in consistency, detail preservation, and controllability.

Conclusion: PMMD effectively addresses key challenges in human image generation by leveraging multi-view multimodal conditioning, achieving superior performance in generating photorealistic person images with precise pose and appearance control.

Abstract: Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms representative baselines in consistency, detail preservation, and controllability. Project page and code are available at https://github.com/ZANMANGLOOPYE/PMMD.

[90] Uni-Parser Technical Report

Xi Fang, Haoyi Tao, Shuwen Yang, Suyang Zhong, Haocheng Lu, Han Lyu, Chaozheng Huang, Xinyu Li, Linfeng Zhang, Guolin Ke

Main category: cs.CV

TL;DR: Uni-Parser is an industrial-grade document parsing engine for scientific literature and patents that uses a modular multi-expert architecture to preserve cross-modal alignments while achieving high throughput (20 PDF pages/sec) and cost efficiency.

DetailsMotivation: Traditional pipeline-based document parsing methods lack fine-grained cross-modal alignments and are not easily extensible to emerging modalities. There's a need for a scalable, cost-efficient parsing system that can handle scientific literature and patents at industrial scale while maintaining accuracy across multiple modalities.

Method: Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes supporting holistic or modality-specific parsing.

Result: The system achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. It maintains robust accuracy while being easily extensible to emerging modalities.

Conclusion: Uni-Parser provides a scalable, high-throughput document parsing solution optimized for large-scale cloud deployment, facilitating downstream applications including literature retrieval, chemical structure extraction, and corpus curation for training next-generation AI models in scientific domains.

Abstract: This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.

[91] Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

Jialong Zuo, Haoyou Deng, Hanyu Zhou, Jiaxin Zhu, Yicheng Zhang, Yiwei Zhang, Yongxin Yan, Kaixing Huang, Weisen Chen, Yongtai Deng, Rui Jin, Nong Sang, Changxin Gao

Main category: cs.CV

TL;DR: Nano Banana Pro shows strong subjective visual quality for low-level vision tasks but underperforms on traditional quantitative metrics due to generative model stochasticity.

DetailsMotivation: To explore whether commercial text-to-image models like Nano Banana Pro can serve as generalist solvers for traditional low-level vision tasks, which remains largely underexplored despite their popularity.

Method: Comprehensive zero-shot evaluation across 14 low-level vision tasks spanning 40 diverse datasets using simple textual prompts without fine-tuning, benchmarking against state-of-the-art specialist models.

Result: Performance dichotomy: Nano Banana Pro demonstrates superior subjective visual quality (often hallucinating plausible high-frequency details) but lags behind in traditional reference-based quantitative metrics.

Conclusion: Nano Banana Pro is a capable zero-shot contender for low-level vision tasks, but achieving the high fidelity of domain specialists remains challenging due to generative model stochasticity and pixel-level consistency issues.

Abstract: The rapid evolution of text-to-image generation models has revolutionized visual content creation. While commercial products like Nano Banana Pro have garnered significant attention, their potential as generalist solvers for traditional low-level vision challenges remains largely underexplored. In this study, we investigate the critical question: Is Nano Banana Pro a Low-Level Vision All-Rounder? We conducted a comprehensive zero-shot evaluation across 14 distinct low-level tasks spanning 40 diverse datasets. By utilizing simple textual prompts without fine-tuning, we benchmarked Nano Banana Pro against state-of-the-art specialist models. Our extensive analysis reveals a distinct performance dichotomy: while \textbf{Nano Banana Pro demonstrates superior subjective visual quality}, often hallucinating plausible high-frequency details that surpass specialist models, it lags behind in traditional reference-based quantitative metrics. We attribute this discrepancy to the inherent stochasticity of generative models, which struggle to maintain the strict pixel-level consistency required by conventional metrics. This report identifies Nano Banana Pro as a capable zero-shot contender for low-level vision tasks, while highlighting that achieving the high fidelity of domain specialists remains a significant hurdle.

[92] 3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding

Yupeng Zhu, Xiongzhen Zhang, Ye Chen, Bingbing Ni

Main category: cs.CV

TL;DR: A lightweight 3D animation framework that decouples geometric control from appearance synthesis using 2D-3D aligned proxy representation, enabling efficient animation with 3D control on low-power platforms.

DetailsMotivation: Traditional 3D animation pipelines are labor-intensive and computationally expensive. Recent AIGC approaches either inherit heavy costs or sacrifice 3D controllability. There's a fundamental trade-off between rendering quality and 3D control in single-image 3D animation generation.

Method: Proposes a lightweight framework using 2D-3D aligned proxy representation: coarse 3D estimate as structural carrier + learned image-space generative priors for high-fidelity appearance. Decouples geometric control from appearance synthesis, enabling 3D-aware motion control without accurate geometry or expensive optimization.

Result: Achieves efficient animation generation on low-power platforms. Outperforms video-based 3D animation generation in identity preservation, geometric/textural consistency, and precise interactive control. Naturally extends to coherent background animation.

Conclusion: The proposed framework successfully addresses the quality-control trade-off in 3D animation by separating geometry and appearance, enabling practical, controllable animation generation with computational efficiency.

Abstract: 3D animation is central to modern visual media, yet traditional production pipelines remain labor-intensive, expertise-demanding, and computationally expensive. Recent AIGC-based approaches partially automate asset creation and rigging, but they either inherit the heavy costs of full 3D pipelines or rely on video-synthesis paradigms that sacrifice 3D controllability and interactivity. We focus on single-image 3D animation generation and argue that progress is fundamentally constrained by a trade-off between rendering quality and 3D control. To address this limitation, we propose a lightweight 3D animation framework that decouples geometric control from appearance synthesis. The core idea is a 2D-3D aligned proxy representation that uses a coarse 3D estimate as a structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This proxy formulation enables 3D-aware motion control and interaction comparable to classical pipelines, without requiring accurate geometry or expensive optimization, and naturally extends to coherent background animation. Extensive experiments demonstrate that our method achieves efficient animation generation on low-power platforms and outperforms video-based 3D animation generation in identity preservation, geometric and textural consistency, and the level of precise, interactive control it offers to users.

[93] Borrowing from anything: A generalizable framework for reference-guided instance editing

Shengxiao Zhou, Chenghua Li, Jianhao Huang, Qinghao Hu, Yifan Zhang

Main category: cs.CV

TL;DR: GENIE is a framework for reference-guided instance editing that disentangles intrinsic appearance from extrinsic attributes to achieve better editing fidelity.

DetailsMotivation: Current reference-guided instance editing methods suffer from semantic entanglement where a reference's intrinsic appearance (what should be borrowed) is intertwined with its extrinsic attributes (context-specific features), making it difficult to properly transfer only relevant appearance information to targets.

Method: GENIE uses three main components: 1) Spatial Alignment Module (SAM) to correct spatial misalignments, 2) Adaptive Residual Scaling Module (ARSM) to learn what to borrow by amplifying intrinsic cues while suppressing extrinsic attributes, and 3) Progressive Attention Fusion (PAF) to learn how to render appearance onto targets while preserving structure.

Result: Extensive experiments on the challenging AnyInsertion dataset show that GENIE achieves state-of-the-art fidelity and robustness, setting a new standard for disentanglement-based instance editing.

Conclusion: GENIE successfully addresses the semantic entanglement problem in reference-guided instance editing through explicit disentanglement, enabling more accurate and robust appearance transfer while preserving target structure.

Abstract: Reference-guided instance editing is fundamentally limited by semantic entanglement, where a reference’s intrinsic appearance is intertwined with its extrinsic attributes. The key challenge lies in disentangling what information should be borrowed from the reference, and determining how to apply it appropriately to the target. To tackle this challenge, we propose GENIE, a Generalizable Instance Editing framework capable of achieving explicit disentanglement. GENIE first corrects spatial misalignments with a Spatial Alignment Module (SAM). Then, an Adaptive Residual Scaling Module (ARSM) learns what to borrow by amplifying salient intrinsic cues while suppressing extrinsic attributes, while a Progressive Attention Fusion (PAF) mechanism learns how to render this appearance onto the target, preserving its structure. Extensive experiments on the challenging AnyInsertion dataset demonstrate that GENIE achieves state-of-the-art fidelity and robustness, setting a new standard for disentanglement-based instance editing.

[94] Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning

Mengshi Qi, Yeteng Wu, Xianlin Zhang, Huadong Ma

Main category: cs.CV

TL;DR: A new Human Action Form Assessment (AFA) task with CoT-AFA dataset and Explainable Fitness Assessor framework for evaluating action standardization with chain-of-thought explanations.

DetailsMotivation: Current video understanding methods only identify what and where actions occur, lacking assessment of action standardization quality. Existing datasets lack labels for action standardization degree, and action quality assessment datasets lack explainability and detailed feedback.

Method: Introduced CoT-AFA dataset with fitness/martial arts videos and multi-level annotations using Chain-of-Thought explanation paradigm. Proposed Explainable Fitness Assessor framework with parallel visual/semantic processing streams and dynamic gating mechanism for fusion.

Result: Achieved improvements: +16.0% in CIDEr for explanation generation, +2.7% in action classification accuracy, and +2.1% in quality assessment accuracy.

Conclusion: The CoT-AFA dataset and framework demonstrate significant potential for human action form assessment with explainable feedback, addressing limitations of current video understanding methods.

Abstract: Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process–from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.

[95] EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu, Yang Han, Hong Zhang, Ding Yuan, Yifan Yang

Main category: cs.CV

TL;DR: EagleVision is a dual-stage framework for spatial reasoning that uses macro perception to select keyframes and micro verification for iterative pose prediction and verification, achieving SOTA on spatial understanding benchmarks.

DetailsMotivation: Existing spatial intelligence approaches have weak spatial consistency, limited viewpoint diversity, and untraceable evidence chains. Current "thinking with images" frameworks don't address key challenges in spatial Chain-of-Thought: global space perception under token budgets, explicit association of 3D hypotheses with video frames, and spatially grounded rewards for RL.

Method: Dual-stage framework: 1) Macro perception stage uses SPF-DPP (semantics-perspective-fusion determinantal point process) to select compact geometry- and semantics-aware keyframes from long videos under fixed token budget. 2) Micro verification stage formalizes spatial CoT as BEV-grounded pose querying: agent iteratively predicts poses on BEV plane, retrieves nearest real frames, and is trained purely by RL with spatial grounding reward scoring consistency between predicted poses and observed views.

Result: Achieves state-of-the-art performance among open-source vision-language models on VSI-Bench, demonstrating strong and generalizable spatial understanding.

Conclusion: EagleVision successfully addresses key challenges in spatial Chain-of-Thought reasoning through its dual-stage approach of macro perception and micro verification, enabling progressive spatial cognition with strong consistency and traceable evidence chains.

Abstract: Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for “thinking with images” (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.

[96] Cross-modal ultra-scale learning with tri-modalities of renal biopsy images for glomerular multi-disease auxiliary diagnosis

Kaixing Long, Danyi Weng, Yun Mi, Zhentai Zhang, Yanmeng Lu, Jian Geng, Zhitao Zhou, Liming Zhong, Qianjin Feng, Wei Yang, Lei Cao

Main category: cs.CV

TL;DR: CMUS-Net: A cross-modal ultra-scale learning network for multi-modal classification of glomerular diseases using TEM, OM, and IM images, addressing nanometer-to-micrometer scale differences.

DetailsMotivation: Existing multi-modal models struggle with feature fusion between TEM images at nanoscale and OM/IM images at microscale, limiting classification accuracy for glomerular disease diagnosis.

Method: Proposes CMUS-Net with: 1) sparse multi-instance learning for TEM feature aggregation, 2) cross-modal scale attention for feature interaction, and 3) multiple loss functions for modality weighting.

Result: Achieves 95.37% ACC, 99.05% AUC, and 95.32% F1-score on in-house dataset, outperforming existing multi-modal/multi-scale methods and showing generalization in MN staging.

Conclusion: CMUS-Net effectively bridges nanometer-micrometer scale differences for multi-modal glomerular disease classification, following clinical pathology workflow and demonstrating superior performance.

Abstract: Constructing a multi-modal automatic classification model based on three types of renal biopsy images can assist pathologists in glomerular multi-disease identification. However, the substantial scale difference between transmission electron microscopy (TEM) image features at the nanoscale and optical microscopy (OM) or immunofluorescence microscopy (IM) images at the microscale poses a challenge for existing multi-modal and multi-scale models in achieving effective feature fusion and improving classification accuracy. To address this issue, we propose a cross-modal ultra-scale learning network (CMUS-Net) for the auxiliary diagnosis of multiple glomerular diseases. CMUS-Net utilizes multiple ultrastructural information to bridge the scale difference between nanometer and micrometer images. Specifically, we introduce a sparse multi-instance learning module to aggregate features from TEM images. Furthermore, we design a cross-modal scale attention module to facilitate feature interaction, enhancing pathological semantic information. Finally, multiple loss functions are combined, allowing the model to weigh the importance among different modalities and achieve precise classification of glomerular diseases. Our method follows the conventional process of renal biopsy pathology diagnosis and, for the first time, performs automatic classification of multiple glomerular diseases including IgA nephropathy (IgAN), membranous nephropathy (MN), and lupus nephritis (LN) based on images from three modalities and two scales. On an in-house dataset, CMUS-Net achieves an ACC of 95.37+/-2.41%, an AUC of 99.05+/-0.53%, and an F1-score of 95.32+/-2.41%. Extensive experiments demonstrate that CMUS-Net outperforms other well-known multi-modal or multi-scale methods and show its generalization capability in staging MN. Code is available at https://github.com/SMU-GL-Group/MultiModal_lkx/tree/main.

[97] Criticality Metrics for Relevance Classification in Safety Evaluation of Object Detection in Automated Driving

Jörg Gamerdinger, Sven Teufel, Stephan Amann, Oliver Bringmann

Main category: cs.CV

TL;DR: This paper analyzes criticality metrics for safety evaluation of object detection in automated driving, proposing novel strategies that improve criticality classification accuracy by up to 100%.

DetailsMotivation: Current perception evaluation metrics lack safety-specific measures, and distinguishing between relevant vs. non-relevant objects is crucial for reliable safety assessment of object detection systems in automated vehicles.

Method: Comprehensive literature review to identify applicable criticality metrics, empirical validation using DeepAccident dataset with safety-critical scenarios, and proposal of two novel strategies: bidirectional criticality rating and multi-metric aggregation.

Result: The proposed approach demonstrates up to 100% improvement in criticality classification accuracy, significantly advancing safety evaluation capabilities for object detection systems.

Conclusion: Criticality metrics are essential for safety evaluation of object detection in automated driving, and the proposed bidirectional rating and multi-metric aggregation strategies substantially improve evaluation accuracy and reliability.

Abstract: Ensuring safety is the primary objective of automated driving, which necessitates a comprehensive and accurate perception of the environment. While numerous performance evaluation metrics exist for assessing perception capabilities, incorporating safety-specific metrics is essential to reliably evaluate object detection systems. A key component for safety evaluation is the ability to distinguish between relevant and non-relevant objects - a challenge addressed by criticality or relevance metrics. This paper presents the first in-depth analysis of criticality metrics for safety evaluation of object detection systems. Through a comprehensive review of existing literature, we identify and assess a range of applicable metrics. Their effectiveness is empirically validated using the DeepAccident dataset, which features a variety of safety-critical scenarios. To enhance evaluation accuracy, we propose two novel application strategies: bidirectional criticality rating and multi-metric aggregation. Our approach demonstrates up to a 100% improvement in terms of criticality classification accuracy, highlighting its potential to significantly advance the safety evaluation of object detection systems in automated vehicles.

[98] Robust and Calibrated Detection of Authentic Multimedia Content

Sarim Hashmi, Abdelrahman Elsayed, Mohammed Talha Alam, Samuele Poppi, Nils Lukas

Main category: cs.CV

TL;DR: Proposes a resynthesis framework for deepfake detection that verifies authenticity with controllable false positive rates and achieves adversarial robustness against compute-restricted adversaries.

DetailsMotivation: Current deepfake detection methods are unreliable due to: (i) inability to distinguish inauthentic content post-hoc (leading to unbounded false positive rates), and (ii) lack of robustness as adversaries can easily adapt to known detectors with minimal computational resources.

Method: A resynthesis framework that determines if a sample is authentic or if its authenticity can be plausibly denied. Uses calibrated resynthesis method focusing on high-precision, low-recall setting against compute-restricted adversaries. Supports multiple modalities and leverages state-of-the-art inversion techniques.

Result: The calibrated resynthesis method is the most reliable approach for verifying authentic samples while maintaining controllable, low false positive rates. Achieves adversarial robustness against efficient adversaries, whereas prior methods are easily evaded under identical compute budgets.

Conclusion: The proposed resynthesis framework addresses fundamental limitations of current deepfake detection by providing reliable authenticity verification with controlled false positive rates and robust defense against computationally efficient adversaries.

Abstract: Generative models can synthesize highly realistic content, so-called deepfakes, that are already being misused at scale to undermine digital media authenticity. Current deepfake detection methods are unreliable for two reasons: (i) distinguishing inauthentic content post-hoc is often impossible (e.g., with memorized samples), leading to an unbounded false positive rate (FPR); and (ii) detection lacks robustness, as adversaries can adapt to known detectors with near-perfect accuracy using minimal computational resources. To address these limitations, we propose a resynthesis framework to determine if a sample is authentic or if its authenticity can be plausibly denied. We make two key contributions focusing on the high-precision, low-recall setting against efficient (i.e., compute-restricted) adversaries. First, we demonstrate that our calibrated resynthesis method is the most reliable approach for verifying authentic samples while maintaining controllable, low FPRs. Second, we show that our method achieves adversarial robustness against efficient adversaries, whereas prior methods are easily evaded under identical compute budgets. Our approach supports multiple modalities and leverages state-of-the-art inversion techniques.

[99] ERIENet: An Efficient RAW Image Enhancement Network under Low-Light Environment

Jianan Wang, Yang Hong, Hesong Li, Tao Wang, Songrong Liu, Ying Fu

Main category: cs.CV

TL;DR: ERIENet: Efficient RAW Image Enhancement Network that parallelly processes multi-scale information and leverages green channel superiority for real-time low-light RAW image enhancement.

DetailsMotivation: Existing RAW-based low-light enhancement methods have two main limitations: 1) They sequentially process multi-scale information, making models heavy and slow, and 2) They ignore the superior information in green channels of RAW images, missing opportunities for better reconstruction.

Method: Proposes ERIENet with two key innovations: 1) Efficient multi-scale fully-parallel architecture with channel-aware residual dense blocks for feature extraction, and 2) Green channel guidance branch that exploits rich information in RAW’s green channels to guide image reconstruction.

Result: Outperforms state-of-the-art methods on low-light RAW image enhancement datasets with higher efficiency. Achieves real-time processing at over 146 FPS for 4K-resolution images on a single NVIDIA RTX 3090 GPU.

Conclusion: ERIENet demonstrates that parallel multi-scale processing combined with green channel guidance enables efficient, real-time low-light RAW image enhancement with superior reconstruction quality compared to existing methods.

Abstract: RAW images have shown superior performance than sRGB images in many image processing tasks, especially for low-light image enhancement. However, most existing methods for RAW-based low-light enhancement usually sequentially process multi-scale information, which makes it difficult to achieve lightweight models and high processing speeds. Besides, they usually ignore the green channel superiority of RAW images, and fail to achieve better reconstruction performance with good use of green channel information. In this work, we propose an efficient RAW Image Enhancement Network (ERIENet), which parallelly processes multi-scale information with efficient convolution modules, and takes advantage of rich information in green channels to guide the reconstruction of images. Firstly, we introduce an efficient multi-scale fully-parallel architecture with a novel channel-aware residual dense block to extract feature maps, which reduces computational costs and achieves real-time processing speed. Secondly, we introduce a green channel guidance branch to exploit the rich information within the green channels of the input RAW image. It increases the quality of reconstruction results with few parameters and computations. Experiments on commonly used low-light image enhancement datasets show that ERIENet outperforms state-of-the-art methods in enhancing low-light RAW images with higher effiency. It also achieves an optimal speed of over 146 frame-per-second (FPS) for 4K-resolution images on a single NVIDIA GeForce RTX 3090 with 24G memory.

[100] TBC: A Target-Background Contrast Metric for Low-Altitude Infrared and Visible Image Fusion

Yufeng Xie

Main category: cs.CV

TL;DR: The paper proposes a new metric called Target-Background Contrast (TBC) for evaluating infrared and visible image fusion in UAV reconnaissance, addressing the “Noise Trap” problem where traditional metrics incorrectly reward noisy images.

DetailsMotivation: Traditional no-reference metrics like Entropy (EN) and Average Gradient (AG) fail in complex low-light UAV reconnaissance environments because they misinterpret high-frequency sensor noise as valid detail, creating a "Noise Trap" that paradoxically assigns higher scores to noisy images and misguides fusion algorithms.

Method: The authors propose the Target-Background Contrast (TBC) metric, inspired by Weber’s Law, which focuses on the relative contrast of salient targets rather than global statistics. Unlike traditional metrics, TBC specifically penalizes background noise and rewards target visibility.

Result: Experiments on the DroneVehicle dataset demonstrate that TBC aligns better with human perception and provides a reliable evaluation standard for low-altitude UAV reconnaissance scenarios.

Conclusion: The proposed TBC metric effectively addresses the limitations of traditional no-reference metrics in low-light UAV environments by focusing on target-background contrast rather than global statistics, providing a more reliable evaluation standard for infrared and visible image fusion in reconnaissance applications.

Abstract: Infrared and visible image fusion is a pivotal technology in low-altitude UAV reconnaissance missions, providing high-quality data support for downstream tasks such as target detection and tracking by integrating thermal saliency with background texture details.However, traditional no-reference metrics fail(Specifically,like Entropy (EN) and Average Gradient (AG)) in complex low-light environments. They often misinterpret high-frequency sensor noise as valid detail. This creates a “Noise Trap,” paradoxically assigning higher scores to noisy images and misguiding fusion algorithms.To address this, we propose the Target-Background Contrast (TBC) metric. Inspired by Weber’s Law, TBC focuses on the relative contrast of salient targets rather than global statistics. Unlike traditional metrics, TBC penalizes background noise and rewards target visibility. Experiments on the DroneVehicle dataset demonstrate that TBC aligns better with human perception and provides a reliable standard for low-altitude scenarios.

[101] From Camera to World: A Plug-and-Play Module for Human Mesh Transformation

Changhai Ma, Ziyu Wu, Yunkang Zhang, Qijun Ying, Boyan Liu, Xiaohui Cai

Main category: cs.CV

TL;DR: Mesh-Plug: A plug-and-play module that transforms human meshes from camera coordinates to world coordinates by estimating camera rotation from human body cues rather than environmental features.

DetailsMotivation: Existing 3D human mesh reconstruction methods work in camera coordinates by assuming zero camera rotation, but this causes significant errors when transforming to world coordinates. The challenge is the lack of camera rotation information from in-the-wild images.

Method: Uses a human-centered approach with RGB images and depth maps from initial mesh to estimate camera rotation. Has two modules: 1) Camera rotation prediction module that focuses on human body spatial configuration to estimate camera pitch angle, 2) Mesh adjustment module that integrates predicted camera parameters with initial mesh to refine root joint orientation and body pose simultaneously.

Result: Outperforms state-of-the-art methods on benchmark datasets SPEC-SYN and SPEC-MTP, demonstrating accurate transformation from camera to world coordinates.

Conclusion: Mesh-Plug provides an effective solution for accurate 3D human mesh reconstruction in world coordinates by eliminating dependency on environmental cues and focusing on human body information for camera rotation estimation.

Abstract: Reconstructing accurate 3D human meshes in the world coordinate system from in-the-wild images remains challenging due to the lack of camera rotation information. While existing methods achieve promising results in the camera coordinate system by assuming zero camera rotation, this simplification leads to significant errors when transforming the reconstructed mesh to the world coordinate system. To address this challenge, we propose Mesh-Plug, a plug-and-play module that accurately transforms human meshes from camera coordinates to world coordinates. Our key innovation lies in a human-centered approach that leverages both RGB images and depth maps rendered from the initial mesh to estimate camera rotation parameters, eliminating the dependency on environmental cues. Specifically, we first train a camera rotation prediction module that focuses on the human body’s spatial configuration to estimate camera pitch angle. Then, by integrating the predicted camera parameters with the initial mesh, we design a mesh adjustment module that simultaneously refines the root joint orientation and body pose. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on the benchmark datasets SPEC-SYN and SPEC-MTP.

[102] SLCFormer: Spectral-Local Context Transformer with Physics-Grounded Flare Synthesis for Nighttime Flare Removal

Xiyu Zhu, Wei Wang, Xin Yuan, Xiao Wang

Main category: cs.CV

TL;DR: SLCFormer is a spectral-local context transformer framework for nighttime lens flare removal that combines frequency domain modeling with spatial enhancement, achieving state-of-the-art performance on complex real-world flare artifacts.

DetailsMotivation: Existing methods fail to effectively address nonuniform scattered flares in complex real-world nighttime scenarios with diverse lighting conditions, limiting their practical applicability.

Method: SLCFormer integrates two key modules: FFEM (Frequency Fourier and Excitation Module) for global contextual representations in frequency domain, and DESM (Directionally-Enhanced Spatial Module) for local structural enhancement in spatial domain. Also introduces ZernikeVAE-based scatter flare generation pipeline for physically realistic training data.

Result: Achieves state-of-the-art performance on Flare7K++ dataset, outperforming existing approaches in both quantitative metrics and perceptual visual quality, with robust generalization to real nighttime scenes.

Conclusion: SLCFormer effectively addresses complex nighttime lens flare removal by combining frequency and spatial domain processing with physically realistic training data, demonstrating superior performance and real-world applicability.

Abstract: Lens flare is a common nighttime artifact caused by strong light sources scattering within camera lenses, leading to hazy streaks, halos, and glare that degrade visual quality. However, existing methods usually fail to effectively address nonuniform scattered flares, which severely reduces their applicability to complex real-world scenarios with diverse lighting conditions. To address this issue, we propose SLCFormer, a novel spectral-local context transformer framework for effective nighttime lens flare removal. SLCFormer integrates two key modules: the Frequency Fourier and Excitation Module (FFEM), which captures efficient global contextual representations in the frequency domain to model flare characteristics, and the Directionally-Enhanced Spatial Module (DESM) for local structural enhancement and directional features in the spatial domain for precise flare removal. Furthermore, we introduce a ZernikeVAE-based scatter flare generation pipeline to synthesize physically realistic scatter flares with spatially varying PSFs, bridging optical physics and data-driven training. Extensive experiments on the Flare7K++ dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches in both quantitative metrics and perceptual visual quality, and generalizing robustly to real nighttime scenes with complex flare artifacts.

[103] Null-LoRA: Low-Rank Adaptation on Null Space

Yi Zhang, Yulei Kang, Haoxuan Chen, Jinxuan Li, ian-Fang Hu

Main category: cs.CV

TL;DR: Null-LoRA: A parameter-efficient fine-tuning method that performs low-rank adaptation within the null space of pre-trained models, achieving SOTA performance with fewer parameters.

DetailsMotivation: Existing LoRA methods perform adaptation over the full parameter space, but fine-tuning within a subspace can achieve comparable effectiveness. Pre-trained models have non-trivial null spaces that can be leveraged for more efficient adaptation.

Method: Proposes Null-space based Low-Rank Adaptation (Null-LoRA) that reduces redundancy by freezing portions of low-rank matrices and constrains the entire incremental update within the null space of pre-trained models.

Result: Null-LoRA surpasses state-of-the-art methods with fewer parameters in extensive experiments across image-text retrieval and visual question answering tasks.

Conclusion: Null-LoRA effectively reduces redundancy and enhances parameter efficiency by leveraging the null space of pre-trained models, achieving better performance with fewer parameters than existing LoRA variants.

Abstract: Parameter-efficient fine-tuning methods have gained considerable popularity for adapting large-scale models to downstream tasks, particularly LoRA and its variants. Existing methods perform low-rank adaptation over the full parameter space. However, fine-tuning within a subspace can achieve comparable effectiveness. Inspired by the observation that pre-trained models possess non-trivial null spaces, we propose Null-space based Low-Rank Adaptation (Null-LoRA). Null-LoRA effectively reduces redundancy and enhances effective rank by freezing portions of the low-rank matrices. To further improve parameter efficiency, Null-LoRA constrains the entire incremental update within the null space, maximizing the utilization of incremental updates to adapt to new task paradigms. Null-LoRA surpasses the state of the art with fewer parameters in extensive experiments across image-text retrieval and visual question answering tasks.

[104] Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification

Yupeng Zhang, Adam G. Dunn, Usman Naseem, Jinman Kim

Main category: cs.CV

TL;DR: CMAC-MMD framework reduces intersectional bias in medical AI by standardizing diagnostic certainty across patient subgroups without needing demographic data during inference, improving both fairness and accuracy.

DetailsMotivation: Medical AI systems exhibit intersectional biases where models are less confident in diagnosing marginalized patient subgroups, leading to inaccurate/missed diagnoses. Current fairness interventions often fail to address these gaps or compromise overall performance.

Method: Cross-Modal Alignment Consistency (CMAC-MMD) training framework that standardizes diagnostic certainty across intersectional patient subgroups without requiring sensitive demographic data during clinical inference.

Result: In dermatology: reduced intersectional missed diagnosis gap (ΔTPR) from 0.50 to 0.26 while improving AUC from 0.94 to 0.97. In glaucoma screening: reduced ΔTPR from 0.41 to 0.31, achieving better AUC of 0.72 vs 0.71 baseline.

Conclusion: Establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and equitable across diverse patient subgroups without increasing privacy risks.

Abstract: Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model’s decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $Δ$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $Δ$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.

[105] Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models

Kuinan Hou, Jing Mi, Marco Zorzi, Lamberto Ballan, Alberto Testolin

Main category: cs.CV

TL;DR: VLMs match or surpass specialized counting architectures for object enumeration, especially when prompted to generate intermediate object representations, but still struggle with complex scenes.

DetailsMotivation: Traditional counting methods rely on domain-specific architectures trained on predefined object categories, while large-scale multimodal vision-language models (VLMs) offer potential as flexible alternatives for open-set object counting.

Method: Systematic comparison of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets and a novel benchmark with finer-grained control over visual properties of test images.

Result: Most VLMs can approximately enumerate items in visual scenes, matching or surpassing specialized architectures. Accuracy significantly improves when VLMs are prompted to generate intermediate representations (locations and verbal labels) of each object.

Conclusion: While VLMs show promise for object counting, none can reliably count objects in complex visual scenes, indicating further research is needed for AI systems to deploy counting procedures reliably in realistic environments.

Abstract: Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures. Notably, enumeration accuracy significantly improves when VLMs are prompted to generate intermediate representations (i.e., locations and verbal labels) of each object to be counted. Nevertheless, none of the models can reliably count the number of objects in complex visual scenes, showing that further research is still needed to create AI systems that can reliably deploy counting procedures in realistic environments.

[106] MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement

Yingying Wang, Xuanhua He, Chen Wu, Jialing Huang, Suiyun Zhang, Rui Liu, Xinghao Ding, Haoxuan Che

Main category: cs.CV

TL;DR: MMMamba: A cross-modal in-context fusion framework for pan-sharpening using Mamba architecture with linear complexity and multimodal interleaved scanning.

DetailsMotivation: Traditional CNN-based methods have limited adaptability to diverse spatial/spectral variations, while cross-attention mechanisms are computationally inefficient and may dilute fine-grained correspondences. There's a need for more efficient and effective cross-modal fusion for pan-sharpening.

Method: Proposes MMMamba, a cross-modal in-context fusion framework built on Mamba architecture with linear computational complexity. Introduces novel multimodal interleaved (MI) scanning mechanism for effective information exchange between PAN and MS modalities. Supports zero-shot image super-resolution.

Result: Extensive experiments demonstrate superior performance compared to existing state-of-the-art techniques across multiple tasks and benchmarks.

Conclusion: MMMamba provides an efficient and effective solution for pan-sharpening with linear complexity, strong cross-modal interaction, and flexibility for zero-shot super-resolution tasks.

Abstract: Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.

[107] SynthSeg-Agents: Multi-Agent Synthetic Data Generation for Zero-Shot Weakly Supervised Semantic Segmentation

Wangyu Wu, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao

Main category: cs.CV

TL;DR: SynthSeg Agents is a novel Zero-Shot Weakly Supervised Semantic Segmentation framework that uses LLM-driven agents to generate synthetic training data without any real images, achieving competitive performance on standard benchmarks.

DetailsMotivation: Current WSSS methods still depend on real-world training samples, limiting scalability and cost-efficiency. The paper introduces ZSWSSS to eliminate this dependency entirely by generating synthetic training data from scratch using LLM-driven agents.

Method: A multi-agent framework with two key modules: 1) Self-Refine Prompt Agent that autonomously crafts diverse image prompts using iterative refinement, memory mechanisms, and prompt space exploration with CLIP-based filtering; 2) Image Generation Agent that uses VLMs to synthesize images, with CLIP scoring for quality selection and ViT classifier for improved semantic relabeling.

Result: Experiments on PASCAL VOC 2012 and COCO 2014 show competitive performance without using real training images, demonstrating the framework’s effectiveness in generating high-quality synthetic training data.

Conclusion: SynthSeg Agents highlights the potential of LLM-driven agents for cost-efficient and scalable semantic segmentation, enabling zero-shot weakly supervised learning without real image supervision.

Abstract: Weakly Supervised Semantic Segmentation (WSSS) with image level labels aims to produce pixel level predictions without requiring dense annotations. While recent approaches have leveraged generative models to augment existing data, they remain dependent on real world training samples. In this paper, we introduce a novel direction, Zero Shot Weakly Supervised Semantic Segmentation (ZSWSSS), and propose SynthSeg Agents, a multi agent framework driven by Large Language Models (LLMs) to generate synthetic training data entirely without real images. SynthSeg Agents comprises two key modules, a Self Refine Prompt Agent and an Image Generation Agent. The Self Refine Prompt Agent autonomously crafts diverse and semantically rich image prompts via iterative refinement, memory mechanisms, and prompt space exploration, guided by CLIP based similarity and nearest neighbor diversity filtering. These prompts are then passed to the Image Generation Agent, which leverages Vision Language Models (VLMs) to synthesize candidate images. A frozen CLIP scoring model is employed to select high quality samples, and a ViT based classifier is further trained to relabel the entire synthetic dataset with improved semantic precision. Our framework produces high quality training data without any real image supervision. Experiments on PASCAL VOC 2012 and COCO 2014 show that SynthSeg Agents achieves competitive performance without using real training images. This highlights the potential of LLM driven agents in enabling cost efficient and scalable semantic segmentation.

[108] KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird’s-Eye-View Segmentation

Wenke E, Yixin Sun, Jiaxu Liu, Hubert P. H. Shum, Amir Atapour-Abarghouei, Toby P. Breckon

Main category: cs.CV

TL;DR: First cross-modality distillation framework for single-panoramic-camera BEV segmentation using LiDAR-camera fusion teacher to distill knowledge into lightweight camera-only student.

DetailsMotivation: To reduce sensor complexity and deployment costs for BEV segmentation in autonomous driving by enabling efficient, low-cost solutions using only a single panoramic camera instead of expensive LiDAR systems.

Method: Proposes cross-modality distillation with: 1) novel LiDAR image representation (range, intensity, ambient channels), 2) voxel-aligned view transformer for spatial fidelity, 3) high-capacity LiDAR-camera fusion Teacher network, 4) lightweight Student network using only panoramic camera images.

Result: Teacher model achieves 25.6% IoU improvement over existing camera-based methods on Dur360BEV. Distilled Student attains 8.5% IoU gain with state-of-the-art 31.2 FPS inference speed. Framework generalizes to diverse camera setups (KITTI-360).

Conclusion: The framework provides practical solution for efficient, low-cost BEV segmentation in real-world autonomous driving by reducing sensor complexity while maintaining competitive performance through cross-modality knowledge distillation.

Abstract: We present the first cross-modality distillation framework specifically tailored for single-panoramic-camera Bird’s-Eye-View (BEV) segmentation. Our approach leverages a novel LiDAR image representation fused from range, intensity and ambient channels, together with a voxel-aligned view transformer that preserves spatial fidelity while enabling efficient BEV processing. During training, a high-capacity LiDAR and camera fusion Teacher network extracts both rich spatial and semantic features for cross-modality knowledge distillation into a lightweight Student network that relies solely on a single 360-degree panoramic camera image. Extensive experiments on the Dur360BEV dataset demonstrate that our teacher model significantly outperforms existing camera-based BEV segmentation methods, achieving a 25.6% IoU improvement. Meanwhile, the distilled Student network attains competitive performance with an 8.5% IoU gain and state-of-the-art inference speed of 31.2 FPS. Moreover, evaluations on KITTI-360 (two fisheye cameras) confirm that our distillation framework generalises to diverse camera setups, underscoring its feasibility and robustness. This approach reduces sensor complexity and deployment costs while providing a practical solution for efficient, low-cost BEV segmentation in real-world autonomous driving.

[109] Automated Motion Artifact Check for MRI (AutoMAC-MRI): An Interpretable Framework for Motion Artifact Detection and Severity Assessment

Antony Jerald, Dattesh Shanbhag, Sudhanya Chatterjee

Main category: cs.CV

TL;DR: AutoMAC-MRI is an explainable framework for grading motion artifacts in MRI images across different contrasts and orientations, using supervised contrastive learning and affinity scoring for interpretable quality assessment.

DetailsMotivation: Motion artifacts degrade MRI quality and increase patient recalls, but existing automated quality assessment methods are limited to binary decisions with little interpretability.

Method: Uses supervised contrastive learning to learn discriminative representations of motion severity, then computes grade-specific affinity scores that quantify an image’s proximity to each motion grade for transparent grade assignment.

Result: Evaluated on 5000+ expert-annotated brain MRI slices across multiple contrasts and views, showing affinity scores align well with expert judgment and support interpretable motion severity measurement.

Conclusion: AutoMAC-MRI enables inline MRI quality control with accurate grade detection and interpretable affinity scoring, potentially reducing unnecessary rescans and improving workflow efficiency.

Abstract: Motion artifacts degrade MRI image quality and increase patient recalls. Existing automated quality assessment methods are largely limited to binary decisions and provide little interpretability. We introduce AutoMAC-MRI, an explainable framework for grading motion artifacts across heterogeneous MR contrasts and orientations. The approach uses supervised contrastive learning to learn a discriminative representation of motion severity. Within this feature space, we compute grade-specific affinity scores that quantify an image’s proximity to each motion grade, thereby making grade assignments transparent and interpretable. We evaluate AutoMAC-MRI on more than 5000 expert-annotated brain MRI slices spanning multiple contrasts and views. Experiments assessing affinity scores against expert labels show that the scores align well with expert judgment, supporting their use as an interpretable measure of motion severity. By coupling accurate grade detection with per-grade affinity scoring, AutoMAC-MRI enables inline MRI quality control, with the potential to reduce unnecessary rescans and improve workflow efficiency.

[110] Prototypical Learning Guided Context-Aware Segmentation Network for Few-Shot Anomaly Detection

Yuxin Jiang, Yunkang Cao, Weiming Shen

Main category: cs.CV

TL;DR: PCSNet improves few-shot anomaly detection by addressing domain gaps between pre-trained features and target scenarios using prototypical feature adaptation and context-aware segmentation.

DetailsMotivation: Existing few-shot anomaly detection methods rely on pre-trained features but overlook domain gaps between these representations and target scenarios, limiting performance.

Method: Proposes PCSNet with two sub-networks: 1) Prototypical Feature Adaptation (PFA) for better feature compactness and anomaly separation using prototypical guidance and pixel-level disparity classification, and 2) Context-Aware Segmentation (CAS) for pixel-level anomaly localization using pseudo anomalies.

Result: Achieves 94.9% and 80.2% image-level AUROC on MVTec and MPDD in 8-shot scenarios, and shows promising results in real-world automotive plastic part inspection with limited samples.

Conclusion: PCSNet effectively addresses domain gaps in few-shot anomaly detection, improving feature descriptiveness and achieving state-of-the-art performance with limited training samples.

Abstract: Few-shot anomaly detection (FSAD) denotes the identification of anomalies within a target category with a limited number of normal samples. Existing FSAD methods largely rely on pre-trained feature representations to detect anomalies, but the inherent domain gap between pre-trained representations and target FSAD scenarios is often overlooked. This study proposes a Prototypical Learning Guided Context-Aware Segmentation Network (PCSNet) to address the domain gap, thereby improving feature descriptiveness in target scenarios and enhancing FSAD performance. In particular, PCSNet comprises a prototypical feature adaption (PFA) sub-network and a context-aware segmentation (CAS) sub-network. PFA extracts prototypical features as guidance to ensure better feature compactness for normal data while distinct separation from anomalies. A pixel-level disparity classification loss is also designed to make subtle anomalies more distinguishable. Then a CAS sub-network is introduced for pixel-level anomaly localization, where pseudo anomalies are exploited to facilitate the training process. Experimental results on MVTec and MPDD demonstrate the superior FSAD performance of PCSNet, with 94.9% and 80.2% image-level AUROC in an 8-shot scenario, respectively. Real-world applications on automotive plastic part inspection further demonstrate that PCSNet can achieve promising results with limited training samples. Code is available at https://github.com/yuxin-jiang/PCSNet.

[111] MECAD: A multi-expert architecture for continual anomaly detection

Malihe Dahmardeh, Francesco Setti

Main category: cs.CV

TL;DR: MECAD is a multi-expert continual anomaly detection system that dynamically assigns experts to object classes using feature similarity and efficient memory management to prevent catastrophic forgetting.

DetailsMotivation: Industrial environments need anomaly detection systems that can adapt to evolving product types without catastrophic forgetting of previously learned classes, requiring efficient continual learning approaches.

Method: Multi-expert architecture with dynamic expert assignment based on feature similarity, optimized coreset selection, specialized replay buffer mechanism, and efficient memory management for knowledge preservation.

Result: 5-expert configuration achieves average AUROC of 0.8259 across 15 object categories on MVTec AD dataset, significantly reducing knowledge degradation compared to single-expert approaches.

Conclusion: MECAD effectively balances computational efficiency, specialized knowledge retention, and adaptability, making it suitable for industrial environments with evolving product types.

Abstract: In this paper we propose MECAD, a novel approach for continual anomaly detection using a multi-expert architecture. Our system dynamically assigns experts to object classes based on feature similarity and employs efficient memory management to preserve the knowledge of previously seen classes. By leveraging an optimized coreset selection and a specialized replay buffer mechanism, we enable incremental learning without requiring full model retraining. Our experimental evaluation on the MVTec AD dataset demonstrates that the optimal 5-expert configuration achieves an average AUROC of 0.8259 across 15 diverse object categories while significantly reducing knowledge degradation compared to single-expert approaches. This framework balances computational efficiency, specialized knowledge retention, and adaptability, making it well-suited for industrial environments with evolving product types.

[112] A Masked Reverse Knowledge Distillation Method Incorporating Global and Local Information for Image Anomaly Detection

Yuxin Jiang, Yunkang Can, Weiming Shen

Main category: cs.CV

TL;DR: MRKD introduces masked reverse knowledge distillation with image-level and feature-level masking to address overgeneralization in anomaly detection, achieving state-of-the-art performance on MVTec dataset.

DetailsMotivation: Knowledge distillation for anomaly detection suffers from overgeneralization due to similarities between input and supervisory signals, limiting its effectiveness in distinguishing normal from abnormal patterns.

Method: Proposes Masked Reverse Knowledge Distillation (MRKD) with two key components: Image-Level Masking (ILM) to capture global information by differentiating input signals, and Feature-Level Masking (FLM) to incorporate synthetic feature-level anomalies for local information preservation.

Result: Achieves impressive performance on MVTec dataset: 98.9% image-level AU-ROC, 98.4% pixel-level AU-ROC, and 95.3% AU-PRO. Extensive ablation studies validate MRKD’s superiority in mitigating overgeneralization.

Conclusion: MRKD effectively addresses the overgeneralization problem in knowledge distillation for anomaly detection through dual masking strategies, transforming reconstruction into restoration and achieving state-of-the-art performance.

Abstract: Knowledge distillation is an effective image anomaly detection and localization scheme. However, a major drawback of this scheme is its tendency to overly generalize, primarily due to the similarities between input and supervisory signals. In order to address this issue, this paper introduces a novel technique called masked reverse knowledge distillation (MRKD). By employing image-level masking (ILM) and feature-level masking (FLM), MRKD transforms the task of image reconstruction into image restoration. Specifically, ILM helps to capture global information by differentiating input signals from supervisory signals. On the other hand, FLM incorporates synthetic feature-level anomalies to ensure that the learned representations contain sufficient local information. With these two strategies, MRKD is endowed with stronger image context capture capacity and is less likely to be overgeneralized. Experiments on the widely-used MVTec anomaly detection dataset demonstrate that MRKD achieves impressive performance: image-level 98.9% AU-ROC, pixel-level 98.4% AU-ROC, and 95.3% AU-PRO. In addition, extensive ablation experiments have validated the superiority of MRKD in mitigating the overgeneralization problem.

[113] Emotion Recognition in Signers

Kotaro Funakoshi, Yaoxiong Zhu

Main category: cs.CV

TL;DR: The paper addresses emotion recognition in sign language by tackling two main challenges: distinguishing grammatical vs. affective facial expressions and data scarcity, using cross-lingual datasets including a new Japanese Sign Language dataset (eJSL) and BOBSL for British Sign Language.

DetailsMotivation: The motivation is to overcome two key challenges in signer emotion recognition: 1) the overlap between grammatical facial expressions (used for sign language grammar) and affective facial expressions (showing emotions), and 2) the scarcity of labeled training data for sign language emotion recognition models.

Method: The method uses cross-lingual approach with two datasets: eJSL (new Japanese Sign Language dataset with 1,092 video clips of 78 utterances × 7 emotional states) and BOBSL (large British Sign Language dataset with subtitles). They explore textual emotion recognition from spoken language to mitigate data scarcity, temporal segment selection, and incorporation of hand motion features.

Result: Results show that: 1) textual emotion recognition from spoken language helps mitigate sign language data scarcity, 2) temporal segment selection significantly impacts performance, and 3) incorporating hand motion enhances emotion recognition in signers. The approach establishes a stronger baseline than spoken language LLMs.

Conclusion: The paper successfully addresses both theoretical (grammatical vs. affective expression overlap) and practical (data scarcity) challenges in signer emotion recognition through cross-lingual methods, demonstrating effective techniques including spoken language text analysis, temporal selection, and hand motion incorporation, outperforming spoken language LLM baselines.

Abstract: Recognition of signers’ emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.

[114] Vision-based module for accurately reading linear scales in a laboratory

Parvesh Saini, Soumyadipta Maiti, Beena Rai

Main category: cs.CV

TL;DR: A vision-based system that mimics human ability to read measurements from linear scales (syringes and measuring cylinders) by correcting orientation, extracting scale features, and calculating readings with human-like accuracy.

DetailsMotivation: While vision models excel at tasks like object detection and classification, they lack the human-like ability to take accurate quantitative measurements from images. For robots to achieve complete autonomy in laboratory environments, they need basic skills including reading measurements from instruments and apparatus.

Method: The system uses a human-inspired approach: 1) Corrects orientation of randomly oriented syringes through transformations, 2) Reduces area of interest to just the linear scale portion for efficiency, 3) Extracts features including major markers, corresponding digits, and level indicator location, 4) Calculates final reading from these features.

Result: The system successfully reads measurements from syringes and measuring cylinders. When compared against human-read values of the same instances, an accurate correspondence was observed, demonstrating the system’s effectiveness.

Conclusion: The proposed vision-based system successfully mimics human ability to read measurements from linear scales, providing an important capability for autonomous laboratory robots and addressing a gap in current vision model capabilities for quantitative measurement tasks.

Abstract: Capabilities and the number of vision-based models are increasing rapidly. And these vision models are now able to do more tasks like object detection, image classification, instance segmentation etc. with great accuracy. But models which can take accurate quantitative measurements form an image, as a human can do by just looking at it, are rare. For a robot to work with complete autonomy in a Laboratory environment, it needs to have some basic skills like navigation, handling objects, preparing samples etc. to match human-like capabilities in an unstructured environment. Another important capability is to read measurements from instruments and apparatus. Here, we tried to mimic a human inspired approach to read measurements from a linear scale. As a test case we have picked reading level from a syringe and a measuring cylinder. For a randomly oriented syringe we carry out transformations to correct the orientation. To make the system efficient and robust, the area of interest is reduced to just the linear scale containing part of the image. After that, a series of features were extracted like the major makers, the corresponding digits, and the level indicator location, from which the final reading was calculated. Readings obtained using this system were also compared against human read values of the same instances and an accurate correspondence was observed.

[115] Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Junjie Chen, Fei Wang, Zhihao Huang, Qing Zhou, Kun Li, Dan Guo, Linfeng Zhang, Xun Yang

Main category: cs.CV

TL;DR: TIMAR is a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts using turn-level masked auto-regression and diffusion to predict continuous head dynamics.

DetailsMotivation: Human conversation involves bidirectional exchanges of speech and nonverbal cues that need to be modeled in 3D for expressive avatars and interactive robots. Existing frameworks treat talking and listening as independent or use non-causal full-sequence modeling, which hinders temporal coherence across conversational turns.

Method: TIMAR (Turn-level Interleaved Masked AutoRegression) is a causal framework that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn, applies turn-level causal attention to accumulate conversational history, and uses a lightweight diffusion head to predict continuous 3D head dynamics that capture both coordination and expressive variability.

Result: Experiments on the DualTalk benchmark show TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data.

Conclusion: TIMAR provides an effective causal framework for 3D conversational head generation that improves temporal coherence and performance over existing methods, with applications to expressive avatars and interactive robots.

Abstract: Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository https://github.com/CoderChen01/towards-seamleass-interaction.

[116] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: First benchmark for vision-text compression (VTC) evaluates VLMs’ long-context understanding with compressed text, revealing poor performance despite good OCR decoding.

DetailsMotivation: Vision-text compression (VTC) enables token compression ratios of 3x-20x for LLMs, but its impact on VLMs' long-context capabilities is under-investigated. Need systematic evaluation of how high information density affects core VLM abilities.

Method: Introduced first VTC benchmark with three settings: VTC-Retrieval (information retrieval/aggregation), VTC-Reasoning (inferring latent associations with minimal lexical overlap), and VTC-Memory (comprehensive QA within long-term dialogue memory). Also created VTCBench-Wild for diverse input scenarios. Evaluated leading open-source and proprietary models.

Result: Most VLMs show surprisingly poor long-context understanding with VTC-compressed information, despite good OCR decoding ability. They fail to capture long associations or dependencies in context, indicating fundamental limitations with compressed visual representations.

Conclusion: Study provides deep understanding of VTC limitations and serves as foundation for designing more efficient and scalable VLMs that can better handle compressed visual-text representations while maintaining long-context understanding capabilities.

Abstract: The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model’s ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.

[117] Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models

Shiran Ge, Chenyi Huang, Yuang Ai, Qihang Fan, Huaibo Huang, Ran He

Main category: cs.CV

TL;DR: Pro-GRPO is a dynamic framework that improves Group Relative Policy Optimization by proactively pruning low-variance trajectories during sampling, reducing computational costs while maintaining performance.

DetailsMotivation: GRPO's effectiveness is limited by the conflict between needing large group sizes for good performance and prohibitive computational costs. The authors identified that many trajectories cluster around the group-mean reward, offering limited optimization value, creating computational inefficiency.

Method: Pro-GRPO integrates latent feature-based trajectory pruning into the sampling process through early termination of reward-clustered trajectories. It uses an “Expand-and-Prune” strategy: first expands initial sampling group for diversity, then applies multi-step Optimal Variance Filtering (OVF) to latents to avoid computational costs.

Result: Extensive experiments on diffusion-based and flow-based models demonstrate Pro-GRPO’s generality and effectiveness. The high-variance subset selected by OVF outperforms larger unfiltered groups, and Pro-GRPO reduces computational overhead compared to static post-sampling approaches.

Conclusion: Pro-GRPO successfully resolves the computational bottleneck in GRPO by dynamically pruning low-variance trajectories during sampling, enabling efficient use of larger group sizes while maintaining or improving alignment performance across different generative model architectures.

Abstract: Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an “Expand-and-Prune” strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.

[118] SemanticBridge – A Dataset for 3D Semantic Segmentation of Bridges and Domain Gap Analysis

Maximilian Kellner, Mariana Ferrandon Cervantes, Yuandong Pan, Ruodan Lu, Ioannis Brilakis, Alexander Reiterer

Main category: cs.CV

TL;DR: A novel 3D semantic segmentation dataset for bridges with sensor variation analysis shows robust model performance but up to 11.4% mIoU degradation due to domain gaps.

DetailsMotivation: Addresses critical need for infrastructure inspection and maintenance by creating a specialized dataset for 3D semantic segmentation of bridges, enabling automated structural health monitoring.

Method: Created a novel dataset with high-resolution 3D scans of diverse bridge structures from various countries, with detailed semantic labels. Evaluated three state-of-the-art 3D deep learning architectures and analyzed domain gaps caused by different sensors.

Result: All three architectures demonstrated robust performance on the bridge segmentation task. However, sensor variations caused domain gaps that could lead to performance degradation of up to 11.4% mIoU.

Conclusion: The dataset successfully enables accurate bridge component segmentation, but sensor-induced domain gaps significantly impact model performance, highlighting the importance of sensor consistency in practical applications.

Abstract: We propose a novel dataset that has been specifically designed for 3D semantic segmentation of bridges and the domain gap analysis caused by varying sensors. This addresses a critical need in the field of infrastructure inspection and maintenance, which is essential for modern society. The dataset comprises high-resolution 3D scans of a diverse range of bridge structures from various countries, with detailed semantic labels provided for each. Our initial objective is to facilitate accurate and automated segmentation of bridge components, thereby advancing the structural health monitoring practice. To evaluate the effectiveness of existing 3D deep learning models on this novel dataset, we conduct a comprehensive analysis of three distinct state-of-the-art architectures. Furthermore, we present data acquired through diverse sensors to quantify the domain gap resulting from sensor variations. Our findings indicate that all architectures demonstrate robust performance on the specified task. However, the domain gap can potentially lead to a decline in the performance of up to 11.4% mIoU.

[119] See It Before You Grab It: Deep Learning-based Action Anticipation in Basketball

Arnau Barrera Roy, Albert Clapés Sintes

Main category: cs.CV

TL;DR: This paper introduces action anticipation in basketball videos, specifically predicting which team will gain possession after a shot attempt, using a new dataset of 100K clips with 2K annotated rebound events.

DetailsMotivation: While computer vision has advanced sports analytics with tracking, pose estimation, and foul recognition, action anticipation (predicting future events) in sports videos has received little attention. The paper aims to address this gap by focusing on rebound prediction in basketball.

Method: The authors create a new self-curated dataset of 100,000 basketball video clips (300+ hours) with 2,000+ manually annotated rebound events. They benchmark the task using state-of-the-art action anticipation methods, applying deep learning techniques to basketball rebound prediction for the first time. They also explore two complementary tasks: rebound classification and rebound spotting.

Result: The paper presents comprehensive baseline results showing both the feasibility and inherent challenges of anticipating rebounds. The dataset supports a wide range of video understanding applications in basketball, filling a gap where no comparable datasets currently exist.

Conclusion: This work enables applications in real-time automated broadcasting and post-game analysis tools by forecasting team possession before rebounds occur. It provides valuable insights into predictive modeling for dynamic multi-agent sports scenarios and establishes a foundation for action anticipation research in sports analytics.

Abstract: Computer vision and video understanding have transformed sports analytics by enabling large-scale, automated analysis of game dynamics from broadcast footage. Despite significant advances in player and ball tracking, pose estimation, action localization, and automatic foul recognition, anticipating actions before they occur in sports videos has received comparatively little attention. This work introduces the task of action anticipation in basketball broadcast videos, focusing on predicting which team will gain possession of the ball following a shot attempt. To benchmark this task, a new self-curated dataset comprising 100,000 basketball video clips, over 300 hours of footage, and more than 2,000 manually annotated rebound events is presented. Comprehensive baseline results are reported using state-of-the-art action anticipation methods, representing the first application of deep learning techniques to basketball rebound prediction. Additionally, two complementary tasks, rebound classification and rebound spotting, are explored, demonstrating that this dataset supports a wide range of video understanding applications in basketball, for which no comparable datasets currently exist. Experimental results highlight both the feasibility and inherent challenges of anticipating rebounds, providing valuable insights into predictive modeling for dynamic multi-agent sports scenarios. By forecasting team possession before rebounds occur, this work enables applications in real-time automated broadcasting and post-game analysis tools to support decision-making.

[120] SMART: Semantic Matching Contrastive Learning for Partially View-Aligned Clustering

Liang Peng, Yixuan Ye, Cheng Liu, Hangjun Che, Fei Wang, Zhiwen Yu, Si Wu, Hau-San Wong

Main category: cs.CV

TL;DR: SMART is a semantic matching contrastive learning model for partially view-aligned clustering that addresses cross-view distributional shifts to better exploit semantic relationships in both aligned and unaligned data.

DetailsMotivation: Real-world multi-view data often contains both aligned and unaligned samples, but existing PVC methods fail to effectively leverage unaligned data to capture shared semantics. Cross-view distributional shifts further impair correspondence establishment and learning effectiveness.

Method: Proposes SMART (Semantic MAtching contRasTive learning model) that alleviates cross-view distributional shifts to facilitate semantic matching contrastive learning, enabling better exploitation of semantic relationships across both aligned and unaligned data.

Result: Extensive experiments on eight benchmark datasets demonstrate that SMART consistently outperforms existing approaches on the PVC problem.

Conclusion: SMART effectively addresses the challenges of partially view-aligned clustering by mitigating distributional shifts and enabling semantic matching contrastive learning, leading to superior performance compared to existing methods.

Abstract: Multi-view clustering has been empirically shown to improve learning performance by leveraging the inherent complementary information across multiple views of data. However, in real-world scenarios, collecting strictly aligned views is challenging, and learning from both aligned and unaligned data becomes a more practical solution. Partially View-aligned Clustering aims to learn correspondences between misaligned view samples to better exploit the potential consistency and complementarity across views, including both aligned and unaligned data. However, most existing PVC methods fail to leverage unaligned data to capture the shared semantics among samples from the same cluster. Moreover, the inherent heterogeneity of multi-view data induces distributional shifts in representations, leading to inaccuracies in establishing meaningful correspondences between cross-view latent features and, consequently, impairing learning effectiveness. To address these challenges, we propose a Semantic MAtching contRasTive learning model (SMART) for PVC. The main idea of our approach is to alleviate the influence of cross-view distributional shifts, thereby facilitating semantic matching contrastive learning to fully exploit semantic relationships in both aligned and unaligned data. Extensive experiments on eight benchmark datasets demonstrate that our method consistently outperforms existing approaches on the PVC problem.

[121] Preserving Marker Specificity with Lightweight Channel-Independent Representation Learning

Simon Gutwein, Arthur Longuefosse, Jun Seita, Sabine Taschner-Mandl, Roxane Licandro

Main category: cs.CV

TL;DR: Lightweight channel-independent architectures outperform deep early-fusion CNNs for self-supervised representation learning in multiplex tissue imaging data.

DetailsMotivation: Most deep learning models for multiplex tissue imaging use early channel fusion, assuming shared structure across protein markers, but this may not be optimal for preserving marker-specific information needed for rare-cell discrimination.

Method: Proposed Channel-Independent Model (CIM-S) with only 5.5K parameters that preserves marker independence, compared against standard early-fusion CNNs and marker-aware baselines using contrastive pretraining and linear evaluation on Hodgkin lymphoma CODEX dataset with 145,000 cells and 49 markers.

Result: Channel-independent architectures, especially CIM-S, achieve substantially stronger representations despite compact size, outperform early-fusion models in retaining marker-specific information and rare-cell discrimination, with consistent results across multiple self-supervised frameworks and marker settings.

Conclusion: Lightweight channel-independent architectures can match or surpass deep early-fusion CNNs and foundation models for multiplex representation learning, offering a more suitable inductive bias than increasing model scale.

Abstract: Multiplexed tissue imaging measures dozens of protein markers per cell, yet most deep learning models still apply early channel fusion, assuming shared structure across markers. We investigate whether preserving marker independence, combined with deliberately shallow architectures, provides a more suitable inductive bias for self-supervised representation learning in multiplex data than increasing model scale. Using a Hodgkin lymphoma CODEX dataset with 145,000 cells and 49 markers, we compare standard early-fusion CNNs with channel-separated architectures, including a marker-aware baseline and our novel shallow Channel-Independent Model (CIM-S) with 5.5K parameters. After contrastive pretraining and linear evaluation, early-fusion models show limited ability to retain marker-specific information and struggle particularly with rare-cell discrimination. Channel-independent architectures, and CIM-S in particular, achieve substantially stronger representations despite their compact size. These findings are consistent across multiple self-supervised frameworks, remain stable across augmentation settings, and are reproducible across both the 49-marker and reduced 18-marker settings. These results show that lightweight, channel-independent architectures can match or surpass deep early-fusion CNNs and foundation models for multiplex representation learning. Code is available at https://github.com/SimonBon/CIM-S.

[122] Photorealistic Phantom Roads in Real Scenes: Disentangling 3D Hallucinations from Physical Geometry

Hoang Nguyen, Xiaohao Xu, Xiaonan Huang

Main category: cs.CV

TL;DR: Monocular depth foundation models hallucinate 3D structures from planar but ambiguous inputs (3D Mirage). The paper introduces a benchmark, evaluation metrics, and a parameter-efficient mitigation method to address this safety risk.

DetailsMotivation: Monocular depth foundation models achieve great generalization through large-scale semantic priors, but this creates a critical vulnerability: they hallucinate 3D structures from geometrically planar but perceptually ambiguous inputs (e.g., street art). This "3D Mirage" phenomenon represents an unquantified safety risk that needs systematic investigation and mitigation.

Method: 1) 3D-Mirage benchmark: First benchmark of real-world illusions with planar-region annotations and context-restricted crops. 2) Laplacian-based evaluation framework with two metrics: Deviation Composite Score (DCS) for spurious non-planarity and Confusion Composite Score (CCS) for contextual instability. 3) Grounded Self-Distillation: Parameter-efficient strategy that surgically enforces planarity on illusion ROIs while using a frozen teacher to preserve background knowledge, avoiding catastrophic forgetting.

Result: The paper provides essential tools to diagnose and mitigate the 3D Mirage phenomenon. The benchmark enables systematic probing, the metrics allow quantification of the problem, and the Grounded Self-Distillation method effectively tames the failure while preserving model knowledge.

Conclusion: The work urges a necessary shift in monocular depth estimation evaluation from pixel-wise accuracy to structural and contextual robustness. The code and benchmark will be publicly available to foster research in this direction, addressing a critical safety vulnerability in depth foundation models.

Abstract: Monocular depth foundation models achieve remarkable generalization by learning large-scale semantic priors, but this creates a critical vulnerability: they hallucinate illusory 3D structures from geometrically planar but perceptually ambiguous inputs. We term this failure the 3D Mirage. This paper introduces the first end-to-end framework to probe, quantify, and tame this unquantified safety risk. To probe, we present 3D-Mirage, the first benchmark of real-world illusions (e.g., street art) with precise planar-region annotations and context-restricted crops. To quantify, we propose a Laplacian-based evaluation framework with two metrics: the Deviation Composite Score (DCS) for spurious non-planarity and the Confusion Composite Score (CCS) for contextual instability. To tame this failure, we introduce Grounded Self-Distillation, a parameter-efficient strategy that surgically enforces planarity on illusion ROIs while using a frozen teacher to preserve background knowledge, thus avoiding catastrophic forgetting. Our work provides the essential tools to diagnose and mitigate this phenomenon, urging a necessary shift in MDE evaluation from pixel-wise accuracy to structural and contextual robustness. Our code and benchmark will be publicly available to foster this exciting research direction.

[123] Step-GUI Technical Report

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jiayu Yuan, Jianpeng Yin, Kai Cao, Liang Zhao, Liguo Tan, Liying Shi, Mengqiang Ren, Min Xu, Manjiao Liu, Mao Luo, Mingxin Wan, Na Wang, Nan Wu, Ning Wang, Peiyao Ma, Qingzhou Zhang, Qiao Wang, Qinlin Zeng, Qiong Gao, Qiongyao Li, Shangwu Zhong, Shuli Gao, Shaofan Liu, Shisi Gao, Shuang Luo, Xingbin Liu, Xiaojia Liu, Xiaojie Hou, Xin Liu, Xuanti Feng, Xuedan Cai, Xuan Wen, Xianwei Zhu, Xin Liang, Xin Liu, Xin Zhou, Yingxiu Zhao, Yukang Shi, Yunfang Xu, Yuqing Zeng, Yixun Zhang, Zejia Weng, Zhonghao Yan, Zhiguo Huang, Zhuoyu Wang, Zheng Ge, Jing Li, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Daxin Jiang

Main category: cs.CV

TL;DR: Step-GUI introduces a self-evolving training pipeline with calibrated rewards for GUI automation, achieving SOTA performance while proposing GUI-MCP protocol for privacy-preserving deployment and AndroidDaily benchmark for real-world evaluation.

DetailsMotivation: The paper addresses two key challenges in GUI automation: 1) efficient acquisition of high-quality training data while maintaining annotation reliability, and 2) practical deployment needs including standardized interfaces across devices and privacy protection for real-world usage.

Method: 1) Self-evolving training pipeline with Calibrated Step Reward System that converts model-generated trajectories into reliable training signals through trajectory-level calibration; 2) Step-GUI family of models (4B/8B); 3) GUI-MCP (Model Context Protocol) with hierarchical architecture combining low-level atomic operations and high-level task delegation; 4) AndroidDaily benchmark for real-world evaluation.

Result: Step-GUI 8B achieves state-of-the-art GUI performance: 80.2% on AndroidWorld, 48.5% on OSWorld, 62.6% on ScreenShot-Pro. The training pipeline achieves >90% annotation accuracy with 10-100x lower cost. On AndroidDaily benchmark: 89.91% static actions, 52.50% end-to-end tasks.

Conclusion: The work advances practical GUI agents with a comprehensive solution covering training efficiency, model performance, deployment standardization, privacy protection, and real-world evaluation, demonstrating strong potential for everyday digital interaction deployment.

Abstract: Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.

[124] CLIP-FTI: Fine-Grained Face Template Inversion via CLIP-Driven Attribute Conditioning

Longchen Dai, Zixuan Shen, Zhiheng Zhou, Peipeng Yu, Zhihua Xia

Main category: cs.CV

TL;DR: CLIP-FTI: A CLIP-driven framework for face template inversion that uses semantic embeddings of facial features to reconstruct photorealistic faces from leaked face recognition templates with improved attribute detail and cross-model transferability.

DetailsMotivation: Leaked face templates pose privacy and security threats as they can be inverted to create photorealistic surrogates for impersonation. Existing inversion methods produce over-smoothed facial features with limited transferability between different face recognition systems.

Method: Uses CLIP model to extract semantic embeddings of facial features, fuses them with leaked templates via cross-modal feature interaction network, projects into StyleGAN’s latent space, and uses StyleGAN generator to synthesize face images with fine-grained attribute details.

Result: Achieves higher identification accuracy and attribute similarity, recovers sharper component-level attribute semantics, and improves cross-model attack transferability compared to prior reconstruction attacks across multiple face recognition backbones and datasets.

Conclusion: CLIP-FTI is the first method to use additional information beyond face templates for inversion, achieving state-of-the-art results in reconstructing photorealistic faces with fine-grained facial attributes from leaked templates.

Abstract: Face recognition systems store face templates for efficient matching. Once leaked, these templates pose a threat: inverting them can yield photorealistic surrogates that compromise privacy and enable impersonation. Although existing research has achieved relatively realistic face template inversion, the reconstructed facial images exhibit over-smoothed facial-part attributes (eyes, nose, mouth) and limited transferability. To address this problem, we present CLIP-FTI, a CLIP-driven fine-grained attribute conditioning framework for face template inversion. Our core idea is to use the CLIP model to obtain the semantic embeddings of facial features, in order to realize the reconstruction of specific facial feature attributes. Specifically, facial feature attribute embeddings extracted from CLIP are fused with the leaked template via a cross-modal feature interaction network and projected into the intermediate latent space of a pretrained StyleGAN. The StyleGAN generator then synthesizes face images with the same identity as the templates but with more fine-grained facial feature attributes. Experiments across multiple face recognition backbones and datasets show that our reconstructions (i) achieve higher identification accuracy and attribute similarity, (ii) recover sharper component-level attribute semantics, and (iii) improve cross-model attack transferability compared to prior reconstruction attacks. To the best of our knowledge, ours is the first method to use additional information besides the face template attack to realize face template inversion and obtains SOTA results.

[125] ST-DETrack: Identity-Preserving Branch Tracking in Entangled Plant Canopies via Dual Spatiotemporal Evidence

Yueqianji Chen, Kevin Williams, John H. Doonan, Paolo Remagnino, Jo Hepworth

Main category: cs.CV

TL;DR: ST-DETrack is a spatiotemporal-fusion dual-decoder network that preserves branch identity in plant phenotyping by integrating spatial geometric priors with temporal motion consistency, achieving 93.6% branch matching accuracy.

DetailsMotivation: Automated extraction of individual plant branches from time-series imagery is computationally challenging due to non-rigid growth dynamics and severe identity fragmentation within entangled canopies, requiring solutions to overcome stage-dependent ambiguities.

Method: ST-DETrack integrates a spatial decoder (using geometric priors like position and angle for early-stage tracking) with a temporal decoder (exploiting motion consistency for late-stage occlusions), featuring an adaptive gating mechanism to dynamically shift between spatial and temporal cues, plus a biological constraint based on negative gravitropism to mitigate vertical growth ambiguities.

Result: On a Brassica napus dataset, ST-DETrack achieves a Branch Matching Accuracy (BMA) of 93.6%, significantly outperforming spatial baselines by 28.9 percentage points and temporal baselines by 3.3 percentage points.

Conclusion: ST-DETrack demonstrates robust long-term identity consistency for branch tracking in complex, dynamic plant architectures by effectively fusing spatial and temporal information with adaptive gating and biological constraints.

Abstract: Automated extraction of individual plant branches from time-series imagery is essential for high-throughput phenotyping, yet it remains computationally challenging due to non-rigid growth dynamics and severe identity fragmentation within entangled canopies. To overcome these stage-dependent ambiguities, we propose ST-DETrack, a spatiotemporal-fusion dual-decoder network designed to preserve branch identity from budding to flowering. Our architecture integrates a spatial decoder, which leverages geometric priors such as position and angle for early-stage tracking, with a temporal decoder that exploits motion consistency to resolve late-stage occlusions. Crucially, an adaptive gating mechanism dynamically shifts reliance between these spatial and temporal cues, while a biological constraint based on negative gravitropism mitigates vertical growth ambiguities. Validated on a Brassica napus dataset, ST-DETrack achieves a Branch Matching Accuracy (BMA) of 93.6%, significantly outperforming spatial and temporal baselines by 28.9 and 3.3 percentage points, respectively. These results demonstrate the method’s robustness in maintaining long-term identity consistency amidst complex, dynamic plant architectures.

[126] Evaluation of deep learning architectures for wildlife object detection: A comparative study of ResNet and Inception

Malach Obisa Amonga, Benard Osero, Edna Too

Main category: cs.CV

TL;DR: ResNet-101 and Inception v3 achieve high accuracy (94-95%) for wildlife detection but struggle with visually similar species and challenging conditions like poor lighting/occlusion.

DetailsMotivation: Wildlife object detection is crucial for biodiversity conservation and ecological monitoring, but faces challenges from environmental variability, visual similarities among species, and intra-class diversity.

Method: Evaluated ResNet-101 and Inception v3 on wildlife image dataset with standardized preprocessing (resizing to 800px max, RGB conversion, PyTorch tensors) using 70:30 train-validation split.

Result: ResNet-101: 94% accuracy, 0.91 mAP; Inception v3: 95% accuracy, 0.92 mAP. Both performed well but struggled with visually similar species and poor lighting/occlusion conditions.

Conclusion: Both ResNet-101 and Inception v3 are effective for wildlife object detection, providing reliable foundation for conservation-focused computer vision applications despite some limitations with challenging cases.

Abstract: Wildlife object detection plays a vital role in biodiversity conservation, ecological monitoring, and habitat protection. However, this task is often challenged by environmental variability, visual similarities among species, and intra-class diversity. This study investigates the effectiveness of two individual deep learning architectures ResNet-101 and Inception v3 for wildlife object detection under such complex conditions. The models were trained and evaluated on a wildlife image dataset using a standardized preprocessing approach, which included resizing images to a maximum dimension of 800 pixels, converting them to RGB format, and transforming them into PyTorch tensors. A ratio of 70:30 training and validation split was used for model development. The ResNet-101 model achieved a classification accuracy of 94% and a mean Average Precision (mAP) of 0.91, showing strong performance in extracting deep hierarchical features. The Inception v3 model performed slightly better, attaining a classification accuracy of 95% and a mAP of 0.92, attributed to its efficient multi-scale feature extraction through parallel convolutions. Despite the strong results, both models exhibited challenges when detecting species with similar visual characteristics or those captured under poor lighting and occlusion. Nonetheless, the findings confirm that both ResNet-101 and Inception v3 are effective models for wildlife object detection tasks and provide a reliable foundation for conservation-focused computer vision applications.

[127] RUMPL: Ray-Based Transformers for Universal Multi-View 2D to 3D Human Pose Lifting

Seyed Abolfazl Ghasemzadeh, Alexandre Alahi, Christophe De Vleeschouwer

Main category: cs.CV

TL;DR: RUMPL is a transformer-based 3D pose lifter that uses a novel 3D ray-based representation of 2D keypoints, making it camera-calibration-free and view-count-agnostic for universal multi-view deployment without retraining.

DetailsMotivation: Existing multi-view 3D pose estimation methods struggle with real-world generalization due to limited large-scale multi-view datasets with 3D ground truth captured under constrained conditions. Current approaches rely on 2D pose estimation combined with 2D-to-3D lifting trained on synthetic data, but face challenges with camera calibration dependencies and view configuration limitations.

Method: Building on the MPL framework, RUMPL introduces a 3D ray-based representation of 2D keypoints that makes the model independent of camera calibration and number of views. It uses a View Fusion Transformer that leverages learned fused-ray tokens to aggregate information along rays, improving multi-view consistency. The approach enables universal deployment across arbitrary multi-view configurations without retraining or fine-tuning.

Result: RUMPL reduces MPJPE by up to 53% compared to triangulation and over 60% compared to transformer-based image-representation baselines. It demonstrates robustness and scalability on new benchmarks including in-the-wild multi-view and multi-person datasets.

Conclusion: The proposed RUMPL framework successfully addresses limitations of existing multi-view 3D pose estimation methods by introducing a camera-calibration-free, view-count-agnostic approach that generalizes well to real-world scenarios and arbitrary multi-view configurations without requiring retraining.

Abstract: Estimating 3D human poses from 2D images remains challenging due to occlusions and projective ambiguity. Multi-view learning-based approaches mitigate these issues but often fail to generalize to real-world scenarios, as large-scale multi-view datasets with 3D ground truth are scarce and captured under constrained conditions. To overcome this limitation, recent methods rely on 2D pose estimation combined with 2D-to-3D pose lifting trained on synthetic data. Building on our previous MPL framework, we propose RUMPL, a transformer-based 3D pose lifter that introduces a 3D ray-based representation of 2D keypoints. This formulation makes the model independent of camera calibration and the number of views, enabling universal deployment across arbitrary multi-view configurations without retraining or fine-tuning. A new View Fusion Transformer leverages learned fused-ray tokens to aggregate information along rays, further improving multi-view consistency. Extensive experiments demonstrate that RUMPL reduces MPJPE by up to 53% compared to triangulation and over 60% compared to transformer-based image-representation baselines. Results on new benchmarks, including in-the-wild multi-view and multi-person datasets, confirm its robustness and scalability. The framework’s source code is available at https://github.com/aghasemzadeh/OpenRUMPL

[128] Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting

Arthur Moreau, Richard Shaw, Michal Nazarczuk, Jisu Shin, Thomas Tanay, Zhensong Zhang, Songcen Xu, Eduardo Pérez-Pellitero

Main category: cs.CV

TL;DR: A novel feed-forward 3D Gaussian Splatting architecture that replaces rigid pixel grids with adaptive sub-pixel primitive detection, achieving state-of-the-art novel view synthesis with fewer primitives and improved efficiency.

DetailsMotivation: Current feed-forward 3DGS models suffer from suboptimal primitive placement due to reliance on dense, rigid pixel grids, which limits both quality and efficiency in scene generation.

Method: Introduces an “Off The Grid” distribution approach using multi-resolution decoder inspired by keypoint detection to distribute 3D Gaussian primitives at sub-pixel level across image patches, trained end-to-end with self-supervised learning.

Result: Achieves state-of-the-art novel view synthesis for feed-forward models, generating photorealistic scenes in seconds with far fewer primitives, better detail capture, and reduced artifacts. Also improves camera pose estimation.

Conclusion: The adaptive primitive placement approach enables more accurate and efficient 3D scene generation, and demonstrates that learning to render 3D Gaussians can improve pose estimation, suggesting potential for training foundational models without labels.

Abstract: Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, “Off The Grid” distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts. Moreover, we observe that by learning to render 3D Gaussians, our 3D reconstruction backbone improves camera pose estimation, suggesting opportunities to train these foundational models without labels.

[129] DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations

Yuxiang Shi, Zhe Li, Yanwen Wang, Hao Zhu, Xun Cao, Ligang Liu

Main category: cs.CV

TL;DR: DeX-Portrait: A diffusion-based portrait animation method with disentangled control over head pose and facial expression using explicit pose transformations and implicit expression codes.

DetailsMotivation: Existing diffusion-based portrait animation methods lack high-fidelity disentangled control between head pose and facial expression, limiting applications like expression-only or pose-only editing and animation.

Method: 1) Design motion trainer with pose and expression encoders for decomposed driving signals; 2) Inject pose transformation via dual-branch conditioning and expression latent via cross attention; 3) Use progressive hybrid classifier-free guidance for identity consistency.

Result: Outperforms state-of-the-art baselines on both animation quality and disentangled controllability.

Conclusion: DeX-Portrait enables expressive portrait animation with precise, disentangled control over pose and expression, addressing limitations of previous diffusion-based approaches.

Abstract: Portrait animation from a single source image and a driving video is a long-standing problem. Recent approaches tend to adopt diffusion-based image/video generation models for realistic and expressive animation. However, none of these diffusion models realizes high-fidelity disentangled control between the head pose and facial expression, hindering applications like expression-only or pose-only editing and animation. To address this, we propose DeX-Portrait, a novel approach capable of generating expressive portrait animation driven by disentangled pose and expression signals. Specifically, we represent the pose as an explicit global transformation and the expression as an implicit latent code. First, we design a powerful motion trainer to learn both pose and expression encoders for extracting precise and decomposed driving signals. Then we propose to inject the pose transformation into the diffusion model through a dual-branch conditioning mechanism, and the expression latent through cross attention. Finally, we design a progressive hybrid classifier-free guidance for more faithful identity consistency. Experiments show that our method outperforms state-of-the-art baselines on both animation quality and disentangled controllability.

[130] EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration

Daiqing Wu, Dongbao Yang, Can Ma. Yu Zhou

Main category: cs.CV

TL;DR: EmoCaliber is a confidence-aware Multimodal Large Language Model for Visual Emotion Comprehension that verbalizes confidence in emotion predictions to address the subjectivity of emotion perception.

DetailsMotivation: Current MLLMs for VEC treat emotion prediction as deterministic, outputting single definitive labels, which fails to account for the inherent subjectivity of emotion perception and overlooks plausible alternative interpretations.

Method: A three-stage training framework: 1) progressive structured reasoning, 2) teaching to verbalize confidence, and 3) calibrating confidence expression, resulting in EmoCaliber model.

Result: EmoCaliber demonstrates overall superiority against existing methods on VECBench benchmark in both emotion prediction and confidence estimation.

Conclusion: The approach effectively addresses emotion perception subjectivity and represents a feasible step toward more reliable VEC systems by providing confidence estimates alongside predictions.

Abstract: Visual Emotion Comprehension (VEC) aims to infer sentiment polarities or emotion categories from affective cues embedded in images. In recent years, Multimodal Large Language Models (MLLMs) have established a popular paradigm in VEC, leveraging their generalizability to unify VEC tasks defined under diverse emotion taxonomies. While this paradigm achieves notable success, it typically formulates VEC as a deterministic task, requiring the model to output a single, definitive emotion label for each image. Such a formulation insufficiently accounts for the inherent subjectivity of emotion perception, overlooking alternative interpretations that may be equally plausible to different viewers. To address this limitation, we propose equipping MLLMs with capabilities to verbalize their confidence in emotion predictions. This additional signal provides users with an estimate of both the plausibility of alternative interpretations and the MLLMs’ self-assessed competence, thereby enhancing reliability in practice. Building on this insight, we introduce a three-stage training framework that progressively endows with structured reasoning, teaches to verbalize confidence, and calibrates confidence expression, culminating in EmoCaliber, a confidence-aware MLLM for VEC. Through fair and comprehensive evaluations on the unified benchmark VECBench, EmoCaliber demonstrates overall superiority against existing methods in both emotion prediction and confidence estimation. These results validate the effectiveness of our approach and mark a feasible step toward more reliable VEC systems. Project page: https://github.com/wdqqdw/EmoCaliber.

[131] An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain

João Daniel Silva, Joao Magalhaes, Devis Tuia, Bruno Martins

Main category: cs.CV

TL;DR: GeoMELT is a compact encoder-only transformer model for multi-task learning in remote sensing, handling both text generation from images and cross-modal retrieval with fewer parameters than large vision-language models.

DetailsMotivation: Large Vision and Language Models (LVLMs) are expensive to train and use due to their massive parameter counts, making them inaccessible to many institutions. While parameter-efficient adaptation techniques exist, computational costs remain prohibitive for widespread adoption in remote sensing applications.

Method: Proposes GeoMELT (Multi-task Efficient Learning Transformer), an encoder-only architecture that addresses multiple remote sensing tasks simultaneously: text generation from images and cross-modal retrieval. The model is designed to be compact in terms of parameter count while maintaining effectiveness.

Result: GeoMELT demonstrates efficacy and efficiency in established benchmarks, confirming that the proposed approach can effectively handle multi-task learning in remote sensing while remaining computationally accessible.

Conclusion: Encoder-only architectures like GeoMELT offer a viable alternative to large LVLMs for multi-task learning in remote sensing, providing effective performance for text generation and cross-modal retrieval tasks while being more computationally efficient and accessible to a wider range of institutions.

Abstract: The remote sensing community has recently seen the emergence of methods based on Large Vision and Language Models (LVLMs) that can address multiple tasks at the intersection of computer vision and natural language processing. To fully exploit the potential of such models, a significant focus has been given to the collection of large amounts of training data that cover multiple remote sensing-specific tasks, such as image captioning or visual question answering. However, the cost of using and training LVLMs is high, due to the large number of parameters. While multiple parameter-efficient adaptation techniques have been explored, the computational costs of training and inference with these models can remain prohibitive for most institutions. In this work, we explore the use of encoder-only architectures and propose a model that can effectively address multi-task learning while remaining compact in terms of the number of parameters. In particular, our model tackles combinations of tasks that are not typically explored in a unified model: the generation of text from remote sensing images and cross-modal retrieval. The results of our GeoMELT model - named from Multi-task Efficient Learning Transformer - in established benchmarks confirm the efficacy and efficiency of the proposed approach.

[132] BLANKET: Anonymizing Faces in Infant Video Recordings

Ditmar Hadera, Jan Cech, Miroslav Purkrabek, Matej Hoffmann

Main category: cs.CV

TL;DR: BLANKET is a novel infant face anonymization method that generates compatible random faces via diffusion inpainting and performs temporally consistent face swapping with expression transfer, outperforming DeepPrivacy2 on de-identification, attribute preservation, pose estimation impact, and artifact reduction.

DetailsMotivation: Ethical use of video data involving human subjects, especially infants, requires robust anonymization methods that protect privacy while preserving essential facial attributes for research purposes.

Method: Two-stage approach: 1) Generate new random face compatible with original identity using diffusion model inpainting; 2) Seamlessly incorporate new identity into each video frame through temporally consistent face swapping with authentic expression transfer.

Result: Evaluated on baby video dataset, BLANKET outperforms DeepPrivacy2 in all assessed metrics: level of de-identification, preservation of facial attributes, impact on human pose estimation (downstream task), and presence of artifacts. Both methods successfully alter identity.

Conclusion: BLANKET provides effective infant face anonymization that balances privacy protection with preservation of essential facial attributes, making it suitable for ethical video data use in research involving infants.

Abstract: Ensuring the ethical use of video data involving human subjects, particularly infants, requires robust anonymization methods. We propose BLANKET (Baby-face Landmark-preserving ANonymization with Keypoint dEtection consisTency), a novel approach designed to anonymize infant faces in video recordings while preserving essential facial attributes. Our method comprises two stages. First, a new random face, compatible with the original identity, is generated via inpainting using a diffusion model. Second, the new identity is seamlessly incorporated into each video frame through temporally consistent face swapping with authentic expression transfer. The method is evaluated on a dataset of short video recordings of babies and is compared to the popular anonymization method, DeepPrivacy2. Key metrics assessed include the level of de-identification, preservation of facial attributes, impact on human pose estimation (as an example of a downstream task), and presence of artifacts. Both methods alter the identity, and our method outperforms DeepPrivacy2 in all other respects. The code is available as an easy-to-use anonymization demo at https://github.com/ctu-vras/blanket-infant-face-anonym.

[133] GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang

Main category: cs.CV

TL;DR: GRAN-TED introduces a new paradigm for text encoders in diffusion models with TED-6K benchmark and two-stage training for superior text embeddings.

DetailsMotivation: Text encoders are critical for semantic fidelity in text-to-image/video diffusion models, but development is hindered by lack of efficient evaluation frameworks and difficulty adapting pretrained language models for visual synthesis.

Method: Two main contributions: 1) TED-6K text-only benchmark with lightweight unified adapter for efficient encoder evaluation, 2) Two-stage training paradigm: initial fine-tuning on Multimodal LLM for visual representation, followed by layer-wise weighting for nuanced text features.

Result: TED-6K performance strongly correlates with downstream generation effectiveness. GRAN-TED encoder achieves SOTA on TED-6K and leads to performance gains in text-to-image and text-to-video generation tasks.

Conclusion: GRAN-TED provides both an evaluation framework (TED-6K) and superior encoder training method, addressing key challenges in text encoder development for diffusion models.

Abstract: The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder’s representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder’s effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.

[134] On the Effectiveness of Textual Prompting with Lightweight Fine-Tuning for SAM3 Remote Sensing Segmentation

Roni Blushtein-Livnon, Osher Rafaeli, David Ioffe, Amir Boger, Karen Sandberg Esquenazi, Tal Svoray

Main category: cs.CV

TL;DR: SAM3 framework evaluated for remote sensing image segmentation using textual, geometric, and hybrid prompting strategies with varying supervision levels, showing hybrid approaches work best while text-only struggles with irregular targets.

DetailsMotivation: Remote sensing image segmentation faces challenges due to limited annotated data and domain gap between overhead imagery and natural images used to train foundational models, requiring effective adaptation under limited supervision.

Method: Evaluated SAM3 concept-driven framework for RS imagery across four target types, comparing textual, geometric, and hybrid prompting strategies under lightweight fine-tuning scales with increasing supervision, alongside zero-shot inference.

Result: Combining semantic and geometric cues yields highest performance across targets and metrics. Text-only prompting shows lowest performance, especially for irregularly shaped targets. Performance improves between zero-shot and fine-tuning, then shows diminishing returns with more supervision. Modest geometric annotation effort is sufficient for effective adaptation.

Conclusion: Hybrid prompting (semantic + geometric) works best for RS segmentation. Text-only prompting has limited semantic alignment with overhead imagery. Light fine-tuning offers practical performance-effort trade-off for regular targets. Under-segmentation and boundary inaccuracies remain prevalent error patterns, especially for irregular and less prevalent targets.

Abstract: Remote sensing (RS) image segmentation is constrained by the limited availability of annotated data and a gap between overhead imagery and natural images used to train foundational models. This motivates effective adaptation under limited supervision. SAM3 concept-driven framework generates masks from textual prompts without requiring task-specific modifications, which may enable this adaptation. We evaluate SAM3 for RS imagery across four target types, comparing textual, geometric, and hybrid prompting strategies, under lightweight fine-tuning scales with increasing supervision, alongside zero-shot inference. Results show that combining semantic and geometric cues yields the highest performance across targets and metrics. Text-only prompting exhibits the lowest performance, with marked score gaps for irregularly shaped targets, reflecting limited semantic alignment between SAM3 textual representations and their overhead appearances. Nevertheless, textual prompting with light fine-tuning offers a practical performance-effort trade-off for geometrically regular and visually salient targets. Across targets, performance improves between zero-shot inference and fine-tuning, followed by diminishing returns as the supervision scale increases. Namely, a modest geometric annotation effort is sufficient for effective adaptation. A persistent gap between Precision and IoU further indicates that under-segmentation and boundary inaccuracies remain prevalent error patterns in RS tasks, particularly for irregular and less prevalent targets.

[135] MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

Zhipeng Du, Duolikun Danier, Jan Eric Lenssen, Hakan Bilen

Main category: cs.CV

TL;DR: MoonSeg3R enables online zero-shot monocular 3D instance segmentation using geometric priors from CUT3R foundation model, achieving competitive performance with RGB-D systems.

DetailsMotivation: Existing 3D instance segmentation approaches rely on posed RGB-D sequences, failing in practical settings where only single RGB streams are available. There's a need for online monocular 3D segmentation without depth sensors.

Method: Three key components: (1) self-supervised query refinement with spatial-semantic distillation to transform 2D VFM masks into 3D queries; (2) 3D query index memory for temporal consistency via contextual query retrieval; (3) state-distribution token from CUT3R as mask identity descriptor for cross-frame fusion.

Result: First method enabling online monocular 3D segmentation, achieving competitive performance with state-of-the-art RGB-D-based systems on ScanNet200 and SceneNN datasets.

Conclusion: MoonSeg3R successfully overcomes limitations of RGB-D dependency by leveraging foundation models, enabling practical online 3D instance segmentation from single RGB streams with temporal consistency.

Abstract: In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.

[136] IMKD: Intensity-Aware Multi-Level Knowledge Distillation for Camera-Radar Fusion

Shashank Mishra, Karan Patil, Didier Stricker, Jason Rambach

Main category: cs.CV

TL;DR: IMKD is a radar-camera 3D object detection framework using multi-level knowledge distillation that preserves sensor characteristics while enhancing complementary strengths, achieving state-of-the-art performance on nuScenes.

DetailsMotivation: Existing knowledge distillation methods for radar-camera fusion often transfer modality-specific features directly, which can distort each sensor's unique characteristics and degrade their individual strengths. There's a need for a distillation approach that preserves sensor-specific properties while effectively leveraging their complementary information.

Method: IMKD uses a three-stage intensity-aware distillation strategy: (1) LiDAR-to-Radar intensity-aware feature distillation to enhance radar with structural cues, (2) LiDAR-to-Fused intensity-guided distillation to highlight useful geometry/depth information at fusion level, and (3) Camera-Radar intensity-guided fusion mechanism for effective feature alignment and calibration.

Result: Extensive experiments on nuScenes benchmark show IMKD achieves 67.0% NDS and 61.0% mAP, outperforming all prior distillation-based radar-camera fusion methods.

Conclusion: IMKD demonstrates that multi-level knowledge distillation with intensity awareness can effectively preserve sensor characteristics while amplifying complementary strengths, leading to state-of-the-art radar-camera 3D object detection performance without LiDAR at inference time.

Abstract: High-performance Radar-Camera 3D object detection can be achieved by leveraging knowledge distillation without using LiDAR at inference time. However, existing distillation methods typically transfer modality-specific features directly to each sensor, which can distort their unique characteristics and degrade their individual strengths. To address this, we introduce IMKD, a radar-camera fusion framework based on multi-level knowledge distillation that preserves each sensor’s intrinsic characteristics while amplifying their complementary strengths. IMKD applies a three-stage, intensity-aware distillation strategy to enrich the fused representation across the architecture: (1) LiDAR-to-Radar intensity-aware feature distillation to enhance radar representations with fine-grained structural cues, (2) LiDAR-to-Fused feature intensity-guided distillation to selectively highlight useful geometry and depth information at the fusion level, fostering complementarity between the modalities rather than forcing them to align, and (3) Camera-Radar intensity-guided fusion mechanism that facilitates effective feature alignment and calibration. Extensive experiments on the nuScenes benchmark show that IMKD reaches 67.0% NDS and 61.0% mAP, outperforming all prior distillation-based radar-camera fusion methods. Our code and models are available at https://github.com/dfki-av/IMKD/.

[137] FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision

Tobias Kirschstein, Simon Giebenhain, Matthias Nießner

Main category: cs.CV

TL;DR: FlexAvatar creates complete 3D head avatars from single images by combining monocular and multi-view training through transformer architecture with bias sinks.

DetailsMotivation: Existing methods struggle with incomplete 3D head reconstructions from single images due to limited multi-view data and entanglement issues in monocular training.

Method: Transformer-based 3D portrait animation model with learnable data source tokens (bias sinks) that enables unified training across monocular and multi-view datasets.

Result: Generates complete 3D head avatars with realistic facial animations, outperforming existing methods in view extrapolation and achieving smooth latent avatar space for identity interpolation.

Conclusion: FlexAvatar successfully addresses the completeness problem in single-image 3D avatar creation by leveraging both monocular and multi-view data strengths through a unified transformer architecture.

Abstract: We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations. Website: https://tobias-kirschstein.github.io/flexavatar/

[138] Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei Chen, Yuxiang Chen, Heung-Yeung Shum, Lionel M. Ni, Jingren Zhou, Junyang Lin, Chenfei Wu

Main category: cs.CV

TL;DR: Qwen-Image-Layered is a diffusion model that decomposes RGB images into multiple RGBA layers for independent editing, addressing consistency issues in image generation.

DetailsMotivation: Current visual generative models struggle with consistency during image editing due to raster images' entangled nature, unlike professional design tools that use layered representations for isolated edits.

Method: Three key components: 1) RGBA-VAE for unified latent representations, 2) VLD-MMDiT architecture for variable-length layer decomposition, 3) Multi-stage training strategy to adapt pretrained models. Also built a pipeline to extract multilayer images from PSD files for training data.

Result: The method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing.

Conclusion: Qwen-Image-Layered enables inherent editability through semantically disentangled RGBA layers, providing a professional design-like approach to consistent image editing in generative models.

Abstract: Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose \textbf{Qwen-Image-Layered}, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling \textbf{inherent editability}, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing. Our code and models are released on \href{https://github.com/QwenLM/Qwen-Image-Layered}{https://github.com/QwenLM/Qwen-Image-Layered}

[139] 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang

Main category: cs.CV

TL;DR: 3DLLM-Mem: A novel memory management model for LLMs that enables effective spatial-temporal reasoning in 3D environments by dynamically fusing current observations with relevant past memories.

DetailsMotivation: Current LLMs struggle with planning and acting in dynamic 3D environments due to lack of proper spatial-temporal memory modeling, while humans excel at leveraging long-term memory across temporal and spatial experiences.

Method: Proposes 3DLLM-Mem with working memory tokens representing current observations as queries to selectively attend to and fuse useful spatial-temporal features from episodic memory storing past observations and interactions.

Result: Achieves state-of-the-art performance, outperforming strongest baselines by 16.5% in success rate on 3DMem-Bench’s most challenging in-the-wild embodied tasks.

Conclusion: The approach enables agents to focus on task-relevant information while maintaining memory efficiency in complex, long-horizon 3D environments.

Abstract: Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, question-answering and captioning, designed to evaluate an agent’s ability to reason over long-term memory in 3D environments. Second, we propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions in LLMs. Our model uses working memory tokens, which represents current observations, as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, which stores past observations and interactions. Our approach allows the agent to focus on task-relevant information while maintaining memory efficiency in complex, long-horizon environments. Experimental results demonstrate that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming the strongest baselines by 16.5% in success rate on 3DMem-Bench’s most challenging in-the-wild embodied tasks.

[140] Robust Multi-view Camera Calibration from Dense Matches

Johannes Hägerlind, Bao-Long Tran, Urs Waldmann, Per-Erik Forssén

Main category: cs.CV

TL;DR: A robust method for camera pose estimation and calibration that improves SfM pipeline design choices, particularly for cameras with strong radial distortion.

DetailsMotivation: While SfM has improved camera calibration accuracy, challenges remain, especially for applications like animal behavior studies and forensic analysis where multiple rigid cameras observe scenes from different perspectives.

Method: Analyzes SfM pipeline components to identify improvements: (1) investigates optimal subsampling of dense matcher correspondences for estimation, (2) develops selection criteria for incremental view addition, and (3) applies correspondence subsampling in global SfM initialized with VGGT.

Result: Significant improvement over vanilla VGGT (79.9% vs 40.4%) for cameras with strong radial distortion. The pipeline generalizes across diverse camera setups.

Conclusion: The proposed robust SfM pipeline could become a valuable tool for animal behavior studies and forensic analysis by improving camera calibration accuracy across various setups.

Abstract: Estimating camera intrinsics and extrinsics is a fundamental problem in computer vision, and while advances in structure-from-motion (SfM) have improved accuracy and robustness, open challenges remain. In this paper, we introduce a robust method for pose estimation and calibration. We consider a set of rigid cameras, each observing the scene from a different perspective, which is a typical camera setup in animal behavior studies and forensic analysis of surveillance footage. Specifically, we analyse the individual components in a structure-from-motion (SfM) pipeline, and identify design choices that improve accuracy. Our main contributions are: (1) we investigate how to best subsample the predicted correspondences from a dense matcher to leverage them in the estimation process. (2) We investigate selection criteria for how to add the views incrementally. In a rigorous quantitative evaluation, we show the effectiveness of our changes, especially for cameras with strong radial distortion (79.9% ours vs. 40.4 vanilla VGGT). Finally, we demonstrate our correspondence subsampling in a global SfM setting where we initialize the poses using VGGT. The proposed pipeline generalizes across a wide range of camera setups, and could thus become a useful tool for animal behavior and forensic analysis.

[141] Persistent feature reconstruction of resident space objects (RSOs) within inverse synthetic aperture radar (ISAR) images

Morgan Coe, Gruffudd Jones, Leah-Nani Alconcel, Marina Gashinova

Main category: cs.CV

TL;DR: This paper proposes using sequential feature detection and tracking in sub-THz ISAR images for space object recognition, enhancing Space Domain Awareness through improved feature detection confidence.

DetailsMotivation: With increasing resident space objects (RSOs), detailed information about their condition and capabilities is needed for Space Domain Awareness. Space-based sensing enables inspection at shorter ranges without atmospheric effects, requiring robust feature recognition methods.

Method: Uses sequential feature detection and tracking in aligned ISAR images: 1) Metaheuristic simulator generates ISAR imagery for various scenarios, 2) Affine transformations align frames, 3) Gradient-by-ratio method detects edges, 4) Double-weighted Hough transform detects linear features, 5) Features are tracked throughout image sequences.

Result: The approach increases confidence in feature detection and classification. Feature evolution during frame sequences can be analyzed, and the method enables robust detection of features like shadowing. Sub-cm image resolution at up to 100 km ranges was previously demonstrated.

Conclusion: Sequential feature tracking in ISAR images enhances space object recognition for Space Domain Awareness. The proposed method improves feature detection reliability and enables analysis of feature evolution, with shadowing detection presented as a practical application.

Abstract: With the rapidly growing population of resident space objects (RSOs) in the near-Earth space environment, detailed information about their condition and capabilities is needed to provide Space Domain Awareness (SDA). Space-based sensing will enable inspection of RSOs at shorter ranges, independent of atmospheric effects, and from all aspects. The use of a sub-THz inverse synthetic aperture radar (ISAR) imaging and sensing system for SDA has been proposed in previous work, demonstrating the achievement of sub-cm image resolution at ranges of up to 100 km. This work focuses on recognition of external structures by use of sequential feature detection and tracking throughout the aligned ISAR images of the satellites. The Hough transform is employed to detect linear features, which are tracked throughout the sequence. ISAR imagery is generated via a metaheuristic simulator capable of modelling encounters for a variety of deployment scenarios. Initial frame-to-frame alignment is achieved through a series of affine transformations to facilitate later association between image features. A gradient-by-ratio method is used for edge detection within individual ISAR images, and edge magnitude and direction are subsequently used to inform a double-weighted Hough transform to detect features with high accuracy. Feature evolution during sequences of frames is analysed. It is shown that the use of feature tracking within sequences with the proposed approach will increase confidence in feature detection and classification, and an example use-case of robust detection of shadowing as a feature is presented.

[142] IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning

Yuanhang Li, Yiren Song, Junzhe Bai, Xinran Liang, Hu Yang, Libiao Jin, Qi Mao

Main category: cs.CV

TL;DR: IC-Effect is a DiT-based framework for few-shot video VFX editing that preserves spatial/temporal consistency while injecting complex effects like flames and particles using limited paired data.

DetailsMotivation: Video VFX editing is challenging because effects must blend seamlessly with unchanged backgrounds, and existing models fail to meet requirements of background preservation, natural effect injection, and efficient learning from limited data.

Method: Uses instruction-guided DiT framework with source video as contextual conditions, two-stage training (general editing adaptation + Effect-LoRA for effect-specific learning), and spatiotemporal sparse tokenization for efficiency.

Result: Delivers high-quality, controllable, temporally consistent VFX editing across 15 visual styles, with substantially reduced computation while maintaining high fidelity.

Conclusion: IC-Effect opens new possibilities for video creation by enabling efficient few-shot VFX editing with strict background preservation and natural effect injection.

Abstract: We propose \textbf{IC-Effect}, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (\eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning $15$ high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.

[143] OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence

Yu Zheng, Jie Hu, Kailun Yang, Jiaming Zhang

Main category: cs.CV

TL;DR: 4D Occupancy Spatio-Temporal Persistence (OccSTeP) introduces persistent 3D scene understanding for autonomous driving with reactive and proactive forecasting, benchmarked on challenging scenarios, achieving significant performance gains.

DetailsMotivation: Autonomous driving requires robust 3D scene understanding that persists over time and handles temporal disturbances while accounting for potential future actions.

Method: Proposes OccSTeP-WM, a tokenizer-free world model with dense voxel-based scene state, linear-complexity attention backbone, recurrent state-space module, and ego-motion compensation for incremental spatio-temporal context fusion.

Result: Achieves average semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain), demonstrating robust performance even with missing or noisy historical sensor input.

Conclusion: The OccSTeP concept and OccSTeP-WM model effectively address persistent 4D scene understanding for autonomous driving, with open-sourced data and code for community use.

Abstract: Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce a new concept of 4D Occupancy Spatio-Temporal Persistence (OccSTeP), which aims to address two tasks: (1) reactive forecasting: ‘‘what will happen next’’ and (2) proactive forecasting: “what would happen given a specific future action”. For the first time, we create a new OccSTeP benchmark with challenging scenarios (e.g., erroneous semantic labels and dropped frames). To address this task, we propose OccSTeP-WM, a tokenizer-free world model that maintains a dense voxel-based scene state and incrementally fuses spatio-temporal context over time. OccSTeP-WM leverages a linear-complexity attention backbone and a recurrent state-space module to capture long-range spatial dependencies while continually updating the scene memory with ego-motion compensation. This design enables online inference and robust performance even when historical sensor input is missing or noisy. Extensive experiments prove the effectiveness of the OccSTeP concept and our OccSTeP-WM, yielding an average semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain). The data and code will be open source at https://github.com/FaterYU/OccSTeP.

[144] Towards Physically-Based Sky-Modeling For Image Based Lighting

Ian J. Maquignaz

Main category: cs.CV

TL;DR: AllSky is a new sky-model that learns directly from physically captured HDR imagery to achieve more accurate outdoor illumination than existing DNN-based methods, providing intuitive user control over sun position and cloud formations while supporting the full dynamic range needed for photorealistic rendering.

DetailsMotivation: Existing DNN-generated sky models fail to faithfully recreate natural skies and cannot support both photorealism and the full 22 f-stops required for outdoor illumination. Current methods produce environment maps that don't re-light scenes with the same tones, shadows, and illumination as physically captured HDR imagery, limiting their usefulness in visual arts, VR, and engineering applications.

Method: AllSky is a flexible all-weather sky-model learned directly from physically captured HDRI. The authors study input modalities, tonemapping, conditioning, and evaluation of sky-models. The model allows intuitive user control over environment maps with per-user-controlled positioning of the sun and cloud formations.

Result: AllSky achieves state-of-the-art sky-model performance and expands current functionality by enabling intuitive user control over environment maps. The evaluation demonstrates that existing DNN sky-models are not interchangeable with physically captured HDRI or parametric sky-models, with current limitations being prohibitive for scalability and accurate illumination in downstream applications.

Conclusion: The paper presents progress in HDR sky-modeling by addressing the limitations of existing DNN methods. AllSky provides a more accurate and controllable solution that better supports the full dynamic range needed for photorealistic outdoor rendering, making it more suitable for professional applications requiring faithful illumination.

Abstract: Accurate environment maps are a key component for rendering photorealistic outdoor scenes with coherent illumination. They enable captivating visual arts, immersive virtual reality, and a wide range of engineering and scientific applications. Recent works have extended sky-models to be more comprehensive and inclusive of cloud formations but, as we demonstrate, existing methods fall short in faithfully recreating natural skies. Though in recent years the visual quality of DNN-generated High Dynamic Range Imagery (HDRI) has greatly improved, the environment maps generated by DNN sky-models do not re-light scenes with the same tones, shadows, and illumination as physically captured HDR imagery. In this work, we demonstrate progress in HDR literature to be tangential to sky-modelling as current works cannot support both photorealism and the 22 f-stops required for the Full Dynamic Range (FDR) of outdoor illumination. We achieve this by proposing AllSky, a flexible all-weather sky-model learned directly from physically captured HDRI which we leverage to study the input modalities, tonemapping, conditioning, and evaluation of sky-models. Per user-controlled positioning of the sun and cloud formations, AllSky expands on current functionality by allowing for intuitive user control over environment maps and achieves state-of-the-art sky-model performance. Through our proposed evaluation, we demonstrate existing DNN sky-models are not interchangeable with physically captured HDRI or parametric sky-models, with current limitations being prohibitive of scalability and accurate illumination in downstream applications

[145] InpaintDPO: Mitigating Spatial Relationship Hallucinations in Foreground-conditioned Inpainting via Diverse Preference Optimization

Qirui Li, Yizhe Tang, Ran Yi, Guangben Lu, Fangyuan Zou, Peng Shu, Huan Yu, Jie Jiang

Main category: cs.CV

TL;DR: InpaintDPO: A DPO-based framework for foreground-conditioned inpainting that addresses spatial relationship hallucinations between foreground subjects and generated backgrounds through specialized preference optimization techniques.

DetailsMotivation: Current foreground-conditioned inpainting methods suffer from Spatial Relationship Hallucinations (inappropriate scale, positional relationships, and viewpoints) between foreground subjects and generated backgrounds. The subjective nature of spatial rationality makes it difficult to quantify and apply traditional RLHF methods.

Method: Proposes InpaintDPO framework with three key components: 1) MaskDPO - confines preference optimization to background regions while retaining inpainting loss in foreground to avoid gradient conflicts; 2) Conditional Asymmetric Preference Optimization - samples pairs with differentiated cropping operations and applies global preference optimization for boundary coherence; 3) Shared Commonality Preference Optimization - enhances model’s understanding of spatial commonality across high-quality winning samples.

Result: The framework ensures plausible spatial relationships between foreground and background elements, addressing scale, positional, and viewpoint hallucinations while maintaining foreground preservation and boundary coherence.

Conclusion: InpaintDPO is the first DPO-based framework dedicated to spatial rationality in foreground-conditioned inpainting, effectively addressing spatial relationship hallucinations through specialized preference optimization techniques that handle gradient conflicts, boundary coherence, and spatial commonality learning.

Abstract: Foreground-conditioned inpainting, which aims at generating a harmonious background for a given foreground subject based on the text prompt, is an important subfield in controllable image generation. A common challenge in current methods, however, is the occurrence of Spatial Relationship Hallucinations between the foreground subject and the generated background, including inappropriate scale, positional relationships, and viewpoints. Critically, the subjective nature of spatial rationality makes it challenging to quantify, hindering the use of traditional reward-based RLHF methods. To address this issue, we propose InpaintDPO, the first Direct Preference Optimization (DPO) based framework dedicated to spatial rationality in foreground-conditioned inpainting, ensuring plausible spatial relationships between foreground and background elements. To resolve the gradient conflicts in standard DPO caused by identical foreground in win-lose pairs, we propose MaskDPO, which confines preference optimization exclusively to the background to enhance background spatial relationships, while retaining the inpainting loss in the foreground region for robust foreground preservation. To enhance coherence at the foreground-background boundary, we propose Conditional Asymmetric Preference Optimization, which samples pairs with differentiated cropping operations and applies global preference optimization to promote contextual awareness and enhance boundary coherence. Finally, based on the observation that winning samples share a commonality in plausible spatial relationships, we propose Shared Commonality Preference Optimization to enhance the model’s understanding of spatial commonality across high-quality winning samples, further promoting shared spatial rationality.

[146] Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift

Jiacheng Cui, Bingkui Tong, Xinyue Bi, Xiaohan Zhao, Jiacheng Liu, Zhiqiang shen

Main category: cs.CV

TL;DR: Soft labels can cause semantic drift when using few crops per image, leading to mismatch between local visual content and global semantics. The paper proposes HALD, which uses hard labels as corrective anchors alongside soft labels to fix this drift, achieving state-of-the-art results.

DetailsMotivation: Soft labels have become dominant for knowledge transfer but suffer from "local semantic drift" when using limited crops per image - crops may visually resemble wrong classes, causing soft embeddings to deviate from ground-truth semantics. This creates mismatch between local visual content and global semantic meaning, introducing systematic errors and distribution misalignment.

Method: Proposes HALD (Hard Label for Alleviating Local Semantic Drift), a new training paradigm that leverages hard labels as intermediate corrective signals while retaining soft labels’ fine-grained advantages. Theoretically characterizes drift emergence under few soft-label supervision and shows hybridizing soft and hard labels restores alignment between visual content and semantic supervision.

Result: Extensive experiments on dataset distillation and large-scale conventional classification benchmarks show consistent improvements. On ImageNet-1K, achieves 42.7% with only 285M storage for soft labels, outperforming prior state-of-the-art LPLD by 9.0%.

Conclusion: Re-establishes the importance of hard labels as complementary tools in soft-label-dominated training. Shows that appropriately integrated hard labels provide powerful content-agnostic anchors to calibrate semantic drift, calling for rethinking their role in modern training paradigms.

Abstract: Soft labels generated by teacher models have become a dominant paradigm for knowledge transfer and recent large-scale dataset distillation such as SRe2L, RDED, LPLD, offering richer supervision than conventional hard labels. However, we observe that when only a limited number of crops per image are used, soft labels are prone to local semantic drift: a crop may visually resemble another class, causing its soft embedding to deviate from the ground-truth semantics of the original image. This mismatch between local visual content and global semantic meaning introduces systematic errors and distribution misalignment between training and testing. In this work, we revisit the overlooked role of hard labels and show that, when appropriately integrated, they provide a powerful content-agnostic anchor to calibrate semantic drift. We theoretically characterize the emergence of drift under few soft-label supervision and demonstrate that hybridizing soft and hard labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which leverages hard labels as intermediate corrective signals while retaining the fine-grained advantages of soft labels. Extensive experiments on dataset distillation and large-scale conventional classification benchmarks validate our approach, showing consistent improvements in generalization. On ImageNet-1K, we achieve 42.7% with only 285M storage for soft labels, outperforming prior state-of-the-art LPLD by 9.0%. Our findings re-establish the importance of hard labels as a complementary tool, and call for a rethinking of their role in soft-label-dominated training.

[147] Spatia: Video Generation with Updatable Spatial Memory

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, Yan Lu

Main category: cs.CV

TL;DR: Spatia: A video generation framework that uses persistent 3D scene point clouds as spatial memory to maintain long-term consistency, enabling explicit camera control and 3D-aware editing.

DetailsMotivation: Existing video generation models struggle with long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. Current approaches lack persistent spatial memory, leading to inconsistencies over time.

Method: Spatia maintains a 3D scene point cloud as persistent spatial memory. It iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM (Simultaneous Localization and Mapping). This dynamic-static disentanglement separates persistent scene structure from dynamic entities.

Result: The framework enhances spatial consistency throughout video generation while preserving the ability to produce realistic dynamic entities. It enables explicit camera control and 3D-aware interactive editing, providing geometrically grounded video generation.

Conclusion: Spatia offers a scalable, memory-driven video generation framework that addresses long-term consistency issues through persistent spatial memory, opening up new possibilities for controllable and geometrically-aware video synthesis.

Abstract: Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model’s ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.

[148] Stylized Synthetic Augmentation further improves Corruption Robustness

Georg Siedel, Rojan Regmi, Abhirami Anand, Weijia Shao, Silvia Vock, Andrey Morozov

Main category: cs.CV

TL;DR: A training data augmentation pipeline combining synthetic images with neural style transfer improves deep vision models’ robustness to common corruptions, achieving state-of-the-art results on corruption benchmarks.

DetailsMotivation: Deep vision models are vulnerable to common corruptions, and the paper aims to address this weakness through effective data augmentation techniques.

Method: Combines synthetic image data generation with neural style transfer augmentation, systematically analyzing their effects and hyperparameters. The approach integrates with rule-based augmentations like TrivialAugment when compatible.

Result: Achieves state-of-the-art corruption robustness: 93.54% on CIFAR-10-C, 74.9% on CIFAR-100-C, and 50.86% on TinyImageNet-C. Surprisingly, stylized synthetic images (despite lower FID scores) improve training effectiveness.

Conclusion: Synthetic data and style transfer complement each other effectively for improving corruption robustness, and can be combined with compatible rule-based augmentations to achieve superior performance on corruption benchmarks.

Abstract: This paper proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer in order to address the vulnerability of deep vision models to common corruptions. We show that although applying style transfer on synthetic images degrades their quality with respect to the common FID metric, these images are surprisingly beneficial for model training. We conduct a systematic empirical analysis of the effects of both augmentations and their key hyperparameters on the performance of image classifiers. Our results demonstrate that stylization and synthetic data complement each other well and can be combined with popular rule-based data augmentation techniques such as TrivialAugment, while not working with others. Our method achieves state-of-the-art corruption robustness on several small-scale image classification benchmarks, reaching 93.54%, 74.9% and 50.86% robust accuracy on CIFAR-10-C, CIFAR-100-C and TinyImageNet-C, respectively

[149] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: Skyra is a multimodal LLM that detects AI-generated videos by identifying visual artifacts and providing human-interpretable explanations, trained on a novel dataset with a two-stage strategy.

DetailsMotivation: The misuse of AI video generation raises serious social concerns, creating urgent need for reliable detectors. Most existing methods only do binary classification without explanations for human interpretation.

Method: Skyra is a specialized multimodal LLM that identifies human-perceivable visual artifacts as grounded evidence for detection and explanation. Uses two-stage training strategy: 1) enhances spatio-temporal artifact perception and explanation capability, 2) improves detection accuracy. Built on ViF-CoT-4K dataset - first large-scale AI-generated video artifact dataset with fine-grained human annotations.

Result: Skyra surpasses existing methods across multiple benchmarks. Comprehensive evaluation on ViF-Bench (3K samples from 10+ state-of-the-art video generators) shows superior performance. Provides valuable insights for advancing explainable AI-generated video detection.

Conclusion: Skyra addresses the limitations of existing binary classification methods by providing explainable AI-generated video detection through artifact identification. The approach advances the field toward more interpretable and reliable video authenticity verification.

Abstract: The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model’s spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.

[150] VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski, Li Fei-Fei, Jiajun Wu, Jason Zhang

Main category: cs.CV

TL;DR: VLMs can replicate human visual judgments zero-shot, enabling diffusion-based image compression trained directly on VLM preferences without perceptual loss distillation.

DetailsMotivation: Traditional distortion metrics like MSE don't align well with human perception, and existing perceptual losses require large-scale human judgment datasets. VLMs show surprising ability to replicate human visual preferences zero-shot.

Method: Propose VLIC: diffusion-based image compression system post-trained with binary VLM judgments using 2AFC reasoning. Uses existing diffusion model post-training techniques with VLM preferences instead of distilling into separate perceptual loss networks.

Result: VLIC achieves competitive or state-of-the-art performance on human-aligned compression across datasets, validated by perceptual metrics and large-scale user studies. VLM-based reward design and training procedures are extensively analyzed.

Conclusion: VLMs can effectively replace human judgment datasets for perceptual alignment in image compression, enabling high-performance compression systems trained directly on VLM preferences without complex perceptual loss distillation.

Abstract: Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic

[151] End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, Dahua Lin

Main category: cs.CV

TL;DR: Resampling Forcing: A teacher-free framework for training autoregressive video diffusion models from scratch that addresses exposure bias through self-resampling and enables efficient long-horizon generation with history routing.

DetailsMotivation: Autoregressive video diffusion models are promising for world simulation but suffer from exposure bias due to train-test mismatch. Existing solutions rely on bidirectional teacher models or online discriminators, lacking end-to-end training solutions.

Method: Introduces Resampling Forcing with: 1) Self-resampling scheme that simulates inference-time model errors on history frames during training, 2) Sparse causal mask for temporal causality while enabling parallel training with frame-level diffusion loss, and 3) History routing mechanism that dynamically retrieves top-k most relevant history frames for each query.

Result: Achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos due to native-length training.

Conclusion: Resampling Forcing provides an effective teacher-free framework for training autoregressive video diffusion models from scratch, addressing exposure bias and enabling efficient long-horizon generation with improved temporal consistency.

Abstract: Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.

[152] GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection

Yu Wang, Juhyung Ha, Frangil M. Ramirez, Yuchen Wang, David J. Crandall

Main category: cs.CV

TL;DR: GateFusion introduces a novel hierarchical gated fusion decoder for active speaker detection that enables progressive multi-depth fusion of visual and audio features, achieving new SOTA results on multiple benchmarks.

DetailsMotivation: Late fusion approaches in active speaker detection fail to capture fine-grained cross-modal interactions needed for robust performance in unconstrained scenarios, necessitating better multimodal fusion methods.

Method: Combines pretrained unimodal encoders with Hierarchical Gated Fusion Decoder (HiGate) that enables progressive, multi-depth fusion using learnable bimodally-conditioned gates, plus Masked Alignment Loss and Over-Positive Penalty auxiliary objectives.

Result: Achieves new SOTA: 77.8% mAP (+9.4%) on Ego4D-ASD, 86.1% mAP (+2.9%) on UniTalk, 96.1% mAP (+0.5%) on WASD, and competitive performance on AVA-ActiveSpeaker with strong generalization in out-of-domain experiments.

Conclusion: GateFusion’s hierarchical gated fusion approach with auxiliary objectives effectively captures cross-modal interactions for active speaker detection, demonstrating superior performance and generalization across multiple challenging benchmarks.

Abstract: Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.

[153] Multi-View Foundation Models

Leo Segre, Or Hirschorn, Shai Avidan

Main category: cs.CV

TL;DR: The paper proposes a method to convert single-view foundation models into multi-view foundation models that produce consistent features across different views of the same 3D scene.

DetailsMotivation: Current foundation models process each image independently, leading to inconsistent features for the same 3D point across different views. This inconsistency limits their effectiveness in multi-view scenarios where 3D scene understanding is important.

Method: The approach augments Transformers-based foundation models (DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help match features across different views. This bypasses the need to build a consistent 3D model and allows direct manipulation in image space.

Result: Quantitative experiments show that the method improves feature matching considerably compared to current foundation models. The approach is demonstrated on surface normal estimation and multi-view segmentation tasks.

Conclusion: The proposed method successfully converts single-view foundation models into multi-view ones, enabling consistent feature representation across different views without requiring explicit 3D reconstruction, which enhances performance in multi-view computer vision tasks.

Abstract: Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help match features across different views. As leading examples, we show surface normal estimation and multi-view segmentation tasks. Quantitative experiments show that our method improves feature matching considerably compared to current foundation models.

[154] Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering

Divam Gupta, Anuj Pahuja, Nemanja Bartolovic, Tomas Simon, Forrest Iandola, Giljoo Nam

Main category: cs.CV

TL;DR: GPiCA combines triangle mesh and 3D Gaussians for photorealistic head avatars that render efficiently on mobile devices, achieving Gaussian-based realism with mesh-based performance.

DetailsMotivation: To create photorealistic head avatars that can be efficiently rendered on mobile devices, addressing the limitations of purely mesh-based approaches (poor hair/beard representation) and purely Gaussian-based approaches (high computational cost).

Method: Hybrid representation combining triangle mesh for facial skin and anisotropic 3D Gaussians for hair/beard. Uses unified differentiable rendering pipeline treating mesh as semi-transparent layer within 3D Gaussian Splatting. Neural networks decode facial expression codes into mesh, RGBA texture, and Gaussians.

Result: GPiCA achieves the realism of purely Gaussian-based avatars while matching the rendering performance of mesh-based avatars, enabling efficient mobile rendering.

Conclusion: The hybrid mesh-Gaussian approach successfully balances photorealism and rendering efficiency, making photorealistic head avatars practical for mobile applications.

Abstract: We present Gaussian Pixel Codec Avatars (GPiCA), photorealistic head avatars that can be generated from multi-view images and efficiently rendered on mobile devices. GPiCA utilizes a unique hybrid representation that combines a triangle mesh and anisotropic 3D Gaussians. This combination maximizes memory and rendering efficiency while maintaining a photorealistic appearance. The triangle mesh is highly efficient in representing surface areas like facial skin, while the 3D Gaussians effectively handle non-surface areas such as hair and beard. To this end, we develop a unified differentiable rendering pipeline that treats the mesh as a semi-transparent layer within the volumetric rendering paradigm of 3D Gaussian Splatting. We train neural networks to decode a facial expression code into three components: a 3D face mesh, an RGBA texture, and a set of 3D Gaussians. These components are rendered simultaneously in a unified rendering engine. The networks are trained using multi-view image supervision. Our results demonstrate that GPiCA achieves the realism of purely Gaussian-based avatars while matching the rendering performance of mesh-based avatars.

[155] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: DiffusionVL enables conversion of powerful autoregressive vision-language models to diffusion-based models through simple fine-tuning, achieving performance gains and faster inference with minimal training data.

DetailsMotivation: While diffusion models show promise for multimodal tasks, existing diffusion vision-language models (dVLMs) underperform compared to autoregressive models. The paper aims to bridge this gap by enabling conversion of powerful AR models to diffusion paradigm.

Method: Proposes DiffusionVL, a method to translate any powerful AR model into a dVLM through simple fine-tuning. Introduces block-decoding design for arbitrary-length generation and KV cache reuse to accelerate inference.

Result: Achieves 34.4% gain on MMMU-Pro (vision) benchmark and 37.5% gain on MME (Cog.) benchmark with less than 5% of training data compared to prior methods. Also achieves 2x inference speedup.

Conclusion: Demonstrates effective paradigm shift from AR to diffusion for multimodal models, showing that powerful AR models can be successfully converted to competitive dVLMs with performance improvements and faster inference.

Abstract: In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.

[156] In Pursuit of Pixel Supervision for Visual Pre-training

Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu

Main category: cs.CV

TL;DR: Pixio is an enhanced masked autoencoder that shows autoencoder-based self-supervised learning remains competitive today, achieving strong performance across diverse downstream tasks while being simple and efficient.

DetailsMotivation: To demonstrate that autoencoder-based self-supervised learning (a classical paradigm) remains competitive with modern approaches, providing a simple, stable, and efficient alternative to latent-space methods.

Method: Enhanced masked autoencoder (MAE) with more challenging pre-training tasks and capable architectures, trained on 2B web-crawled images using self-curation strategy with minimal human curation.

Result: Pixio performs competitively across diverse downstream tasks including monocular depth estimation, 3D reconstruction, semantic segmentation, and robot learning, matching or outperforming DINOv3 at similar scales.

Conclusion: Pixel-space self-supervised learning serves as a promising alternative and complement to latent-space approaches, showing autoencoders remain relevant in modern self-supervised learning.

Abstract: At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed “Pixio”, is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.

[157] DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

Erfei Cui, Wenhai Wang, Zhiqi Li, Jiangwei Xie, Haoming Zou, Hanming Deng, Gen Luo, Lewei Lu, Xizhou Zhu, Jifeng Dai

Main category: cs.CV

TL;DR: DriveMLM is an LLM-based autonomous driving framework that replaces traditional decision-making modules with multimodal LLMs for close-loop driving in simulators, showing significant performance improvements over existing systems like Autopilot and Apollo.

DetailsMotivation: The paper aims to explore the potential of large language models in autonomous driving, leveraging their human-like thinking and cognitive abilities to improve decision-making in driving systems.

Method: The authors introduce DriveMLM with three key components: (1) bridging language decisions to vehicle control via standardized decision states, (2) using a multimodal LLM as behavior planning module that processes driving rules, user commands, and sensor inputs, and (3) creating a data engine to collect annotated datasets for training and evaluation.

Result: DriveMLM achieved significant improvements of 3.2 and 4.7 points on CARLA Town05 Long when replacing decision-making modules in Autopilot and Apollo respectively, demonstrating its effectiveness in autonomous driving.

Conclusion: The work successfully demonstrates that LLMs can effectively serve as decision-making modules in autonomous driving systems, establishing a baseline for future LLM-based autonomous driving research.

Abstract: Large language models (LLMs) have opened up new possibilities for intelligent agents, endowing them with human-like thinking and cognitive abilities. In this work, we delve into the potential of large language models (LLMs) in autonomous driving (AD). We introduce DriveMLM, an LLM-based AD framework that can perform close-loop autonomous driving in realistic simulators. To this end, (1) we bridge the gap between the language decisions and the vehicle control commands by standardizing the decision states according to the off-the-shelf motion planning module. (2) We employ a multimodal LLM (MLLM) to model the behavior planning module of a module AD system, which uses driving rules, user commands, and inputs from various sensors (e.g., camera, lidar) as input and makes driving decisions and provide explanations; This model can plug-and-play in existing AD systems such as Autopilot and Apollo for close-loop driving. (3) We design an effective data engine to collect a dataset that includes decision state and corresponding explanation annotation for model training and evaluation. We conduct extensive experiments and show that replacing the decision-making modules of the Autopilot and Apollo with DriveMLM resulted in significant improvements of 3.2 and 4.7 points on the CARLA Town05 Long respectively, demonstrating the effectiveness of our model. We hope this work can serve as a baseline for autonomous driving with LLMs.

[158] ASSR-NeRF: Arbitrary-Scale Super-Resolution on Voxel Grid for High-Quality Radiance Fields Reconstruction

Ding-Jiun Huang, Zi-Ting Chou, Yu-Chiang Frank Wang, Cheng Sun

Main category: cs.CV

TL;DR: ASSR-NeRF is a framework for super-resolution novel view synthesis that performs 3D super-resolution on optimized volumes using an attention-based VoxelGridSR model to achieve multi-view consistent high-resolution rendering.

DetailsMotivation: NeRF-based methods suffer from oversmoothing in high-resolution novel view synthesis when trained with low-resolution data, while single-image super-resolution lacks multi-view consistency. There's a need for a method that can achieve multi-view consistent super-resolution for 3D scenes.

Method: Proposes Arbitrary-Scale Super-Resolution NeRF (ASSR-NeRF) with an attention-based VoxelGridSR model that directly performs 3D super-resolution on optimized volumes. The model is trained on diverse scenes for generalizability and can be applied to unseen scenes trained with low-resolution views.

Result: The method demonstrates significant quantitative and qualitative performance improvements in super-resolution novel view synthesis, achieving multi-view consistent high-resolution rendering.

Conclusion: ASSR-NeRF successfully addresses the limitations of both NeRF-based methods and single-image super-resolution by providing a framework for multi-view consistent super-resolution novel view synthesis through 3D volume refinement.

Abstract: NeRF-based methods reconstruct 3D scenes by building a radiance field with implicit or explicit representations. While NeRF-based methods can perform novel view synthesis (NVS) at arbitrary scale, the performance in high-resolution novel view synthesis (HRNVS) with low-resolution (LR) optimization often results in oversmoothing. On the other hand, single-image super-resolution (SR) aims to enhance LR images to HR counterparts but lacks multi-view consistency. To address these challenges, we propose Arbitrary-Scale Super-Resolution NeRF (ASSR-NeRF), a novel framework for super-resolution novel view synthesis (SRNVS). We propose an attention-based VoxelGridSR model to directly perform 3D super-resolution (SR) on the optimized volume. Our model is trained on diverse scenes to ensure generalizability. For unseen scenes trained with LR views, we then can directly apply our VoxelGridSR to further refine the volume and achieve multi-view consistent SR. We demonstrate quantitative and qualitatively that the proposed method achieves significant performance in SRNVS.

[159] SynJAC: Synthetic-data-driven Joint-granular Adaptation and Calibration for Domain Specific Scanned Document Key Information Extraction

Yihao Ding, Soyeon Caren Han, Zechuan Li, Hyunsuk Chung

Main category: cs.CV

TL;DR: SynJAC is a method for key information extraction from scanned visually rich documents using synthetic data for domain adaptation and calibration on small annotated datasets to reduce manual labeling needs.

DetailsMotivation: Extracting key information from visually rich documents (charts, tables, paragraphs) is labor-intensive, especially for scanned formats with inconsistent layouts and domain-specific requirements. Current pretrained models require large annotated datasets for fine-tuning, limiting scalability.

Method: SynJAC (Synthetic-data-driven Joint-granular Adaptation and Calibration) uses synthetic machine-generated data for domain adaptation and calibration on a small manually annotated dataset to mitigate noise. It integrates fine-grained and coarse-grained document representation learning.

Result: Extensive experiments demonstrate SynJAC’s effectiveness in domain-specific and scanned VRD scenarios, achieving competitive performance while significantly reducing the need for extensive manual labeling.

Conclusion: SynJAC provides a scalable solution for key information extraction from scanned visually rich documents by leveraging synthetic data and calibration techniques to overcome the limitations of large annotated dataset requirements.

Abstract: Visually Rich Documents (VRDs), comprising elements such as charts, tables, and paragraphs, convey complex information across diverse domains. However, extracting key information from these documents remains labour-intensive, particularly for scanned formats with inconsistent layouts and domain-specific requirements. Despite advances in pretrained models for VRD understanding, their dependence on large annotated datasets for fine-tuning hinders scalability. This paper proposes \textbf{SynJAC} (Synthetic-data-driven Joint-granular Adaptation and Calibration), a method for key information extraction in scanned documents. SynJAC leverages synthetic, machine-generated data for domain adaptation and employs calibration on a small, manually annotated dataset to mitigate noise. By integrating fine-grained and coarse-grained document representation learning, SynJAC significantly reduces the need for extensive manual labelling while achieving competitive performance. Extensive experiments demonstrate its effectiveness in domain-specific and scanned VRD scenarios.

[160] One-Cycle Structured Pruning via Stability-Driven Subnetwork Search

Deepak Ghimire, Dayoung Kil, Seonghwan Jeong, Jaesik Park, Seong-heum Kim

Main category: cs.CV

TL;DR: One-cycle structured pruning framework that integrates pre-training, pruning, and fine-tuning into a single training cycle, achieving state-of-the-art accuracy with reduced computational cost.

DetailsMotivation: Existing structured pruning methods require multi-stage training with high computational costs, while pruning at initialization suffers from degraded performance. Need for efficient pruning that maintains accuracy.

Method: Proposes OCSPruner framework with norm-based group saliency criteria, structured sparsity regularization, and a novel pruning indicator that detects stable pruning epoch by measuring similarity between pruning sub-networks across consecutive epochs.

Result: Achieves state-of-the-art accuracy on CIFAR-10, CIFAR-100, and ImageNet using VGG, ResNet, and MobileNet architectures while being among the most efficient structured pruning frameworks in training cost.

Conclusion: The proposed one-cycle structured pruning framework successfully integrates all pruning stages into a single training cycle, achieving high accuracy with significantly reduced computational overhead compared to existing methods.

Abstract: Existing structured pruning methods typically rely on multi-stage training procedures that incur high computational costs. Pruning at initialization aims to reduce this burden but often suffers from degraded performance. To address these limitations, we propose an efficient one-cycle structured pruning framework that integrates pre-training, pruning, and fine-tuning into a single training cycle without sacrificing accuracy. The key idea is to identify an optimal sub-network during the early stages of training, guided by norm-based group saliency criteria and structured sparsity regularization. We introduce a novel pruning indicator that detects a stable pruning epoch by measuring the similarity between pruning sub-networks across consecutive training epochs. In addition, group sparsity regularization accelerates convergence, further reducing overall training time. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet using VGG, ResNet, and MobileNet architectures demonstrate that the proposed method achieves state-of-the-art accuracy while being among the most efficient structured pruning frameworks in terms of training cost. Code is available at https://github.com/ghimiredhikura/OCSPruner.

[161] Cascaded Dual Vision Transformer for Accurate Facial Landmark Detection

Ziqiang Dang, Jianfang Li, Lin Liu

Main category: cs.CV

TL;DR: A new facial landmark detector using Dual Vision Transformer (D-ViT) with Channel-split ViT for modeling geometric relations and Long Skip Connections (LSC) to preserve low-level features, achieving state-of-the-art performance on WFLW, COFW, and 300W benchmarks.

DetailsMotivation: Facial landmark detection is fundamental for many computer vision applications. Current methods need better modeling of inherent geometric relations among landmarks and preservation of useful low-level image features that might be discarded by intermediate supervision.

Method: Proposes two key designs: 1) Dual Vision Transformer (D-ViT) combining Channel-split ViT (learning interconnections between channel dimensions representing heatmap linear bases) with standard spatial-split ViT to model geometric relations, and 2) Long Skip Connections (LSC) to deliver low-level image features to all prediction blocks.

Result: Extensive experiments on WFLW, COFW, and 300W benchmarks demonstrate that the proposed model outperforms previous state-of-the-art methods across all three benchmarks.

Conclusion: The proposed facial landmark detector with D-ViT and LSC effectively models geometric relations among landmarks while preserving useful low-level features, achieving superior performance on standard benchmarks.

Abstract: Facial landmark detection is a fundamental problem in computer vision for many downstream applications. This paper introduces a new facial landmark detector based on vision transformers, which consists of two unique designs: Dual Vision Transformer (D-ViT) and Long Skip Connections (LSC). Based on the observation that the channel dimension of feature maps essentially represents the linear bases of the heatmap space, we propose learning the interconnections between these linear bases to model the inherent geometric relations among landmarks via Channel-split ViT. We integrate such channel-split ViT into the standard vision transformer (i.e., spatial-split ViT), forming our Dual Vision Transformer to constitute the prediction blocks. We also suggest using long skip connections to deliver low-level image features to all prediction blocks, thereby preventing useful information from being discarded by intermediate supervision. Extensive experiments are conducted to evaluate the performance of our proposal on the widely used benchmarks, i.e., WFLW, COFW, and 300W, demonstrating that our model outperforms the previous SOTAs across all three benchmarks.

[162] Toward Robust and Accurate Adversarial Camouflage Generation against Vehicle Detectors

Jiawei Zhou, Linye Lyu, Daojing He, Yu Li

Main category: cs.CV

TL;DR: RAUCA is a robust adversarial camouflage generation method that uses a novel neural renderer (E2E-NRP) to create multi-view vehicle camouflage effective across diverse weather conditions, outperforming existing methods on six object detectors.

DetailsMotivation: Existing adversarial camouflage methods have limitations: they struggle to capture environmental characteristics during rendering, produce imprecise texture mapping to target vehicles, and neglect diverse weather conditions, reducing attack effectiveness across varying scenarios.

Method: Proposes RAUCA with End-to-End Neural Renderer Plus (E2E-NRP) that accurately optimizes/projects vehicle textures while rendering images with environmental characteristics (lighting, weather). Integrates multi-weather dataset for camouflage generation to enhance attack robustness.

Result: RAUCA-final outperforms existing methods on six popular object detectors in both simulation and real-world settings, demonstrating superior attack performance across diverse conditions.

Conclusion: RAUCA provides a robust and accurate camouflage generation method that addresses key limitations of existing approaches by incorporating environmental characteristics and weather diversity, resulting in more effective adversarial attacks against vehicle detectors.

Abstract: Adversarial camouflage is a widely used physical attack against vehicle detectors for its superiority in multi-view attack performance. One promising approach involves using differentiable neural renderers to facilitate adversarial camouflage optimization through gradient back-propagation. However, existing methods often struggle to capture environmental characteristics during the rendering process or produce adversarial textures that can precisely map to the target vehicle. Moreover, these approaches neglect diverse weather conditions, reducing the efficacy of generated camouflage across varying weather scenarios. To tackle these challenges, we propose a robust and accurate camouflage generation method, namely RAUCA. The core of RAUCA is a novel neural rendering component, End-to-End Neural Renderer Plus (E2E-NRP), which can accurately optimize and project vehicle textures and render images with environmental characteristics such as lighting and weather. In addition, we integrate a multi-weather dataset for camouflage generation, leveraging the E2E-NRP to enhance the attack robustness. Experimental results on six popular object detectors show that RAUCA-final outperforms existing methods in both simulation and real-world settings.

[163] If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions

Carlo Alberto Barbano, Luca Molinaro, Massimiliano Ciranni, Emanuele Aiello, Vito Paolo Pastore, Marco Grangetto

Main category: cs.CV

TL;DR: KT enables VLMs to learn new visual concepts from text descriptions alone by transferring knowledge within the model, without needing visual examples or external models.

DetailsMotivation: Humans can visualize new concepts from language descriptions based on prior knowledge. The paper aims to extend this ability to Vision-Language Models by enabling them to learn novel concepts from text alone, without requiring visual examples.

Method: Knowledge Transfer (KT) aligns relevant visual encoder features (obtained through model inversion) with text representations. It transfers knowledge within the same VLM by injecting visual knowledge directly from textual descriptions of novel concepts.

Result: KT efficiently introduces new visual concepts from single textual descriptions, refines existing concept representations, and significantly improves zero-shot VLM performance across classification, segmentation, retrieval, and captioning tasks.

Conclusion: The Knowledge Transfer approach successfully enables VLMs to learn novel concepts from text descriptions alone, demonstrating that pre-trained VLM knowledge can be effectively re-used to represent previously unknown concepts.

Abstract: Humans can visualize new and unknown concepts from their natural language description, based on their experience and previous knowledge. Insipired by this, we present a way to extend this ability to Vision-Language Models (VLMs), teaching them novel concepts by only using a textual description. We refer to this approach as Knowledge Transfer (KT). Our hypothesis is that the knowledge of a pre-trained VLM can be re-used to represent previously unknown concepts. Provided with a textual description of the novel concept, KT works by aligning relevant features of the visual encoder, obtained through model inversion, to its text representation. Differently from approaches relying on visual examples or external generative models, KT transfers knowledge within the same VLM by injecting visual knowledge directly from the text. Through an extensive evaluation on several VLM tasks, including classification, segmentation, image-text retrieval, and captioning, we show that: 1) KT can efficiently introduce new visual concepts from a single textual description; 2) the same principle can be used to refine the representation of existing concepts; and 3) KT significantly improves the performance of zero-shot VLMs.

[164] Chain-of-Evidence Multimodal Reasoning for Few-shot Temporal Action Localization

Mengshi Qi, Hongwei Ji, Wulian Yun, Xianlin Zhang, Huadong Ma

Main category: cs.CV

TL;DR: A novel few-shot temporal action localization method using Chain-of-Evidence multimodal reasoning that combines visual and textual information to improve localization performance with limited training data.

DetailsMotivation: Existing few-shot TAL methods focus only on video-level information and neglect textual information, which could provide valuable semantic support. There's a need to reduce dependence on large annotated datasets while improving localization accuracy.

Method: Proposes a few-shot learning framework with semantic-aware text-visual alignment module to align query and support videos at different levels. Introduces Chain-of-Evidence reasoning method that progressively guides VLM and LLM to generate CoE text descriptions capturing action variance better than visual features alone.

Result: Extensive experiments on ActivityNet1.3, THUMOS14, and a new Human-related Anomaly Localization Dataset show the method significantly outperforms existing methods in both single-instance and multi-instance scenarios.

Conclusion: The proposed Chain-of-Evidence multimodal reasoning approach effectively leverages textual information to enhance few-shot temporal action localization, demonstrating superior performance across multiple benchmark datasets.

Abstract: Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the action localization task. To address these issues, in this work, we propose a new few-shot temporal action localization method by Chain-of-Evidence multimodal reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level, we design a Chain-of-Evidence (CoE) reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoE text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3, THUMOS14 and our newly collected Human-related Anomaly Localization Dataset. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. Our source code and data are available at https://github.com/MICLAB-BUPT/VAL-VLM.

[165] Do MLLMs Exhibit Human-like Perceptual Behaviors? HVSBench: A Benchmark for MLLM Alignment with Human Perceptual Behavior

Jiaying Lin, Shuquan Ye, Dan Xu, Wanli Ouyang, Rynson W. H. Lau

Main category: cs.CV

TL;DR: HVSBench is a large-scale benchmark with 85k+ samples to test if MLLMs align with human visual perception across 13 categories in 5 fields, revealing significant perceptual gaps between models and humans.

DetailsMotivation: While MLLMs excel at many vision tasks, it's unknown whether they exhibit human-like perceptual behaviors. The paper aims to evaluate MLLM alignment with the human visual system (HVS).

Method: Introduce HVSBench, the first large-scale benchmark with over 85,000 samples covering 13 categories across 5 key fields: Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching.

Result: Comprehensive evaluation reveals significant perceptual gap: state-of-the-art MLLMs achieve only moderate results, while human participants demonstrate strong performance, significantly outperforming all models.

Conclusion: HVSBench is a high-quality benchmark that reveals the need for more human-aligned AI, and will be a critical tool for developing the next generation of explainable MLLMs.

Abstract: While Multimodal Large Language Models (MLLMs) excel at many vision tasks, it is unknown if they exhibit human-like perceptual behaviors. To evaluate this, we introduce HVSBench, the first large-scale benchmark with over 85,000 samples designed to test MLLM alignment with the human visual system (HVS). The benchmark covers 13 categories across 5 key fields: Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Our comprehensive evaluation reveals a significant perceptual gap: even state-of-the-art MLLMs achieve only moderate results. In contrast, human participants demonstrate strong performance, significantly outperforming all models. This underscores the high quality of HVSBench and the need for more human-aligned AI. We believe our benchmark will be a critical tool for developing the next generation of explainable MLLMs.

[166] MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos

Arkaprava Sinha, Monish Soundar Raj, Pu Wang, Ahmed Helmy, Hieu Le, Srijan Das

Main category: cs.CV

TL;DR: MS-Temba extends Mamba’s state-space models to temporal action detection using dilated SSMs and multi-scale fusion, achieving SOTA on ADL benchmarks with only 17M parameters.

DetailsMotivation: Temporal Action Detection in untrimmed videos faces challenges with long-duration videos, temporal variations, and dense overlapping actions. Existing CNN/Transformer approaches struggle to capture both fine-grained detail and long-range structure simultaneously.

Method: Proposes Multi-Scale Temporal Mamba (MS-Temba) with dilated SSMs in Temba blocks, coupled with additional losses to learn discriminative representations across temporal scales. Uses lightweight Multi-scale Mamba Fuser for SSM-based aggregation of multi-scale features.

Result: Achieves state-of-the-art performance on densely labeled ADL benchmarks TSU & Charades with only 17M parameters. Also generalizes to long-form video summarization, setting new SOTA results on TVSum & SumMe.

Conclusion: MS-Temba effectively addresses TAD challenges by extending Mamba with dilated SSMs and multi-scale fusion, demonstrating strong performance on ADL benchmarks and generalization to video summarization tasks.

Abstract: Temporal Action Detection (TAD) in untrimmed videos poses significant challenges, particularly for Activities of Daily Living (ADL) requiring models to (1) process long-duration videos, (2) capture temporal variations in actions, and (3) simultaneously detect dense overlapping actions. Existing CNN and Transformer-based approaches, struggle to jointly capture fine-grained detail and long-range structure at scale. State-space Model (SSM) based Mamba offers powerful long-range modeling, but naive application to TAD collapses fine-grained temporal structure and fails to account for the challenges inherent to TAD. To this end, we propose Multi-Scale Temporal Mamba (MS-Temba), which extends Mamba to TAD with newly introduced dilated SSMs. Each Temba block, comprising dilated SSMs coupled with our proposed additional losses, enables the learning of discriminative representations across temporal scales. A lightweight Multi-scale Mamba Fuser then unifies these multi-scale features via SSM-based aggregation, yielding precise action-boundary localization. With only 17M parameters, MS-Temba achieves state-of-the-art performance on densely labeled ADL benchmarks TSU & Charades, and further generalizes to long-form video summarization, setting new state-of-the-art results on TVSum & SumMe.

[167] GT2-GS: Geometry-aware Texture Transfer for Gaussian Splatting

Wenjie Liu, Zhongliang Liu, Junwei Shu, Changbo Wang, Yang Li

Main category: cs.CV

TL;DR: GT2-GS is a geometry-aware texture transfer framework for Gaussian splatting that achieves high-quality 3D texture transfer by incorporating geometric information and view-consistent optimization.

DetailsMotivation: Existing 3D style transfer methods focus on abstract artistic styles but overlook geometric information, making it challenging to achieve high-quality 3D texture transfer results that align with human visual perception.

Method: Proposes three key components: 1) geometry-aware texture transfer loss using view-dependent features and geometric parameters, 2) adaptive fine-grained control module to address scene degradation from low-granularity textures, and 3) geometry preservation branch that refines geometric parameters using Gaussian color priors to decouple appearance and geometry optimization.

Result: Extensive experiments demonstrate the method’s effectiveness and controllability, achieving texture transfer results that better align with human visual perception through geometric awareness.

Conclusion: GT2-GS successfully addresses the limitations of existing 3D style transfer methods by incorporating geometric awareness, enabling high-quality texture transfer onto complex 3D scenes with better visual alignment and control.

Abstract: Transferring 2D textures onto complex 3D scenes plays a vital role in enhancing the efficiency and controllability of 3D multimedia content creation. However, existing 3D style transfer methods primarily focus on transferring abstract artistic styles to 3D scenes. These methods often overlook the geometric information of the scene, which makes it challenging to achieve high-quality 3D texture transfer results. In this paper, we present GT2-GS, a geometry-aware texture transfer framework for gaussian splatting. First, we propose a geometry-aware texture transfer loss that enables view-consistent texture transfer by leveraging prior view-dependent feature information and texture features augmented with additional geometric parameters. Moreover, an adaptive fine-grained control module is proposed to address the degradation of scene information caused by low-granularity texture features. Finally, a geometry preservation branch is introduced. This branch refines the geometric parameters using additionally bound Gaussian color priors, thereby decoupling the optimization objectives of appearance and geometry. Extensive experiments demonstrate the effectiveness and controllability of our method. Through geometric awareness, our approach achieves texture transfer results that better align with human visual perception. Our homepage is available at https://vpx-ecnu.github.io/GT2-GS-website.

[168] Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation

Bozhong Zheng, Jinye Gan, Xiaohao Xu, Xintao Chen, Wenqiao Li, Xiaonan Huang, Na Ni, Yingna Wu

Main category: cs.CV

TL;DR: PASDF is a novel 3D point cloud anomaly detection framework that uses continuous signed distance fields for pose-invariant representation, enabling both precise anomaly localization and in-situ repair.

DetailsMotivation: Existing 3D point cloud anomaly detection methods struggle with pose variations and complex geometric anomalies. Patch-based approaches suffer from geometric fidelity issues due to discrete representations (voxelization/projection), limiting fine-grained anomaly localization.

Method: PASDF integrates Pose Alignment Module for canonicalization and SDF Network to dynamically incorporate pose, learning continuous pose-invariant shape representations. It uses implicit learning of high-fidelity anomaly repair templates from continuous SDF and precise pixel-level localization through Anomaly-Aware Scoring Module.

Result: State-of-the-art performance on Real3D-AD (80.2% object-level AUROC) and Anomaly-ShapeNet (90.0% object-level AUROC). The continuous 3D representation enables both detection and in-situ anomaly repair.

Conclusion: Continuous geometric representations effectively advance 3D anomaly detection and facilitate practical anomaly region repair. PASDF demonstrates superior performance over existing methods and provides a framework for both detection and repair tasks.

Abstract: 3D point cloud anomaly detection is essential for robust vision systems but is challenged by pose variations and complex geometric anomalies. Existing patch-based methods often suffer from geometric fidelity issues due to discrete voxelization or projection-based representations, limiting fine-grained anomaly localization. We introduce Pose-Aware Signed Distance Field (PASDF), a novel framework that integrates 3D anomaly detection and repair by learning a continuous, pose-invariant shape representation. PASDF leverages a Pose Alignment Module for canonicalization and a SDF Network to dynamically incorporate pose, enabling implicit learning of high-fidelity anomaly repair templates from the continuous SDF. This facilitates precise pixel-level anomaly localization through an Anomaly-Aware Scoring Module. Crucially, the continuous 3D representation in PASDF extends beyond detection, facilitating in-situ anomaly repair. Experiments on Real3D-AD and Anomaly-ShapeNet demonstrate state-of-the-art performance, achieving high object-level AUROC scores of 80.2% and 90.0%, respectively. These results highlight the effectiveness of continuous geometric representations in advancing 3D anomaly detection and facilitating practical anomaly region repair. The code is available at https://github.com/ZZZBBBZZZ/PASDF to support further research.

[169] Deep Learning for Retinal Degeneration Assessment: A Comprehensive Analysis of the MARIO Challenge

Rachid Zeghlache, Ikram Brahim, Pierre-Henri Conze, Mathieu Lamard, Mohammed El Amine Lazouni, Zineb Aziza Elaouaber, Leila Ryma Lazouni, Christopher Nielsen, Ahmad O. Ahsan, Matthias Wilms, Nils D. Forkert, Lovre Antonio Budimir, Ivana Matovinović, Donik Vršnak, Sven Lončarić, Philippe Zhang, Weili Jiang, Yihao Li, Yiding Hao, Markus Frohmann, Patrick Binder, Marcel Huber, Taha Emre, Teresa Finisterra Araújo, Marzieh Oghbaie, Hrvoje Bogunović, Amerens A. Bekkers, Nina M. van Liebergen, Hugo J. Kuijf, Abdul Qayyum, Moona Mazher, Steven A. Niederer, Alberto J. Beltrán-Carrero, Juan J. Gómez-Valverde, Javier Torresano-Rodríquez, Álvaro Caballero-Sastre, María J. Ledesma Carbayo, Yosuke Yamagishi, Yi Ding, Robin Peretzke, Alexandra Ertl, Maximilian Fischer, Jessica Kächele, Sofiane Zehar, Karim Boukli Hacene, Thomas Monfort, Béatrice Cochener, Mostafa El Habib Daho, Anas-Alexis Benyoussef, Gwenolé Quellec

Main category: cs.CV

TL;DR: The MARIO challenge at MICCAI 2024 benchmarked AI for AMD monitoring using OCT images, showing AI matches physicians in detecting AMD progression but fails at predicting future evolution.

DetailsMotivation: To advance automated detection and monitoring of age-related macular degeneration (AMD) through AI analysis of optical coherence tomography (OCT) images, evaluating algorithmic performance in detecting neovascular activity changes and predicting disease evolution.

Method: A challenge structure with two tasks: 1) Classification of evolution between consecutive 2D OCT B-scans, and 2) Prediction of future AMD evolution over three months for patients on anti-VEGF therapy. Used multi-modal datasets from Brest, France (primary) and Algeria (auxiliary) with OCT, infrared imaging, and clinical data. 35 teams participated with top 12 finalists presenting methods.

Result: AI performs as well as physicians in measuring AMD progression (Task 1) but is not yet capable of predicting future evolution (Task 2). The challenge established a benchmark for AMD monitoring using multi-modal data.

Conclusion: While AI matches physician performance in detecting current AMD progression, it lacks predictive capability for future disease evolution, highlighting the need for further research in predictive modeling for AMD monitoring.

Abstract: The MARIO challenge, held at MICCAI 2024, focused on advancing the automated detection and monitoring of age-related macular degeneration (AMD) through the analysis of optical coherence tomography (OCT) images. Designed to evaluate algorithmic performance in detecting neovascular activity changes within AMD, the challenge incorporated unique multi-modal datasets. The primary dataset, sourced from Brest, France, was used by participating teams to train and test their models. The final ranking was determined based on performance on this dataset. An auxiliary dataset from Algeria was used post-challenge to evaluate population and device shifts from submitted solutions. Two tasks were involved in the MARIO challenge. The first one was the classification of evolution between two consecutive 2D OCT B-scans. The second one was the prediction of future AMD evolution over three months for patients undergoing anti-vascular endothelial growth factor (VEGF) therapy. Thirty-five teams participated, with the top 12 finalists presenting their methods. This paper outlines the challenge’s structure, tasks, data characteristics, and winning methodologies, setting a benchmark for AMD monitoring using OCT, infrared imaging, and clinical data (such as the number of visits, age, gender, etc.). The results of this challenge indicate that artificial intelligence (AI) performs as well as a physician in measuring AMD progression (Task 1) but is not yet able of predicting future evolution (Task 2).

[170] Benchmarking Gaslighting Negation Attacks Against Reasoning Models

Bin Zhu, Hailong Yin, Jingjing Chen, Yu-Gang Jiang

Main category: cs.CV

TL;DR: Reasoning models suffer significant accuracy drops (25-53%) when faced with gaslighting negation attacks that confidently deny correct answers, revealing a vulnerability in their ability to defend beliefs against manipulative feedback.

DetailsMotivation: While reasoning-centric models show improved robustness through techniques like chain-of-thought prompting, their resistance to gaslighting negation attacks (adversarial prompts that confidently deny correct answers) remains underexplored and potentially vulnerable.

Method: Systematically evaluated three state-of-the-art reasoning models (OpenAI’s o4-mini, Claude-3.7-Sonnet, Gemini-2.5-Flash) across three multimodal benchmarks (MMMU, MathVista, CharXiv), then introduced GaslightingBench-R - a diagnostic benchmark of 1,025 challenging samples curated from existing benchmarks to specifically test susceptibility to gaslighting attacks.

Result: Significant accuracy drops of 25-29% on average across original benchmarks, with even more dramatic failures on GaslightingBench-R where accuracy drops exceeded 53% on average, showing that top-tier reasoning models struggle to preserve correct answers under manipulative user feedback.

Conclusion: There’s a fundamental gap between step-by-step reasoning and resistance to adversarial manipulation, calling for new robustness strategies to safeguard reasoning models against gaslighting negation attacks.

Abstract: Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand gaslighting negation attacks-adversarial prompts that confidently deny correct answers-remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI’s o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25-29% on average) following gaslighting negation attacks, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models’ susceptibility to defend their belief under gaslighting negation attacks. Constructed by filtering and curating 1,025 challenging samples from the existing benchmarks, GaslightingBench-R induces even more dramatic failures, with accuracy drops exceeding 53% on average. Our findings highlight a fundamental gap between step-by-step reasoning and resistance to adversarial manipulation, calling for new robustness strategies that safeguard reasoning models against gaslighting negation attacks.

[171] Binarization-Aware Adjuster: A Theoretical Framework for Bridging Continuous Optimization and Discrete Inference with Application to Edge Detection

Hao Shu

Main category: cs.CV

TL;DR: A framework called Binarization-Aware Adjuster (BAA) addresses the training-inference gap in discrete decision tasks by integrating binarization behavior into gradient-based learning through dynamic loss weighting and adaptive thresholding.

DetailsMotivation: There's a fundamental inconsistency in machine learning for discrete decision-making tasks: models are trained using continuous-valued outputs but evaluated through discrete predictions. This discrepancy arises from the non-differentiability of discretization operations, which weakens alignment between optimization objectives and practical decision outcomes.

Method: The paper proposes a Binarization-Aware Adjuster (BAA) framework with two key components: 1) A Distance Weight Function (DWF) that dynamically modulates pixel-wise loss contributions based on prediction correctness and proximity to decision boundaries, emphasizing decision-critical regions while de-emphasizing confidently correct samples. 2) A self-adaptive threshold estimation procedure to better match optimization dynamics with inference conditions.

Result: The method was experimentally validated on edge detection tasks, demonstrating effectiveness. The framework shows promise for broader applications beyond binary decision tasks.

Conclusion: The proposed framework provides a general strategy for aligning continuous optimization with discrete evaluation in machine learning, and can be extended to multi-valued decision processes in broader structured prediction problems.

Abstract: In machine learning, discrete decision-making tasks exhibit a fundamental inconsistency between training and inference: models are optimized using continuous-valued outputs, yet evaluated through discrete predictions. This discrepancy arises from the non-differentiability of discretization operations, weakening the alignment between optimization objectives and practical decision outcomes. To address this, we present a theoretical framework for constructing a Binarization-Aware Adjuster (BAA) that integrates binarization behavior directly into gradient-based learning. Central to the approach is a Distance Weight Function (DWF) that dynamically modulates pixel-wise loss contributions based on prediction correctness and proximity to the decision boundary, thereby emphasizing decision-critical regions while de-emphasizing confidently correct samples. Furthermore, a self-adaptive threshold estimation procedure is introduced to better match optimization dynamics with inference conditions. As one of its applications, we implement experiments on the edge detection (ED) task, which also demonstrate the effectiveness of the proposed method experimentally. Beyond binary decision tasks and ED, the proposed framework provides a general strategy for aligning continuous optimization with discrete evaluation and can be extended to multi-valued decision processes in broader structured prediction problems.

[172] MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness

JaeHyuck Choi, MinJun Kim, JeHyeong Hong

Main category: cs.CV

TL;DR: MAGIC is a diffusion-based framework for few-shot anomaly generation that uses fine-tuned inpainting with Gaussian prompt perturbation, spatially adaptive guidance, and context-aware mask alignment to create diverse, high-fidelity anomalies for industrial quality control.

DetailsMotivation: Existing diffusion-based methods for anomaly generation have limitations: global prompt-guided approaches corrupt normal regions, and inpainting-based methods lack the in-distribution diversity needed for robust downstream anomaly detection models in industrial quality control.

Method: MAGIC uses a fine-tuned inpainting framework with three key components: (1) Gaussian prompt perturbation to prevent overfitting and learn from a smooth manifold of realistic anomalies, (2) spatially adaptive guidance that applies different guidance strengths to anomaly vs. background regions, and (3) context-aware mask alignment to relocate masks for plausible placement within host objects.

Result: Under consistent identical evaluation protocol, MAGIC outperforms state-of-the-art methods on diverse anomaly datasets in downstream tasks, demonstrating superior performance in generating high-fidelity anomalies that strictly adhere to masks while maximizing diversity.

Conclusion: MAGIC provides an effective solution for few-shot anomaly generation in industrial quality control, addressing key limitations of existing methods and enabling more robust downstream anomaly detection models through diverse, realistic anomaly generation.

Abstract: Few-shot anomaly generation is a key challenge in industrial quality control. Although diffusion models are promising, existing methods struggle: global prompt-guided approaches corrupt normal regions, and existing inpainting-based methods often lack the in-distribution diversity essential for robust downstream models. We propose MAGIC, a fine-tuned inpainting framework that generates high-fidelity anomalies that strictly adhere to the mask while maximizing this diversity. MAGIC introduces three complementary components: (i) Gaussian prompt perturbation, which prevents model overfitting in the few-shot setting by learning and sampling from a smooth manifold of realistic anomalies, (ii) spatially adaptive guidance that applies distinct guidance strengths to the anomaly and background regions, and (iii) context-aware mask alignment to relocate masks for plausible placement within the host object. Under consistent identical evaluation protocol, MAGIC outperforms state-of-the-art methods on diverse anomaly datasets in downstream tasks.

[173] Online Navigation Refinement: Achieving Lane-Level Guidance by Associating Standard-Definition and Online Perception Maps

Jiaxu Wan, Xu Wang, Mengwei Xie, Xinyuan Chang, Xinran Liu, Zheng Pan, Mu Xu, Hong Zhang, Ding Yuan, Yifan Yang

Main category: cs.CV

TL;DR: ONR refines road-level SD maps into lane-level navigation by associating them with real-time OP maps, using a new dataset and transformer model to handle spatial misalignment and noise.

DetailsMotivation: Current lane-level navigation relies on expensive HD maps that can't adapt to dynamic conditions, while online perception maps lack global topology needed for navigation. There's a need for a solution that combines the benefits of both.

Method: Proposes Online Navigation Refinement (ONR) mission, creates OMA dataset with lane-to-road annotations, and develops MAT transformer with path-aware attention for topology alignment and spatial attention for noise handling.

Result: MAT outperforms existing methods with 34 ms latency, enabling low-cost, up-to-date lane-level navigation by effectively handling spatial fluctuations, semantic disparities, and OP map noise.

Conclusion: ONR provides a practical solution for lane-level navigation by combining SD map topology with OP map real-time geometry, overcoming limitations of both HD maps and pure perception-based approaches.

Abstract: Lane-level navigation is critical for geographic information systems and navigation-based tasks, offering finer-grained guidance than road-level navigation by standard definition (SD) maps. However, it currently relies on expansive global HD maps that cannot adapt to dynamic road conditions. Recently, online perception (OP) maps have become research hotspots, providing real-time geometry as an alternative, but lack the global topology needed for navigation. To address these issues, Online Navigation Refinement (ONR), a new mission is introduced that refines SD-map-based road-level routes into accurate lane-level navigation by associating SD maps with OP maps. The map-to-map association to handle many-to-one lane-to-road mappings under two key challenges: (1) no public dataset provides lane-to-road correspondences; (2) severe misalignment from spatial fluctuations, semantic disparities, and OP map noise invalidates traditional map matching. For these challenges, We contribute: (1) Online map association dataset (OMA), the first ONR benchmark with 30K scenarios and 2.6M annotated lane vectors; (2) MAT, a transformer with path-aware attention to aligns topology despite spatial fluctuations and semantic disparities and spatial attention for integrates noisy OP features via global context; and (3) NR P-R, a metric evaluating geometric and semantic alignment. Experiments show that MAT outperforms existing methods at 34 ms latency, enabling low-cost and up-to-date lane-level navigation.

[174] Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts

Kiran Chhatre, Christopher Peters, Srikrishna Karanam

Main category: cs.CV

TL;DR: Spectrum is a unified network for detailed human parsing that repurposes an Image-to-Texture diffusion model to achieve better alignment with diverse body parts and clothing categories through prompt-guided grounding.

DetailsMotivation: Existing methods have limitations: fixed mask categories obscure fine-grained clothing types, and open-vocabulary segmentation approaches group entire humans into single categories without distinguishing detailed clothing or body parts. Diffusion models lack specialization for detailed human parsing.

Method: Repurposes an Image-to-Texture (I2Tx) diffusion model (fine-tuned T2I model on 3D human texture maps) to extract human-part internal features. Uses prompt-guided grounding to generate semantically valid masks aligned to diverse clothing categories. Produces semantic segmentation for every visible body part and clothing category.

Result: Spectrum consistently outperforms baseline methods in prompt-based segmentation across extensive cross-dataset experiments assessing body parts, clothing parts, unseen clothing categories, and full-body masks.

Conclusion: Spectrum successfully addresses detailed human parsing by leveraging I2Tx diffusion models for improved alignment with body parts and clothing, demonstrating superior performance over existing methods while ignoring irrelevant objects and handling any number of humans in scenes.

Abstract: Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model (obtained by fine-tuning a T2I model on 3D human texture maps) for improved alignment with body parts and clothing. From an input image, we extract human-part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross-dataset experiments, separately assessing body parts, clothing parts, unseen clothing categories, and full-body masks, and demonstrate that Spectrum consistently outperforms baseline methods in prompt-based segmentation.

[175] Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu

Main category: cs.CV

TL;DR: Omni-Effects is a unified framework for generating prompt-guided visual effects with spatial control, enabling multiple effects at designated locations without interference.

DetailsMotivation: Current VFX generation methods are limited to single effects per LoRA training, preventing spatially controllable composite effects needed for realistic cinematic production.

Method: Two key innovations: 1) LoRA-based Mixture of Experts (LoRA-MoE) integrates diverse effects while mitigating interference, and 2) Spatial-Aware Prompt (SAP) with Independent-Information Flow (IIF) module enables precise spatial control and prevents effect blending.

Result: The framework achieves precise spatial control and diverse effect generation, allowing users to specify both effect category and location. A comprehensive VFX dataset (Omni-VFX) and evaluation framework were created.

Conclusion: Omni-Effects successfully addresses the limitations of current VFX generation methods by enabling spatially controllable composite effects through a unified framework that prevents interference and maintains spatial precision.

Abstract: Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.

[176] FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World

Zengli Luo, Canlong Zhang, Xiaochun Lu, Zhixin Li

Main category: cs.CV

TL;DR: FitPro is an open-world interactive zero-shot text-based pedestrian retrieval framework that enhances semantic comprehension and cross-scene adaptability through three novel components: Feature Contrastive Decoding, Incremental Semantic Mining, and Query-aware Hierarchical Retrieval.

DetailsMotivation: Existing text-based pedestrian retrieval methods have limited generalization and insufficient semantic understanding in open-world interactive scenarios, struggling with semantic drift in zero-shot settings and robustness against viewpoint shifts and fine-grained variations.

Method: FitPro introduces three key components: 1) Feature Contrastive Decoding (FCD) uses prompt-guided contrastive decoding to generate high-quality structured pedestrian descriptions from denoised images; 2) Incremental Semantic Mining (ISM) constructs holistic pedestrian representations from multi-view observations for global semantic modeling in multi-turn interactions; 3) Query-aware Hierarchical Retrieval (QHR) dynamically optimizes the retrieval pipeline based on query types for efficient adaptation to multi-modal and multi-view inputs.

Result: Extensive experiments on five public datasets and two evaluation protocols demonstrate that FitPro significantly overcomes the generalization limitations and semantic modeling constraints of existing methods in interactive retrieval.

Conclusion: FitPro paves the way for practical deployment of text-based pedestrian retrieval systems by addressing key challenges in open-world interactive scenarios through enhanced semantic comprehension and cross-scene adaptability.

Abstract: Text-based Pedestrian Retrieval (TPR) deals with retrieving specific target pedestrians in visual scenes according to natural language descriptions. Although existing methods have achieved progress under constrained settings, interactive retrieval in the open-world scenario still suffers from limited model generalization and insufficient semantic understanding. To address these challenges, we propose FitPro, an open-world interactive zero-shot TPR framework with enhanced semantic comprehension and cross-scene adaptability. FitPro has three innovative components: Feature Contrastive Decoding (FCD), Incremental Semantic Mining (ISM), and Query-aware Hierarchical Retrieval (QHR). The FCD integrates prompt-guided contrastive decoding to generate high-quality structured pedestrian descriptions from denoised images, effectively alleviating semantic drift in zero-shot scenarios. The ISM constructs holistic pedestrian representations from multi-view observations to achieve global semantic modeling in multi-turn interactions, thereby improving robustness against viewpoint shifts and fine-grained variations in descriptions. The QHR dynamically optimizes the retrieval pipeline according to query types, enabling efficient adaptation to multi-modal and multi-view inputs. Extensive experiments on five public datasets and two evaluation protocols demonstrate that FitPro significantly overcomes the generalization limitations and semantic modeling constraints of existing methods in interactive retrieval, paving the way for practical deployment.

[177] Benchmarking and Mitigating Sycophancy in Medical Vision Language Models

Zikun Guo, Jingwei Lv, Xinyue Xu, Shu Yang, Jun Wen, Di Wang, Lijie Hu

Main category: cs.CV

TL;DR: Medical VLMs show dangerous sycophancy (agreeing with users despite evidence), especially triggered by authority cues. VIPER method filters social cues to enforce evidence-based reasoning, reducing sycophancy while maintaining performance.

DetailsMotivation: Visual language models (VLMs) have transformative potential for medical workflows, but their deployment is limited by sycophancy - the tendency to agree with users even when evidence contradicts them. This poses a serious threat to patient safety, and there's a lack of systematic benchmarks to measure this problem in medical contexts.

Method: 1) Created a Medical benchmark applying multiple templates to VLMs in hierarchical medical visual question answering tasks. 2) Discovered that perceived authority and user mimicry trigger sycophancy independent of visual data. 3) Proposed VIPER (Visual Information Purification for Evidence based Responses) strategy that proactively filters out non-evidence-based social cues to reinforce evidence-based reasoning.

Result: Current VLMs are highly susceptible to visual cues, with failure rates correlating to model size or overall accuracy. Authority cues and user mimicry are powerful triggers for sycophancy. VIPER successfully reduces sycophancy while maintaining interpretability and consistently outperforms baseline methods.

Conclusion: VIPER provides a foundation for robust and secure integration of VLMs in medical workflows by addressing the critical sycophancy problem. The method reinforces evidence-based reasoning by filtering out misleading social cues, making VLMs safer for patient care applications.

Abstract: Visual language models (VLMs) have the potential to transform medical workflows. However, the deployment is limited by sycophancy. Despite this serious threat to patient safety, a systematic benchmark remains lacking. This paper addresses this gap by introducing a Medical benchmark that applies multiple templates to VLMs in a hierarchical medical visual question answering task. We find that current VLMs are highly susceptible to visual cues, with failure rates showing a correlation to model size or overall accuracy. we discover that perceived authority and user mimicry are powerful triggers, suggesting a bias mechanism independent of visual data. To overcome this, we propose a Visual Information Purification for Evidence based Responses (VIPER) strategy that proactively filters out non-evidence-based social cues, thereby reinforcing evidence based reasoning. VIPER reduces sycophancy while maintaining interpretability and consistently outperforms baseline methods, laying the necessary foundation for the robust and secure integration of VLMs.

[178] Efficiency vs. Efficacy: Assessing the Compression Ratio-Dice Score Relationship through a Simple Benchmarking Framework for Cerebrovascular 3D Segmentation

Shimaa Elbana, Ahmad Kamal, Shahd Ahmed Ali, Ahmad Al-Kabbany

Main category: cs.CV

TL;DR: ZFP compression enables 22.89:1 data reduction for 3D medical imaging while maintaining cerebrovascular segmentation accuracy (Dice ~0.87656 vs 0.8774 baseline).

DetailsMotivation: Large 3D medical imaging datasets create barriers for collaborative research and transferability. Need compression techniques that don't compromise automated segmentation performance for intracranial aneurysm detection.

Method: Applied ZFP compression in both error tolerance and fixed-rate modes to a large-scale 3D medical dataset with ground-truth vascular segmentations. Compared segmentation quality on compressed volumes against uncompressed baseline using Dice coefficient.

Result: ZFP achieved substantial data reduction up to 22.89:1 ratio in error tolerance mode while maintaining high fidelity. Mean Dice coefficient remained high at 0.87656 compared to uncompressed baseline of 0.8774.

Conclusion: ZFP is a viable and powerful tool for enabling more efficient and accessible research on large-scale medical datasets, fostering broader collaboration across the medical imaging community.

Abstract: The increasing size and complexity of medical imaging datasets, particularly in 3D formats, present significant barriers to collaborative research and transferability. This study investigates whether the ZFP compression technique can mitigate these challenges without compromising the performance of automated cerebrovascular segmentation, a critical first step in intracranial aneurysm detection. We apply ZFP in both its error tolerance and fixed-rate modes to a large scale, and one of the most recent, datasets in the literature, 3D medical dataset containing ground-truth vascular segmentations. The segmentation quality on the compressed volumes is rigorously compared to the uncompressed baseline (Dice approximately equals 0.8774). Our findings reveal that ZFP can achieve substantial data reduction–up to a 22.89:1 ratio in error tolerance mode–while maintaining a high degree of fidelity, with the mean Dice coefficient remaining high at 0.87656. These results demonstrate that ZFP is a viable and powerful tool for enabling more efficient and accessible research on large-scale medical datasets, fostering broader collaboration across the community.

[179] LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models

Mert Sonmezer, Matthew Zheng, Pinar Yanardag

Main category: cs.CV

TL;DR: Proposes a submodular framework for selecting relevant and diverse LoRA adapters from large databases to improve user navigation and content generation quality.

DetailsMotivation: Users struggle to navigate and select appropriate LoRA adapters from massive databases (100K+ on Civit.ai) due to volume, diversity, and lack of organization, hindering effective personalization of diffusion models.

Method: Frames LoRA adapter selection as a combinatorial optimization problem and proposes a novel submodular framework to select the most relevant and diverse models from large databases.

Result: Quantitative and qualitative experiments demonstrate that the method generates diverse outputs across a wide range of domains, effectively addressing the selection challenge.

Conclusion: The proposed submodular framework successfully solves the problem of selecting relevant and diverse LoRA adapters from large databases, improving user experience and content generation quality.

Abstract: Low-rank Adaptation (LoRA) models have revolutionized the personalization of pre-trained diffusion models by enabling fine-tuning through low-rank, factorized weight matrices specifically optimized for attention layers. These models facilitate the generation of highly customized content across a variety of objects, individuals, and artistic styles without the need for extensive retraining. Despite the availability of over 100K LoRA adapters on platforms like Civit.ai, users often face challenges in navigating, selecting, and effectively utilizing the most suitable adapters due to their sheer volume, diversity, and lack of structured organization. This paper addresses the problem of selecting the most relevant and diverse LoRA models from this vast database by framing the task as a combinatorial optimization problem and proposing a novel submodular framework. Our quantitative and qualitative experiments demonstrate that our method generates diverse outputs across a wide range of domains.

[180] Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen

Main category: cs.CV

TL;DR: Ditto framework creates Ditto-1M dataset (1M high-quality video editing examples) to solve data scarcity in instruction-based video editing, enabling state-of-the-art model Editto.

DetailsMotivation: Instruction-based video editing lacks large-scale, high-quality training data, severely limiting progress in democratizing content creation.

Method: Three-part framework: 1) Data generation pipeline combining image editor diversity with in-context video generation, 2) Efficient distilled model with temporal enhancer for cost-quality trade-off, 3) Intelligent agent for instruction crafting and quality filtering at scale.

Result: Created Ditto-1M dataset with 1M high-fidelity video editing examples using 12,000 GPU-days; trained Editto model achieves superior instruction-following and state-of-the-art performance.

Conclusion: Ditto framework successfully addresses data scarcity in instruction-based video editing, enabling scalable high-quality training and establishing new SOTA with Editto model.

Abstract: Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.

[181] Weakly Supervised Pneumonia Localization from Chest X-Rays Using Deep Neural Network and Grad-CAM Explanations

Kiran Shahi, Anup Bagale

Main category: cs.CV

TL;DR: A weakly supervised deep learning framework using Grad-CAM for pneumonia classification and localization from chest X-rays, achieving 96-98% accuracy with image-level labels instead of costly pixel-level annotations.

DetailsMotivation: Pixel-level annotations for pneumonia localization in chest X-rays are expensive and time-consuming to obtain. The study aims to develop a more practical approach using only image-level labels while maintaining clinical relevance and explainability.

Method: Proposes a weakly supervised framework using Gradient-weighted Class Activation Mapping (Grad-CAM) with image-level labels. Evaluates seven pre-trained models (including Vision Transformer) under identical conditions with focal loss and patient-wise splits to prevent data leakage.

Result: All models achieved high classification accuracy (96-98%). ResNet-18 and EfficientNet-B0 showed best overall performance, while MobileNet-V3 provided efficient lightweight alternative. Grad-CAM heatmaps successfully highlighted clinically relevant lung regions.

Conclusion: Weakly supervised, explainable AI models using Grad-CAM can effectively localize pneumonia regions without costly pixel-level annotations, enhancing transparency and clinical trust in AI-assisted radiological diagnostics.

Abstract: Chest X-ray imaging is commonly used to diagnose pneumonia, but accurately localizing the pneumonia-affected regions typically requires detailed pixel-level annotations, which are costly and time consuming to obtain. To address this limitation, this study proposes a weakly supervised deep learning framework for pneumonia classification and localization using Gradient-weighted Class Activation Mapping (Grad-CAM). Instead of relying on costly pixel-level annotations, the proposed method utilizes image-level labels to generate clinically meaningful heatmaps that highlight pneumonia-affected regions. Furthermore, we evaluate seven pre-trained deep learning models, including a Vision Transformer, under identical training conditions, using focal loss and patient-wise splits to prevent data leakage. Experimental results suggest that all models achieved high classification accuracy (96–98%), with ResNet-18 and EfficientNet-B0 showing the best overall performance and MobileNet-V3 providing an efficient lightweight alternative. Grad-CAM heatmap visualizations confirm that the proposed methods focus on clinically relevant lung regions, supporting the use of explainable AI for radiological diagnostics. Overall, this work highlights the potential of weakly supervised, explainable models that enhance transparency and clinical trust in AI-assisted pneumonia screening.

[182] MUSE: Multi-Scale Dense Self-Distillation for Nucleus Detection and Classification

Zijiang Yang, Hanqing Chao, Bokai Zhao, Yelin Yang, Yunshuo Zhang, Dongmei Fu, Junping Zhang, Le Lu, Ke Yan, Dakai Jin, Minfeng Xu, Yun Bian, Hui Jiang

Main category: cs.CV

TL;DR: MUSE is a self-supervised learning method for nucleus detection and classification in histopathology that uses multi-scale dense self-distillation with a coordinate-guided mechanism called NuLo, eliminating the need for strict spatial alignment and enabling cross-scale nucleus-level representation learning.

DetailsMotivation: Existing nucleus detection and classification methods rely heavily on labor-intensive nucleus-level annotations and fail to effectively leverage large-scale unlabeled data for learning discriminative nucleus representations in histopathology analysis.

Method: Proposes MUSE with NuLo (Nucleus-based Local self-distillation), a coordinate-guided mechanism for flexible local self-distillation based on predicted nucleus positions. Also includes an encoder-decoder architecture and large field-of-view semi-supervised fine-tuning strategy to maximize use of unlabeled pathology images.

Result: Extensive experiments on three widely used benchmarks show MUSE effectively addresses core challenges of histopathological NDC, surpassing state-of-the-art supervised baselines and outperforming generic pathology foundation models.

Conclusion: MUSE provides a novel self-supervised learning approach that eliminates dependency on labor-intensive nucleus annotations while enabling effective cross-scale nucleus-level representation learning, demonstrating superior performance over existing methods.

Abstract: Nucleus detection and classification (NDC) in histopathology analysis is a fundamental task that underpins a wide range of high-level pathology applications. However, existing methods heavily rely on labor-intensive nucleus-level annotations and struggle to fully exploit large-scale unlabeled data for learning discriminative nucleus representations. In this work, we propose MUSE (MUlti-scale denSE self-distillation), a novel self-supervised learning method tailored for NDC. At its core is NuLo (Nucleus-based Local self-distillation), a coordinate-guided mechanism that enables flexible local self-distillation based on predicted nucleus positions. By removing the need for strict spatial alignment between augmented views, NuLo allows critical cross-scale alignment, thus unlocking the capacity of models for fine-grained nucleus-level representation. To support MUSE, we design a simple yet effective encoder-decoder architecture and a large field-of-view semi-supervised fine-tuning strategy that together maximize the value of unlabeled pathology images. Extensive experiments on three widely used benchmarks demonstrate that MUSE effectively addresses the core challenges of histopathological NDC. The resulting models not only surpass state-of-the-art supervised baselines but also outperform generic pathology foundation models.

[183] Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang, Hanspeter Pfister

Main category: cs.CV

TL;DR: SandboxVLM enhances VLMs’ 3D reasoning by using abstract bounding boxes to bridge the modality gap between 2D training and 3D tasks, achieving significant improvements in spatial intelligence without additional training.

DetailsMotivation: Vision-language models struggle with 3D-related tasks like spatial cognition and physical understanding due to a modality gap between their 2D training and 3D task requirements, limiting their effectiveness in real-world applications like robotics and embodied agents.

Method: Introduces SandboxVLM framework that uses abstract bounding boxes to encode geometric structure and physical kinematics. Features a 4-stage 3D Sandbox reconstruction and perception pipeline: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning.

Result: In zero-shot settings across multiple benchmarks and VLM backbones, the approach consistently improves spatial intelligence, achieving 8.3% gain on SAT Real compared to baseline methods. Demonstrates substantial enhancement of 3D reasoning ability without additional training.

Conclusion: Equipping VLMs with 3D abstraction substantially enhances their 3D reasoning capabilities, suggesting new possibilities for general-purpose embodied intelligence by bridging the modality gap between 2D training and 3D tasks.

Abstract: Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

[184] MovSemCL: Movement-Semantics Contrastive Learning for Trajectory Similarity (Extension)

Zhichen Lai, Hua Lu, Huan Li, Jialiang Li, Christian S. Jensen

Main category: cs.CV

TL;DR: MovSemCL is a movement-semantics contrastive learning framework for trajectory similarity computation that addresses limitations in existing methods through hierarchical representation, efficient patch-based encoding, and physically plausible augmentations.

DetailsMotivation: Existing learning-based trajectory similarity methods have three key limitations: (1) insufficient modeling of trajectory semantics and hierarchy, lacking movement dynamics extraction and multi-scale structural representation; (2) high computational costs from point-wise encoding; and (3) use of physically implausible augmentations that distort trajectory semantics.

Method: MovSemCL transforms raw GPS trajectories into movement-semantics features, segments them into patches, then uses intra- and inter-patch attentions to encode local and global trajectory patterns. It includes a curvature-guided augmentation strategy that preserves informative segments (turns, intersections) while masking redundant ones.

Result: Experiments show MovSemCL outperforms state-of-the-art methods, achieving mean ranks close to the ideal value of 1 at similarity search tasks, improvements by up to 20.3% at heuristic approximation, while reducing inference latency by up to 43.4%.

Conclusion: MovSemCL effectively addresses the limitations of existing trajectory similarity methods through its movement-semantics contrastive learning approach, providing efficient hierarchical representation, reduced computational costs, and physically plausible augmentations for improved performance.

Abstract: Trajectory similarity computation is fundamental functionality that is used for, e.g., clustering, prediction, and anomaly detection. However, existing learning-based methods exhibit three key limitations: (1) insufficient modeling of trajectory semantics and hierarchy, lacking both movement dynamics extraction and multi-scale structural representation; (2) high computational costs due to point-wise encoding; and (3) use of physically implausible augmentations that distort trajectory semantics. To address these issues, we propose MovSemCL, a movement-semantics contrastive learning framework for trajectory similarity computation. MovSemCL first transforms raw GPS trajectories into movement-semantics features and then segments them into patches. Next, MovSemCL employs intra- and inter-patch attentions to encode local as well as global trajectory patterns, enabling efficient hierarchical representation and reducing computational costs. Moreover, MovSemCL includes a curvature-guided augmentation strategy that preserves informative segments (e.g., turns and intersections) and masks redundant ones, generating physically plausible augmented views. Experiments on real-world datasets show that MovSemCL is capable of outperforming state-of-the-art methods, achieving mean ranks close to the ideal value of 1 at similarity search tasks and improvements by up to 20.3% at heuristic approximation, while reducing inference latency by up to 43.4%.

[185] PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching

Zewei Chang, Zheng-Peng Duan, Jianxing Zhang, Chun-Le Guo, Siyu Liu, Hyungju Chun, Hyunhee Park, Zikun Liu, Chongyi Li

Main category: cs.CV

TL;DR: PerTouch is a unified diffusion-based image retouching framework that balances controllability and subjectivity through semantic-level parameter maps and VLM-driven agents for personalized aesthetic enhancement.

DetailsMotivation: Image retouching needs to balance objective controllability with subjective aesthetic preferences. Existing methods struggle to handle both fine-grained semantic control and personalized user intent alignment.

Method: Uses parameter maps with attribute values in semantic regions as input to create explicit parameter-to-image mapping. Introduces semantic replacement and parameter perturbation for better boundary perception. Develops VLM-driven agent with feedback-driven rethinking and scene-aware memory to handle user instructions.

Result: Extensive experiments demonstrate each component’s effectiveness and superior performance in personalized image retouching compared to existing methods.

Conclusion: PerTouch provides a unified framework that successfully balances controllability and subjectivity in image retouching, enabling semantic-level control while maintaining global aesthetics and aligning with user preferences.

Abstract: Image retouching aims to enhance visual quality while aligning with users’ personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch. Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms in the training process. To connect natural language instructions with visual control, we develop a VLM-driven agent that can handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences. Extensive experiments demonstrate each component’s effectiveness and the superior performance of PerTouch in personalized image retouching. Code is available at: https://github.com/Auroral703/PerTouch.

[186] TPG-INR: Target Prior-Guided Implicit 3D CT Reconstruction for Enhanced Sparse-view Imaging

Qinglei Cao, Ziyao Tang, Xiaoqin Tang

Main category: cs.CV

TL;DR: A novel 3D CT reconstruction framework that uses target priors from projection data to enhance implicit neural representation learning, achieving 10x faster learning and significant quality improvements over state-of-the-art methods in sparse-view scenarios.

DetailsMotivation: Existing NeRF-based CT reconstruction methods overlook anatomical priors, limiting precision and efficiency in ultra-sparse view scenarios where projection data is limited.

Method: Proposes a framework using target priors derived from projection data to guide implicit learning. Integrates positional and structural encoding for voxel-wise reconstruction, with target priors guiding voxel sampling and enriching structural encoding. Includes CUDA-based algorithm for rapid 3D target prior estimation from sparse-view projections.

Result: Achieves 10x faster learning efficiency than leading model NAF. Outperforms most accurate model NeRP with PSNR improvements of 3.57 dB (10 projections), 5.42 dB (20 projections), and 5.70 dB (30 projections) on complex abdominal dataset.

Conclusion: The target prior-guided implicit neural representation framework significantly improves both learning efficiency and reconstruction quality for 3D CT reconstruction from sparse-view projections, demonstrating the importance of incorporating anatomical priors in implicit learning approaches.

Abstract: X-ray imaging, based on penetration, enables detailed visualization of internal structures. Building on this capability, existing implicit 3D reconstruction methods have adapted the NeRF model and its variants for internal CT reconstruction. However, these approaches often neglect the significance of objects’ anatomical priors for implicit learning, limiting both reconstruction precision and learning efficiency, particularly in ultra-sparse view scenarios. To address these challenges, we propose a novel 3D CT reconstruction framework that employs a ’target prior’ derived from the object’s projection data to enhance implicit learning. Our approach integrates positional and structural encoding to facilitate voxel-wise implicit reconstruction, utilizing the target prior to guide voxel sampling and enrich structural encoding. This dual strategy significantly boosts both learning efficiency and reconstruction quality. Additionally, we introduce a CUDA-based algorithm for rapid estimation of high-quality 3D target priors from sparse-view projections. Experiments utilizing projection data from a complex abdominal dataset demonstrate that the proposed model substantially enhances learning efficiency, outperforming the current leading model, NAF, by a factor of ten. In terms of reconstruction quality, it also exceeds the most accurate model, NeRP, achieving PSNR improvements of 3.57 dB, 5.42 dB, and 5.70 dB with 10, 20, and 30 projections, respectively. The code is available at https://github.com/qlcao171/TPG-INR.

[187] Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search

Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang, Liang Lin

Main category: cs.CV

TL;DR: The paper introduces HOTD-Bench for evaluating human-centric open-future task discovery and proposes CMAST framework that outperforms existing LMMs through multi-agent reasoning and search tree structure.

DetailsMotivation: Current LMMs struggle with discovering tasks that assist humans in open-future scenarios where human intentions are concurrent and dynamic, creating a need for better methods to identify tasks that reduce human effort across plausible futures.

Method: Proposes HOTD-Bench with 2K+ real-world videos, semi-automated annotation pipeline, and simulation-based protocol for open-set future evaluation. Introduces CMAST framework that uses multi-agent system for complex reasoning decomposition and scalable search tree module for structured reasoning.

Result: CMAST achieves best performance on HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving their performance.

Conclusion: The work formalizes Human-centric Open-future Task Discovery problem and provides both a benchmark (HOTD-Bench) and effective framework (CMAST) that advances LMM capabilities for discovering human-assisting tasks in dynamic, open-future scenarios.

Abstract: Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across plausible futures. To facilitate this study, we propose HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.

[188] History-Augmented Contrastive Learning With Soft Mixture of Experts for Blind Super-Resolution of Planetary Remote Sensing Images

Hui-Jia Zhao, Jie Lu, Yunqing Jiang, Xiao-Ping Lu, Kaichang Di

Main category: cs.CV

TL;DR: HAC-MoE is an unsupervised blind super-resolution framework for planetary remote sensing that decouples kernel estimation from image reconstruction using contrastive learning and mixture-of-experts without external kernel priors.

DetailsMotivation: Blind super-resolution in planetary remote sensing is highly ill-posed with unknown degradation patterns and no ground-truth supervision. Existing unsupervised approaches suffer from optimization instability, distribution shifts, and fail to preserve morphological semantics.

Method: Three key innovations: 1) Contrastive Kernel Sampling to mitigate distribution bias in random sampling, 2) History-Augmented Contrastive Learning using historical model states as negative self-priors to stabilize optimization, 3) Morphology-Aware Soft Mixture-of-Experts to dynamically modulate spectral-spatial features for diverse planetary topographies.

Result: The method achieves state-of-the-art performance in reconstruction quality and kernel estimation accuracy. A new benchmark dataset Ceres-50 is introduced for evaluation under realistic degradation simulations.

Conclusion: HAC-MoE provides a robust solution for scientific observation in data-sparse extraterrestrial environments by addressing the challenges of unsupervised blind super-resolution in planetary remote sensing.

Abstract: Blind Super-Resolution (BSR) in planetary remote sensing constitutes a highly ill-posed inverse problem, characterized by unknown degradation patterns and a complete absence of ground-truth supervision. Existing unsupervised approaches often struggle with optimization instability and distribution shifts, relying on greedy strategies or generic priors that fail to preserve distinct morphological semantics. To address these challenges, we propose History-Augmented Contrastive Mixture of Experts (HAC-MoE), a novel unsupervised framework that decouples kernel estimation from image reconstruction without external kernel priors. The framework is founded on three key innovations: (1) A Contrastive Kernel Sampling mechanism that mitigates the distribution bias inherent in random Gaussian sampling, ensuring the generation of plausible kernel priors via similarity constraints; (2) A History-Augmented Contrastive Learning strategy that leverages historical model states as negative self-priors. We provide a theoretical analysis demonstrating that this mechanism induces strong convexity in the feature space, thereby stabilizing the unsupervised optimization trajectory and preventing overfitting; and (3) A Morphology-Aware Soft Mixture-of-Experts (MA-MoE) estimator that dynamically modulates spectral-spatial features to adaptively reconstruct diverse planetary topographies. To facilitate rigorous evaluation, we introduce Ceres-50, a benchmark dataset encapsulating diverse geological features under realistic degradation simulations. Extensive experiments demonstrate that HAC-MoE achieves state-of-the-art performance in reconstruction quality and kernel estimation accuracy, offering a solution for scientific observation in data-sparse extraterrestrial environments.

[189] SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

Jiayuan Du, Yiming Zhao, Zhenglong Guo, Yong Pan, Wenbo Hou, Zhihui Hao, Kun Zhan, Qijun Chen

Main category: cs.CV

TL;DR: A novel transformer-based architecture for end-to-end 3D scene occupancy forecasting directly from image features, bypassing BEV projections and discrete tokenization to achieve SOTA performance on nuScenes.

DetailsMotivation: Existing methods for 3D scene occupancy forecasting rely on either VAEs with discrete occupancy tokens (limiting representational capacity) or bird's eye view projections with explicit geometric priors, both of which constrain performance and flexibility.

Method: Uses a transformer architecture with sparse occupancy representation that directly processes raw image features end-to-end, avoiding intermediate BEV projections and discrete tokenization to better capture spatiotemporal dependencies.

Result: Achieves state-of-the-art performance on nuScenes benchmark for 1-3 second occupancy forecasting, significantly outperforming existing approaches while demonstrating robust scene dynamics understanding under arbitrary trajectory conditioning.

Conclusion: The end-to-end transformer approach with sparse occupancy representation effectively overcomes limitations of both discrete tokenization and BEV-based methods, enabling superior 3D scene forecasting with flexible trajectory conditioning.

Abstract: This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird’s eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1-3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.

[190] dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang

Main category: cs.CV

TL;DR: dots_ocr is a unified Vision-Language Model that jointly learns document layout detection, text recognition, and relational understanding in an end-to-end framework, achieving SOTA performance on multilingual benchmarks.

DetailsMotivation: Current document layout parsing methods use fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage joint training synergies, limiting AI's ability to access structured knowledge.

Method: Introduces dots_ocr, a single Vision-Language Model that jointly learns three core tasks (layout detection, text recognition, relational understanding) in a unified end-to-end framework, enabled by a scalable data engine that synthesizes vast multilingual corpus.

Result: Achieves state-of-the-art performance on comprehensive OmniDocBench and introduces XDocParse benchmark (126 languages), where dots_ocr achieves ~10% relative improvement and demonstrates strong multilingual capability.

Conclusion: The unified end-to-end paradigm for document layout parsing outperforms fragmented multi-stage approaches, enabling robust performance across diverse languages, layouts, and domains through joint learning.

Abstract: Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world’s vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots_ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this benchmark, dots_ocr achieves state-of-the-art performance, delivering an approximately 10% relative improvement and demonstrating strong multilingual capability.

[191] From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

Main category: cs.CV

TL;DR: The paper introduces TAD, a new benchmark for evaluating temporal understanding in autonomous driving, tests current VLMs showing poor performance, and proposes two training-free methods that improve accuracy by up to 17.72%.

DetailsMotivation: Existing temporal reasoning benchmarks focus on domains like sports and movies, but lack specialized evaluation for ego-centric autonomous driving footage where temporal understanding is crucial for safety and decision-making.

Method: 1) Created TAD benchmark with 6,000 QA pairs across 7 tasks for autonomous driving temporal understanding. 2) Evaluated 9 generalist and specialist models. 3) Proposed two training-free solutions: Scene-CoT (Chain-of-Thought reasoning) and TCogMap (ego-centric temporal cognitive map).

Result: Current SoTA models performed poorly on TAD due to inadequate fine-grained motion understanding. The proposed Scene-CoT and TCogMap methods improved average accuracy by up to 17.72% when integrated with existing VLMs.

Conclusion: TAD fills a critical gap in evaluating temporal understanding for autonomous driving, reveals limitations of current models, and provides effective solutions to improve performance, catalyzing future research in this important area.

Abstract: Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs’ ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at \href{https://huggingface.co/datasets/vbdai/TAD}{Hugging Face} and \href{https://github.com/vbdi/tad_bench}{Github}, respectively.

[192] DM3D: Deformable Mamba via Offset-Guided Gaussian Sequencing for Point Cloud Understanding

Bin Liu, Chunyang Wang, Xuelian Liu

Main category: cs.CV

TL;DR: DM3D is a deformable Mamba architecture for point cloud understanding that introduces adaptive serialization to overcome SSMs’ reliance on input order, achieving SOTA performance across multiple tasks.

DetailsMotivation: State Space Models (SSMs) show great potential for long-sequence modeling but depend on input order, which conflicts with the irregular nature of point clouds. Existing approaches use predefined serialization strategies that cannot adapt to diverse geometric structures.

Method: DM3D introduces: 1) Offset-guided Gaussian sequencing mechanism unifying local resampling and global reordering; 2) Gaussian-based KNN Resampling (GKR) for adaptive neighborhood reorganization; 3) Gaussian-based Differentiable Reordering (GDR) for end-to-end serialization optimization; 4) Tri-Path Frequency Fusion module for feature complementarity and aliasing reduction.

Result: Extensive experiments on benchmark datasets show DM3D achieves state-of-the-art performance in classification, few-shot learning, and part segmentation, demonstrating that adaptive serialization effectively unlocks SSMs’ potential for point cloud understanding.

Conclusion: DM3D enables structure-adaptive serialization of point clouds, overcoming the limitations of SSMs’ reliance on input order. The approach demonstrates that adaptive serialization is key to unlocking SSMs’ potential for irregular data like point clouds.

Abstract: State Space Models (SSMs) demonstrate significant potential for long-sequence modeling, but their reliance on input order conflicts with the irregular nature of point clouds. Existing approaches often rely on predefined serialization strategies, which cannot adjust based on diverse geometric structures. To overcome this limitation, we propose \textbf{DM3D}, a deformable Mamba architecture for point cloud understanding. Specifically, DM3D introduces an offset-guided Gaussian sequencing mechanism that unifies local resampling and global reordering within a deformable scan. The Gaussian-based KNN Resampling (GKR) enhances structural awareness by adaptively reorganizing neighboring points, while the Gaussian-based Differentiable Reordering (GDR) enables end-to-end optimization of serialization order. Furthermore, a Tri-Path Frequency Fusion module enhances feature complementarity and reduces aliasing. Together, these components enable structure-adaptive serialization of point clouds. Extensive experiments on benchmark datasets show that DM3D achieves state-of-the-art performance in classification, few-shot learning, and part segmentation, demonstrating that adaptive serialization effectively unlocks the potential of SSMs for point cloud understanding. The code will be released at https://github.com/L1277471578/DM3D.

[193] Prompt-Based Continual Compositional Zero-Shot Learning

Sauda Maryam, Sara Nadeem, Faisal Qureshi, Mohsen Ali

Main category: cs.CV

TL;DR: PromptCCZSL: A prompt-based continual learning framework for compositional zero-shot learning that prevents forgetting while adapting to new attribute-object compositions using multi-teacher distillation and specialized losses.

DetailsMotivation: Continual adaptation of vision-language models to new attributes, objects, and compositions in CZSL while preventing catastrophic forgetting of prior knowledge. CCZSL is more complex than classical continual learning because attributes and objects may reoccur across sessions while compositions remain unique.

Method: Built on frozen VLM backbone, uses recency-weighted multi-teacher distillation to retain prior knowledge. Employs session-aware compositional prompts for new compositions, and session-agnostic attribute/object prompts for global semantic consistency. Uses Cosine Anchor Loss (CAL) to preserve prior knowledge, Orthogonal Projection Loss (OPL) to keep new embeddings distinct from previous ones, and Intra-Session Diversity Loss (IDL) to promote variation among current-session embeddings.

Result: Extensive experiments on UT-Zappos and C-GQA benchmarks show PromptCCZSL achieves substantial improvements over prior VLM-based and non-VLM baselines, setting new benchmark for CCZSL in closed-world settings.

Conclusion: PromptCCZSL successfully addresses continual compositional zero-shot learning by preventing forgetting while adapting to new compositions, demonstrating superior performance through its prompt-based framework with specialized losses and comprehensive evaluation protocol.

Abstract: We tackle continual adaptation of vision-language models to new attributes, objects, and their compositions in Compositional Zero-Shot Learning (CZSL), while preventing forgetting of prior knowledge. Unlike classical continual learning where classes are disjoint, CCZSL is more complex as attributes and objects may reoccur across sessions while compositions remain unique. Built on a frozen VLM backbone, we propose the first Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) framework that retains prior knowledge through recency-weighted multi-teacher distillation. It employs session-aware compositional prompts to fuse multimodal features for new compositions, while attribute and object prompts are learned through session-agnostic fusion to maintain global semantic consistency, which is further stabilized by a Cosine Anchor Loss (CAL) to preserve prior knowledge. To enhance adaptation in the current session, an Orthogonal Projection Loss (OPL) ensures that new attribute and object embeddings remain distinct from previous ones, preventing overlap, while an Intra-Session Diversity Loss (IDL) promotes variation among current-session embeddings for richer, more discriminative representations. We also introduce a comprehensive protocol that jointly measures catastrophic forgetting and compositional generalization. Extensive experiments on UT-Zappos and C-GQA benchmarks demonstrate that PromptCCZSL achieves substantial improvements over prior VLM-based and non-VLM baselines, setting a new benchmark for CCZSL in closed-world settings.

[194] M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction

Junqiao Fan, Yunjiao Zhou, Yizhuo Yang, Xinyuan Cui, Jiarui Zhang, Lihua Xie, Jianfei Yang, Chris Xiaoxuan Lu, Fangqiang Ding

Main category: cs.CV

TL;DR: M4Human is the largest-scale multimodal benchmark (661K frames) for human mesh reconstruction, featuring mmWave radar, RGB, and depth data with high-quality MoCap annotations for 20 subjects performing 50 diverse actions.

DetailsMotivation: Current human mesh reconstruction datasets rely on RGB input which has limitations: occlusion, lighting variation, and privacy concerns. Existing radar datasets are limited by sparse skeleton labels, small scale, and simple actions.

Method: Created M4Human dataset with 661K frames (9× larger than prior largest) featuring high-resolution mmWave radar, RGB, and depth data. Provides both raw radar tensors and processed radar point clouds. Includes 20 subjects performing 50 diverse actions with high-quality MoCap annotations (3D meshes and global trajectories).

Result: Established benchmarks on both radar tensor and radar point cloud modalities, as well as multimodal fusion with RGB-D. Results highlight the dataset’s significance for radar-based human modeling while revealing challenges with fast, unconstrained motion.

Conclusion: M4Human advances HMR research by providing a large-scale multimodal benchmark that enables privacy-preserving human sensing using radar, overcoming limitations of vision-based approaches while supporting research across different RF signal granularity levels.

Abstract: Human mesh reconstruction (HMR) provides direct insights into body-environment interaction, which enables various immersive applications. While existing large-scale HMR datasets rely heavily on line-of-sight RGB input, vision-based sensing is limited by occlusion, lighting variation, and privacy concerns. To overcome these limitations, recent efforts have explored radio-frequency (RF) mmWave radar for privacy-preserving indoor human sensing. However, current radar datasets are constrained by sparse skeleton labels, limited scale, and simple in-place actions. To advance the HMR research community, we introduce M4Human, the current largest-scale (661K-frame) ($9\times$ prior largest) multimodal benchmark, featuring high-resolution mmWave radar, RGB, and depth data. M4Human provides both raw radar tensors (RT) and processed radar point clouds (RPC) to enable research across different levels of RF signal granularity. M4Human includes high-quality motion capture (MoCap) annotations with 3D meshes and global trajectories, and spans 20 subjects and 50 diverse actions, including in-place, sit-in-place, and free-space sports or rehabilitation movements. We establish benchmarks on both RT and RPC modalities, as well as multimodal fusion with RGB-D modalities. Extensive results highlight the significance of M4Human for radar-based human modeling while revealing persistent challenges under fast, unconstrained motion. The dataset and code will be released after the paper publication.

[195] RecTok: Reconstruction Distillation along Rectified Flow

Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang, Yunhai Tong, Xiangtai Li, Xuelong Li

Main category: cs.CV

TL;DR: RecTok overcomes the quality limitations of high-dimensional visual tokenizers through flow semantic distillation and reconstruction-alignment distillation, achieving state-of-the-art generation quality while maintaining semantically rich latent spaces.

DetailsMotivation: There's a fundamental trade-off between latent space dimensionality and generation quality in visual tokenizers for diffusion models. High-dimensional tokenizers underperform low-dimensional ones despite offering richer semantics, limiting the potential of vision foundation models.

Method: RecTok introduces two key innovations: 1) Flow semantic distillation - distills semantic information from vision foundation models into forward flow trajectories in flow matching (rather than focusing on latent space), and 2) Reconstruction-alignment distillation - enhances semantics further with masked feature reconstruction loss.

Result: Achieves state-of-the-art results on gFID-50K with and without classifier-free guidance, superior image reconstruction and generation quality, and maintains semantically rich latent space structure. Performance improves consistently as latent dimensionality increases.

Conclusion: RecTok successfully overcomes the limitations of high-dimensional visual tokenizers by focusing on enriching forward flow trajectories with semantic information, enabling both high-quality generation and semantically expressive latent spaces that scale effectively with dimensionality.

Abstract: Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction–alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at https://shi-qingyu.github.io/rectok.github.io.

[196] History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

Xichen Ding, Jianzhe Gao, Cong Pan, Wenguan Wang, Jie Qin

Main category: cs.CV

TL;DR: HETT framework improves UAV navigation by combining global reasoning and local scene analysis through a two-stage transformer with historical context.

DetailsMotivation: Existing UAV agents for aerial vision-language navigation use mono-granularity frameworks that struggle to balance global environmental reasoning and local scene comprehension, limiting navigation performance in large-scale urban environments.

Method: Proposes History-Enhanced Two-Stage Transformer (HETT) with coarse-to-fine navigation: first predicts coarse target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. Uses historical grid map to aggregate visual features into structured spatial memory.

Result: Experiments on refined CityNav dataset show significant performance gains. Extensive ablation studies verify effectiveness of each component.

Conclusion: HETT framework successfully integrates global reasoning and local scene comprehension for improved aerial vision-language navigation, with enhanced data quality through manual refinement of CityNav annotations.

Abstract: Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.

[197] The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy

Zhuo Chen, Fanyue Wei, Runze Xu, Jingjing Li, Lixin Duan, Angela Yao, Wen Li

Main category: cs.CV

TL;DR: SynPS addresses attention collapse in diffusion model editing by synergistically combining positional embeddings and semantic information to achieve faithful non-rigid image editing.

DetailsMotivation: Existing training-free image editing methods struggle with complex non-rigid edits (pose/shape changes) due to attention collapse, where either positional embeddings or semantic features dominate, leading to over-editing or under-editing.

Method: SynPS introduces an editing measurement to quantify required editing magnitude at each denoising step, and an attention synergy pipeline that dynamically modulates positional embedding influence to balance semantic modifications and fidelity preservation.

Result: Extensive experiments on public and newly curated benchmarks demonstrate superior performance and faithfulness compared to existing methods, effectively avoiding both over- and under-editing.

Conclusion: SynPS successfully addresses attention collapse in diffusion-based editing by adaptively integrating positional and semantic cues, enabling faithful non-rigid image editing without training.

Abstract: Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing. To address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation. By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.

[198] ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking

Lihong Wang, Liangqi Li, Weiwei Feng, Jiamin Wu, Changtao Miao, Tieru Wu, Rui Ma, Bo Zhang, Zhe Li

Main category: cs.CV

TL;DR: ViRC framework introduces Reason Chunking with Critical Reasoning Units (CRUs) to simulate human step-by-step visual reasoning for multimodal math tasks, achieving 18.8% average improvement over baselines.

DetailsMotivation: Existing MLLMs perform textual reasoning from static images, missing dynamic visual acquisition during reasoning. Humans repeatedly examine images and use step-by-step reasoning to prove intermediate propositions, following Miller's Law in cognitive science.

Method: Propose ViRC framework with Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs). Create CRUX dataset with explicit CRU annotations using three visual tools and four reasoning patterns. Use progressive training strategy: Instructional SFT, Practice SFT, and Strategic RL.

Result: ViRC-7B model achieves 18.8% average improvement over baselines across multiple mathematical benchmarks.

Conclusion: The ViRC framework effectively simulates human expert problem-solving patterns through Reason Chunking and CRUs, significantly enhancing multimodal mathematical reasoning capabilities.

Abstract: CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller’s Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model. The resulting ViRC-7B model achieves a 18.8% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.

cs.AI

[199] Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning

Sahil Rajesh Dhayalkar

Main category: cs.AI

TL;DR: The paper interprets transformer self-attention and residual streams as implementing an approximate Vector Symbolic Architecture (VSA), explaining reasoning capabilities and failure modes while proposing architectural improvements for more reliable symbolic manipulation.

DetailsMotivation: Transformer models show impressive reasoning but fail at stable symbolic manipulation tasks. The paper aims to develop a unified perspective by interpreting transformer components through the lens of Vector Symbolic Architectures to understand both capabilities and limitations.

Method: The paper interprets transformer internals as VSA: queries/keys define role spaces, values encode fillers, attention weights perform soft unbinding, and residual connections realize superposition. It uses this algebraic perspective to analyze reasoning behaviors and proposes VSA-inspired architectural biases including explicit binding/unbinding heads and hyperdimensional memory layers.

Result: The VSA perspective explains transformer reasoning phenomena (chain-of-thought, program-based reasoning, tool use) and characteristic failure modes (variable confusion, inconsistency). It proposes architectural improvements and metrics for measuring “VSA-likeness” and logical compositionality.

Conclusion: Viewing attention as soft vector-symbolic computation provides a principled framework for understanding transformer reasoning, explaining both successes and failures, and offers a route toward more interpretable and logically reliable reasoning systems.

Abstract: Transformer-based language models display impressive reasoning-like behavior, yet remain brittle on tasks that require stable symbolic manipulation. This paper develops a unified perspective on these phenomena by interpreting self-attention and residual streams as implementing an approximate Vector Symbolic Architecture (VSA). In this view, queries and keys define role spaces, values encode fillers, attention weights perform soft unbinding, and residual connections realize superposition of many bound structures. We use this algebraic lens to relate transformer internals to chain-of-thought traces, program-based reasoning, and memory-augmented tool use, and to explain characteristic failure modes such as variable confusion and inconsistency across logically related prompts. Building on this perspective, we propose VSA-inspired architectural biases, including explicit binding/unbinding heads and hyperdimensional memory layers, and training objectives that promote role-filler separation and robust superposition. Finally, we outline metrics for measuring “VSA-likeness” and logical compositionality, and pose theoretical and architectural open problems. Overall, the paper argues that viewing attention as soft vector-symbolic computation offers a principled route toward more interpretable and logically reliable reasoning systems.

[200] GR-Agent: Adaptive Graph Reasoning Agent under Incomplete Knowledge

Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Jiaoyan Chen, Steffen Staab, Yuan He, Evgeny Kharlamov

Main category: cs.AI

TL;DR: The paper addresses knowledge graph question answering under incomplete KGs, proposes a benchmark construction methodology for KG incompleteness, and introduces GR-Agent that formalizes KGQA as agent-environment interaction with graph reasoning tools.

DetailsMotivation: Most KGQA benchmarks assume complete knowledge graphs with direct supporting triples, which overlooks the reality of incomplete KGs where facts are missing and answers must be inferred from existing facts. This gap reduces evaluation to shallow retrieval and doesn't test true reasoning ability.

Method: 1) Proposes methodology for constructing benchmarks under KG incompleteness by removing direct supporting triples while ensuring alternative reasoning paths remain; 2) Introduces Adaptive Graph Reasoning Agent (GR-Agent) that constructs interactive environment from KG and formalizes KGQA as agent-environment interaction; 3) GR-Agent operates over action space of graph reasoning tools and maintains memory of potential supporting evidence.

Result: Experiments show existing methods suffer consistent performance degradation under incompleteness, highlighting limited reasoning ability. GR-Agent outperforms non-training baselines and performs comparably to training-based methods under both complete and incomplete settings.

Conclusion: The paper bridges the gap in KGQA evaluation by addressing KG incompleteness, demonstrates the limitations of existing methods under incomplete settings, and proposes GR-Agent as an effective solution that maintains strong performance through adaptive graph reasoning and agent-based interaction.

Abstract: Large language models (LLMs) achieve strong results on knowledge graph question answering (KGQA), but most benchmarks assume complete knowledge graphs (KGs) where direct supporting triples exist. This reduces evaluation to shallow retrieval and overlooks the reality of incomplete KGs, where many facts are missing and answers must be inferred from existing facts. We bridge this gap by proposing a methodology for constructing benchmarks under KG incompleteness, which removes direct supporting triples while ensuring that alternative reasoning paths required to infer the answer remain. Experiments on benchmarks constructed using our methodology show that existing methods suffer consistent performance degradation under incompleteness, highlighting their limited reasoning ability. To overcome this limitation, we present the Adaptive Graph Reasoning Agent (GR-Agent). It first constructs an interactive environment from the KG, and then formalizes KGQA as agent environment interaction within this environment. GR-Agent operates over an action space comprising graph reasoning tools and maintains a memory of potential supporting reasoning evidence, including relevant relations and reasoning paths. Extensive experiments demonstrate that GR-Agent outperforms non-training baselines and performs comparably to training-based methods under both complete and incomplete settings.

[201] TrafficGamer: Reliable and Flexible Traffic Simulation for Safety-Critical Scenarios with Game-Theoretic Oracles

Guanren Qiao, Guorui Quan, Jiawei Yu, Shujun Jia, Guiliang Liu

Main category: cs.AI

TL;DR: TrafficGamer is a game-theoretic traffic simulation framework that generates safety-critical driving scenarios by modeling road driving as a multi-agent game, ensuring fidelity, exploitability, and diversity in simulated scenarios.

DetailsMotivation: Autonomous Vehicle systems struggle with safety-critical traffic scenarios due to their rarity in datasets and the complexity of predictive modeling for multiple vehicles. There's a need for effective simulation of these critical situations to improve AV safety.

Method: TrafficGamer uses a game-theoretic approach to traffic simulation, viewing common road driving as a multi-agent game. It configures risk-sensitive constraints during optimization to dynamically adapt to equilibria of varying tightness.

Result: TrafficGamer ensures fidelity, exploitability, and diversity of simulated scenarios, aligning with real-world traffic distribution while efficiently capturing equilibria for safety-critical scenarios involving multiple agents. It provides highly flexible simulations across various contexts.

Conclusion: TrafficGamer effectively addresses the challenge of simulating safety-critical traffic scenarios through game-theoretic modeling, offering a flexible framework that can adapt to different risk levels and equilibrium conditions for improved AV testing and development.

Abstract: While modern Autonomous Vehicle (AV) systems can develop reliable driving policies under regular traffic conditions, they frequently struggle with safety-critical traffic scenarios. This difficulty primarily arises from the rarity of such scenarios in driving datasets and the complexities associated with predictive modeling of multiple vehicles. Effectively simulating safety-critical traffic situations is therefore a crucial challenge. In this paper, we introduce TrafficGamer, which facilitates game-theoretic traffic simulation by viewing common road driving as a multi-agent game. When we evaluate the empirical performance across various real-world datasets, TrafficGamer ensures both the fidelity, exploitability, and diversity of the simulated scenarios, guaranteeing that they not only statically align with real-world traffic distribution but also efficiently capture equilibria for representing safety-critical scenarios involving multiple agents compared with other methods. Additionally, the results demonstrate that TrafficGamer provides highly flexible simulations across various contexts. Specifically, we demonstrate that the generated scenarios can dynamically adapt to equilibria of varying tightness by configuring risk-sensitive constraints during optimization. We have provided a demo webpage at: https://anonymous.4open.science/api/repo/trafficgamer-demo-1EE0/file/index.html.

[202] IaC Generation with LLMs: An Error Taxonomy and A Study on Configuration Knowledge Injection

Roman Nekrasov, Stefano Fossati, Indika Kumara, Damian Andrew Tamburri, Willem-Jan van den Heuvel

Main category: cs.AI

TL;DR: LLMs struggle with Infrastructure as Code generation; injecting structured configuration knowledge improves technical correctness but intent alignment remains limited.

DetailsMotivation: LLMs currently have low success rates in generating correct and intent-aligned Infrastructure as Code (IaC), particularly for Terraform, creating a need for improved methods.

Method: Enhanced IaC-Eval benchmark with cloud emulation and automated error analysis; developed error taxonomy; implemented knowledge injection techniques from Naive RAG to sophisticated Graph RAG with semantic enrichment and inter-resource dependency modeling.

Result: Baseline LLM performance was poor (27.1% overall success). Knowledge injection increased technical validation success to 75.3% and overall success to 62.6%, but intent alignment plateaued, revealing a “Correctness-Congruence Gap.”

Conclusion: While structured knowledge injection significantly improves technical correctness of LLM-generated IaC, LLMs remain limited as “architects” for fulfilling nuanced user intent, highlighting the need for better intent alignment methods.

Abstract: Large Language Models (LLMs) currently exhibit low success rates in generating correct and intent-aligned Infrastructure as Code (IaC). This research investigated methods to improve LLM-based IaC generation, specifically for Terraform, by systematically injecting structured configuration knowledge. To facilitate this, an existing IaC-Eval benchmark was significantly enhanced with cloud emulation and automated error analysis. Additionally, a novel error taxonomy for LLM-assisted IaC code generation was developed. A series of knowledge injection techniques was implemented and evaluated, progressing from Naive Retrieval-Augmented Generation (RAG) to more sophisticated Graph RAG approaches. These included semantic enrichment of graph components and modeling inter-resource dependencies. Experimental results demonstrated that while baseline LLM performance was poor (27.1% overall success), injecting structured configuration knowledge increased technical validation success to 75.3% and overall success to 62.6%. Despite these gains in technical correctness, intent alignment plateaued, revealing a “Correctness-Congruence Gap” where LLMs can become proficient “coders” but remain limited “architects” in fulfilling nuanced user intent.

[203] What Is Your AI Agent Buying? Evaluation, Biases, Model Dependence, & Emerging Implications for Agentic E-Commerce

Amine Allouah, Omar Besbes, Josué D Figueroa, Yash Kanoria, Akshit Kumar

Main category: cs.AI

TL;DR: AI agents in online marketplaces show unstable preferences, strong position biases, and can be manipulated by sellers, creating volatile markets that differ fundamentally from human commerce.

DetailsMotivation: To understand how autonomous AI agents will transform online marketplaces by investigating their decision-making behavior, biases, and market impacts using an auditing framework.

Method: Used ACES (provider-agnostic framework for auditing agent decision-making) to analyze AI agents’ product selection behavior, including randomized trials to test position biases, sensitivity to various factors (price, ratings, reviews), and seller manipulation strategies.

Result: AI agents exhibit: 1) Choice homogeneity (concentrating demand on few products), 2) Unstable preferences (model updates drastically reshuffle market shares), 3) Strong position biases (varying across providers and persisting in text-only interfaces), 4) Penalization of sponsored tags but reward of platform endorsements, 5) Varying sensitivities to price/ratings/reviews across models, and 6) Susceptibility to seller manipulation through query-conditional description tweaks.

Conclusion: Agentic markets are volatile and fundamentally different from human-centric commerce, requiring continuous auditing and raising important questions for platform design, seller strategy, and regulation.

Abstract: Online marketplaces will be transformed by autonomous AI agents acting on behalf of consumers. Rather than humans browsing and clicking, AI agents can parse webpages or leverage APIs to view, evaluate and choose products. We investigate the behavior of AI agents using ACES, a provider-agnostic framework for auditing agent decision-making. We reveal that agents can exhibit choice homogeneity, often concentrating demand on a few modal'' products while ignoring others entirely. Yet, these preferences are unstable: model updates can drastically reshuffle market shares. Furthermore, randomized trials show that while agents have improved over time on simple tasks with a clearly identified best choice, they exhibit strong position biases -- varying across providers and model versions, and persisting even in text-only "headless" interfaces -- undermining any universal notion of a top’’ rank. Agents also consistently penalize sponsored tags while rewarding platform endorsements, and sensitivities to price, ratings, and reviews vary sharply across models. Finally, we demonstrate that sellers can respond: a seller-side agent making simple, query-conditional description tweaks can drive significant gains in market share. These findings reveal that agentic markets are volatile and fundamentally different from human-centric commerce, highlighting the need for continuous auditing and raising questions for platform design, seller strategy and regulation.

[204] AgroAskAI: A Multi-Agentic AI Framework for Supporting Smallholder Farmers’ Enquiries Globally

Nadine Angela Cantonjos, Arpita Biswas

Main category: cs.AI

TL;DR: AgroAskAI is a multi-agent AI system for climate adaptation decision support in agriculture, featuring role-specialized agents, real-time data integration, governance mechanisms, and multilingual support for vulnerable rural communities.

DetailsMotivation: Agricultural regions face climate-related risks (droughts, heavy rainfall, shifting weather patterns), requiring adaptive risk-management solutions. Current AI systems lack dynamic collaborative reasoning and context-aware outputs needed for effective climate adaptation support in vulnerable rural communities.

Method: AgroAskAI uses a modular, role-specialized multi-agent architecture with chain-of-responsibility approach to coordinate autonomous agents. It integrates real-time tools and datasets, includes governance mechanisms to mitigate hallucination, enables internal feedback, and supports multilingual interactions for non-English-speaking farmers.

Result: Experiments on agricultural climate adaptation queries show that with additional tools and prompt refinement, AgroAskAI delivers more actionable, grounded, and inclusive outputs compared to previous approaches.

Conclusion: Agentic AI systems like AgroAskAI show strong potential for providing sustainable and accountable decision support in climate adaptation for agriculture, particularly for vulnerable rural communities through dynamic collaborative reasoning and context-aware outputs.

Abstract: Agricultural regions in rural areas face damage from climate-related risks, including droughts, heavy rainfall, and shifting weather patterns. Prior research calls for adaptive risk-management solutions and decision-making strategies. To this end, artificial intelligence (AI), particularly agentic AI, offers a promising path forward. Agentic AI systems consist of autonomous, specialized agents capable of solving complex, dynamic tasks. While past systems have relied on single-agent models or have used multi-agent frameworks only for static functions, there is a growing need for architectures that support dynamic collaborative reasoning and context-aware outputs. To bridge this gap, we present AgroAskAI, a multi-agent reasoning system for climate adaptation decision support in agriculture, with a focus on vulnerable rural communities. AgroAskAI features a modular, role-specialized architecture that uses a chain-of-responsibility approach to coordinate autonomous agents, integrating real-time tools and datasets. The system has built-in governance mechanisms that mitigate hallucination and enable internal feedback for coherent, locally relevant strategies. The system also supports multilingual interactions, making it accessible to non-English-speaking farmers. Experiments on common agricultural queries related to climate adaptation show that, with additional tools and prompt refinement, AgroAskAI delivers more actionable, grounded, and inclusive outputs. Our experimental results highlight the potential of agentic AI for sustainable and accountable decision support in climate adaptation for agriculture.

[205] Beyond Accuracy: A Geometric Stability Analysis of Large Language Models in Chess Evaluation

Xidan Song, Weiqi Wang, Ruifeng Cao, Qingya Hu

Main category: cs.AI

TL;DR: LLMs show high chess accuracy but fail geometric reasoning tests, revealing an Accuracy-Stability Paradox where models rely on pattern matching rather than abstract spatial logic.

DetailsMotivation: Standard accuracy metrics for LLMs in chess evaluation fail to distinguish genuine geometric reasoning from superficial memorization of canonical board states, creating a need for better evaluation methods.

Method: Proposed Geometric Stability Framework tests model consistency under invariant transformations (board rotation, mirror symmetry, color inversion, format conversion) on 6 state-of-the-art LLMs using ~3,000 chess positions.

Result: Revealed Accuracy-Stability Paradox: GPT-5.1 shows near-optimal accuracy but catastrophic degradation under rotation (600%+ error increase), while Claude Sonnet 4.5 and Kimi K2 Turbo maintain dual robustness. Gemini 2.5 Flash leads in illegal state rejection (96.0%).

Conclusion: Geometric stability provides an essential orthogonal metric for AI evaluation, offering a proxy to disentangle reasoning capabilities from data contamination and overfitting in large-scale models.

Abstract: The evaluation of Large Language Models (LLMs) in complex reasoning domains typically relies on performance alignment with ground-truth oracles. In the domain of chess, this standard manifests as accuracy benchmarks against strong engines like Stockfish. However, high scalar accuracy does not necessarily imply robust conceptual understanding. This paper argues that standard accuracy metrics fail to distinguish between genuine geometric reasoning and the superficial memorization of canonical board states. To address this gap, we propose a Geometric Stability Framework, a novel evaluation methodology that rigorously tests model consistency under invariant transformations-including board rotation, mirror symmetry, color inversion, and format conversion. We applied this framework to a comparative analysis of six state-of-the-art LLMs including GPT-5.1, Claude Sonnet 4.5, and Kimi K2 Turbo, utilizing a dataset of approximately 3,000 positions. Our results reveal a significant Accuracy-Stability Paradox. While models such as GPT-5.1 achieve near-optimal accuracy on standard positions, they exhibit catastrophic degradation under geometric perturbation, specifically in rotation tasks where error rates surge by over 600%. This disparity suggests a reliance on pattern matching over abstract spatial logic. Conversely, Claude Sonnet 4.5 and Kimi K2 Turbo demonstrate superior dual robustness, maintaining high consistency across all transformation axes. Furthermore, we analyze the trade-off between helpfulness and safety, identifying Gemini 2.5 Flash as the leader in illegal state rejection (96.0%). We conclude that geometric stability provides an orthogonal and essential metric for AI evaluation, offering a necessary proxy for disentangling reasoning capabilities from data contamination and overfitting in large-scale models.

[206] EvoLattice: Persistent Internal-Population Evolution through Multi-Alternative Quality-Diversity Graph Representations for LLM-Guided Program Discovery

Kamer Ali Yuksel

Main category: cs.AI

TL;DR: EvoLattice is a framework that represents multiple program/agent candidates as a directed acyclic graph where each node stores persistent alternatives, enabling combinatorial search with guaranteed structural correctness.

DetailsMotivation: Existing LLM-based evolution approaches use overwrite-based mutations that discard useful variants, suffer from destructive edits, and explore brittle search spaces prone to structural failure. There's a need for a method that preserves successful components while enabling more stable and expressive evolution.

Method: EvoLattice represents candidate programs/agents as a directed acyclic graph where each node stores multiple persistent alternatives. Each valid path through the graph defines a distinct executable candidate. The framework includes fine-grained alternative-level evaluation, LLM-guided mutation/recombination/pruning with data-driven feedback, and a deterministic self-repair mechanism that guarantees structural correctness by enforcing acyclicity and dependency consistency.

Result: Across program synthesis tasks (proxy and optimizer meta-learning), EvoLattice yields more stable evolution, greater expressivity, and stronger improvement trajectories than prior LLM-guided methods. The framework naturally extends to agent evolution and implicitly produces quality-diversity optimization dynamics.

Conclusion: EvoLattice provides a novel graph-based representation for evolutionary search that overcomes limitations of single-candidate approaches, enabling more robust and expressive evolution of programs and multi-agent systems while guaranteeing structural correctness.

Abstract: Large language models (LLMs) are increasingly used to evolve programs and multi-agent systems, yet most existing approaches rely on overwrite-based mutations that maintain only a single candidate at a time. Such methods discard useful variants, suffer from destructive edits, and explore a brittle search space prone to structural failure. We introduce EvoLattice, a framework that represents an entire population of candidate programs or agent behaviors within a single directed acyclic graph. Each node stores multiple persistent alternatives, and every valid path through the graph defines a distinct executable candidate, yielding a large combinatorial search space without duplicating structure. EvoLattice enables fine-grained alternative-level evaluation by scoring each alternative across all paths in which it appears, producing statistics that reveal how local design choices affect global performance. These statistics provide a dense, data-driven feedback signal for LLM-guided mutation, recombination, and pruning, while preserving successful components. Structural correctness is guaranteed by a deterministic self-repair mechanism that enforces acyclicity and dependency consistency independently of the LLM. EvoLattice naturally extends to agent evolution by interpreting alternatives as prompt fragments or sub-agent behaviors. Across program synthesis (proxy and optimizer meta-learning), EvoLattice yields more stable evolution, greater expressivity, and stronger improvement trajectories than prior LLM-guided methods. The resulting dynamics resemble quality-diversity optimization, emerging implicitly from EvoLattice’s internal multi-alternative representation rather than an explicit external archive.

[207] LADY: Linear Attention for Autonomous Driving Efficiency without Transformers

Jihao Huang, Xi Xia, Zhiyuan Li, Tianle Liu, Jingke Wang, Junbo Chen, Tengju Ye

Main category: cs.AI

TL;DR: LADY is a fully linear attention-based generative model for end-to-end autonomous driving that achieves SOTA performance with constant computational/memory costs regardless of history length, enabling efficient deployment on edge devices.

DetailsMotivation: Existing Transformer-based methods for autonomous driving suffer from quadratic attention costs that limit long sequence modeling and deployment on resource-constrained edge platforms. While linear attention mechanisms offer better complexity, they lack support for crucial cross-modal and cross-temporal interactions needed for autonomous driving.

Method: Proposes LADY, the first fully linear attention-based generative model for end-to-end autonomous driving. It enables fusion of long-range temporal context with constant computational/memory costs and introduces a lightweight linear cross-attention mechanism for effective cross-modal information exchange.

Result: Achieves state-of-the-art performance on NAVSIM and Bench2Drive benchmarks with constant-time and memory complexity. Shows improved planning performance with significantly reduced computational cost. Successfully deployed and validated on edge devices.

Conclusion: LADY demonstrates that fully linear attention architectures can achieve superior performance for autonomous driving while maintaining practical efficiency for edge deployment, addressing the computational limitations of traditional Transformers in resource-constrained scenarios.

Abstract: End-to-end paradigms have demonstrated great potential for autonomous driving. Additionally, most existing methods are built upon Transformer architectures. However, transformers incur a quadratic attention cost, limiting their ability to model long spatial and temporal sequences-particularly on resource-constrained edge platforms. As autonomous driving inherently demands efficient temporal modeling, this challenge severely limits their deployment and real-time performance. Recently, linear attention mechanisms have gained increasing attention due to their superior spatiotemporal complexity. However, existing linear attention architectures are limited to self-attention, lacking support for cross-modal and cross-temporal interactions-both crucial for autonomous driving. In this work, we propose LADY, the first fully linear attention-based generative model for end-to-end autonomous driving. LADY enables fusion of long-range temporal context at inference with constant computational and memory costs, regardless of the history length of camera and LiDAR features. Additionally, we introduce a lightweight linear cross-attention mechanism that enables effective cross-modal information exchange. Experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that LADY achieves state-of-the-art performance with constant-time and memory complexity, offering improved planning performance and significantly reduced computational cost. Additionally, the model has been deployed and validated on edge devices, demonstrating its practicality in resource-limited scenarios.

[208] Agentic AI for Integrated Sensing and Communication: Analysis, Framework, and Case Study

Wenwen Xie, Geng Sun, Ruichen Zhang, Xuejie Liu, Yinqiu Liu, Jiacheng Wang, Dusit Niyato, Ping Zhang

Main category: cs.AI

TL;DR: Agentic AI enables intelligent, autonomous ISAC systems through continuous perception-reasoning-action loops, with GenAI-based approaches showing significant advantages for optimizing 6G integrated sensing and communication.

DetailsMotivation: As wireless environments become more dynamic and complex, ISAC systems need more intelligent and autonomous operation to maintain efficiency and adaptability. Agentic AI offers a solution by enabling continuous perception-reasoning-action loops for intelligent operation in dynamic environments.

Method: 1) Comprehensive review of agentic AI and ISAC systems; 2) Analysis of common ISAC optimization approaches and advantages of GenAI-based agentic AI; 3) Proposal of a novel agentic ISAC framework with case study validation; 4) Identification of future research directions.

Result: The proposed agentic ISAC framework demonstrates superiority in optimizing ISAC performance, with GenAI-based agentic AI showing significant advantages over traditional optimization approaches for ISAC systems.

Conclusion: Agentic AI provides a promising solution for intelligent, autonomous ISAC systems in 6G networks, with the proposed framework showing performance improvements and clear future research directions for further development.

Abstract: Integrated sensing and communication (ISAC) has emerged as a key development direction in the sixth-generation (6G) era, which provides essential support for the collaborative sensing and communication of future intelligent networks. However, as wireless environments become increasingly dynamic and complex, ISAC systems require more intelligent processing and more autonomous operation to maintain efficiency and adaptability. Meanwhile, agentic artificial intelligence (AI) offers a feasible solution to address these challenges by enabling continuous perception-reasoning-action loops in dynamic environments to support intelligent, autonomous, and efficient operation for ISAC systems. As such, we delve into the application value and prospects of agentic AI in ISAC systems in this work. Firstly, we provide a comprehensive review of agentic AI and ISAC systems to demonstrate their key characteristics. Secondly, we show several common optimization approaches for ISAC systems and highlight the significant advantages of generative artificial intelligence (GenAI)-based agentic AI. Thirdly, we propose a novel agentic ISAC framework and prensent a case study to verify its superiority in optimizing ISAC performance. Finally, we clarify future research directions for agentic AI-based ISAC systems.

[209] Beyond Fast and Slow: Cognitive-Inspired Elastic Reasoning for Large Language Models

Jinwu Hu, Dongjin Yang, Langyu Bian, Zhiquan Wen, Yufeng Wang, Yaofo Chen, Bin Xiao, Yuanqing Li, Mingkui Tan

Main category: cs.AI

TL;DR: CogER is a framework that dynamically selects optimal reasoning strategies for LLMs based on query complexity, balancing efficiency and accuracy through reinforcement learning.

DetailsMotivation: Existing LLM reasoning strategies struggle to balance efficiency and accuracy across queries of varying difficulties, as they rely on fixed fast/slow modes without adapting to query complexity.

Method: CogER assesses query complexity and assigns queries to predefined difficulty levels, then uses a reinforcement learning agent (CogER-Agent) trained with Markov Decision Process to select optimal reasoning strategies. For tool-requiring queries, it introduces Cognitive Tool-Assisted Reasoning for autonomous external tool invocation.

Result: CogER outperforms state-of-the-art Test-Time scaling methods, achieving at least 13% relative improvement in average exact match on In-Domain tasks and 8% relative gain on Out-of-Domain tasks.

Conclusion: The proposed Cognitive-Inspired Elastic Reasoning framework effectively addresses the efficiency-accuracy tradeoff in LLM reasoning by dynamically adapting strategies to query difficulty, demonstrating significant performance improvements over existing methods.

Abstract: Large language models (LLMs) have demonstrated impressive performance across various language tasks. However, existing LLM reasoning strategies mainly rely on the LLM itself with fast or slow mode (like o1 thinking) and thus struggle to balance reasoning efficiency and accuracy across queries of varying difficulties. In this paper, we propose Cognitive-Inspired Elastic Reasoning (CogER), a framework inspired by human hierarchical reasoning that dynamically selects the most suitable reasoning strategy for each query. Specifically, CogER first assesses the complexity of incoming queries and assigns them to one of several predefined levels, each corresponding to a tailored processing strategy, thereby addressing the challenge of unobservable query difficulty. To achieve automatic strategy selection, we model the process as a Markov Decision Process and train a CogER-Agent using reinforcement learning. The agent is guided by a reward function that balances solution quality and computational cost, ensuring resource-efficient reasoning. Moreover, for queries requiring external tools, we introduce Cognitive Tool-Assisted Reasoning, which enables the LLM to autonomously invoke external tools within its chain-of-thought. Extensive experiments demonstrate that CogER outperforms state-of-the-art Test-Time scaling methods, achieving at least a 13% relative improvement in average exact match on In-Domain tasks and an 8% relative gain on Out-of-Domain tasks.

[210] A Clustering-Based Variable Ordering Framework for Relaxed Decision Diagrams for Maximum Weighted Independent Set Problem

Mohsen Nafar, Michael Römer, Lin Xie

Main category: cs.AI

TL;DR: A clustering-based framework for variable ordering in Relaxed Decision Diagrams that partitions variables into clusters to reduce search space for dynamic ordering heuristics, improving computational efficiency for Discrete Optimization problems like MWISP.

DetailsMotivation: Dynamic variable ordering heuristics for Relaxed Decision Diagrams can tighten dual bounds but incur significant computational overhead when applied globally across all variables. There's a need to mitigate this trade-off between bound quality and computational cost.

Method: Introduces a clustering-based framework that partitions variables into clusters first, then applies dynamic ordering heuristics within this reduced search space. Two strategies: Cluster-to-Cluster (processes clusters sequentially using aggregate criteria) and Pick-and-Sort (iteratively selects and sorts representative variables from each cluster). Also develops theoretical results on DD size growth for MWISP and proposes policies for setting cluster numbers.

Result: The proposed methodology consistently reduces computational costs compared to standard dynamic variable ordering baselines when evaluated on the Maximum Weighted Independent Set Problem across benchmark instances.

Conclusion: Clustering-based variable ordering provides an effective way to balance bound quality with computational efficiency in Decision Diagram-based optimization, offering a practical solution to the trade-off problem in dynamic variable ordering heuristics.

Abstract: Efficient exact algorithms for Discrete Optimization (DO) rely heavily on strong primal and dual bounds. Relaxed Decision Diagrams (DDs) provide a versatile mechanism for deriving such dual bounds by compactly over-approximating the solution space through node merging. However, the quality of these relaxed diagrams, i.e. the tightness of the resulting dual bounds, depends critically on the variable ordering and the merging decisions executed during compilation. While dynamic variable ordering heuristics effectively tighten bounds, they often incur computational overhead when evaluated globally across the entire variable set. To mitigate this trade-off, this work introduces a novel clustering-based framework for variable ordering. Instead of applying dynamic ordering heuristics to the full set of unfixed variables, we first partition variables into clusters. We then leverage this structural decomposition to guide the ordering process, significantly reducing the heuristic’s search space. Within this framework, we investigate two distinct strategies: Cluster-to-Cluster, which processes clusters sequentially using problem-specific aggregate criteria (such as cumulative vertex weights in the Maximum Weighted Independent Set Problem (MWISP)), and Pick-and-Sort, which iteratively selects and sorts representative variables from each cluster to balance local diversity with heuristic guidance. Later on, developing some theoretical results on the growth of the size of DDs for MWISP we propose two different policies for setting the number of clusters within the proposed framework. We embed these strategies into a DD-based branch-and-bound algorithm and evaluate them on the MWISP. Across benchmark instances, the proposed methodology consistently reduces computational costs compared to standard dynamic variable ordering baseline.

[211] CangLing-KnowFlow: A Unified Knowledge-and-Flow-fused Agent for Comprehensive Remote Sensing Applications

Zhengchao Chen, Haoran Wang, Jing Yao, Pedram Ghamisi, Jun Zhou, Peter M. Atkinson, Bing Zhang

Main category: cs.AI

TL;DR: CangLing-KnowFlow is a unified intelligent agent framework for remote sensing that integrates procedural knowledge, dynamic workflow adjustment, and evolutionary memory to automate complex Earth observation tasks.

DetailsMotivation: Existing automated remote sensing systems are task-specific and lack a unified framework for managing diverse, end-to-end workflows across different applications, creating a gap in comprehensive Earth observation automation.

Method: The framework combines three key components: (1) Procedural Knowledge Base (PKB) with 1,008 expert-validated workflow cases across 162 RS tasks to guide planning and reduce hallucinations; (2) Dynamic Workflow Adjustment for autonomous diagnosis and replanning during runtime failures; (3) Evolutionary Memory Module that continuously learns from events to enhance agent knowledge and performance.

Result: Evaluated on KnowFlow-Bench (324 workflows from real-world applications) across 13 LLM backbones, CangLing-KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate across all complex tasks.

Conclusion: CangLing-KnowFlow demonstrates great potential as a robust, efficient, and scalable automated solution for complex Earth observation challenges by integrating expert knowledge into adaptive and verifiable procedures.

Abstract: The automated and intelligent processing of massive remote sensing (RS) datasets is critical in Earth observation (EO). Existing automated systems are normally task-specific, lacking a unified framework to manage diverse, end-to-end workflows–from data preprocessing to advanced interpretation–across diverse RS applications. To address this gap, this paper introduces CangLing-KnowFlow, a unified intelligent agent framework that integrates a Procedural Knowledge Base (PKB), Dynamic Workflow Adjustment, and an Evolutionary Memory Module. The PKB, comprising 1,008 expert-validated workflow cases across 162 practical RS tasks, guides planning and substantially reduces hallucinations common in general-purpose agents. During runtime failures, the Dynamic Workflow Adjustment autonomously diagnoses and replans recovery strategies, while the Evolutionary Memory Module continuously learns from these events, iteratively enhancing the agent’s knowledge and performance. This synergy enables CangLing-KnowFlow to adapt, learn, and operate reliably across diverse, complex tasks. We evaluated CangLing-KnowFlow on the KnowFlow-Bench, a novel benchmark of 324 workflows inspired by real-world applications, testing its performance across 13 top Large Language Model (LLM) backbones, from open-source to commercial. Across all complex tasks, CangLing-KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate. As the first most comprehensive validation along this emerging field, this research demonstrates the great potential of CangLing-KnowFlow as a robust, efficient, and scalable automated solution for complex EO challenges by leveraging expert knowledge (Knowledge) into adaptive and verifiable procedures (Flow).

[212] Graph Contextual Reinforcement Learning for Efficient Directed Controller Synthesis

Toshihide Ubukata, Enhong Mu, Takuto Yamauchi, Mingyue Zhang, Jialong Li, Kenji Tei

Main category: cs.AI

TL;DR: GCRL enhances RL-based controller synthesis by using GNNs to encode exploration history into graphs, improving learning efficiency and generalization over state-of-the-art methods in most benchmark domains.

DetailsMotivation: Current controller synthesis methods rely on fixed rules or RL strategies that only consider limited current features, lacking broader contextual understanding from exploration history.

Method: GCRL integrates Graph Neural Networks (GNNs) to encode the history of LTS exploration into a graph structure, capturing non-current-based context for better decision-making.

Result: GCRL demonstrated superior learning efficiency and generalization compared to state-of-the-art methods in 4 out of 5 benchmark domains, except for one domain with high symmetry and strictly local interactions.

Conclusion: Graph-based encoding of exploration history significantly improves RL-based controller synthesis, though challenges remain for domains with specific structural properties like high symmetry and local interactions.

Abstract: Controller synthesis is a formal method approach for automatically generating Labeled Transition System (LTS) controllers that satisfy specified properties. The efficiency of the synthesis process, however, is critically dependent on exploration policies. These policies often rely on fixed rules or strategies learned through reinforcement learning (RL) that consider only a limited set of current features. To address this limitation, this paper introduces GCRL, an approach that enhances RL-based methods by integrating Graph Neural Networks (GNNs). GCRL encodes the history of LTS exploration into a graph structure, allowing it to capture a broader, non-current-based context. In a comparative experiment against state-of-the-art methods, GCRL exhibited superior learning efficiency and generalization across four out of five benchmark domains, except one particular domain characterized by high symmetry and strictly local interactions.

[213] ChatGPT and Gemini participated in the Korean College Scholastic Ability Test – Earth Science I

Seok-Hyun Ga, Chun-Yen Chang

Main category: cs.AI

TL;DR: Analysis of multimodal scientific reasoning in LLMs using Korean CSAT Earth Science questions reveals critical cognitive gaps: perception-cognition disconnect, calculation-conceptualization discrepancy, and process hallucination, providing insights for designing AI-resistant assessments.

DetailsMotivation: As students increasingly use AI for assignments, concerns about academic integrity and assessment validity grow. This study aims to understand AI's capabilities and limitations in scientific reasoning to help design assessments that can distinguish human from AI performance.

Method: Used 2025 Korean CSAT Earth Science I section to evaluate GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Designed three experimental conditions: full-page input, individual item input, and optimized multimodal input to assess performance across different data structures.

Result: Unstructured inputs caused significant performance degradation due to segmentation and OCR failures. Even with optimization, models showed fundamental reasoning flaws: perception-cognition gap (failing to interpret symbolic meanings), calculation-conceptualization discrepancy (calculating correctly but missing concepts), and process hallucination (skipping visual verification).

Conclusion: The study identifies specific cognitive vulnerabilities in LLMs that can be exploited to create “AI-resistant questions.” By targeting gaps between perception and cognition, educators can design assessments that distinguish genuine student competency from AI-generated responses, ensuring assessment fairness.

Abstract: The rapid development of Generative AI is bringing innovative changes to education and assessment. As the prevalence of students utilizing AI for assignments increases, concerns regarding academic integrity and the validity of assessments are growing. This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to deeply analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs), including GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Three experimental conditions (full-page input, individual item input, and optimized multimodal input) were designed to evaluate model performance across different data structures. Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures. Even under optimized conditions, models exhibited fundamental reasoning flaws. Qualitative analysis revealed that “Perception Errors” were dominant, highlighting a “Perception-Cognition Gap” where models failed to interpret symbolic meanings in schematic diagrams despite recognizing visual data. Furthermore, models demonstrated a “Calculation-Conceptualization Discrepancy,” successfully performing calculations while failing to apply the underlying scientific concepts, and “Process Hallucination,” where models skipped visual verification in favor of plausible but unfounded background knowledge. Addressing the challenge of unauthorized AI use in coursework, this study provides actionable cues for designing “AI-resistant questions” that target these specific cognitive vulnerabilities. By exploiting AI’s weaknesses, such as the gap between perception and cognition, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.

[214] SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

Zehua Pei, Hui-Ling Zhen, Shixiong Kai, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu

Main category: cs.AI

TL;DR: SCOPE introduces self-evolving prompt optimization for LLM agents to dynamically manage massive contexts, improving task success from 14.23% to 38.64% without human intervention.

DetailsMotivation: LLM agents face a critical bottleneck: while they have access to massive, dynamic contexts, their static prompts lack effective context management mechanisms, leading to recurring Corrective and Enhancement failures.

Method: SCOPE frames context management as an online optimization problem, using a Dual-Stream mechanism that balances tactical specificity (resolving immediate errors) with strategic generality (evolving long-term principles), plus Perspective-Driven Exploration to maximize strategy coverage.

Result: Experiments on the HLE benchmark show SCOPE improves task success rates from 14.23% to 38.64% without human intervention.

Conclusion: SCOPE effectively addresses the context management gap in LLM agents through self-evolving prompt optimization, significantly improving performance in dynamic environments.

Abstract: Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the mechanisms to manage it effectively, leading to recurring Corrective and Enhancement failures. To address this capability gap, we introduce \textbf{SCOPE} (Self-evolving Context Optimization via Prompt Evolution). SCOPE frames context management as an \textit{online optimization} problem, synthesizing guidelines from execution traces to automatically evolve the agent’s prompt. We propose a Dual-Stream mechanism that balances tactical specificity (resolving immediate errors) with strategic generality (evolving long-term principles). Furthermore, we introduce Perspective-Driven Exploration to maximize strategy coverage, increasing the likelihood that the agent has the correct strategy for any given task. Experiments on the HLE benchmark show that SCOPE improves task success rates from 14.23% to 38.64% without human intervention. We make our code publicly available at https://github.com/JarvisPei/SCOPE.

[215] Bilateral Spatial Reasoning about Street Networks: Graph-based RAG with Qualitative Spatial Representations

Reinhard Moratz, Niklas Daute, James Ondieki, Markus Kattenbeck, Mario Krajina, Ioannis Giannopoulos

Main category: cs.AI

TL;DR: Improving LLM route instructions using qualitative spatial relations for pedestrian navigation

DetailsMotivation: Current LLMs lack robust spatial reasoning capabilities for providing effective pedestrian route instructions, particularly in using qualitative spatial relations that humans naturally use for wayfinding.

Method: The paper likely proposes integrating qualitative spatial reasoning techniques with LLMs, potentially through specialized training, fine-tuning, or architectural modifications to enhance spatial relation understanding for route instructions.

Result: The approach would demonstrate improved LLM performance in generating accurate, human-like pedestrian route instructions using qualitative spatial relations compared to baseline LLMs.

Conclusion: Enhancing LLMs with qualitative spatial reasoning capabilities significantly improves their utility for pedestrian navigation tasks, bridging the gap between AI-generated instructions and human spatial communication patterns.

Abstract: This paper deals with improving the capabilities of Large Language Models (LLM) to provide route instructions for pedestrian wayfinders by means of qualitative spatial relations.

[216] Outer-Learning Framework for Playing Multi-Player Trick-Taking Card Games: A Case Study in Skat

Stefan Edelkamp

Main category: cs.AI

TL;DR: A bootstrapping outer-learning framework that expands human expert game databases with AI self-play games to improve prediction accuracy in multi-player card games like Skat.

DetailsMotivation: Early game stages (bidding, game selection, initial card selection) are critical in card games like Skat and Bridge, but current approaches rely on limited human expert game statistics. There's a need to improve prediction accuracy beyond what's possible with existing human game databases.

Method: Developed a general bootstrapping outer-learning framework that expands human game databases with millions of AI self-play games. Used perfect feature hash functions to handle compacted tables, creating a self-improving card game engine where newly inferred knowledge continuously improves during self-learning.

Result: The approach successfully improves prediction accuracy by merging statistics from human and AI games. The Skat case study demonstrates that the automated framework can support various game decisions effectively.

Conclusion: The bootstrapping outer-learning framework provides an effective way to enhance decision-making in card games by combining human expertise with AI self-play, creating continuously improving game engines that outperform traditional statistical approaches.

Abstract: In multi-player card games such as Skat or Bridge, the early stages of the game, such as bidding, game selection, and initial card selection, are often more critical to the success of the play than refined middle- and end-game play. At the current limits of computation, such early decision-making resorts to using statistical information derived from a large corpus of human expert games. In this paper, we derive and evaluate a general bootstrapping outer-learning framework that improves prediction accuracy by expanding the database of human games with millions of self-playing AI games to generate and merge statistics. We implement perfect feature hash functions to address compacted tables, producing a self-improving card game engine, where newly inferred knowledge is continuously improved during self-learning. The case study in Skat shows that the automated approach can be used to support various decisions in the game.

[217] Intent-Driven UAM Rescheduling

Jeongseok Kim, Kangjin Kim

Main category: cs.AI

TL;DR: Integrated ASP+MILP framework for explainable UAM vertiport scheduling with human-in-the-loop handling of ambiguous inputs using three-valued logic.

DetailsMotivation: Urban Air Mobility (UAM) requires efficient vertiport scheduling under resource constraints, but existing approaches struggle with dynamic operational requirements and ambiguous human rescheduling requests that need transparent handling.

Method: Combines Answer Set Programming (ASP) and Mixed Integer Linear Programming (MILP) with a three-valued logic system to interpret ambiguous user intents and a decision tree for human-in-the-loop scheduling. Formulates scheduling as resource-constrained project scheduling problem (RCPSP).

Result: Proposes an integrated framework that optimizes schedules while transparently supporting human inputs, providing robust structure for explainable and adaptive UAM scheduling.

Conclusion: The ASP+MILP integration with three-valued logic enables effective handling of both dynamic operational requirements and vague human rescheduling requests, advancing explainable AI for UAM scheduling systems.

Abstract: Due to the restricted resources, efficient scheduling in vertiports has received much more attention in the field of Urban Air Mobility (UAM). For the scheduling problem, we utilize a Mixed Integer Linear Programming (MILP), which is often formulated in a resource-restricted project scheduling problem (RCPSP). In this paper, we show our approach to handle both dynamic operation requirements and vague rescheduling requests from humans. Particularly, we utilize a three-valued logic for interpreting ambiguous user intents and a decision tree, proposing a newly integrated system that combines Answer Set Programming (ASP) and MILP. This integrated framework optimizes schedules and supports human inputs transparently. With this system, we provide a robust structure for explainable, adaptive UAM scheduling.

[218] Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Wei Du, Shubham Toshniwal, Branislav Kisacanin, Sadegh Mahdavi, Ivan Moshkov, George Armstrong, Stephen Ge, Edgar Minasyan, Feng Chen, Igor Gitman

Main category: cs.AI

TL;DR: Nemotron-Math is a 7.5M solution trace mathematical reasoning dataset with diverse reasoning styles and Python tool integration, combining curated AoPS and StackExchange-Math problems to achieve SOTA performance including 100% accuracy on AIME benchmarks.

DetailsMotivation: Existing mathematical reasoning datasets lack diverse reasoning styles, long-form traces, and effective tool integration needed for high-quality supervision. The authors aim to create a comprehensive dataset that addresses these limitations.

Method: Leveraged GPT-OSS-120B’s multi-mode generation to create 7.5M solution traces across high/medium/low reasoning modes, with and without Python tool-integrated reasoning. Combined 85K curated AoPS problems with 262K StackExchange-Math problems. Developed sequential bucketed strategy for efficient 128K context-length fine-tuning.

Result: Nemotron-Math outperforms OpenMathReasoning on matched AoPS problems. StackExchange-Math integration improves robustness and generalization on HLE-Math while maintaining competition benchmark accuracy. Achieves 100% maj@16 accuracy on AIME 2024/2025 with Python TIR. Fine-tuning acceleration of 2-3× without significant accuracy loss.

Conclusion: Nemotron-Math enables state-of-the-art mathematical reasoning performance through its large-scale, diverse dataset with tool integration and efficient training methods, demonstrating superior generalization and competition-level accuracy.

Abstract: High-quality mathematical reasoning supervision requires diverse reasoning styles, long-form traces, and effective tool integration, capabilities that existing datasets provide only in limited form. Leveraging the multi-mode generation ability of gpt-oss-120b, we introduce Nemotron-Math, a large-scale mathematical reasoning dataset containing 7.5M solution traces across high, medium, and low reasoning modes, each available both with and without Python tool-integrated reasoning (TIR). The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems, combining structured competition tasks with diverse real-world mathematical queries. We conduct controlled evaluations to assess the dataset quality. Nemotron-Math consistently outperforms the original OpenMathReasoning on matched AoPS problems. Incorporating StackExchange-Math substantially improves robustness and generalization, especially on HLE-Math, while preserving accuracy on math competition benchmarks. To support efficient long-context training, we develop a sequential bucketed strategy that accelerates 128K context-length fine-tuning by 2–3$\times$ without significant accuracy loss. Overall, Nemotron-Math enables state-of-the-art performance, including 100% maj@16 accuracy on AIME 2024 and 2025 with Python TIR.

[219] Evaluating Large Language Models in Scientific Discovery

Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M. Pruyn, Yue Huang, Kehan Guo, Xiuzhe Luo, Yuanhao Qu, Yi Qu, Yinkai Wang, Haorui Wang, Jeff Guo, Jingru Gan, Parshin Shojaee, Di Luo, Andres M Bran, Gen Li, Qiyuan Zhao, Shao-Xiong Lennon Luo, Yuxuan Zhang, Xiang Zou, Wanru Zhao, Yifan F. Zhang, Wucheng Zhang, Shunan Zheng, Saiyang Zhang, Sartaaj Takrim Khan, Mahyar Rajabi-Kochi, Samantha Paradi-Maropakis, Tony Baltoiu, Fengyu Xie, Tianyang Chen, Kexin Huang, Weiliang Luo, Meijing Fang, Xin Yang, Lixue Cheng, Jiajun He, Soha Hassoun, Xiangliang Zhang, Wei Wang, Chandan K. Reddy, Chao Zhang, Zhiling Zheng, Mengdi Wang, Le Cong, Carla P. Gomes, Chang-Yu Hsieh, Aditya Nandy, Philippe Schwaller, Heather J. Kulik, Haojun Jia, Huan Sun, Seyed Mohamad Moosavi, Chenru Duan

Main category: cs.AI

TL;DR: A new benchmark evaluates LLMs on scientific discovery tasks beyond decontextualized knowledge, revealing performance gaps, diminishing returns from scaling, and systematic weaknesses across models.

DetailsMotivation: Current science benchmarks focus on decontextualized knowledge but miss the iterative reasoning, hypothesis generation, and observation interpretation essential for real scientific discovery.

Method: A scenario-grounded benchmark across biology, chemistry, materials, and physics where domain experts define research projects, decompose them into modular scenarios, and sample vetted questions. Two-phase evaluation: question-level accuracy on scenario-tied items and project-level performance requiring hypothesis generation, experimental design, and result interpretation.

Result: LLMs show consistent performance gaps compared to general science benchmarks, diminishing returns from scaling model size and reasoning, systematic weaknesses across top models, and large performance variation across scenarios leading to changing best-model choices.

Conclusion: Current LLMs are distant from general scientific “superintelligence” but show promise in scientific discovery projects, highlighting the role of guided exploration and serendipity. The SDE framework provides a reproducible benchmark for discovery-relevant evaluation and charts paths for advancing LLMs toward scientific discovery.

Abstract: Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific “superintelligence”. Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery.

[220] A Decision-Theoretic Approach for Managing Misalignment

Daniel A. Herrmann, Abinav Chari, Isabelle Qian, Sree Sharvesh, B. A. Levinstein

Main category: cs.AI

TL;DR: A decision-theoretic framework for determining when to delegate decisions to AI systems, showing that context-specific delegation can be optimal even with significant value misalignment if the AI has superior accuracy or expanded reach.

DetailsMotivation: While value alignment literature focuses on shaping AI values, there's less attention on determining when imperfect alignment is sufficient for delegation under uncertainty. The paper aims to provide principled methods for deciding when AI is aligned enough for specific contexts.

Method: Introduces a formal decision-theoretic framework that balances three factors: value (mis)alignment, epistemic accuracy, and reach (available acts). Develops a novel scoring framework to quantify ex ante delegation decisions.

Result: Reveals a sharp distinction: universal delegation requires near-perfect alignment and total epistemic trust (rare in practice), while context-specific delegation can be optimal even with significant misalignment if the agent has superior accuracy or expanded reach.

Conclusion: Provides a principled method for determining when AI is aligned enough for given contexts, shifting focus from achieving perfect alignment to managing risks and rewards of delegation under uncertainty.

Abstract: When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent’s value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal’s uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent’s superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.

[221] Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning

Jiaqi Xu, Cuiling Lan, Xuejin Chen, Yan LU

Main category: cs.AI

TL;DR: STC is a unified framework that interleaves reasoning and self-critique at each step within a single LLM, trained with hybrid reinforcement learning to optimize both reasoning quality and self-evaluation.

DetailsMotivation: Current LLMs decouple reasoning from verification - either generating reasoning without self-checking or relying on external verifiers. This lacks immediate feedback or increases system complexity, unlike human critical thinking where reasoning and evaluation are intertwined.

Method: Stepwise Think-Critique (STC) framework that interleaves reasoning and self-critique at each step within a single model. Trained with hybrid reinforcement learning combining reasoning rewards and critique-consistency rewards to jointly optimize reasoning quality and self-evaluation.

Result: Experiments on mathematical reasoning benchmarks show STC demonstrates strong critic-thinking capabilities and produces more interpretable reasoning traces.

Conclusion: STC represents a step toward LLMs with built-in critical thinking, moving beyond decoupled reasoning-verification approaches to more human-like integrated problem-solving.

Abstract: Human beings solve complex problems through critical thinking, where reasoning and evaluation are intertwined to converge toward correct solutions. However, most existing large language models (LLMs) decouple reasoning from verification: they either generate reasoning without explicit self-checking or rely on external verifiers to detect errors post hoc. The former lacks immediate feedback, while the latter increases system complexity and hinders synchronized learning. Motivated by human critical thinking, we propose Stepwise Think-Critique (STC), a unified framework that interleaves reasoning and self-critique at each step within a single model. STC is trained with a hybrid reinforcement learning objective combining reasoning rewards and critique-consistency rewards to jointly optimize reasoning quality and self-evaluation. Experiments on mathematical reasoning benchmarks show that STC demonstrates strong critic-thinking capabilities and produces more interpretable reasoning traces, representing a step toward LLMs with built-in critical thinking.

[222] Explaining the Reasoning of Large Language Models Using Attribution Graphs

Chase Walker, Rickard Ewetz

Main category: cs.AI

TL;DR: CAGE framework improves LLM explainability by creating attribution graphs that capture both prompt and inter-generational influences, boosting faithfulness by up to 40%.

DetailsMotivation: LLMs have opaque reasoning that raises safety and trust concerns. Current context attribution methods are incomplete because they only relate generated tokens to the prompt, ignoring how earlier generations influence later ones.

Method: Introduces Context Attribution via Graph Explanations (CAGE) framework with attribution graphs - directed graphs quantifying how each generation is influenced by both prompt and prior generations. Graphs preserve causality and row stochasticity, allowing context attributions via path marginalization.

Result: CAGE improves context attribution faithfulness across multiple models, datasets, metrics, and methods, achieving average gains of up to 40%.

Conclusion: CAGE provides more complete explanations of LLM behavior by capturing inter-generational influences through attribution graphs, addressing limitations of current context attribution methods.

Abstract: Large language models (LLMs) exhibit remarkable capabilities, yet their reasoning remains opaque, raising safety and trust concerns. Attribution methods, which assign credit to input features, have proven effective for explaining the decision making of computer vision models. From these, context attributions have emerged as a promising approach for explaining the behavior of autoregressive LLMs. However, current context attributions produce incomplete explanations by directly relating generated tokens to the prompt, discarding inter-generational influence in the process. To overcome these shortcomings, we introduce the Context Attribution via Graph Explanations (CAGE) framework. CAGE introduces an attribution graph: a directed graph that quantifies how each generation is influenced by both the prompt and all prior generations. The graph is constructed to preserve two properties-causality and row stochasticity. The attribution graph allows context attributions to be computed by marginalizing intermediate contributions along paths in the graph. Across multiple models, datasets, metrics, and methods, CAGE improves context attribution faithfulness, achieving average gains of up to 40%.

[223] Artism: AI-Driven Dual-Engine System for Art Generation and Critique

Shuai Liu, Yiqing Tian, Yang Chen, Mar Canet Sola

Main category: cs.AI

TL;DR: The paper proposes a dual-engine AI architecture with AIDA (artificial artist social network) and Ismism Machine for critical analysis to simulate art evolution trajectories using deep learning and multi-agent collaboration.

DetailsMotivation: To address the complex problem of exploring potential trajectories in art evolution, moving beyond traditional unidirectional critique toward intelligent, interactive reflexive practice.

Method: Dual-engine AI architecture with two interconnected components: AIDA (artificial artist social network) and the Ismism Machine for critical analysis, leveraging deep learning and multi-agent collaboration for multidimensional simulations.

Result: Currently being applied in experimental studies on contemporary art concepts, introducing a general methodology based on AI-driven critical loops for computational art analysis.

Conclusion: The framework offers new possibilities for computational analysis of art through intelligent, interactive reflexive practice and AI-driven critical loops.

Abstract: This paper proposes a dual-engine AI architectural method designed to address the complex problem of exploring potential trajectories in the evolution of art. We present two interconnected components: AIDA (an artificial artist social network) and the Ismism Machine, a system for critical analysis. The core innovation lies in leveraging deep learning and multi-agent collaboration to enable multidimensional simulations of art historical developments and conceptual innovation patterns. The framework explores a shift from traditional unidirectional critique toward an intelligent, interactive mode of reflexive practice. We are currently applying this method in experimental studies on contemporary art concepts. This study introduces a general methodology based on AI-driven critical loops, offering new possibilities for computational analysis of art.

[224] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt

Main category: cs.AI

TL;DR: The paper introduces Predictive Concept Decoders (PCDs) - an end-to-end training approach for interpreting neural network activations by training assistants to predict model behavior through a communication bottleneck.

DetailsMotivation: Interpreting internal neural network activations is difficult due to complex activation space structure. Existing scalable interpretability methods use hand-designed agents, which is suboptimal.

Method: Train interpretability assistants with an encoder-decoder architecture: encoder compresses activations to sparse concept lists, decoder reads concepts to answer natural language questions about model behavior. Pretrain on large data then finetune for specific questions.

Result: PCDs show favorable scaling - concept interpretability improves with data and downstream performance increases. They can detect jailbreaks, secret hints, implanted latent concepts, and accurately surface latent user attributes.

Conclusion: Predictive Concept Decoders provide an effective end-to-end approach to neural network interpretability that scales well and performs effectively on various interpretability tasks.

Abstract: Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.

[225] aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists

Pengsong Zhang, Xiang Hu, Guowei Huang, Yang Qi, Heng Zhang, Xiuxu Li, Jiaxing Song, Jiabin Luo, Yijiang Li, Shuo Yin, Chengxiao Dai, Eric Hanchen Jiang, Xiaoyan Zhou, Zhenfei Yin, Boqin Yuan, Jing Dong, Guinan Su, Guanren Qiao, Haiming Tang, Anghong Du, Lili Pan, Zhenzhong Lan, Xinyu Liu

Main category: cs.AI

TL;DR: aiXiv is an open-access platform using multi-agent architecture for AI-generated research submission, review, and refinement, addressing the lack of suitable venues for AI-generated scientific content.

DetailsMotivation: Traditional publication systems struggle with AI-generated research due to human-centric peer review limitations and lack of quality control in existing preprint servers, creating a dissemination bottleneck for high-quality AI-generated scientific content.

Method: Multi-agent architecture platform with API and MCP interfaces that enables seamless integration of human and AI scientists for collaborative submission, review, and iterative refinement of research proposals and papers.

Result: aiXiv significantly enhances the quality of AI-generated research proposals and papers through iterative revising and reviewing, demonstrating reliability and robustness as a scalable platform.

Conclusion: aiXiv establishes a foundation for next-generation open-access ecosystems that accelerate publication and dissemination of high-quality AI-generated research, advancing autonomous scientific discovery.

Abstract: Recent advances in large language models (LLMs) have enabled AI agents to autonomously generate scientific proposals, conduct experiments, author papers, and perform peer reviews. Yet this flood of AI-generated research content collides with a fragmented and largely closed publication ecosystem. Traditional journals and conferences rely on human peer review, making them difficult to scale and often reluctant to accept AI-generated research content; existing preprint servers (e.g. arXiv) lack rigorous quality-control mechanisms. Consequently, a significant amount of high-quality AI-generated research lacks appropriate venues for dissemination, hindering its potential to advance scientific progress. To address these challenges, we introduce aiXiv, a next-generation open-access platform for human and AI scientists. Its multi-agent architecture allows research proposals and papers to be submitted, reviewed, and iteratively refined by both human and AI scientists. It also provides API and MCP interfaces that enable seamless integration of heterogeneous human and AI scientists, creating a scalable and extensible ecosystem for autonomous scientific discovery. Through extensive experiments, we demonstrate that aiXiv is a reliable and robust platform that significantly enhances the quality of AI-generated research proposals and papers after iterative revising and reviewing on aiXiv. Our work lays the groundwork for a next-generation open-access ecosystem for AI scientists, accelerating the publication and dissemination of high-quality AI-generated research content. Code: https://github.com/aixiv-org aiXiv: https://aixiv.science

[226] Large Language Model-Based Intelligent Antenna Design System

Tao Wu, Kexue Fu, Qiang Hua, Xinxin Liu, Bo Liu

Main category: cs.AI

TL;DR: LLM-based antenna design system (LADS) automates antenna modeling and optimization using text/images from technical documents, reducing manual effort in antenna simulation and design.

DetailsMotivation: Traditional antenna simulation involves time-consuming manual modeling and optimization, slowing down antenna analysis and design processes.

Method: LADS generates antenna models from textual descriptions and images extracted from academic papers/patents, interacts with engineers for iterative refinement, and configures/runs optimizers to meet specifications.

Result: Demonstrated with monopole slotted antenna - modified cross-slot to H-slot and changed substrate material, reducing gain variation while maintaining gain level across 3.1-10.6 GHz ultra-wide band.

Conclusion: LLM-based system successfully automates antenna design process, improving efficiency and performance through automated modeling and optimization.

Abstract: Antenna simulation typically involves modeling and optimization, which are time-consuming and labor-intensive, slowing down antenna analysis and design. This paper presents a prototype of a large language model (LLM)-based antenna design system (LADS) to assist in antenna simulation. LADS generates antenna models with textual descriptions and images extracted from academic papers, patents, and technical reports (either one or multiple), and it interacts with engineers to iteratively refine the designs. After that, LADS configures and runs an optimizer to meet the design specifications. The effectiveness of LADS is demonstrated by a monopole slotted antenna generated from images and descriptions from the literature. To improve gain stability across the 3.1-10.6 GHz ultra-wide band, LADS modifies the cross-slot into an H-slot and changes substrate material, followed by parameter optimization. As a result, the gain variation is reduced while maintaining the same gain level. The LLM-enabled antenna modeling (LEAM) is available at: https://github.com/TaoWu974/LEAM.

[227] Improving Subgraph Matching by Combining Algorithms and Graph Neural Networks

Shuyang Guo, Wenjin Xie, Ping Lu, Ting Deng, Richong Zhang, Jianxin Li, Xiangping Huang, Zhongyi Liu

Main category: cs.AI

TL;DR: HFrame is the first GNN-based framework for subgraph homomorphism that combines traditional algorithms with ML, achieving 101.91x speedup over exact algorithms with 0.962 average accuracy.

DetailsMotivation: Subgraph homomorphism is more complex than isomorphism (allows many-to-one mapping) and lacks efficient ML-based solutions. Traditional exact algorithms are computationally expensive for large graphs.

Method: HFrame integrates graph neural networks with traditional homomorphism algorithms, creating a hybrid framework that learns to predict homomorphism existence while providing generalization error bounds.

Result: HFrame outperforms standard GNNs in distinguishing non-homomorphic graph pairs, achieves up to 101.91x speedup over exact matching algorithms, and maintains 0.962 average accuracy on real-world and synthetic graphs.

Conclusion: HFrame successfully bridges traditional algorithms with ML for subgraph homomorphism, offering significant speed advantages while maintaining high accuracy, with theoretical guarantees via generalization error bounds.

Abstract: Homomorphism is a key mapping technique between graphs that preserves their structure. Given a graph and a pattern, the subgraph homomorphism problem involves finding a mapping from the pattern to the graph, ensuring that adjacent vertices in the pattern are mapped to adjacent vertices in the graph. Unlike subgraph isomorphism, which requires a one-to-one mapping, homomorphism allows multiple vertices in the pattern to map to the same vertex in the graph, making it more complex. We propose HFrame, the first graph neural network-based framework for subgraph homomorphism, which integrates traditional algorithms with machine learning techniques. We demonstrate that HFrame outperforms standard graph neural networks by being able to distinguish more graph pairs where the pattern is not homomorphic to the graph. Additionally, we provide a generalization error bound for HFrame. Through experiments on both real-world and synthetic graphs, we show that HFrame is up to 101.91 times faster than exact matching algorithms and achieves an average accuracy of 0.962.

[228] The Need for Verification in AI-Driven Scientific Discovery

Cristina Cornelio, Takuya Ito, Ryan Cory-Wright, Sanjeeb Dash, Lior Horesh

Main category: cs.AI

TL;DR: AI accelerates scientific hypothesis generation but requires robust verification mechanisms to ensure scientific validity and avoid hindering progress.

DetailsMotivation: AI can generate hypotheses at unprecedented scale and speed, potentially accelerating scientific discovery across fields, but this abundance creates a verification bottleneck that could actually hinder progress if not properly addressed.

Method: The paper traces historical scientific discovery development, examines AI’s impact on established practices, and reviews principal AI approaches including data-driven methods, knowledge-aware neural architectures, symbolic reasoning frameworks, and LLM agents.

Result: AI systems can uncover patterns and propose candidate scientific laws, but their scientific value depends entirely on rigorous verification processes.

Conclusion: Rigorous and transparent verification must be the cornerstone of AI-assisted scientific discovery to ensure that AI-generated hypotheses actually advance rather than hinder scientific progress.

Abstract: Artificial intelligence (AI) is transforming the practice of science. Machine learning and large language models (LLMs) can generate hypotheses at a scale and speed far exceeding traditional methods, offering the potential to accelerate discovery across diverse fields. However, the abundance of hypotheses introduces a critical challenge: without scalable and reliable mechanisms for verification, scientific progress risks being hindered rather than being advanced. In this article, we trace the historical development of scientific discovery, examine how AI is reshaping established practices for scientific discovery, and review the principal approaches, ranging from data-driven methods and knowledge-aware neural architectures to symbolic reasoning frameworks and LLM agents. While these systems can uncover patterns and propose candidate laws, their scientific value ultimately depends on rigorous and transparent verification, which we argue must be the cornerstone of AI-assisted discovery.

[229] N2N: A Parallel Framework for Large-Scale MILP under Distributed Memory

Longfei Wang, Junyan Liu, Fan Zhang, Jiangwen Wei, Yuanhua Tang, Jie Sun, Xiaodong Luo

Main category: cs.AI

TL;DR: N2N is a scalable parallel framework for MILP solving that maps B&B nodes to distributed computing nodes, achieving significant speedups over state-of-the-art parallel solvers in both deterministic and nondeterministic modes.

DetailsMotivation: Parallelizing MILP solving is challenging due to the complexity of branch-and-bound framework and numerous algorithm components in MILP solvers. There's a need for scalable distributed parallel frameworks that can effectively utilize modern computing clusters.

Method: Proposed N2N framework with node-to-node mapping of B&B nodes to distributed computing nodes. Features include: sliding-window-based algorithm for deterministic task ordering, utilization of CP search and primal heuristics, adaptive solving, data communication optimization, and integration with existing solvers like SCIP and HiGHS.

Result: N2N-SCIP achieves speedups of 22.52× and 12.71× with 1,000 MPI processes on Kunpeng and x86 clusters, outperforming ParaSCIP by 1.98× and 2.08× respectively. Deterministic mode also shows significant improvements. Framework generality validated with HiGHS integration.

Conclusion: N2N provides an effective scalable parallel framework for MILP solving that significantly outperforms state-of-the-art parallel solvers, supports both deterministic and nondeterministic modes, and can be integrated with various base solvers.

Abstract: Parallelization has emerged as a promising approach for accelerating MILP solving. However, the complexity of the branch-and-bound (B&B) framework and the numerous effective algorithm components in MILP solvers make it difficult to parallelize. In this study, a scalable parallel framework, N2N (a node-to-node framework that maps the B&B nodes to distributed computing nodes), was proposed to solve large-scale problems in a distributed memory computing environment. Both deterministic and nondeterministic modes are supported, and the framework is designed to be easily integrated with existing solvers. Regarding the deterministic mode, a novel sliding-window-based algorithm was designed and implemented to ensure that tasks are generated and solved in a deterministic order. Moreover, several advanced techniques, such as the utilization of CP search and general primal heuristics, have been developed to fully utilize distributed computing resources and capabilities of base solvers. Adaptive solving and data communication optimization were also investigated. A popular open-source MILP solver, SCIP, was integrated into N2N as the base solver, yielding N2N-SCIP. Extensive computational experiments were conducted to evaluate the performance of N2N-SCIP compared to ParaSCIP, which is a state-of-the-art distributed parallel MILP solver under the UG framework. The nondeterministic N2N-SCIP achieves speedups of 22.52 and 12.71 with 1,000 MPI processes on the Kunpeng and x86 computing clusters, which is 1.98 and 2.08 times faster than ParaSCIP, respectively. In the deterministic mode, N2N-SCIP also shows significant performance improvements over ParaSCIP across different process numbers and computing clusters. To validate the generality of N2N, HiGHS, another open-source solver, was integrated into N2N. The related results are analyzed, and the requirements of N2N on base solvers are also concluded.

[230] RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation

Yuanyuan Lin, Xiangyu Ouyang, Teng Zhang, Kaixin Sui

Main category: cs.AI

TL;DR: RPM-MCTS: Knowledge-Retrieval Process Reward Model with Monte Carlo Tree Search for code generation, improving accuracy while reducing token usage by ~15%.

DetailsMotivation: Existing tree search methods for code generation struggle with evaluating intermediate algorithmic steps and timely error correction, leading to incorrect code and high computational costs.

Method: Uses Knowledge-Retrieval as Process Reward Model (RPM) with Monte Carlo Tree Search (MCTS) to evaluate intermediate steps. Employs similarity filtering to remove redundant nodes during expansion, and uses sandbox execution feedback to locate and correct errors during generation.

Result: Outperforms state-of-the-art methods on four public code generation benchmarks while achieving approximately 15% reduction in token consumption. Fine-tuning base models with RPM-MCTS constructed data significantly enhances code capabilities.

Conclusion: RPM-MCTS effectively addresses evaluation and error correction challenges in code generation, offering improved performance with reduced computational costs, and provides valuable training data for model enhancement.

Abstract: Tree search-based methods have made significant progress in enhancing the code generation capabilities of large language models. However, due to the difficulty in effectively evaluating intermediate algorithmic steps and the inability to locate and timely correct erroneous steps, these methods often generate incorrect code and incur increased computational costs. To tackle these problems, we propose RPM-MCTS, an effective method that utilizes Knowledge-Retrieval as Process Reward Model based on Monte Carlo Tree Search to evaluate intermediate algorithmic steps. By utilizing knowledge base retrieval, RPM-MCTS avoids the complex training of process reward models. During the expansion phase, similarity filtering is employed to remove redundant nodes, ensuring diversity in reasoning paths. Furthermore, our method utilizes sandbox execution feedback to locate erroneous algorithmic steps during generation, enabling timely and targeted corrections. Extensive experiments on four public code generation benchmarks demonstrate that RPM-MCTS outperforms current state-of-the-art methods while achieving an approximately 15% reduction in token consumption. Furthermore, full fine-tuning of the base model using the data constructed by RPM-MCTS significantly enhances its code capabilities.

[231] Self-Transparency Failures in Expert-Persona LLMs: How Instruction-Following Overrides Disclosure

Alex Diep

Main category: cs.AI

TL;DR: Language models often fail to disclose their AI identity when assigned professional personas, with disclosure rates varying dramatically by domain (30.8% for Financial Advisor vs 3.5% for Neurosurgeon) and model family, revealing that safety properties don’t transfer across deployment contexts.

DetailsMotivation: To stress-test language models' self-transparency - their ability to honestly disclose limitations and artificial nature - especially when assigned professional personas that conflict with transparent self-representation, since users may incorrectly trust AI-generated guidance as equivalent to licensed professional advice.

Method: Common-garden experimental design auditing 16 open-weight models (4B-671B parameters) across 19,200 trials under identical conditions, testing disclosure rates when models are assigned professional personas, with Bayesian validation for robustness.

Result: Sharp domain-specific inconsistency: Financial Advisor persona elicited 30.8% disclosure vs Neurosurgeon’s 3.5% (8.8-fold difference). Disclosure ranged from 2.8% to 73.6% across model families, with model identity explaining much more variance than parameter count. Reasoning variants showed heterogeneous effects, and explicit permission increased disclosure from 23.7% to 65.8%.

Conclusion: Organizations cannot assume safety properties transfer across deployment domains; suppression of AI disclosure reflects instruction-following prioritization rather than capability limitations, requiring deliberate behavior design and empirical verification.

Abstract: Self-transparency is a critical safety boundary, requiring language models to honestly disclose their limitations and artificial nature. This study stress-tests this capability, investigating whether models willingly disclose their identity when assigned professional personas that conflict with transparent self-representation. When models prioritize role consistency over this boundary disclosure, users may calibrate trust based on overstated competence claims, treating AI-generated guidance as equivalent to licensed professional advice. Using a common-garden experimental design, sixteen open-weight models (4B-671B parameters) were audited under identical conditions across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure at the first prompt, while a Neurosurgeon persona elicited only 3.5% – an 8.8-fold difference that emerged at the initial epistemic inquiry. Disclosure ranged from 2.8% to 73.6% across model families, with a 14B model reaching 39.4% while a 70B model produced just 4.1%. Model identity provided substantially larger improvement in fitting observations than parameter count ($ΔR_{adj}^{2}=0.359$ vs $0.018$). Reasoning variants showed heterogeneous effects: some exhibited up to 48.4 percentage points lower disclosure than their base instruction-tuned counterparts, while others maintained high transparency. An additional experiment demonstrated that explicit permission to disclose AI nature increased disclosure from 23.7% to 65.8%, revealing that suppression reflects instruction-following prioritization rather than capability limitations. Bayesian validation confirmed robustness to judge measurement error ($κ=0.908$). Organizations cannot assume safety properties will transfer across deployment domains, requiring deliberate behavior design and empirical verification.

[232] The 4/$δ$ Bound: Designing Predictable LLM-Verifier Systems for Formal Method Guarantee

PIerre Dantas, Lucas Cordeiro, Youcheng Sun, Waldir Junior

Main category: cs.AI

TL;DR: Proves LLM-Verifier Convergence Theorem providing formal guarantees for termination in multi-stage verification pipelines, with empirical validation of 4/δ latency bound.

DetailsMotivation: Current integration of Formal Verification tools with LLMs lacks theoretical foundation, making refinement processes unreliable (oscillating, looping, diverging). Need provable guarantees for safety-critical software verification.

Method: Models verification pipeline as sequential absorbing Markov Chain with four stages: CodeGen, Compilation, InvariantSynth, SMTSolving. Proves convergence theorem showing system reaches Verified state almost surely for any δ>0, with latency bound 𝔼[n] ≤ 4/δ.

Result: Empirical validation with 90,000+ trials confirms theory: every run reached verification, empirical convergence factor Cf ≈ 1.0. Identified three operating zones (marginal, practical, high-performance) and proposed dynamic calibration strategy for parameter drift.

Conclusion: Replaces heuristic guesswork with rigorous architectural foundation for LLM-based verification, enabling predictable resource planning and performance budgeting for safety-critical software.

Abstract: The integration of Formal Verification tools with Large Language Models (LLMs) offers a path to scale software verification beyond manual workflows. However, current methods remain unreliable: without a solid theoretical footing, the refinement process acts as a black box that may oscillate, loop, or diverge. This work bridges this critical gap by developing an LLM-Verifier Convergence Theorem, providing the first formal framework with provable guarantees for termination in multi-stage verification pipelines. We model the interaction not as a generic loop, but as a sequential absorbing Markov Chain comprising four essential engineering stages: \texttt{CodeGen}, \texttt{Compilation}, \texttt{InvariantSynth}, and \texttt{SMTSolving}. We prove that for any non-zero stage success probability ($δ> 0$), the system reaches the \texttt{Verified} state almost surely. Furthermore, because of the sequential nature of the pipeline, we derive a precise latency bound of $\mathbb{E}[n] \leq 4/δ$. We stress-tested this prediction in an extensive empirical campaign comprising over 90,000 trials. The results match the theory with striking consistency: every run reached verification, and the empirical convergence factor clustered tightly around $C_f\approx 1.0$, confirming that the $4/δ$ bound accurately mirrors system behavior rather than serving as a loose buffer. Based on this data, we identify three distinct operating zones – marginal, practical, and high-performance – and propose a dynamic calibration strategy to handle parameter drift in real-world environments. Together, these contributions replace heuristic guesswork with a rigorous architectural foundation, enabling predictable resource planning and performance budgeting for safety-critical software.

[233] AI-Assisted Game Management Decisions: A Fuzzy Logic Approach to Real-Time Soccer Substitutions

Pedro Passos

Main category: cs.AI

TL;DR: A Fuzzy Logic DSS for real-time soccer substitutions that objectively evaluates players using role-aware metrics and identifies substitution priorities, outperforming traditional predictive models.

DetailsMotivation: Current soccer substitution decisions rely too heavily on intuition or predictive models that replicate historical biases rather than providing objective, real-time guidance. There's a need for transparent, explainable systems that can identify high-risk scenarios human decision-makers might miss.

Method: Developed a Fuzzy Logic Decision Support System that reformulates PlayeRank into a Cumulative Mean with Role Aware Normalization to eliminate play-time bias. Integrates this refined performance metric with physiological fatigue proxies and contextual variables (disciplinary risk modulated by tactical role) to calculate dynamic Substitution Priority (P_final).

Result: Validated on 2018 FIFA World Cup Brazil vs Belgium match: system aligned with expert consensus on executed substitutions (e.g., Gabriel Jesus) and identified critical risks missed by humans - the “FAGNER Paradox” (defensive risk minutes before yellow card) and “Lukaku Paradox” (isolated assist masking severe participation drop).

Conclusion: Fuzzy Logic provides a transparent, explainable, and superior alternative to black-box models for optimizing real-time tactical decisions in soccer, capable of identifying high-risk scenarios that human decision-makers overlook.

Abstract: In elite soccer, substitution decisions entail significant financial and sporting consequences yet remain heavily reliant on intuition or predictive models that merely mimic historical biases. This paper introduces a Fuzzy Logic based Decision Support System (DSS) designed for real time, prescriptive game management. Unlike traditional Machine Learning approaches that encounter a predictive ceiling by attempting to replicate human behavior, our system audits performance through an objective, rule based inference engine. We propose a methodological advancement by reformulating the PlayeRank metric into a Cumulative Mean with Role Aware Normalization, eliminating the play time exposure bias inherent in cumulative sum models to enable accurate intra match comparison. The system integrates this refined metric with physiological proxies (fatigue) and contextual variables (disciplinary risk modulated by tactical role) to calculate a dynamic Substitution Priority (P final). Validation via a case study of the 2018 FIFA World Cup match between Brazil and Belgium demonstrates the system’s ecological validity: it not only aligned with expert consensus on executed substitutions (for example Gabriel Jesus) but, crucially, identified high risk scenarios ignored by human decision makers. Specifically, the model flagged the “FAGNER Paradox” - a maximum priority defensive risk - minutes before a critical yellow card, and detected the “Lukaku Paradox”, where an isolated assist masked a severe drop in participation. These results confirm that Fuzzy Logic offers a transparent, explainable, and superior alternative to black box models for optimizing real time tactical decisions.

[234] Towards a Science of Scaling Agent Systems

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, Xin Liu

Main category: cs.AI

TL;DR: Researchers derive quantitative scaling principles for AI agent systems, identifying key trade-offs between coordination, tool usage, and model capabilities across different task types.

DetailsMotivation: Despite widespread adoption of language model-based agents for real-world applications, the principles determining their performance remain poorly understood, creating a need for systematic scaling laws.

Method: Formalized agentic evaluation framework with controlled experiments across 180 configurations using 5 agent architectures (Single-Agent + 4 Multi-Agent Systems) across 3 LLM families and 4 benchmarks. Developed predictive model using coordination metrics.

Result: Identified three key effects: tool-coordination trade-off, capability saturation at ~45% baseline performance, and topology-dependent error amplification. Centralized coordination improves parallel tasks by 80.8%, decentralized excels at web navigation (+9.2%), but multi-agent degrades sequential reasoning by 39-70%. Framework predicts optimal strategy for 87% of configurations.

Conclusion: Quantitative scaling principles enable prediction of agent system performance, with framework generalizing to unseen frontier models (GPT-5.2 validation). Different coordination strategies excel for different task types, providing practical guidance for agent system design.

Abstract: Agents, language model-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored. We address this by deriving quantitative scaling principles for agent systems. We first formalize a definition for agentic evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, model capability, and task properties. We evaluate this across four benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench. With five canonical agent architectures (Single-Agent and four Multi-Agent Systems: Independent, Centralized, Decentralized, Hybrid), instantiated across three LLM families, we perform a controlled evaluation spanning 180 configurations. We derive a predictive model using coordination metrics, that achieves cross-validated R^2=0.524, enabling prediction on unseen task domains. We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead. (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed ~45%. (3) topology-dependent error amplification: independent agents amplify errors 17.2x, while centralized coordination contains this to 4.4x. Centralized coordination improves performance by 80.8% on parallelizable tasks, while decentralized coordination excels on web navigation (+9.2% vs. +0.2%). Yet for sequential reasoning tasks, every multi-agent variants degraded performance by 39-70%. The framework predicts the optimal coordination strategy for 87% of held-out configurations. Out-of-sample validation on GPT-5.2, achieves MAE=0.071 and confirms four of five scaling principles generalize to unseen frontier models.

[235] Causal Inference in Energy Demand Prediction

Chutian Ma, Grigorii Pomazkin, Giacinto Paolo Saggese, Paul Smith

Main category: cs.AI

TL;DR: A structural causal model for energy demand prediction that incorporates weather and calendar factors, with a Bayesian model achieving 3.84% MAPE.

DetailsMotivation: Energy demand prediction is crucial for grid operators and industrial users, but existing correlation-based methods fail to capture the complex causal interdependencies between weather conditions, calendar information, and human activity patterns.

Method: Proposed a structural causal model to explain causal relationships between variables, validated causal beliefs through full analysis, then built a Bayesian model incorporating causal insights as prior knowledge.

Result: Achieved state-of-the-art performance with 3.84% MAPE on test set, with strong robustness shown by 3.88% average MAPE across two years of cross-validation. Causal analysis revealed season-dependent temperature sensitivity and lower winter variance due to decoupling effects.

Conclusion: Causal modeling provides superior energy demand prediction by capturing complex interdependencies, offering both performance improvements and valuable insights into seasonal patterns and variance characteristics.

Abstract: Energy demand prediction is critical for grid operators, industrial energy consumers, and service providers. Energy demand is influenced by multiple factors, including weather conditions (e.g. temperature, humidity, wind speed, solar radiation), and calendar information (e.g. hour of day and month of year), which further affect daily work and life schedules. These factors are causally interdependent, making the problem more complex than simple correlation-based learning techniques satisfactorily allow for. We propose a structural causal model that explains the causal relationship between these variables. A full analysis is performed to validate our causal beliefs, also revealing important insights consistent with prior studies. For example, our causal model reveals that energy demand responds to temperature fluctuations with season-dependent sensitivity. Additionally, we find that energy demand exhibits lower variance in winter due to the decoupling effect between temperature changes and daily activity patterns. We then build a Bayesian model, which takes advantage of the causal insights we learned as prior knowledge. The model is trained and tested on unseen data and yields state-of-the-art performance in the form of a 3.84 percent MAPE on the test set. The model also demonstrates strong robustness, as the cross-validation across two years of data yields an average MAPE of 3.88 percent.

[236] Massive Editing for Large Language Models Based on Dynamic Weight Generation

Wentao Wan, Qiqing Lao, Zhiwei Xie, Hefeng Wu, Runnan Lin, Liang Lin, Keze Wang

Main category: cs.AI

TL;DR: MeG proposes a diffusion-based dynamic weight generation method for large-scale knowledge editing in LLMs, achieving better performance on reliability, generality, and locality metrics.

DetailsMotivation: Current knowledge editing methods struggle with large-scale edits while maintaining reliability, generality, and locality metrics. There's a need for efficient methods to modify LLM knowledge without expensive retraining.

Method: Attaches dynamic weight neurons to specific LLM layers and uses a diffusion model to conditionally generate neuron weights based on input queries, enabling large-scale editing with minimal architectural changes.

Result: MeG significantly outperforms existing knowledge editing methods on reliability, generality, and locality metrics, with particularly large improvements in locality metrics.

Conclusion: The diffusion-based dynamic weight generation approach enables effective large-scale knowledge editing in LLMs while maintaining key performance metrics.

Abstract: Knowledge Editing (KE) is a field that studies how to modify some knowledge in Large Language Models (LLMs) at a low cost (compared to pre-training). Currently, performing large-scale edits on LLMs while ensuring the Reliability, Generality, and Locality metrics of the edits remain a challenge. This paper proposes a Massive editing approach for LLMs based on dynamic weight Generation (MeG). Our MeG involves attaching a dynamic weight neuron to specific layers of the LLMs and using a diffusion model to conditionally generate the weights of this neuron based on the input query required for the knowledge. This allows the use of adding a single dynamic weight neuron to achieve the goal of large-scale knowledge editing. Experiments show that our MeG can significantly improve the performance of large-scale KE in terms of Reliability, Generality, and Locality metrics compared to existing knowledge editing methods, particularly with a high percentage point increase in the absolute value index for the Locality metric, demonstrating the advantages of our proposed method.

cs.SD

[237] A Conditioned UNet for Music Source Separation

Ken O’Hanlon, Basil Woods, Lin Wang, Mark Sandler

Main category: cs.SD

TL;DR: QSCNet is a novel conditioned UNet architecture for music source separation that outperforms existing methods like Banquet by over 1dB SNR while using less than half the parameters.

DetailsMotivation: Traditional music source separation uses multi-output networks with predefined instrument vocabularies, limiting flexibility. Conditioned approaches using audio queries enable more realistic tasks but have been hindered by lack of suitable data and claims that UNets are unsuitable for conditioned MSS.

Method: QSCNet integrates network conditioning elements into a conditioned UNet architecture based on the Sparse Compressed Network for MSS, countering the argument that UNets are unsuitable for conditioned music source separation.

Result: QSCNet outperforms Banquet by over 1dB SNR on multiple MSS tasks while using less than half the number of parameters, demonstrating UNets can be effective for conditioned MSS.

Conclusion: Conditioned UNets are viable and effective for music source separation, challenging previous assumptions and offering superior performance with fewer parameters compared to existing conditioned approaches.

Abstract: In this paper we propose a conditioned UNet for Music Source Separation (MSS). MSS is generally performed by multi-output neural networks, typically UNets, with each output representing a particular stem from a predefined instrument vocabulary. In contrast, conditioned MSS networks accept an audio query related to a stem of interest alongside the signal from which that stem is to be extracted. Thus, a strict vocabulary is not required and this enables more realistic tasks in MSS. The potential of conditioned approaches for such tasks has been somewhat hidden due to a lack of suitable data, an issue recently addressed with the MoisesDb dataset. A recent method, Banquet, employs this dataset with promising results seen on larger vocabularies. Banquet uses Bandsplit RNN rather than a UNet and the authors state that UNets should not be suitable for conditioned MSS. We counter this argument and propose QSCNet, a novel conditioned UNet for MSS that integrates network conditioning elements in the Sparse Compressed Network for MSS. We find QSCNet to outperform Banquet by over 1dB SNR on a couple of MSS tasks, while using less than half the number of parameters.

[238] Single-channel speech enhancement by using psychoacoustical model inspired fusion framework

Suman Samui

Main category: cs.SD

TL;DR: The paper proposes a fusion framework combining acoustic and modulation domain speech enhancement approaches to jointly improve both perceived speech quality and intelligibility.

DetailsMotivation: Existing speech enhancement methods have complementary weaknesses: acoustic domain Bayesian STSA estimators reduce noise but cause speech distortions (especially for high-frequency content like fricatives), while modulation domain approaches improve intelligibility but cause temporal slurring. There's a need to combine the strengths of both domains while avoiding their respective weaknesses.

Method: A fusion framework that combines the merits of acoustic domain (Bayesian STSA estimator with human auditory system parameters) and modulation domain (psychoacoustic frequency selectivity) approaches. The framework integrates both techniques to leverage their complementary strengths.

Result: Objective measure evaluation shows that the proposed fusion framework provides consistent improvements in both perceived speech quality and intelligibility across different SNR levels in various noise conditions, outperforming other baseline techniques.

Conclusion: The fusion of acoustic and modulation domain approaches successfully achieves joint improvements in speech quality and intelligibility by combining their respective strengths while mitigating their weaknesses.

Abstract: When the parameters of Bayesian Short-time Spectral Amplitude (STSA) estimator for speech enhancement are selected based on the characteristics of the human auditory system, the gain function of the estimator becomes more flexible. Although this type of estimator in acoustic domain is quite effective in reducing the back-ground noise at high frequencies, it produces more speech distortions, which make the high-frequency contents of the speech such as friciatives less perceptible in heavy noise conditions, resulting in intelligibility reduction. On the other hand, the speech enhancement scheme, which exploits the psychoacoustic evidence of frequency selectivity in the modulation domain, is found to be able to increase the intelligibility of noisy speech by a substantial amount, but also suffers from the temporal slurring problem due to its essential design constraint. In order to achieve the joint improvements in both the perceived speech quality and intelligibility, we proposed and investigated a fusion framework by combining the merits of acoustic and modulation domain approaches while avoiding their respective weaknesses. Objective measure evaluation shows that the proposed speech enhancement fusion framework can provide consistent improvements in the perceived speech quality and intelligibility across different SNR levels in various noise conditions, while compared to the other baseline techniques.

[239] Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction

Advait Gosai, Tyler Vuong, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, Yunzhong He

Main category: cs.SD

TL;DR: Audio MultiChallenge is a new benchmark for evaluating end-to-end spoken dialogue systems on realistic multi-turn conversations, testing inference memory, instruction retention, self coherence, and voice editing with audio-specific challenges.

DetailsMotivation: Existing benchmarks for end-to-end spoken dialogue systems focus on synthetic speech and single-turn tasks, leaving realistic multi-turn conversational ability underexplored. There's a need to evaluate these systems under natural interaction patterns with real human speech characteristics.

Method: Built on text-based MultiChallenge framework, adding Voice Editing axis and augmenting all axes to audio modality (e.g., Audio-Cue challenges). Curated 452 conversations from 47 speakers with 1,712 rubrics using hybrid audio-native agentic and human-in-the-loop pipeline to preserve natural disfluencies.

Result: Even frontier models struggle, with Gemini 3 Pro Preview achieving only 54.65% pass rate. Models fail most on new axes (Voice Editing) and Self Coherence degrades with longer audio context. Difficulties include tracking edits, audio cues, and long-range context.

Conclusion: Audio MultiChallenge provides a reproducible testbed to quantify challenges in audio-native multi-turn interaction and drive improvements in spoken dialogue systems, highlighting significant gaps in current models’ ability to handle natural conversational speech.

Abstract: End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn tasks, leaving realistic multi-turn conversational ability underexplored. We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns. Building on the text-based MultiChallenge framework, which evaluates Inference Memory, Instruction Retention, and Self Coherence, we introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking. We further augment each axis to the audio modality, such as introducing Audio-Cue challenges for Inference Memory that require recalling ambient sounds and paralinguistic signals beyond semantic content. We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline that exposes model failures at scale while preserving natural disfluencies found in unscripted human speech. Our evaluation of proprietary and open-source models reveals that even frontier models struggle on our benchmark, with Gemini 3 Pro Preview (Thinking), our highest-performing model achieving a 54.65% pass rate. Error analysis shows that models fail most often on our new axes and that Self Coherence degrades with longer audio context. These failures reflect difficulty of tracking edits, audio cues, and long-range context in natural spoken dialogue. Audio MultiChallenge provides a reproducible testbed to quantify them and drive improvements in audio-native multi-turn interaction capability.

[240] Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Zhixian Zhao, Xinfa Zhu, Xinsheng Wang, Shuiyuan Wang, Xuelong Geng, Wenjie Tian, Lei Xie

Main category: cs.SD

TL;DR: C²SER is a novel audio language model that improves speech emotion recognition by combining contextual perception with chain-of-thought reasoning and self-distillation techniques to reduce hallucinations and enhance accuracy.

DetailsMotivation: Current large-scale audio language models (ALMs) like Qwen2-Audio suffer from hallucinations in speech emotion recognition tasks, leading to misclassifications and irrelevant outputs. There's a need for more stable and accurate SER systems.

Method: C²SER integrates Whisper encoder for semantic perception and Emotion2Vec-S (enhanced with semi-supervised learning) for acoustic perception. It employs chain-of-thought reasoning to process SER step-by-step using speech content and speaking styles. The model also uses self-distillation from explicit to implicit CoT to reduce error accumulation.

Result: Extensive experiments show C²SER outperforms existing popular ALMs like Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. The authors release training code, checkpoints, and test sets.

Conclusion: C²SER successfully addresses hallucination issues in ALMs for SER through its novel combination of contextual perception, chain-of-thought reasoning, and self-distillation techniques, achieving state-of-the-art performance in speech emotion recognition.

Abstract: Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.

[241] Synaspot: A Lightweight, Streaming Multi-modal Framework for Keyword Spotting with Audio-Text Synergy

Kewei Li, Yinan Zhong, Xiaotao Liang, Tianchi Dai, Shaofei Xue

Main category: cs.SD

TL;DR: A lightweight streaming multi-modal framework for open-vocabulary keyword spotting that reduces speaker-specific information, effectively fuses speech and text features, and uses mathematical decoding with three modal representations.

DetailsMotivation: Open-vocabulary KWS in continuous speech has practical value but faces challenges: increased parameter costs from multimodal integration and constraints for end-to-end deployment limit practical applicability of existing models.

Method: 1) Focus on multimodal enrollment features, reducing speaker-specific (voiceprint) information to extract speaker-irrelevant characteristics; 2) Effectively fuse speech and text features; 3) Introduce streaming decoding framework where encoder extracts features that are mathematically decoded with three modal representations.

Result: Experiments on LibriPhase and WenetPrase show better performance than existing streaming approaches with significantly fewer parameters.

Conclusion: The proposed lightweight streaming multi-modal framework addresses parameter cost and deployment constraints while achieving superior performance for open-vocabulary keyword spotting.

Abstract: Open-vocabulary keyword spotting (KWS) in continuous speech streams holds significant practical value across a wide range of real-world applications. While increasing attention has been paid to the role of different modalities in KWS, their effectiveness has been acknowledged. However, the increased parameter cost from multimodal integration and the constraints of end-to-end deployment have limited the practical applicability of such models. To address these challenges, we propose a lightweight, streaming multi-modal framework. First, we focus on multimodal enrollment features and reduce speaker-specific (voiceprint) information in the speech enrollment to extract speaker-irrelevant characteristics. Second, we effectively fuse speech and text features. Finally, we introduce a streaming decoding framework that only requires the encoder to extract features, which are then mathematically decoded with our three modal representations. Experiments on LibriPhase and WenetPrase demonstrate the performance of our model. Compared to existing streaming approaches, our method achieves better performance with significantly fewer parameters.

[242] Sparse Autoencoders Make Audio Foundation Models more Explainable

Théo Mariotte, Martin Lebourdais, Antonio Almudévar, Marie Tahon, Alfonso Ortega, Nicolas Dugué

Main category: cs.SD

TL;DR: SAEs effectively analyze hidden representations of audio pretrained models, enhancing disentanglement of vocal attributes in singing technique classification.

DetailsMotivation: Audio pretrained models are widely used but their learned representations are unclear, with analysis mainly limited to linear probing. There's a need for better tools to understand what these models learn.

Method: Use Sparse Autoencoders (SAEs) to analyze hidden representations of pretrained models, with a case study in singing technique classification.

Result: SAEs retain information about original representations and class labels, provide insights into self-supervised learning systems, and enhance disentanglement of vocal attributes.

Conclusion: SAEs are an effective tool for identifying underlying factors encoded in audio pretrained model representations.

Abstract: Audio pretrained models are widely employed to solve various tasks in speech processing, sound event detection, or music information retrieval. However, the representations learned by these models are unclear, and their analysis mainly restricts to linear probing of the hidden representations. In this work, we explore the use of Sparse Autoencoders (SAEs) to analyze the hidden representations of pretrained models, focusing on a case study in singing technique classification. We first demonstrate that SAEs retain both information about the original representations and class labels, enabling their internal structure to provide insights into self-supervised learning systems. Furthermore, we show that SAEs enhance the disentanglement of vocal attributes, establishing them as an effective tool for identifying the underlying factors encoded in the representations.

[243] BEAT2AASIST model with layer fusion for ESDD 2026 Challenge

Sanghyeok Chung, Eujin Kim, Donggun Kim, Gaeun Heo, Jeongbin You, Nahyun Lee, Sunmook Choi, Soyul Han, Seungsang Oh, Il-Youp Kwak

Main category: cs.SD

TL;DR: BEAT2AASIST extends BEATs-AASIST for environmental sound deepfake detection by splitting representations along frequency/channel dimensions and using dual AASIST branches with transformer layer fusion and vocoder-based data augmentation.

DetailsMotivation: Recent advances in audio generation have increased risks of realistic environmental sound manipulation, creating need for robust detection methods. The ESDD 2026 Challenge serves as first large-scale benchmark for Environmental Sound Deepfake Detection.

Method: Extends BEATs-AASIST by splitting BEATs-derived representations along frequency or channel dimension and processing with dual AASIST branches. Incorporates top-k transformer layer fusion using concatenation, CNN-gated, and SE-gated strategies. Uses vocoder-based data augmentation for robustness against unseen spoofing methods.

Result: Experimental results on official test sets demonstrate competitive performance across challenge tracks.

Conclusion: The proposed BEAT2AASIST approach effectively addresses environmental sound deepfake detection challenges and shows promising performance on the ESDD benchmark.

Abstract: Recent advances in audio generation have increased the risk of realistic environmental sound manipulation, motivating the ESDD 2026 Challenge as the first large-scale benchmark for Environmental Sound Deepfake Detection (ESDD). We propose BEAT2AASIST which extends BEATs-AASIST by splitting BEATs-derived representations along frequency or channel dimension and processing them with dual AASIST branches. To enrich feature representations, we incorporate top-k transformer layer fusion using concatenation, CNN-gated, and SE-gated strategies. In addition, vocoder-based data augmentation is applied to improve robustness against unseen spoofing methods. Experimental results on the official test sets demonstrate that the proposed approach achieves competitive performance across the challenge tracks.

[244] Time-Varying Audio Effect Modeling by End-to-End Adversarial Training

Yann Bourdin, Pierrick Legrand, Fanny Roche

Main category: cs.SD

TL;DR: GAN framework for modeling time-varying audio effects without needing control signals, using only input-output recordings.

DetailsMotivation: Black-box modeling of time-varying audio effects is problematic because standard approaches require control signal extraction for time-alignment, which is often unavailable or difficult to obtain.

Method: Two-stage training: 1) adversarial phase learns modulation distribution without strict phase constraints, 2) supervised fine-tuning with State Prediction Network (SPN) to estimate initial internal states for synchronization. Uses convolutional-recurrent architecture and develops chirp-train metric for modulation accuracy.

Result: Successfully models vintage hardware phaser, demonstrating ability to capture time-varying dynamics in fully black-box context.

Conclusion: Proposed GAN framework enables modeling of time-varying audio effects without control signal extraction, using only input-output recordings, addressing a key limitation in black-box audio effect modeling.

Abstract: Deep learning has become a standard approach for the modeling of audio effects, yet strictly black-box modeling remains problematic for time-varying systems. Unlike time-invariant effects, training models on devices with internal modulation typically requires the recording or extraction of control signals to ensure the time-alignment required by standard loss functions. This paper introduces a Generative Adversarial Network (GAN) framework to model such effects using only input-output audio recordings, removing the need for modulation signal extraction. We propose a convolutional-recurrent architecture trained via a two-stage strategy: an initial adversarial phase allows the model to learn the distribution of the modulation behavior without strict phase constraints, followed by a supervised fine-tuning phase where a State Prediction Network (SPN) estimates the initial internal states required to synchronize the model with the target. Additionally, a new objective metric based on chirp-train signals is developed to quantify modulation accuracy. Experiments modeling a vintage hardware phaser demonstrate the method’s ability to capture time-varying dynamics in a fully black-box context.

[245] Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

Kamel Kamel, Hridoy Sankar Dutta, Keshav Sood, Sunil Aryal

Main category: cs.SD

TL;DR: SMIA attack manipulates inaudible frequency regions of AI-generated audio to bypass voice authentication and anti-spoofing systems, achieving high success rates and exposing critical security vulnerabilities.

DetailsMotivation: Voice authentication systems are increasingly used in high-security sectors but face vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. Current anti-spoofing countermeasures rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap.

Method: Proposed Spectral Masking and Interpolation Attack (SMIA) - a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving countermeasures.

Result: SMIA achieved: 82% attack success rate against combined VAS/CM systems, 97.5% against standalone speaker verification systems, and 100% against countermeasures. Comprehensive evaluation against SOTA models under simulated real-world conditions.

Conclusion: Current security postures are insufficient against adaptive adversarial attacks. Urgent need for paradigm shift toward next-generation defenses with dynamic, context-aware frameworks capable of evolving with the threat landscape.

Abstract: Voice Authentication Systems (VAS) use unique vocal characteristics for verification. They are increasingly integrated into high-security sectors such as banking and healthcare. Despite their improvements using deep learning, they face severe vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. The emergence of realistic voice cloning complicates detection, as systems struggle to distinguish authentic from synthetic audio. While anti-spoofing countermeasures (CMs) exist to mitigate these risks, many rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap. To demonstrate this vulnerability, we propose the Spectral Masking and Interpolation Attack (SMIA), a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving CMs. We conducted a comprehensive evaluation of our attack against state-of-the-art (SOTA) models across multiple tasks, under simulated real-world conditions. SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks. This work highlights the urgent need for a paradigm shift toward next-generation defenses that employ dynamic, context-aware frameworks capable of evolving with the threat landscape.

[246] Memo2496: Expert-Annotated Dataset and Dual-View Adaptive Framework for Music Emotion Recognition

Qilin Li, C. L. Philip Chen, Tong Zhang

Main category: cs.SD

TL;DR: This paper introduces Memo2496 dataset and DAMER model for music emotion recognition, addressing data limitations and cross-track feature drift through novel modules.

DetailsMotivation: Music Emotion Recognition faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift, which hinders model performance and generalization.

Method: Proposes Dual-view Adaptive Music Emotion Recogniser (DAMER) with three modules: Dual Stream Attention Fusion for cross-modal interaction, Progressive Confidence Labelling for pseudo-label generation, and Style Anchored Memory Learning to mitigate feature drift.

Result: DAMER achieves state-of-the-art performance on Memo2496, 1000songs, and PMEmo datasets, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17% respectively. Ablation studies validate each module’s contribution.

Conclusion: The Memo2496 dataset and DAMER model effectively address key challenges in MER, with publicly available resources advancing the field of music emotion recognition.

Abstract: Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER’s state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module’s contribution. Both the dataset and source code are publicly available.

cs.LG

[247] Epistemic diversity across language models mitigates knowledge collapse

Damian Hodel, Jevin D. West

Main category: cs.LG

TL;DR: AI ecosystem diversity can mitigate knowledge collapse, but only up to an optimal level - too few diverse models fail to capture distribution richness, while too many reduce individual model capacity.

DetailsMotivation: Address concerns about knowledge collapse in AI (reduction to dominant ideas) by investigating whether diversity among AI models in an ecosystem can mitigate performance decay seen in single-model collapse scenarios.

Method: Build on single-model collapse approach but focus on ecosystems of models trained on their collective output. Segment training data across different language models and evaluate resulting ecosystems over ten self-training iterations to study diversity effects.

Result: Increased epistemic diversity mitigates collapse, but only up to an optimal level. Too few diverse models fail to express the rich mixture of the full distribution, causing rapid performance decay. Too many models reduce each model’s approximation capacity, leading to poor performance from the first iteration.

Conclusion: In AI monoculture context, need to monitor diversity across AI systems and develop policies that incentivize more domain- and community-specific models to maintain optimal ecosystem diversity.

Abstract: The growing use of artificial intelligence (AI) raises concerns of knowledge collapse, i.e., a reduction to the most dominant and central set of ideas. Prior work has demonstrated single-model collapse, defined as performance decay in an AI model trained on its own output. Inspired by ecology, we ask whether AI ecosystem diversity, that is, diversity among models, can mitigate such a collapse. We build on the single-model approach but focus on ecosystems of models trained on their collective output. To study the effect of diversity on model performance, we segment the training data across language models and evaluate the resulting ecosystems over ten, self-training iterations. We find that increased epistemic diversity mitigates collapse, but, interestingly, only up to an optimal level. Our results suggest that an ecosystem containing only a few diverse models fails to express the rich mixture of the full, true distribution, resulting in rapid performance decay. Yet distributing the data across too many models reduces each model’s approximation capacity on the true distribution, leading to poor performance already in the first iteration step. In the context of AI monoculture, our results suggest the need to monitor diversity across AI systems and to develop policies that incentivize more domain- and community-specific models.

[248] LLM as a Neural Architect: Controlled Generation of Image Captioning Models Under Strict API Contracts

Krunal Jesani, Dmitry Ignatov, Radu Timofte

Main category: cs.LG

TL;DR: NN-Caption is an LLM-guided neural architecture search pipeline that automatically generates runnable image-captioning models by composing CNN encoders with sequence decoders, achieving successful training and meaningful captions on MS COCO dataset.

DetailsMotivation: Traditional neural architecture search requires significant human expertise or automated trial-and-error, which is time-consuming and resource-intensive. The authors aim to leverage LLMs to automate the design of deep learning models, specifically for image captioning tasks.

Method: The NN-Caption pipeline uses DeepSeek-R1-0528-Qwen3-8B as the primary generator to create runnable image-captioning models. It composes CNN encoders from LEMUR’s classification backbones with sequence decoders (LSTM/GRU/Transformer) under a strict Net API. The approach includes prompt templates, iterative code fixes, and automatic evaluation of generated architectures.

Result: The LLM generated dozens of captioning models, with over half successfully trained and producing meaningful captions on MS COCO with BLEU-4 evaluation. The study found a slight drop in success rate when providing more candidate components (10 vs 5 input model snippets). The work also reports training dynamics and the highest BLEU-4 scores achieved.

Conclusion: LLM-guided NAS shows promise for automating neural architecture design, with LLMs capable of proposing architectures, hyperparameters, and training practices. The pipeline successfully integrates prompt-based code generation with automatic evaluation, adding novel captioning models to the LEMUR dataset for reproducible benchmarking and AutoML research.

Abstract: Neural architecture search (NAS) traditionally requires significant human expertise or automated trial-and-error to design deep learning models. We present NN-Caption, an LLM-guided neural architecture search pipeline that generates runnable image-captioning models by composing CNN encoders from LEMUR’s classification backbones with sequence decoders (LSTM/GRU/Transformer) under a strict Net API. Using DeepSeek-R1-0528-Qwen3-8B as the primary generator, we present the prompt template and examples of generated architectures. We evaluate on MS COCO with BLEU-4. The LLM generated dozens of captioning models, with over half successfully trained and producing meaningful captions. We analyse the outcomes of using different numbers of input model snippets (5 vs. 10) in the prompt, finding a slight drop in success rate when providing more candidate components. We also report training dynamics (caption accuracy vs. epochs) and the highest BLEU-4 attained. Our results highlight the promise of LLM-guided NAS: the LLM not only proposes architectures but also suggests hyperparameters and training practices. We identify the challenges encountered (e.g., code hallucinations or API compliance issues) and detail how prompt rules and iterative code fixes addressed them. This work presents a pipeline that integrates prompt-based code generation with automatic evaluation, and adds dozens of novel captioning models to the open LEMUR dataset to facilitate reproducible benchmarking and downstream AutoML research.

[249] Autonomous Source Knowledge Selection in Multi-Domain Adaptation

Keqiuyin Li, Jie Lu, Hua Zuo, Guangquan Zhang

Main category: cs.LG

TL;DR: AutoS: Autonomous Source Knowledge Selection for multi-domain adaptation that automatically selects relevant source samples and models while using pseudo-label enhancement to reduce target label noise.

DetailsMotivation: Multiple source domains often contain redundant or unrelated information that can harm transfer performance, especially in massive-source domain settings. There's a need to identify and select the most transferable knowledge from massive source domains for target tasks.

Method: Proposes AutoS with: 1) Density-driven selection strategy to choose source samples during training and determine which source models contribute to target prediction, 2) Pseudo-label enhancement module built on a pre-trained multimodal model to mitigate target label noise and improve self-supervision.

Result: Experiments on real-world datasets indicate the superiority of the proposed method over existing approaches.

Conclusion: AutoS effectively addresses the challenge of selecting transferable knowledge from massive source domains by autonomously selecting relevant source samples and models while reducing target label noise through pseudo-label enhancement.

Abstract: Unsupervised multi-domain adaptation plays a key role in transfer learning by leveraging acquired rich source information from multiple source domains to solve target task from an unlabeled target domain. However, multiple source domains often contain much redundant or unrelated information which can harm transfer performance, especially when in massive-source domain settings. It is urgent to develop effective strategies for identifying and selecting the most transferable knowledge from massive source domains to address the target task. In this paper, we propose a multi-domain adaptation method named \underline{\textit{Auto}}nomous Source Knowledge \underline{\textit{S}}election (AutoS) to autonomosly select source training samples and models, enabling the prediction of target task using more relevant and transferable source information. The proposed method employs a density-driven selection strategy to choose source samples during training and to determine which source models should contribute to target prediction. Simulteneously, a pseudo-label enhancement module built on a pre-trained multimodal modal is employed to mitigate target label noise and improve self-supervision. Experiments on real-world datasets indicate the superiority of the proposed method.

[250] SepsisSuite: Beyond Risk Stratification – A Comparative Analysis of Deep Fusion vs. Expert Stacking for Prescriptive Sepsis AI

Ryan Cartularo

Main category: cs.LG

TL;DR: The paper compares two fusion approaches for sepsis prediction: a complex end-to-end deep fusion model (SepsisFusionFormer) that underperformed due to attention starvation, and a leaner context-aware stacking approach (SepsisLateFusion) that achieved SOTA performance using modality-specific experts gated by a meta-learner.

DetailsMotivation: Sepsis accounts for 20% of global ICU admissions, but conventional prediction models fail to effectively integrate heterogeneous data streams (vitals, text, imaging). Existing approaches are either siloed by modality or rely on brittle early fusion methods.

Method: Two approaches were compared: 1) SepsisFusionFormer - a quad-modal hierarchical gated attention network for end-to-end deep fusion; 2) SepsisLateFusion - a context-aware mixture-of-experts architecture with three modality-specific experts (“Historian” for static data, “Monitor” for temporal data, “Reader” for NLP) dynamically gated by a CatBoost meta-learner.

Result: SepsisFusionFormer suffered from attention starvation in small datasets (N≈2,100) and achieved only 0.66 AUC. SepsisLateFusion achieved SOTA performance with 0.915 AUC for prediction 4 hours prior to clinical onset, reducing missed cases by 48% through threshold calibration. For antibiotic selection, a quad-modal ensemble achieved 0.72 AUC.

Conclusion: Context-aware stacking with modality-specific experts outperforms complex end-to-end fusion for sepsis prediction, especially in small datasets. The approach enables preventative intervention windows and was implemented in SepsisSuite, a deployment-ready clinical decision support framework.

Abstract: Sepsis accounts for nearly 20% of global ICU admissions, yet conventional prediction models often fail to effectively integrate heterogeneous data streams, remaining either siloed by modality or reliant on brittle early fusion. In this work, we present a rigorous architectural comparison between End-to-End Deep Fusion and Context-Aware Stacking for sepsis tasks. We initially hypothesized that a novel Quad-Modal Hierarchical Gated Attention Network – termed SepsisFusionFormer – would resolve complex cross-modal interactions between vitals, text, and imaging. However, experiments on MIMIC-IV revealed that SepsisFusionFormer suffered from “attention starvation” in the small antibiotic cohort ($N \approx 2,100$), resulting in overfitting (AUC 0.66). This counterintuitive result informed the design of SepsisLateFusion, a “leaner” Context-Aware Mixture-of-Experts (MoE) architecture. By treating modalities as orthogonal experts – the “Historian” (Static), the “Monitor” (Temporal), and the “Reader” (NLP) – and dynamically gating them via a CatBoost meta-learner, we achieved State-of-the-Art (SOTA) performance: 0.915 AUC for prediction 4 hours prior to clinical onset. By calibrating the decision threshold for clinical safety, we reduced missed cases by 48% relative to the default operating point, thus opening a true preventative window for timely intervention over reactive alerts. Furthermore, for the novel prescriptive task of multi-class antibiotic selection, we demonstrate that a Quad-Modal Ensemble achieved the highest performance (0.72 AUC). These models are integrated into SepsisSuite, a deployment-ready Python framework for clinical decision support. SepsisSuite is available for free at: https://github.com/RyanCartularo/SepsisSuite-Info

[251] A Bayesian latent class reinforcement learning framework to capture adaptive, feedback-driven travel behaviour

Georges Sfeir, Stephane Hess, Thomas O. Hancock, Filipe Rodrigues, Jamal Amani Rad, Michiel Bliemer, Matthew Beck, Fayyaz Khan

Main category: cs.LG

TL;DR: Latent Class Reinforcement Learning model captures heterogeneity in how travelers learn preferences over time, identifying three distinct adaptation patterns.

DetailsMotivation: Travel decisions involve experience formation where individuals learn preferences over time, with significant heterogeneity across travelers in both underlying preferences and how these evolve.

Method: Latent Class Reinforcement Learning (LCRL) model estimated through Variational Bayes, applied to a driving simulator dataset.

Result: Identified three distinct classes: 1) context-dependent preferences with context-specific exploitative tendencies, 2) persistent exploitative strategy regardless of context, 3) exploratory strategy with context-specific preferences.

Conclusion: The LCRL model successfully captures heterogeneity in preference adaptation, revealing distinct behavioral patterns in how travelers learn and adapt their preferences over time.

Abstract: Many travel decisions involve a degree of experience formation, where individuals learn their preferences over time. At the same time, there is extensive scope for heterogeneity across individual travellers, both in their underlying preferences and in how these evolve. The present paper puts forward a Latent Class Reinforcement Learning (LCRL) model that allows analysts to capture both of these phenomena. We apply the model to a driving simulator dataset and estimate the parameters through Variational Bayes. We identify three distinct classes of individuals that differ markedly in how they adapt their preferences: the first displays context-dependent preferences with context-specific exploitative tendencies; the second follows a persistent exploitative strategy regardless of context; and the third engages in an exploratory strategy combined with context-specific preferences.

[252] Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs

Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao

Main category: cs.LG

TL;DR: AT-GRPO is a new RL algorithm and training system designed for multi-agent systems that addresses challenges in applying on-policy RL to MAS by using agent- and turn-wise grouping and supporting both single- and multi-policy regimes.

DetailsMotivation: Applying on-policy RL to multi-agent systems is underexplored and presents unique challenges: standard GRPO grouping assumptions break down due to varying prompts by role and turn, and the training stack must support MAS-workflow rollouts and on-policy updates for different policy models.

Method: AT-GRPO includes (1) an agent- and turn-wise grouped RL algorithm tailored to MAS, and (2) a training system that supports both single-policy and multi-policy regimes for multi-agent workflows.

Result: AT-GRPO delivers substantial gains across game, planning, coding, and math tasks. On long-horizon planning, it increases accuracy from 14.0-47.0% (single-agent RL baseline) to 96.0-99.5%. It improves reasoning performance with average gains of 3.87-7.62% on coding tasks and 9.0-17.93% on math tasks.

Conclusion: AT-GRPO successfully addresses the challenges of applying on-policy RL to multi-agent systems and demonstrates significant performance improvements across diverse tasks, providing a practical solution for enhancing LLM agentic capabilities through MAS and RL.

Abstract: Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.

[253] Improving Underwater Acoustic Classification Through Learnable Gabor Filter Convolution and Attention Mechanisms

Lucas Cesar Ferreira Domingos, Russell Brinkworth, Paulo Eduardo Santos, Karl Sammut

Main category: cs.LG

TL;DR: GSE ResNeXt: A deep learning architecture combining learnable Gabor convolutional layers with ResNeXt backbone and squeeze-and-excitation attention for underwater acoustic target classification, achieving better performance and faster convergence than baseline models.

DetailsMotivation: Underwater acoustic target classification is critical for environmental monitoring and defense, but faces challenges from complex ship-radiated/environmental noise, limited datasets, and lack of standardized experimentation that hinder generalization and robustness.

Method: Proposes GSE ResNeXt architecture integrating learnable Gabor convolutional layers (as 2D adaptive band-pass filters) with ResNeXt backbone enhanced by squeeze-and-excitation attention mechanisms. Gabor filters extend feature channel representation and combined with channel attention improve training stability and discriminative feature extraction.

Result: GSE ResNeXt consistently outperforms baseline models (Xception, ResNet, MobileNetV2) in classification performance. Gabor convolutions in initial layers reduce training time by 28%. Temporal differences between training/testing data significantly affect performance, with vessel-sensor distance being a key factor.

Conclusion: Signal processing strategies (like Gabor filters) improve reliability and generalization in data-limited underwater acoustic classification. Future work should focus on mitigating environmental factors’ impact on input signals.

Abstract: Remotely detecting and classifying underwater acoustic targets is critical for environmental monitoring and defence. However, the complex nature of ship-radiated and environmental underwater noise poses significant challenges to accurate signal processing. While recent advancements in machine learning have improved classification accuracy, issues such as limited dataset availability and a lack of standardised experimentation hinder generalisation and robustness. This paper introduces GSE ResNeXt, a deep learning architecture integrating learnable Gabor convolutional layers with a ResNeXt backbone enhanced by squeeze-and-excitation attention mechanisms. The Gabor filters serve as two-dimensional adaptive band-pass filters, extending the feature channel representation. Its combination with channel attention improves training stability and convergence while enhancing the model’s ability to extract discriminative features. The model is evaluated on three classification tasks of increasing complexity. In particular, the impact of temporal differences between the training and testing data is explored, revealing that the distance between the vessel and sensor significantly affects performance. Results show that, GSE ResNeXt consistently outperforms baseline models like Xception, ResNet, and MobileNetV2, in terms of classification performance. Regarding stability and convergence, the addition of Gabor convolutions in the initial layers of the model represents a 28% reduction in training time. These results emphasise the importance of signal processing strategies in improving the reliability and generalisation of models under different environmental conditions, especially in data-limited underwater acoustic classification scenarios. Future developments should focus on mitigating the impact of environmental factors on input signals.

[254] How a Bit Becomes a Story: Semantic Steering via Differentiable Fault Injection

Zafaryab Haider, Md Hafizur Rahman, Shane Moeykens, Vijay Devabhaktuni, Prabuddha Chakraborty

Main category: cs.LG

TL;DR: BLADE: A differentiable framework that uses gradient-based sensitivity estimation to identify which specific bits in LLM weights, when flipped, can alter the semantic meaning of generated image captions while preserving grammatical structure.

DetailsMotivation: Prior fault injection research focused on crashing classifiers or degrading accuracy, but overlooked how subtle bit-level perturbations could influence semantic meaning in generative systems like image captioning models. The authors want to understand how meaning is encoded and alterable at the bit level in vision-language models.

Method: BLADE (Bit-level Fault Analysis via Differentiable Estimation) uses gradient-based sensitivity estimation to locate semantically critical bits in LLM weights, then refines selection through a caption-level semantic-fluency objective that preserves syntax while altering meaning.

Result: The framework demonstrates that even imperceptible low-level bit flips can steer high-level semantics in generative vision-language models, showing semantic drifts are not random but can be predicted using the model’s own gradients.

Conclusion: The work reveals how structured bit-level faults can reshape semantic output, opening pathways for robustness testing, adversarial defense, and explainable AI by exposing the relationship between low-level weight perturbations and high-level semantic changes.

Abstract: Hard-to-detect hardware bit flips, from either malicious circuitry or bugs, have already been shown to make transformers vulnerable in non-generative tasks. This work, for the first time, investigates how low-level, bitwise perturbations (fault injection) to the weights of a large language model (LLM) used for image captioning can influence the semantic meaning of its generated descriptions while preserving grammatical structure. While prior fault analysis methods have shown that flipping a few bits can crash classifiers or degrade accuracy, these approaches overlook the semantic and linguistic dimensions of generative systems. In image captioning models, a single flipped bit might subtly alter how visual features map to words, shifting the entire narrative an AI tells about the world. We hypothesize that such semantic drifts are not random but differentiably estimable. That is, the model’s own gradients can predict which bits, if perturbed, will most strongly influence meaning while leaving syntax and fluency intact. We design a differentiable fault analysis framework, BLADE (Bit-level Fault Analysis via Differentiable Estimation), that uses gradient-based sensitivity estimation to locate semantically critical bits and then refines their selection through a caption-level semantic-fluency objective. Our goal is not merely to corrupt captions, but to understand how meaning itself is encoded, distributed, and alterable at the bit level, revealing that even imperceptible low-level changes can steer the high-level semantics of generative vision-language models. It also opens pathways for robustness testing, adversarial defense, and explainable AI, by exposing how structured bit-level faults can reshape a model’s semantic output.

[255] O-EENC-SD: Efficient Online End-to-End Neural Clustering for Speaker Diarization

Elio Gruttadauria, Mathieu Fontaine, Jonathan Le Roux, Slim Essid

Main category: cs.LG

TL;DR: O-EENC-SD is an online speaker diarization system using EEND-EDA with RNN-based stitching and centroid refinement, offering hyperparameter-free operation and computational efficiency while maintaining competitive performance on CallHome dataset.

DetailsMotivation: Existing speaker diarization methods have limitations: unsupervised clustering approaches require hyperparameter tuning, and current online end-to-end methods are computationally expensive. There's a need for an efficient, hyperparameter-free online solution.

Method: The system builds on EEND-EDA architecture with two key innovations: 1) RNN-based stitching mechanism for online prediction, and 2) novel centroid refinement decoder. It processes independent chunks with no overlap for efficiency.

Result: Competitive with state-of-the-art on CallHome dataset for two-speaker conversational telephone speech. Provides excellent trade-off between Diarization Error Rate (DER) and complexity, with demonstrated efficiency through ablation studies.

Conclusion: O-EENC-SD offers a practical solution for online speaker diarization by combining hyperparameter-free operation with computational efficiency while maintaining competitive performance, making it suitable for real-time applications.

Abstract: We introduce O-EENC-SD: an end-to-end online speaker diarization system based on EEND-EDA, featuring a novel RNN-based stitching mechanism for online prediction. In particular, we develop a novel centroid refinement decoder whose usefulness is assessed through a rigorous ablation study. Our system provides key advantages over existing methods: a hyperparameter-free solution compared to unsupervised clustering approaches, and a more efficient alternative to current online end-to-end methods, which are computationally costly. We demonstrate that O-EENC-SD is competitive with the state of the art in the two-speaker conversational telephone speech domain, as tested on the CallHome dataset. Our results show that O-EENC-SD provides a great trade-off between DER and complexity, even when working on independent chunks with no overlap, making the system extremely efficient.

[256] Is GPT-OSS All You Need? Benchmarking Large Language Models for Financial Intelligence and the Surprising Efficiency Paradox

Ziqian Bi, Danyang Zhang, Junhao Song, Chiung-Yi Tseng

Main category: cs.LG

TL;DR: GPT-OSS-20B achieves comparable accuracy to larger models (65.1% vs 66.5%) with superior computational efficiency, challenging the assumption that bigger models always perform better in financial NLP tasks.

DetailsMotivation: The rapid adoption of large language models in financial services requires rigorous evaluation frameworks to assess performance, efficiency, and practical applicability for deployment decisions.

Method: Comprehensive evaluation of GPT-OSS model family (120B and 20B variants) and contemporary LLMs across ten diverse financial NLP tasks using real-world datasets (Financial PhraseBank, FiQA-SA, FLARE FINERORD). Introduced novel efficiency metrics to capture performance-resource trade-offs.

Result: GPT-OSS-20B achieves comparable accuracy (65.1%) to GPT-OSS-120B (66.5%) while demonstrating superior computational efficiency (198.4 Token Efficiency Score, 159.80 tokens/sec). GPT-OSS models consistently outperform larger competitors including Qwen3-235B.

Conclusion: Architectural innovations and training strategies enable smaller models to achieve competitive performance with significantly reduced computational overhead, offering sustainable and cost-effective deployment pathways for LLMs in financial applications.

Abstract: The rapid adoption of large language models in financial services necessitates rigorous evaluation frameworks to assess their performance, efficiency, and practical applicability. This paper conducts a comprehensive evaluation of the GPT-OSS model family alongside contemporary LLMs across ten diverse financial NLP tasks. Through extensive experimentation on 120B and 20B parameter variants of GPT-OSS, we reveal a counterintuitive finding: the smaller GPT-OSS-20B model achieves comparable accuracy (65.1% vs 66.5%) while demonstrating superior computational efficiency with 198.4 Token Efficiency Score and 159.80 tokens per second processing speed [1]. Our evaluation encompasses sentiment analysis, question answering, and entity recognition tasks using real-world financial datasets including Financial PhraseBank, FiQA-SA, and FLARE FINERORD. We introduce novel efficiency metrics that capture the trade-off between model performance and resource utilization, providing critical insights for deployment decisions in production environments. The benchmark reveals that GPT-OSS models consistently outperform larger competitors including Qwen3-235B, challenging the prevailing assumption that model scale directly correlates with task performance [2]. Our findings demonstrate that architectural innovations and training strategies in GPT-OSS enable smaller models to achieve competitive performance with significantly reduced computational overhead, offering a pathway toward sustainable and cost-effective deployment of LLMs in financial applications.

[257] SEED: Spectral Entropy-Guided Evaluation of SpatialTemporal Dependencies for Multivariate Time Series Forecasting

Feng Xiong, Zongxia Xie, Yanru Sun, Haoyu Wang, Jianhong Lin

Main category: cs.LG

TL;DR: SEED is a spectral entropy-guided framework for multivariate time series forecasting that addresses limitations in existing attention/graph methods by dynamically evaluating dependencies, preserving negative correlations, and enhancing temporal position awareness.

DetailsMotivation: Existing attention- or graph-based methods for multivariate time series forecasting have three key issues: (1) strong temporal self-dependencies are disrupted by irrelevant variables, (2) softmax normalization ignores/reverses negative correlations, and (3) variables struggle to perceive their temporal positions.

Method: SEED introduces: (1) Dependency Evaluator using spectral entropy to dynamically evaluate spatial-temporal dependencies and balance CI/CD strategies; (2) Spectral Entropy-based Fuser to refine dependency weights and separate temporal regularities from intrinsic dynamics; (3) Signed Graph Constructor with signed edge weights to preserve negative correlations; (4) Context Spatial Extractor using local contextual windows to help variables perceive temporal positions and extract spatial features.

Result: Extensive experiments on 12 real-world datasets from various application domains demonstrate that SEED achieves state-of-the-art performance, validating its effectiveness and generality.

Conclusion: SEED provides an effective spectral entropy-guided framework for multivariate time series forecasting that addresses key limitations in existing methods, achieving superior performance through adaptive dependency evaluation, negative correlation preservation, and enhanced temporal position awareness.

Abstract: Effective multivariate time series forecasting often benefits from accurately modeling complex inter-variable dependencies. However, existing attention- or graph-based methods face three key issues: (a) strong temporal self-dependencies are often disrupted by irrelevant variables; (b) softmax normalization ignores and reverses negative correlations; (c) variables struggle to perceive their temporal positions. To address these, we propose \textbf{SEED}, a Spectral Entropy-guided Evaluation framework for spatial-temporal Dependency modeling. SEED introduces a Dependency Evaluator, a key innovation that leverages spectral entropy to dynamically provide a preliminary evaluation of the spatial and temporal dependencies of each variable, enabling the model to adaptively balance Channel Independence (CI) and Channel Dependence (CD) strategies. To account for temporal regularities originating from the influence of other variables rather than intrinsic dynamics, we propose Spectral Entropy-based Fuser to further refine the evaluated dependency weights, effectively separating this part. Moreover, to preserve negative correlations, we introduce a Signed Graph Constructor that enables signed edge weights, overcoming the limitations of softmax. Finally, to help variables perceive their temporal positions and thereby construct more comprehensive spatial features, we introduce the Context Spatial Extractor, which leverages local contextual windows to extract spatial features. Extensive experiments on 12 real-world datasets from various application domains demonstrate that SEED achieves state-of-the-art performance, validating its effectiveness and generality.

[258] Hybrid Attribution Priors for Explainable and Robust Model Training

Zhuoran Zhang, Feng Zhang, Shangyuan Li, Yang Shi, Yuanxing Zhang, Wei Chen, Tengjiao Wang, Kam-Fai Wong

Main category: cs.LG

TL;DR: CAP framework extracts class-aware attribution priors to help SLMs distinguish semantically similar classes by focusing on discriminative features rather than shared keywords.

DetailsMotivation: Existing attribution methods for SLMs highlight class-relevant tokens but often focus on common keywords shared by semantically similar classes, providing insufficient discriminative cues for model differentiation.

Method: Proposes Class-Aware Attribution Prior (CAP) framework that extracts attribution priors capturing fine-grained class distinctions, and CAP Hybrid that combines CAP priors with existing attribution techniques for more comprehensive supervision.

Result: Extensive experiments in full-data, few-shot, and adversarial scenarios demonstrate consistent improvements in both interpretability and robustness.

Conclusion: The CAP framework effectively guides language models to learn diverse, decision-relevant features by aligning self-attribution with enriched class-aware priors, overcoming limitations of existing attribution methods.

Abstract: Small language models (SLMs) are widely used in tasks that require low latency and lightweight deployment, particularly classification. As interpretability and robustness gain increasing importance, explanation-guided learning has emerged as an effective framework by introducing attribution-based supervision during training; however, deriving general and reliable attribution priors remains a significant challenge. Through an analysis of representative attribution methods in classification settings, we find that although these methods can reliably highlight class-relevant tokens, they often focus on common keywords shared by semantically similar classes. Because such classes are already difficult to distinguish under standard training, these attributions provide insufficient discriminative cues, limiting their ability to improve model differentiation. To overcome this limitation, we propose Class-Aware Attribution Prior (CAP), a novel attribution prior extraction framework that guides language models toward capturing fine-grained class distinctions and producing more salient, discriminative attribution priors. Building on this idea, we further introduce CAP Hybrid, which combines priors from CAP with those from existing attribution techniques to form a more comprehensive and balanced supervisory signal. By aligning a model’s self-attribution with these enriched priors, our approach encourages the learning of diverse, decision-relevant features. Extensive experiments in full-data, few-shot, and adversarial scenarios demonstrate that our method consistently enhances both interpretability and robustness.

[259] Automatic Extraction of Rules for Generating Synthetic Patient Data From Real-World Population Data Using Glioblastoma as an Example

Arno Appenzeller, Nick Terzer, André Hohmeyer, Jan-Philipp Redlich, Sabine Luttmann, Friedrich Feuerhake, Nadine S. Schaadt, Timm Intemann, Sarah Teuber-Hanselmann, Stefan Nikolin, Joachim Weis, Klaus Kraywinkel, Pascal Birnstill

Main category: cs.LG

TL;DR: Automated generation of Synthea rules from cancer statistics enables privacy-preserving synthetic medical data creation for glioblastoma research.

DetailsMotivation: Synthetic data generation offers privacy-compliant medical data access, but creating meaningful Synthea rules requires expert knowledge and sample data, making the process complex.

Method: Developed approach to automatically generate Synthea rules from tabular cancer report statistics, creating a glioblastoma module from real-world data to generate synthetic datasets.

Result: Synthetic data reproduced known disease courses and mostly retained statistical properties compared to original dataset, demonstrating feasibility.

Conclusion: Synthetic patient data has great potential for privacy-preserving research, useful for hypothesis formulation and prototype development, though medical interpretation should consider current limitations.

Abstract: The generation of synthetic data is a promising technology to make medical data available for secondary use in a privacy-compliant manner. A popular method for creating realistic patient data is the rule-based Synthea data generator. Synthea generates data based on rules describing the lifetime of a synthetic patient. These rules typically express the probability of a condition occurring, such as a disease, depending on factors like age. Since they only contain statistical information, rules usually have no specific data protection requirements. However, creating meaningful rules can be a very complex process that requires expert knowledge and realistic sample data. In this paper, we introduce and evaluate an approach to automatically generate Synthea rules based on statistics from tabular data, which we extracted from cancer reports. As an example use case, we created a Synthea module for glioblastoma from a real-world dataset and used it to generate a synthetic dataset. Compared to the original dataset, the synthetic data reproduced known disease courses and mostly retained the statistical properties. Overall, synthetic patient data holds great potential for privacy-preserving research. The data can be used to formulate hypotheses and to develop prototypes, but medical interpretation should consider the specific limitations as with any currently available approach.

[260] INFORM-CT: INtegrating LLMs and VLMs FOR Incidental Findings Management in Abdominal CT

Idan Tankel, Nir Mazor, Rafi Brada, Christina LeBedis, Guy ben-Yosef

Main category: cs.LG

TL;DR: A novel AI framework combining LLMs and VLMs in a plan-and-execute agentic approach to automate incidental findings detection, classification, and reporting in abdominal CT scans, outperforming pure VLM-based methods.

DetailsMotivation: Incidental findings in CT scans have clinical significance but manual inspection by radiologists is time-consuming and inconsistent. There's a need for more efficient and precise automated systems to handle incidental findings following established medical guidelines.

Method: Plan-and-execute agentic framework using LLMs as planners to generate Python scripts with predefined base functions, and executors to run these scripts using VLMs, segmentation models, and image processing subroutines for automated incidental findings management.

Result: The framework outperforms existing pure VLM-based approaches in accuracy and efficiency when tested on a CT abdominal benchmark for three organs in a fully automatic end-to-end manner.

Conclusion: The proposed LLM-VLM agentic framework successfully automates incidental findings management in abdominal CT scans, offering improved efficiency and precision over existing methods while following medical guidelines.

Abstract: Incidental findings in CT scans, though often benign, can have significant clinical implications and should be reported following established guidelines. Traditional manual inspection by radiologists is time-consuming and variable. This paper proposes a novel framework that leverages large language models (LLMs) and foundational vision-language models (VLMs) in a plan-and-execute agentic approach to improve the efficiency and precision of incidental findings detection, classification, and reporting for abdominal CT scans. Given medical guidelines for abdominal organs, the process of managing incidental findings is automated through a planner-executor framework. The planner, based on LLM, generates Python scripts using predefined base functions, while the executor runs these scripts to perform the necessary checks and detections, via VLMs, segmentation models, and image processing subroutines. We demonstrate the effectiveness of our approach through experiments on a CT abdominal benchmark for three organs, in a fully automatic end-to-end manner. Our results show that the proposed framework outperforms existing pure VLM-based approaches in terms of accuracy and efficiency.

[261] HATSolver: Learning Groebner Bases with Hierarchical Attention Transformers

Mohamed Malhou, Ludovic Perret, Kristin Lauter

Main category: cs.LG

TL;DR: Improved Groebner bases computation using Hierarchical Attention Transformers (HATs) with tree-structured inductive bias for solving multivariate polynomial systems, achieving significant computational savings and scaling to larger instances than previous transformer approaches.

DetailsMotivation: Previous work (Kera et al., NeurIPS 2024) introduced transformers for computing Groebner bases, but there's room for improvement in computational efficiency and scalability for solving systems of multivariate polynomial equations.

Method: Apply Hierarchical Attention Transformers (HATs) with tree-structured inductive bias to model hierarchical relationships in polynomial data, generalize to arbitrary depths, and combine with curriculum learning for training.

Result: Achieves significant computational savings compared to conventional flat attention models, solves much larger instances than Kera et al. (2024), and includes detailed computational cost analysis.

Conclusion: HAT architecture with tree-structured inductive bias provides an efficient approach for Groebner bases computation, enabling solution of larger multivariate polynomial systems than previous transformer-based methods.

Abstract: At NeurIPS 2024, Kera et al. introduced the use of transformers for computing Groebner bases, a central object in computer algebra with numerous practical applications. In this paper, we improve this approach by applying Hierarchical Attention Transformers (HATs) to solve systems of multivariate polynomial equations via Groebner bases computation. The HAT architecture incorporates a tree-structured inductive bias that enables the modeling of hierarchical relationships present in the data and thus achieves significant computational savings compared to conventional flat attention models. We generalize to arbitrary depths and include a detailed computational cost analysis. Combined with curriculum learning, our method solves instances that are much larger than those in Kera et al. (2024 Learning to compute Groebner bases)

[262] Generative Urban Flow Modeling: From Geometry to Airflow with Graph Diffusion

Francisco Giral, Álvaro Manzano, Ignacio Gómez, Petros Koumoutsakos, Soledad Le Clainche

Main category: cs.LG

TL;DR: A generative diffusion framework for synthesizing urban wind fields on unstructured meshes using only geometry information, combining hierarchical graph neural networks with score-based diffusion to generate accurate velocity fields without temporal simulations.

DetailsMotivation: Urban wind flow modeling is crucial for air quality assessment and sustainable city planning, but current approaches face challenges: low-order models can't capture complex geometry effects, while high-fidelity CFD simulations are computationally expensive, especially for multiple geometries or wind conditions.

Method: Proposes a generative diffusion framework that combines hierarchical graph neural networks with score-based diffusion modeling to synthesize steady-state urban wind fields over unstructured meshes. The model requires only geometry information and generates velocity fields without temporal rollouts or dense measurements. It’s trained across multiple mesh slices and wind angles.

Result: The model generalizes to unseen geometries, recovers key flow structures (wakes and recirculation zones), offers uncertainty-aware predictions, and shows robustness to mesh variation in ablation studies. It performs well under different inference regimes.

Conclusion: This work represents a first step toward foundation models for the built environment that can help urban planners rapidly evaluate design decisions under densification and climate uncertainty, providing an efficient alternative to expensive CFD simulations.

Abstract: Urban wind flow modeling and simulation play an important role in air quality assessment and sustainable city planning. A key challenge for modeling and simulation is handling the complex geometries of the urban landscape. Low order models are limited in capturing the effects of geometry, while high-fidelity Computational Fluid Dynamics (CFD) simulations are prohibitively expensive, especially across multiple geometries or wind conditions. Here, we propose a generative diffusion framework for synthesizing steady-state urban wind fields over unstructured meshes that requires only geometry information. The framework combines a hierarchical graph neural network with score-based diffusion modeling to generate accurate and diverse velocity fields without requiring temporal rollouts or dense measurements. Trained across multiple mesh slices and wind angles, the model generalizes to unseen geometries, recovers key flow structures such as wakes and recirculation zones, and offers uncertainty-aware predictions. Ablation studies confirm robustness to mesh variation and performance under different inference regimes. This work develops is the first step towards foundation models for the built environment that can help urban planners rapidly evaluate design decisions under densification and climate uncertainty.

[263] Quantum Decision Transformers (QDT): Synergistic Entanglement and Interference for Offline Reinforcement Learning

Abraham Itzhak Weinberg

Main category: cs.LG

TL;DR: QDT combines quantum-inspired attention and feedforward networks to dramatically improve offline RL performance over standard Decision Transformers through synergistic computational mechanisms.

DetailsMotivation: Standard Decision Transformers struggle with long-horizon credit assignment and complex state-action dependencies in offline reinforcement learning, limiting their effectiveness.

Method: Introduces Quantum Decision Transformer with two core components: Quantum-Inspired Attention with entanglement operations for non-local feature correlations, and Quantum Feedforward Networks with multi-path processing and learnable interference for adaptive computation.

Result: Achieves over 2,000% performance improvement compared to standard DTs, with superior generalization across varying data qualities. Ablation studies reveal strong synergistic effects where neither component alone achieves competitive performance.

Conclusion: Quantum-inspired design principles offer promising direction for advancing transformer architectures in sequential decision-making, requiring holistic co-design of interdependent mechanisms rather than modular component adoption.

Abstract: Offline reinforcement learning enables policy learning from pre-collected datasets without environment interaction, but existing Decision Transformer (DT) architectures struggle with long-horizon credit assignment and complex state-action dependencies. We introduce the Quantum Decision Transformer (QDT), a novel architecture incorporating quantum-inspired computational mechanisms to address these challenges. Our approach integrates two core components: Quantum-Inspired Attention with entanglement operations that capture non-local feature correlations, and Quantum Feedforward Networks with multi-path processing and learnable interference for adaptive computation. Through comprehensive experiments on continuous control tasks, we demonstrate over 2,000% performance improvement compared to standard DTs, with superior generalization across varying data qualities. Critically, our ablation studies reveal strong synergistic effects between quantum-inspired components: neither alone achieves competitive performance, yet their combination produces dramatic improvements far exceeding individual contributions. This synergy demonstrates that effective quantum-inspired architecture design requires holistic co-design of interdependent mechanisms rather than modular component adoption. Our analysis identifies three key computational advantages: enhanced credit assignment through non-local correlations, implicit ensemble behavior via parallel processing, and adaptive resource allocation through learnable interference. These findings establish quantum-inspired design principles as a promising direction for advancing transformer architectures in sequential decision-making, with implications extending beyond reinforcement learning to neural architecture design more broadly.

[264] A Critical Perspective on Finite Sample Conformal Prediction Theory in Medical Applications

Klaus-Rudolf Kladny, Bernhard Schölkopf, Lisa Koch, Christian F. Baumgartner, Michael Muehlebach

Main category: cs.LG

TL;DR: Conformal prediction provides statistical guarantees for uncertainty estimates but practical utility depends heavily on calibration set size, which is problematic in medical domains where data is scarce.

DetailsMotivation: While conformal prediction offers statistical guarantees for uncertainty estimates in ML models, the authors question whether these guarantees have practical utility when calibration sets are small, which is common in medical applications where data is limited.

Method: The paper critiques conformal prediction theory by analyzing the relationship between calibration set size and practical utility of statistical guarantees, supported by empirical demonstration on a medical image classification task.

Result: The authors show that although conformal prediction’s statistical guarantees hold for calibration sets of any size, the practical usefulness of these guarantees depends significantly on the size of the calibration set.

Conclusion: Conformal prediction’s promise of meaningful uncertainty guarantees with small calibration sets is questionable in medical domains where data scarcity makes large calibration sets infeasible, highlighting a limitation for clinical applications.

Abstract: Machine learning (ML) is transforming healthcare, but safe clinical decisions demand reliable uncertainty estimates that standard ML models fail to provide. Conformal prediction (CP) is a popular tool that allows users to turn heuristic uncertainty estimates into uncertainty estimates with statistical guarantees. CP works by converting predictions of a ML model, together with a calibration sample, into prediction sets that are guaranteed to contain the true label with any desired probability. An often cited advantage is that CP theory holds for calibration samples of arbitrary size, suggesting that uncertainty estimates with practically meaningful statistical guarantees can be achieved even if only small calibration sets are available. We question this promise by showing that, although the statistical guarantees hold for calibration sets of arbitrary size, the practical utility of these guarantees does highly depend on the size of the calibration set. This observation is relevant in medical domains because data is often scarce and obtaining large calibration sets is therefore infeasible. We corroborate our critique in an empirical demonstration on a medical image classification task.

[265] A data-driven approach to inferring travel trajectory during peak hours in urban rail transit systems

Jie He, Yong Qin, Jianyuan Guo, Xuan Sun, Xuanchuan Zheng

Main category: cs.LG

TL;DR: A data-driven approach using AFC and AVL data to infer urban rail transit travel trajectories with over 90% accuracy during peak hours, eliminating reliance on external survey data.

DetailsMotivation: Refined trajectory inference is crucial for urban rail transit operation organization, but existing methods often rely on external/survey data and synthetic validation, limiting robustness and applicability.

Method: Three-step approach: 1) Establish train alternative sets based on spatio-temporal constraints, 2) Data-driven adaptive trajectory inference using KLEM (KL divergence + EM algorithm) for parameter estimation, 3) Travel trajectory construction. Uses real AFC and AVL data instead of synthetic data.

Result: Achieves high-precision passenger trajectory inference with over 90% accuracy rate during peak hours in urban rail transit systems.

Conclusion: The proposed fully data-driven approach effectively infers individual travel trajectories without external data dependency, demonstrating strong practical applicability and robustness for urban rail transit operation management.

Abstract: Refined trajectory inference of urban rail transit is of great significance to the operation organization. In this paper, we develop a fully data-driven approach to inferring individual travel trajectories in urban rail transit systems. It utilizes data from the Automatic Fare Collection (AFC) and Automatic Vehicle Location (AVL) systems to infer key trajectory elements, such as selected train, access/egress time, and transfer time. The approach includes establishing train alternative sets based on spatio-temporal constraints, data-driven adaptive trajectory inference, and trave l trajectory construction. To realize data-driven adaptive trajectory inference, a data-driven parameter estimation method based on KL divergence combined with EM algorithm (KLEM) was proposed. This method eliminates the reliance on external or survey data for parameter fitting, enhancing the robustness and applicability of the model. Furthermore, to overcome the limitations of using synthetic data to validate the result, this paper employs real individual travel trajectory data for verification. The results show that the approach developed in this paper can achieve high-precision passenger trajectory inference, with an accuracy rate of over 90% in urban rail transit travel trajectory inference during peak hours.

[266] Semantic Geometry for policy-constrained interpretation

Nikit Phadke

Main category: cs.LG

TL;DR: A geometric framework prevents hallucinated commitments in high-stakes domains using spherical representations, policy constraints as priors, and constrained optimization that enables refusal when needed.

DetailsMotivation: To prevent hallucinated commitments in high-stakes domains where incorrect semantic interpretations can have serious consequences, particularly in regulated environments like finance.

Method: Semantic meaning represented as directions on a unit sphere, evidence as witness vectors, admissible interpretations as spherical convex regions. Policy constraints introduced as explicit priors on the same manifold, separated from evidence geometry. Interpretation reduces to constrained optimization over admissible regions.

Result: Zero hallucinated approvals across multiple policy regimes on large-scale regulated financial data - the first such result at scale. Complexity bounds proven to be information-theoretically optimal.

Conclusion: The geometric framework provides provable prevention of hallucinated commitments through topological refusal mechanisms, connecting to information theory, Bayesian inference, and sheaf-theoretic semantics while achieving practical success in regulated domains.

Abstract: We present a geometric framework for policy-constrained semantic interpretation that provably prevents hallucinated commitments in high-stakes domains. Semantic meaning is represented as direction on a unit sphere, evidence is modeled as sets of witness vectors, and admissible interpretations correspond to spherical convex regions. Policy constraints are introduced as explicit priors defined over the same manifold, separated from evidence geometry. Interpretation reduces to constrained optimization over admissible regions, with refusal emerging as a topologically necessary outcome under contradiction or policy exclusion. We connect this framework to information theory, Bayesian inference, and sheaf-theoretic semantics, proving that our complexity bounds are information-theoretically optimal. Empirical validation on large scale regulated financial data demonstrates zero hallucinated approvals across multiple policy regimes-the first such result at scale.

[267] Inference Time Feature Injection: A Lightweight Approach for Real-Time Recommendation Freshness

Qiang Chen, Venkatesh Ganapati Hegde, Hongfei Li

Main category: cs.LG

TL;DR: Lightweight method for intra-day personalization in video streaming by injecting recent watch history at inference time without retraining models.

DetailsMotivation: Batch-trained recommender systems with daily feature updates create stale recommendations that don't incorporate users' most recent actions, failing to adapt to evolving preferences throughout the day.

Method: Model-agnostic approach that selectively overrides stale user features at inference time using recent watch history, enabling instant adaptation without requiring model retraining.

Result: Statistically significant 0.47% increase in key user engagement metrics, representing one of the most substantial engagement gains in recent experimentation cycles.

Conclusion: First evidence that intra-day personalization can drive meaningful impact in long-form video streaming, offering a compelling alternative to full real-time architectures that require model retraining.

Abstract: Many recommender systems in long-form video streaming reply on batch-trained models and batch-updated features, where user features are updated daily and served statically throughout the day. While efficient, this approach fails to incorporate a user’s most recent actions, often resulting in stale recommendations. In this work, we present a lightweight, model-agnostic approach for intra-day personalization that selectively injects recent watch history at inference time without requiring model retraining. Our approach selectively overrides stale user features at inference time using the recent watch history, allowing the system to adapt instantly to evolving preferences. By reducing the personalization feedback loop from daily to intra-day, we observed a statistically significant 0.47% increase in key user engagement metrics which ranked among the most substantial engagement gains observed in recent experimentation cycles. To our knowledge, this is the first published evidence that intra-day personalization can drive meaningful impact in long-form video streaming service, providing a compelling alternative to full real-time architectures where model retraining is required.

[268] NoveltyRank: Estimating Conceptual Novelty of AI Papers

Zhengxu Yan, Han Li, Yuming Feng

Main category: cs.LG

TL;DR: Developed a model to estimate and rank conceptual novelty of AI papers using semantic analysis of titles/abstracts, with two task formulations: binary classification and pairwise novelty comparison.

DetailsMotivation: The surge in AI publications makes it hard for truly novel work to stand out, and manual novelty assessment is unstable and time-consuming. Need a data-driven, scalable system to identify genuinely innovative research.

Method: Evaluates novelty through paper’s title, abstract, and semantic similarity to prior literature. Two task formulations: (1) binary classification predicting absolute novelty, and (2) pairwise novelty comparison learning relative novelty. Fine-tuned Qwen3-4B-Instruct-2507 and SciBERT, benchmarked against GPT-5.1.

Result: Implementation publicly available at https://github.com/ZhengxuYan/NoveltyRank. Performance analysis shows how task formulation and modeling choices affect novelty estimation.

Conclusion: The system enables data-driven, scalable assessment of research originality, helping researchers identify innovative submissions and providing reviewers with quantitative novelty signals.

Abstract: With the growing ease of academic publishing, the volume of research papers, especially in AI-related fields, has surged dramatically. This flood of publications makes it difficult for truly novel and impactful work to stand out, and manual novelty assessment is often unstable and time-consuming. Our project aims to develop a model that estimates and ranks the conceptual novelty of AI papers, enabling a data-driven and scalable assessment of research originality. Such a system can help researchers efficiently identify submissions that introduce genuinely innovative ideas rather than minor variants, and provide conference reviewers with a quantitative and consistent signal of novelty. Our approach evaluates novelty primarily through a paper’s title, abstract, and semantic similarity to prior literature. Given the motivation of novelty estimation, we explore two task formulations with different modeling objectives, each offering a different perspective: (1) binary classification, which predicts the paper’s absolute novelty from learned patterns of prior novel works, and (2) pairwise novelty comparison, which learns to distinguish papers by relative novelty over others. We fine-tune Qwen3-4B-Instruct-2507 and SciBERT on both tasks, benchmarking against GPT-5.1 to analyze how task formulation and modeling choices affect performance. The implementation is publicly available at https://github.com/ZhengxuYan/NoveltyRank.

[269] Guided Discrete Diffusion for Constraint Satisfaction Problems

Justin Jung

Main category: cs.LG

TL;DR: Discrete diffusion guidance enables solving Sudoku puzzles without supervision by satisfying constraints.

DetailsMotivation: To develop a method for solving constraint satisfaction problems (CSPs) like Sudoku without requiring supervised training data or explicit rule programming.

Method: Uses discrete diffusion guidance, a technique that guides the diffusion process to satisfy constraints, applying it to CSPs where variables have discrete values.

Result: Demonstrates successful solving of Sudoku puzzles without supervision, showing the approach can handle combinatorial constraints.

Conclusion: Discrete diffusion guidance provides an effective unsupervised approach for solving CSPs, with potential applications to various combinatorial problems.

Abstract: We propose discrete diffusion guidance for constraint satisfaction problems (CSPs) and demonstrate its ability to solve Sudoku puzzles without supervision.

[270] Evaluating Weather Forecasts from a Decision Maker’s Perspective

Kornelius Raeth, Nicole Ludwig

Main category: cs.LG

TL;DR: Decision calibration evaluates forecasts based on their ability to improve decision-making rather than statistical accuracy, revealing that forecast-level performance doesn’t reliably translate to decision-level performance.

DetailsMotivation: Standard forecast evaluations focus on statistical accuracy from the forecaster's perspective, but in practice forecasts are used to make decisions. There's a need to evaluate forecasts from the decision-maker's perspective by quantifying their value in improving decision-making.

Method: Decision calibration framework evaluates forecast performance at the decision level rather than forecast level. Applied to compare Machine Learning and classical numerical weather prediction models on various weather-dependent decision tasks.

Result: Model performance at forecast level doesn’t reliably translate to decision-level performance. Some performance differences only become apparent at decision level, and model rankings can change among different decision tasks.

Conclusion: Typical forecast evaluations are insufficient for selecting optimal forecast models for specific decision tasks; decision calibration provides necessary evaluation from decision-maker’s perspective.

Abstract: Standard weather forecast evaluations focus on the forecaster’s perspective and on a statistical assessment comparing forecasts and observations. In practice, however, forecasts are used to make decisions, so it seems natural to take the decision-maker’s perspective and quantify the value of a forecast by its ability to improve decision-making. Decision calibration provides a novel framework for evaluating forecast performance at the decision level rather than the forecast level. We evaluate decision calibration to compare Machine Learning and classical numerical weather prediction models on various weather-dependent decision tasks. We find that model performance at the forecast level does not reliably translate to performance in downstream decision-making: some performance differences only become apparent at the decision level, and model rankings can change among different decision tasks. Our results confirm that typical forecast evaluations are insufficient for selecting the optimal forecast model for a specific decision task.

[271] Unreliable Uncertainty Estimates with Monte Carlo Dropout

Aslak Djupskås, Alexander Johannes Stasik, Signe Riemer-Sørensen

Main category: cs.LG

TL;DR: MCD uncertainty estimates are less reliable than Bayesian methods for capturing true uncertainty in extrapolation/interpolation regions.

DetailsMotivation: Need for reliable uncertainty estimation in safety-critical ML applications, with MCD being computationally efficient but needing empirical validation of its uncertainty quality compared to Bayesian methods.

Method: Empirical investigation comparing Monte Carlo Dropout (MCD) uncertainty estimation against Gaussian Processes (GP) and Bayesian Neural Networks (BNN), evaluating ability to capture true uncertainty in extrapolation and interpolation regions.

Result: MCD struggles to accurately reflect true uncertainty, particularly failing to capture increased uncertainty in extrapolation and interpolation regions where Bayesian models show proper uncertainty behavior.

Conclusion: Uncertainty estimates from MCD are not as reliable as those from traditional Bayesian approaches (GP and BNN) for capturing both epistemic and aleatoric uncertainty.

Abstract: Reliable uncertainty estimation is crucial for machine learning models, especially in safety-critical domains. While exact Bayesian inference offers a principled approach, it is often computationally infeasible for deep neural networks. Monte Carlo dropout (MCD) was proposed as an efficient approximation to Bayesian inference in deep learning by applying neuron dropout at inference time \citep{gal2016dropout}. Hence, the method generates multiple sub-models yielding a distribution of predictions to estimate uncertainty. We empirically investigate its ability to capture true uncertainty and compare to Gaussian Processes (GP) and Bayesian Neural Networks (BNN). We find that MCD struggles to accurately reflect the underlying true uncertainty, particularly failing to capture increased uncertainty in extrapolation and interpolation regions as observed in Bayesian models. The findings suggest that uncertainty estimates from MCD, as implemented and evaluated in these experiments, is not as reliable as those from traditional Bayesian approaches for capturing epistemic and aleatoric uncertainty.

[272] How Does Fourier Analysis Network Work? A Mechanism Analysis and a New Dual-Activation Layer Proposal

Sam Jeong, Hae Yong Kim

Main category: cs.LG

TL;DR: FAN’s performance gains come from sine activation’s non-zero derivative at x=0, not periodicity, addressing vanishing gradients and dying-ReLU problem. This insight led to more efficient Dual-Activation Layer (DAL) for faster convergence.

DetailsMotivation: To understand why Fourier Analysis Network (FAN) improves neural network performance, as previous studies showed consistent gains but unclear mechanisms behind sine/cosine activations.

Method: Analyzed FAN components, discovered only sine activation helps while cosine harms; identified improvement stems from sine’s non-zero derivative at x=0 mitigating vanishing gradients. Developed Dual-Activation Layer (DAL) based on this insight.

Result: FAN primarily alleviates dying-ReLU problem by providing stable gradient pathway. DAL models converge faster and achieve equal/higher validation accuracy on three tasks: noisy sinusoidal signal classification, MNIST digit classification, and ECG biometric recognition.

Conclusion: FAN’s benefits come from training dynamics rather than spectral properties. The analysis enabled development of more efficient DAL convergence accelerator that outperforms conventional activations.

Abstract: Fourier Analysis Network (FAN) was recently proposed as a simple way to improve neural network performance by replacing part of ReLU activations with sine and cosine functions. Although several studies have reported small but consistent gains across tasks, the underlying mechanism behind these improvements has remained unclear. In this work, we show that only the sine activation contributes positively to performance, whereas the cosine activation tends to be detrimental. Our analysis reveals that the improvement is not a consequence of the sine function’s periodic nature; instead, it stems from the function’s local behavior near x = 0, where its non-zero derivative mitigates the vanishing-gradient problem. We further show that FAN primarily alleviates the dying-ReLU problem, in which a neuron consistently receives negative inputs, produces zero gradients, and stops learning. Although modern ReLU-like activations, such as Leaky ReLU, GELU, and Swish, reduce ReLU’s zero-gradient region, they still contain input domains where gradients remain significantly diminished, contributing to slower optimization and hindering rapid convergence. FAN addresses this limitation by introducing a more stable gradient pathway. This analysis shifts the understanding of FAN’s benefits from a spectral interpretation to a concrete analysis of training dynamics, leading to the development of the Dual-Activation Layer (DAL), a more efficient convergence accelerator. We evaluate DAL on three tasks: classification of noisy sinusoidal signals versus pure noise, MNIST digit classification, and ECG-based biometric recognition. In all cases, DAL models converge faster and achieve equal or higher validation accuracy compared to models with conventional activations.

[273] Entropy-Reservoir Bregman Projection: An Information-Geometric Unification of Model Collapse

Jingwei Chen

Main category: cs.LG

TL;DR: ERBP framework explains self-referential learning collapse via information geometry, showing entropy decay causes collapse and proposing entropy reservoirs as universal fix.

DetailsMotivation: Self-referential learning (training on self-generated data) suffers from model collapse across domains (LLMs, GANs, RL), but existing fixes are ad hoc without unified theory.

Method: ERBP models closed-loop learning as stochastic Bregman projection sequences in distribution space, analyzes entropy decay, and introduces entropy reservoirs (high-entropy distributions) to stabilize dynamics.

Result: Theory provides: (i) necessary condition for collapse, (ii) sufficient condition for non-trivial entropy floor, (iii) closed-form convergence rates. Experiments validate across LLMs, SAC RL, and GANs.

Conclusion: ERBP unifies disparate stabilization heuristics into single quantitative design rule: monitor and budget entropy flux via entropy reservoirs.

Abstract: Self-referential learning – training a model on data it generated itself – promises boundless scalability but chronically suffers from model collapse: language models degenerate into repetitive text, GANs drop modes, and reinforcement-learning policies over-exploit. Although practitioners employ ad~hoc fixes such as real-data mixing, entropy bonuses, knowledge distillation, or retrieval-augmented generation, a single principle that explains both the failure mode and the success of these fixes has remained elusive. We present Entropy-Reservoir Bregman Projection (ERBP), an information-geometric framework that unifies these phenomena. We model the closed loop as a stochastic Bregman projection sequence in distribution space. Without external coupling, finite-sample noise forces the system to project onto an ever-shrinking empirical support, causing exponential entropy decay and eventual collapse. Introducing an Entropy Reservoir – a high-entropy distribution mixed into each projection – injects a controllable entropy flux that provably stabilises the dynamics. Our theory yields (i) a necessary condition for collapse, (ii) a sufficient condition that guarantees a non-trivial entropy floor, and (iii) closed-form rates that depend only on sample size and the strong-convexity/Lipschitz constants of the Bregman generator. Experiments on large-language-model self-training, Soft Actor-Critic in reinforcement learning, and GAN optimisation validate our predictions and show that disparate stabilisation heuristics correspond to specific reservoir choices and coupling coefficients. ERBP thus transforms a collection of folk remedies into a single, quantitative design rule: monitor and budget your entropy flux.

[274] Task Matrices: Linear Maps for Cross-Model Finetuning Transfer

Darrin O’ Brien, Dhikshith Gajulapalli, Eric Xia

Main category: cs.LG

TL;DR: The paper demonstrates that linear “task matrices” can effectively transform base model embeddings to approximate finetuned model performance across vision and text tasks, revealing cross-layer linear encodings in neural networks.

DetailsMotivation: Previous interpretability research showed linear encodings exist with in-context prompting, but it was unclear if similar linear representations exist in more general adaptation regimes like finetuning.

Method: Developed the concept of a “task matrix” - a linear transformation from base to finetuned embedding states. Used data-based approximation to efficiently compute these matrices for vision and text models across ten datasets.

Result: Task matrices surpass linear probe performance and sometimes approach finetuned model levels. Validated existence of cross-layer linear encodings between pretrained and finetuned architectures. Data-based approximation proved efficient and generalizable across domains.

Conclusion: Linear task matrices provide an effective, efficient way to adapt base models, revealing fundamental linear structure in neural network adaptation. Implementation is publicly available.

Abstract: Results in interpretability suggest that large vision and language models learn implicit linear encodings when models are biased by in-context prompting. However, the existence of similar linear representations in more general adaptation regimes has not yet been demonstrated. In this work, we develop the concept of a task matrix, a linear transformation from a base to finetuned embedding state. We demonstrate that for vision and text models and ten different datasets, a base model augmented with a task matrix achieves results surpassing linear probes, sometimes approaching finetuned levels. Our results validate the existence of cross-layer linear encodings between pretrained and finetuned architectures. Moreover, we show that a data-based approximation for such encodings is both efficient and generalizable to multiple domains. We make our implementation publicly available.

[275] OLR-WA: Online Weighted Average Linear Regression in Multivariate Data Streams

Mohammad Abu-Shaira, Alejandro Rodriguez, Greg Speegle, Victor Sheng, Ishfaq Ahmad

Main category: cs.LG

TL;DR: OLR-WA is a novel multivariate online linear regression model that achieves batch-like performance while handling data drift and confidence-based scenarios through weighted averaging with conservative updates prioritizing older, high-confidence data.

DetailsMotivation: Online learning needs efficient models that avoid large storage and costly recalculations while handling evolving data patterns (drift) and maintaining performance comparable to batch methods.

Method: OLR-WA (OnLine Regression with Weighted Average) uses weighted averaging with conservative updates that prioritize older data points with higher confidence levels, enabling effective handling of temporal drift and confidence-based scenarios.

Result: OLR-WA achieves performance comparable to batch regression and outperforms other online models, with rapid convergence (high r² values from first iteration), even when initialized with only 1-10% of total data. It uniquely handles confidence-based scenarios effectively.

Conclusion: OLR-WA demonstrates versatility and utility across different contexts, making it a valuable solution for online linear regression tasks with its ability to handle drift, achieve batch-like performance, and uniquely manage confidence-based scenarios.

Abstract: Online learning updates models incrementally with new data, avoiding large storage requirements and costly model recalculations. In this paper, we introduce “OLR-WA; OnLine Regression with Weighted Average”, a novel and versatile multivariate online linear regression model. We also investigate scenarios involving drift, where the underlying patterns in the data evolve over time, conduct convergence analysis, and compare our approach with existing online regression models. The results of OLR-WA demonstrate its ability to achieve performance comparable to the batch regression, while also showcasing comparable or superior performance when compared with other state-of-the-art online models, thus establishing its effectiveness. Moreover, OLR-WA exhibits exceptional performance in terms of rapid convergence, surpassing other online models with consistently achieving high r2 values as a performance measure from the first iteration to the last iteration, even when initialized with minimal amount of data points, as little as 1% to 10% of the total data points. In addition to its ability to handle time-based (temporal drift) scenarios, remarkably, OLR-WA stands out as the only model capable of effectively managing confidence-based challenging scenarios. It achieves this by adopting a conservative approach in its updates, giving priority to older data points with higher confidence levels. In summary, OLR-WA’s performance further solidifies its versatility and utility across different contexts, making it a valuable solution for online linear regression tasks.

[276] Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections

Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, Jeff Da

Main category: cs.LG

TL;DR: OEC (on-policy expert corrections) addresses covariate shift in multi-turn LLM agents by generating partially on-policy data where student rollouts switch to expert corrections mid-trajectory, showing 13-14% improvement over imitation learning on SWE tasks.

DetailsMotivation: Imitation learning for multi-turn LM agents suffers from covariate shift - as student policy diverges from expert behavior, it encounters unseen states, reducing fine-tuning effectiveness. This fundamental limitation needs addressing for better agent training.

Method: Propose OEC (on-policy expert corrections): generate partially on-policy data by starting rollouts with student model, then switching to expert model part way through trajectory. Tested on SWE tasks using rejection sampling combined with supervised fine-tuning.

Result: OEC trajectories show relative 14% improvement (7b) and 13% improvement (32b) over traditional imitation learning on SWE-bench verified. Demonstrates superiority over other on-policy and imitation learning approaches.

Conclusion: Combining expert demonstrations with on-policy data is essential for effective multi-turn LM agent training. OEC methodology successfully addresses covariate shift limitations in imitation learning for LLM agents.

Abstract: A popular paradigm for training LM agents relies on imitation learning, fine-tuning on expert trajectories. However, we show that the off-policy nature of imitation learning for multi-turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy’s behavior diverges from the expert’s, it encounters states not present in the training data, reducing the effectiveness of fine-tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi-turn LLM training. We introduce on-policy expert corrections (OECs), partially on-policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a multi-turn setting where LLM agents must interact with a development environment to fix software bugs. Our experiments compare OEC data against various other on-policy and imitation learning approaches on SWE agent problems and train models using a common rejection sampling (i.e., using environment reward) combined with supervised fine-tuning technique. Experiments find that OEC trajectories show a relative 14% and 13% improvement over traditional imitation learning in the 7b and 32b setting, respectively, on SWE-bench verified. Our results demonstrate the need for combining expert demonstrations with on-policy data for effective multi-turn LM agent training.

[277] ATLAS: Adaptive Topology-based Learning at Scale for Homophilic and Heterophilic Graphs

Turja Kundu, Sanjukta Bhowmick

Main category: cs.LG

TL;DR: ATLAS is a scalable graph learning algorithm that uses multi-level community topology instead of iterative aggregation, achieving strong performance on both homophilic and heterophilic graphs without sampling.

DetailsMotivation: Address two key GNN limitations: performance degradation on heterophilic graphs (where connected nodes have different labels) and poor scalability due to iterative feature aggregation requiring sampling for large graphs.

Method: Extract topological community information at multiple resolution levels, concatenate community assignments to node features, then apply MLPs instead of GNN aggregation. This provides neighborhood context without iterative message passing.

Result: Achieves comparable accuracy to baselines with gains up to 20 percentage points over GCN for heterophilic graphs with negative structural bias and 11 percentage points over MLP for homophilic graphs. Scales to large graphs without sampling.

Conclusion: ATLAS provides a scalable, accurate alternative to GNNs that works well on both homophilic and heterophilic graphs, with multi-resolution community features offering a principled path toward explainable graph learning.

Abstract: We present ATLAS (Adaptive Topology-based Learning at Scale for Homophilic and Heterophilic Graphs), a novel graph learning algorithm that addresses two important challenges in graph neural networks (GNNs). First, the accuracy of GNNs degrades when the graph is heterophilic. Second, iterative feature aggregation limits the scalability of GNNs to large graphs. We address these challenges by extracting topological information about graph communities at multiple levels of refinement, concatenating community assignments to the feature vector, and applying multilayer perceptrons (MLPs) to the resulting representation. This provides topological context about nodes and their neighborhoods without invoking aggregation. Because MLPs are typically more scalable than GNNs, our approach applies to large graphs without the need for sampling. Across a wide set of graphs, ATLAS achieves comparable accuracy to baseline methods, with gains as high as 20 percentage points over GCN for heterophilic graphs with negative structural bias and 11 percentage points over MLP for homophilic graphs. Furthermore, we show how multi-resolution community features systematically modulate performance in both homophilic and heterophilic settings, opening a principled path toward explainable graph learning.

[278] Low-rank MMSE filters, Kronecker-product representation, and regularization: a new perspective

Daniel Gomes de Pinho Zanco, Leszek Szczecinski, Jacob Benesty, Eduardo Vinicius Kuhn

Main category: cs.LG

TL;DR: Proposed method to efficiently find regularization parameter for low-rank MMSE filters using Kronecker-product representation, linking regularization to rank selection.

DetailsMotivation: The regularization parameter in low-rank MMSE filters is crucial but challenging to determine. Existing methods may not be efficient or optimal, especially in low-rank settings where the regularization parameter is surprisingly linked to rank selection problems.

Method: Proposes an efficient method to find the regularization parameter for low-rank MMSE filters based on a Kronecker-product representation. The approach leverages the connection between regularization parameter selection and rank selection problems.

Result: Simulation results validate the proposed method, showing significant performance gains over commonly used methods for determining regularization parameters in low-rank MMSE filters.

Conclusion: The regularization parameter in low-rank MMSE filters is fundamentally linked to rank selection, and the proposed efficient method for determining this parameter provides substantial improvements over existing approaches.

Abstract: In this work, we propose a method to efficiently find the regularization parameter for low-rank MMSE filters based on a Kronecker-product representation. We show that the regularization parameter is surprisingly linked to the problem of rank selection and, thus, properly choosing it, is crucial for low-rank settings. The proposed method is validated through simulations, showing significant gains over commonly used methods.

[279] Deep Learning and Elicitability for McKean-Vlasov FBSDEs With Common Noise

Felipe J. P. Antunes, Yuri F. Saporito, Sebastian Jaimungal

Main category: cs.LG

TL;DR: Novel deep learning method for solving McKean-Vlasov FBSDEs with common noise using elicitability to create path-wise loss functions, avoiding nested Monte Carlo simulations.

DetailsMotivation: Existing methods for solving MV-FBSDEs with common noise require computationally expensive nested Monte Carlo simulations. There's a need for more efficient numerical approaches that can handle complex mean-field interactions and conditional expectations arising from common noise.

Method: Combines Picard iterations, elicitability, and deep learning. Uses elicitability to derive path-wise loss functions for training neural networks. Parameterizes mean-field interaction via recurrent neural network trained to minimize elicitable score. Approximates backward process through feedforward network representing decoupling field.

Result: Validated on systemic risk inter-bank borrowing model with analytical solutions - accurately recovers true solution. Extended to quantile-mediated interactions showing framework flexibility. Applied to non-stationary Aiyagari-Bewley-Huggett economic growth model with endogenous interest rates, demonstrating applicability to complex mean-field games without closed-form solutions.

Conclusion: The elicitability-based deep learning framework provides an efficient alternative to nested Monte Carlo methods for solving MV-FBSDEs with common noise, handling complex interactions and conditional expectations while being applicable to real-world economic models without analytical solutions.

Abstract: We present a novel numerical method for solving McKean-Vlasov forward-backward stochastic differential equations (MV-FBSDEs) with common noise, combining Picard iterations, elicitability and deep learning. The key innovation involves elicitability to derive a path-wise loss function, enabling efficient training of neural networks to approximate both the backward process and the conditional expectations arising from common noise - without requiring computationally expensive nested Monte Carlo simulations. The mean-field interaction term is parameterized via a recurrent neural network trained to minimize an elicitable score, while the backward process is approximated through a feedforward network representing the decoupling field. We validate the algorithm on a systemic risk inter-bank borrowing and lending model, where analytical solutions exist, demonstrating accurate recovery of the true solution. We further extend the model to quantile-mediated interactions, showcasing the flexibility of the elicitability framework beyond conditional means or moments. Finally, we apply the method to a non-stationary Aiyagari–Bewley–Huggett economic growth model with endogenous interest rates, illustrating its applicability to complex mean-field games without closed-form solutions.

[280] Softly Constrained Denoisers for Diffusion Models

Victor M. Yeom Song, Severi Rissanen, Arno Solin, Samuel Kaski, Mingfei Sun

Main category: cs.LG

TL;DR: Softly constrained denoisers integrate constraint guidance into the denoiser architecture to improve compliance while maintaining flexibility when constraints are misspecified.

DetailsMotivation: Diffusion models struggle to produce constraint-compliant samples in scientific applications. Existing methods (regularization or guidance) bias the model away from true data distribution, especially problematic when constraints are misspecified.

Method: Instead of changing loss or sampling loop, integrate guidance-inspired adjustment into the denoiser itself, giving it a soft inductive bias towards constraint-compliant samples.

Result: Softly constrained denoisers exploit constraint knowledge to improve compliance over standard denoisers while maintaining flexibility to deviate when constraints are misspecified with observed data.

Conclusion: Integrating constraint guidance directly into the denoiser architecture provides a better approach for scientific applications where constraints may be imperfectly specified.

Abstract: Diffusion models struggle to produce samples that respect constraints, a common requirement in scientific applications. Recent approaches have introduced regularization terms in the loss or guidance methods during sampling to enforce such constraints, but they bias the generative model away from the true data distribution. This is a problem, especially when the constraint is misspecified, a common issue when formulating constraints on scientific data. In this paper, instead of changing the loss or the sampling loop, we integrate a guidance-inspired adjustment into the denoiser itself, giving it a soft inductive bias towards constraint-compliant samples. We show that these softly constrained denoisers exploit constraint knowledge to improve compliance over standard denoisers, and maintain enough flexibility to deviate from it when there is misspecification with observed data.

[281] Prompt Repetition Improves Non-Reasoning LLMs

Yaniv Leviathan, Matan Kalman, Yossi Matias

Main category: cs.LG

TL;DR: Repeating input prompts boosts performance for major AI models without increasing token count or latency when reasoning is not used.

DetailsMotivation: To explore simple, cost-effective methods to improve model performance without additional computational overhead or token usage.

Method: Simply repeating the input prompt before model processing, without using reasoning capabilities.

Result: Performance improvements observed across popular models (Gemini, GPT, Claude, and Deepseek) with no increase in generated tokens or latency.

Conclusion: Prompt repetition is an effective, zero-cost technique for enhancing model performance when reasoning is not required.

Abstract: When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.

[282] Adaptive Partitioning and Learning for Stochastic Control of Diffusion Processes

Hanqing Jin, Renyuan Xu, Yanzhao Yang

Main category: cs.LG

TL;DR: Model-based RL algorithm for controlled diffusion processes with unbounded state spaces using adaptive partitioning to balance exploration and approximation, achieving regret bounds via zooming dimension analysis.

DetailsMotivation: Address reinforcement learning for controlled diffusion processes with unbounded continuous state spaces, bounded continuous actions, and polynomially growing rewards - settings common in finance, economics, and operations research where existing methods struggle with continuous high-dimensional domains.

Method: Introduce model-based algorithm that adaptively partitions joint state-action space, maintains estimators of drift, volatility, and rewards within each partition, and refines discretization when estimation bias exceeds statistical confidence.

Result: Established regret bounds depending on problem horizon, state dimension, reward growth order, and newly defined zooming dimension tailored to unbounded diffusion processes. Bounds recover existing results for bounded settings as special case and extend theoretical guarantees to broader class of diffusion-type problems.

Conclusion: The adaptive partitioning approach effectively balances exploration and approximation, enabling efficient learning in unbounded domains, validated through numerical experiments including high-dimensional applications like multi-asset mean-variance portfolio selection.

Abstract: We study reinforcement learning for controlled diffusion processes with unbounded continuous state spaces, bounded continuous actions, and polynomially growing rewards: settings that arise naturally in finance, economics, and operations research. To overcome the challenges of continuous and high-dimensional domains, we introduce a model-based algorithm that adaptively partitions the joint state-action space. The algorithm maintains estimators of drift, volatility, and rewards within each partition, refining the discretization whenever estimation bias exceeds statistical confidence. This adaptive scheme balances exploration and approximation, enabling efficient learning in unbounded domains. Our analysis establishes regret bounds that depend on the problem horizon, state dimension, reward growth order, and a newly defined notion of zooming dimension tailored to unbounded diffusion processes. The bounds recover existing results for bounded settings as a special case, while extending theoretical guarantees to a broader class of diffusion-type problems. Finally, we validate the effectiveness of our approach through numerical experiments, including applications to high-dimensional problems such as multi-asset mean-variance portfolio selection.

[283] DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding

Ruiyi Zhang, Peijia Qin, Qi Cao, Pengtao Xie

Main category: cs.LG

TL;DR: DreamPRM-Code introduces a novel Process Reward Model for coding that treats functions as reasoning steps and uses meta-learning to correct noisy intermediate labels, achieving state-of-the-art performance on LiveCodeBench.

DetailsMotivation: Current Process Reward Models (PRMs) are ineffective for coding tasks due to two main issues: lack of meaningful step decompositions in code (unlike mathematical reasoning), and noise in Monte-Carlo-generated partial labels that hampers training.

Method: DreamPRM-Code uses a Chain-of-Function prompting strategy to induce modular code generation, treating functions as reasoning steps. It introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine noisy intermediate labels.

Result: DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.

Conclusion: The proposed approach successfully adapts PRMs to coding tasks by addressing decomposition and label noise challenges, demonstrating that treating functions as reasoning steps combined with meta-learning label correction enables effective test-time scaling for code generation.

Abstract: Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.

[284] Stock Pattern Assistant (SPA): A Deterministic and Explainable Framework for Structural Price Run Extraction and Event Correlation in Equity Markets

Sandeep Neela

Main category: cs.LG

TL;DR: SPA is a deterministic framework that extracts monotonic price runs, aligns them with public events, and generates factual explanations from daily OHLCV data for transparent market analysis.

DetailsMotivation: Existing market analysis tools (technical indicators, chart heuristics, predictive models) lack transparency and auditability, leaving important questions unanswered. There's a need for explainable, reproducible frameworks that can provide structural insights into price evolution.

Method: SPA uses deterministic segmentation to extract monotonic price runs from daily OHLCV data, aligns these runs with public events through symmetric correlation windows, and generates factual, historical, guardrailed explanations using only normalized event streams.

Result: SPA consistently produces stable structural decompositions and contextual narratives across four equities (AAPL, NVDA, SCHW, PGR) spanning different volatility regimes and sectors. Ablation experiments show deterministic segmentation, event alignment, and constrained explanation each contribute to interpretability.

Conclusion: SPA provides a transparent, reproducible view of historical price structure that complements analyst workflows, risk reviews, and explainable-AI pipelines. It’s not a forecasting or trading system, but offers interpretable market analysis with auditability.

Abstract: Understanding how prices evolve over time often requires peeling back the layers of market noise to identify clear, structural behavior. Many of the tools commonly used for this purpose technical indicators, chart heuristics, or even sophisticated predictive models leave important questions unanswered. Technical indicators depend on platform-specific rules, and predictive systems typically offer little in terms of explanation. In settings that demand transparency or auditability, this poses a significant challenge. We introduce the Stock Pattern Assistant (SPA), a deterministic framework designed to extract monotonic price runs, attach relevant public events through a symmetric correlation window, and generate explanations that are factual, historical, and guardrailed. SPA relies only on daily OHLCV data and a normalized event stream, making the pipeline straight-forward to audit and easy to reproduce. To illustrate SPA’s behavior in practice, we evaluate it across four equities-AAPL, NVDA, SCHW, and PGR-chosen to span a range of volatility regimes and sector characteristics. Although the evaluation period is modest, the results demonstrate how SPA consistently produces stable structural decompositions and contextual narratives. Ablation experiments further show how deterministic segmentation, event alignment, and constrained explanation each contribute to interpretability. SPA is not a forecasting system, nor is it intended to produce trading signals. Its value lies in offering a transparent, reproducible view of historical price structure that can complement analyst workflows, risk reviews, and broader explainable-AI pipelines.

[285] Spectral Representation-based Reinforcement Learning

Chenxiao Gao, Haotian Sun, Na Li, Dale Schuurmans, Bo Dai

Main category: cs.LG

TL;DR: Spectral representations derived from transition operator decomposition provide a theoretically grounded framework for RL that addresses issues with neural network approximations, offering effective algorithms validated on challenging control tasks.

DetailsMotivation: Traditional RL with powerful function approximations like neural networks suffers from theoretical ambiguities, optimization instability, exploration difficulty, and high computational costs. The authors seek a more principled approach.

Method: Introduces spectral representations framework based on spectral decomposition of transition operators. Shows how to construct spectral representations for transition operators with latent variable or energy-based structures, with different learning methods to extract these representations from data.

Result: Each learning method yields an effective RL algorithm. The framework is provably extended to partially observable MDPs. Validated on over 20 challenging tasks from DeepMind Control Suite, achieving performance comparable or superior to state-of-the-art model-free and model-based baselines.

Conclusion: Spectral representations provide a theoretically clear and practically effective alternative to neural network approximations in RL, addressing key challenges while maintaining strong performance on complex tasks.

Abstract: In real-world applications with large state and action spaces, reinforcement learning (RL) typically employs function approximations to represent core components like the policies, value functions, and dynamics models. Although powerful approximations such as neural networks offer great expressiveness, they often present theoretical ambiguities, suffer from optimization instability and exploration difficulty, and incur substantial computational costs in practice. In this paper, we introduce the perspective of spectral representations as a solution to address these difficulties in RL. Stemming from the spectral decomposition of the transition operator, this framework yields an effective abstraction of the system dynamics for subsequent policy optimization while also providing a clear theoretical characterization. We reveal how to construct spectral representations for transition operators that possess latent variable structures or energy-based structures, which implies different learning methods to extract spectral representations from data. Notably, each of these learning methods realizes an effective RL algorithm under this framework. We also provably extend this spectral view to partially observable MDPs. Finally, we validate these algorithms on over 20 challenging tasks from the DeepMind Control Suite, where they achieve performances comparable or superior to current state-of-the-art model-free and model-based baselines.

[286] EMFusion: Conditional Diffusion Framework for Trustworthy Frequency Selective EMF Forecasting in Wireless Networks

Zijiang Yan, Yixiang Huang, Jianhua Pei, Hina Tabassum, Luca Chiaraviglio

Main category: cs.LG

TL;DR: EMFusion: A conditional multivariate diffusion-based probabilistic forecasting framework for frequency-selective EMF levels that integrates contextual factors and provides uncertainty estimates, outperforming baselines by significant margins.

DetailsMotivation: Existing EMF forecasting methods use univariate approaches on wideband aggregate data, but frequency-selective multivariate forecasting is needed to capture inter-operator and inter-frequency variations for proactive network planning, compliance monitoring, and health impact assessment.

Method: EMFusion uses a conditional multivariate diffusion-based probabilistic forecasting framework with residual U-Net backbone enhanced by cross-attention mechanism to integrate external conditions (time, season, holidays). It treats forecasting as structural inpainting using imputation-based sampling for temporal coherence with irregular measurements.

Result: EMFusion with working hours contextual information outperforms baseline models: 23.85% improvement in CRPS, 13.93% improvement in normalized RMSE, and 22.47% reduction in prediction CRPS error compared to best baseline.

Conclusion: EMFusion provides accurate, calibrated probabilistic forecasts for frequency-selective EMF levels with explicit uncertainty quantification, enabling trustworthy decision-making for network planning and compliance monitoring.

Abstract: The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, frequency-selective multivariate forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional multivariate diffusion-based probabilistic forecasting framework that integrates diverse contextual factors (e.g., time of day, season, and holidays) while providing explicit uncertainty estimates. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates calibrated probabilistic prediction intervals directly from the learned conditional distribution, providing explicit uncertainty quantification essential for trustworthy decision-making. Numerical experiments conducted on frequency-selective EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. The EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS), 13.93% in normalized root mean square error, and reduces prediction CRPS error by 22.47%.

[287] The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems

Debu Sinha

Main category: cs.LG

TL;DR: RAG systems still hallucinate despite evidence grounding. Embedding-based detection methods fail on real benchmarks due to “semantic illusion” - hallucinations preserve semantic similarity while introducing factual errors. GPT-4 as judge shows the task is solvable through reasoning.

DetailsMotivation: Current RAG hallucination detection methods rely on semantic similarity and NLI, but their fundamental limitations haven't been rigorously characterized. There's a need to understand why these methods fail and provide reliable detection with guarantees.

Method: Applied conformal prediction to hallucination detection for finite-sample coverage guarantees. Evaluated embedding-based methods (OpenAI text-embedding-3-large, cross-encoder models) and GPT-4 as LLM judge across multiple benchmarks (HaluEval, RAGTruth, WikiBio) with various LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral).

Result: Embedding methods show unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, 50% on WikiBio. GPT-4 achieves only 7% FPR (95% CI: [3.4%, 13.7%]), proving the task is solvable. Conformal prediction achieved 94% coverage with 0% FPR on synthetic data (Natural Questions).

Conclusion: Embedding-based detection is insufficient for production RAG due to “semantic illusion” - hallucinations preserve semantic similarity while introducing factual errors invisible to embeddings. This limitation persists across architectures, LLMs, and tasks. LLM reasoning (like GPT-4) shows promise for reliable detection.

Abstract: Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of detection capabilities. Using calibration sets of approximately 600 examples, we achieve 94% coverage with 0% false positive rate on synthetic hallucinations (Natural Questions). However, on three real hallucination benchmarks spanning multiple LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral), embedding-based methods - including state-of-the-art OpenAI text-embedding-3-large and cross-encoder models - exhibit unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, and 50% on WikiBio. Crucially, GPT-4 as an LLM judge achieves only 7% FPR (95% CI: [3.4%, 13.7%]) on the same data, proving the task is solvable through reasoning. We term this the “semantic illusion”: semantically plausible hallucinations preserve similarity to source documents while introducing factual errors invisible to embeddings. This limitation persists across embedding architectures, LLM generators, and task types, suggesting embedding-based detection is insufficient for production RAG deployment.

[288] The Semantic Architect: How FEAML Bridges Structured Data and LLMs for Multi-Label Tasks

Wanfu Gao, Zebin He, Jun Gao

Main category: cs.LG

TL;DR: FEAML is an automated feature engineering method for multi-label classification that uses LLMs’ code generation capabilities guided by metadata and label co-occurrence matrices, with feedback-driven optimization.

DetailsMotivation: Existing LLM-based feature engineering methods haven't been applied to multi-label learning tasks, lacking ability to model complex label dependencies and not adapted to multi-label task characteristics.

Method: Uses LLMs’ code generation capabilities guided by metadata and label co-occurrence matrices to understand feature-task relationships and generate high-quality features. Features are evaluated by model accuracy and redundancy detection via Pearson correlation. Incorporates evaluation results as feedback to optimize LLM code generation iteratively.

Result: FEAML outperforms other feature engineering methods on various multi-label datasets, demonstrating effectiveness in multi-label classification tasks.

Conclusion: FEAML integrates LLMs with feedback mechanisms to create an efficient, interpretable, and self-improving feature engineering paradigm for multi-label learning.

Abstract: Existing feature engineering methods based on large language models (LLMs) have not yet been applied to multi-label learning tasks. They lack the ability to model complex label dependencies and are not specifically adapted to the characteristics of multi-label tasks. To address the above issues, we propose Feature Engineering Automation for Multi-Label Learning (FEAML), an automated feature engineering method for multi-label classification which leverages the code generation capabilities of LLMs. By utilizing metadata and label co-occurrence matrices, LLMs are guided to understand the relationships between data features and task objectives, based on which high-quality features are generated. The newly generated features are evaluated in terms of model accuracy to assess their effectiveness, while Pearson correlation coefficients are used to detect redundancy. FEAML further incorporates the evaluation results as feedback to drive LLMs to continuously optimize code generation in subsequent iterations. By integrating LLMs with a feedback mechanism, FEAML realizes an efficient, interpretable and self-improving feature engineering paradigm. Empirical results on various multi-label datasets demonstrate that our FEAML outperforms other feature engineering methods.

[289] Neural Modular Physics for Elastic Simulation

Yifei Li, Haixu Wu, Zeyi Xu, Tuur Stuyck, Wojciech Matusik

Main category: cs.LG

TL;DR: Neural Modular Physics (NMP) decomposes elastic simulation into neural modules with physical meaning, combining neural network approximation with traditional simulator reliability for better generalization and physical consistency.

DetailsMotivation: Monolithic neural simulators lose physical interpretability and reliability compared to traditional numerical simulators. The paper aims to combine neural network approximation capacity with physical reliability by drawing inspiration from classical modular simulators.

Method: NMP decomposes elastic dynamics into physically meaningful neural modules connected through intermediate physical quantities. Uses specialized architecture and training strategy to transform numerical computation flow into modular neural simulator with direct supervision of intermediate quantities and physical constraints.

Result: NMP demonstrates superior generalization to unseen initial conditions and resolutions, stable long-horizon simulation, better preservation of physical properties compared to other neural simulators, and greater feasibility in scenarios with unknown underlying dynamics than traditional simulators.

Conclusion: The modular neural approach successfully combines neural network approximation with physical reliability, offering improved physical consistency and generalizability over both monolithic neural simulators and traditional numerical methods.

Abstract: Learning-based methods have made significant progress in physics simulation, typically approximating dynamics with a monolithic end-to-end optimized neural network. Although these models offer an effective way to simulation, they may lose essential features compared to traditional numerical simulators, such as physical interpretability and reliability. Drawing inspiration from classical simulators that operate in a modular fashion, this paper presents Neural Modular Physics (NMP) for elastic simulation, which combines the approximation capacity of neural networks with the physical reliability of traditional simulators. Beyond the previous monolithic learning paradigm, NMP enables direct supervision of intermediate quantities and physical constraints by decomposing elastic dynamics into physically meaningful neural modules connected through intermediate physical quantities. With a specialized architecture and training strategy, our method transforms the numerical computation flow into a modular neural simulator, achieving improved physical consistency and generalizability. Experimentally, NMP demonstrates superior generalization to unseen initial conditions and resolutions, stable long-horizon simulation, better preservation of physical properties compared to other neural simulators, and greater feasibility in scenarios with unknown underlying dynamics than traditional simulators.

[290] PIP$^2$ Net: Physics-informed Partition Penalty Deep Operator Network

Hongjin Mi, Huiqiang Lun, Changhong Mou, Yeyu Zhang

Main category: cs.LG

TL;DR: PIP² Net introduces partition-of-unity regularization to DeepONet for improved stability and accuracy in operator learning for PDEs, outperforming existing methods on nonlinear PDE benchmarks.

DetailsMotivation: Existing operator learning architectures like DeepONet and FNO require large datasets, lack physical structure, and suffer from trunk-network instability issues like mode imbalance/collapse that hinder accurate operator approximation.

Method: Developed PIP² Net (Physics-informed Partition Penalty Deep Operator Network) with a simplified, principled partition penalty based on partition-of-unity regularization to improve trunk network coordination and expressiveness while maintaining DeepONet flexibility.

Result: PIP² Net consistently outperforms DeepONet, PI-DeepONet, and POU-DeepONet in prediction accuracy and robustness on three nonlinear PDEs: viscous Burgers equation, Allen-Cahn equation, and a diffusion-reaction system.

Conclusion: Partition-of-unity regularization provides effective stabilization for operator learning, addressing trunk network instability issues and improving performance without sacrificing model flexibility.

Abstract: Operator learning has become a powerful tool for accelerating the solution of parameterized partial differential equations (PDEs), enabling rapid prediction of full spatiotemporal fields for new initial conditions or forcing functions. Existing architectures such as DeepONet and the Fourier Neural Operator (FNO) show strong empirical performance but often require large training datasets, lack explicit physical structure, and may suffer from instability in their trunk-network features, where mode imbalance or collapse can hinder accurate operator approximation. Motivated by the stability and locality of classical partition-of-unity (PoU) methods, we investigate PoU-based regularization techniques for operator learning and develop a revised formulation of the existing POU–PI–DeepONet framework. The resulting \emph{P}hysics-\emph{i}nformed \emph{P}artition \emph{P}enalty Deep Operator Network (PIP$^{2}$ Net) introduces a simplified and more principled partition penalty that improved the coordinated trunk outputs that leads to more expressiveness without sacrificing the flexibility of DeepONet. We evaluate PIP$^{2}$ Net on three nonlinear PDEs: the viscous Burgers equation, the Allen–Cahn equation, and a diffusion–reaction system. The results show that it consistently outperforms DeepONet, PI-DeepONet, and POU-DeepONet in prediction accuracy and robustness.

[291] SigMA: Path Signatures and Multi-head Attention for Learning Parameters in fBm-driven SDEs

Xianglin Wu, Chiheb Ben Hammouda, Cornelis W. Oosterlee

Main category: cs.LG

TL;DR: SigMA combines path signatures with multi-head attention for accurate parameter estimation in fractional Brownian motion SDEs, outperforming traditional deep learning baselines.

DetailsMotivation: Fractional Brownian motion SDEs model systems with rough dynamics and long-range dependence, but their non-Markovian nature makes parameter estimation challenging with classical methods.

Method: SigMA (Signature Multi-head Attention) integrates path signatures with multi-head self-attention, using convolutional preprocessing and MLP for feature encoding to learn parameters from synthetic fBm-driven SDE paths.

Result: SigMA consistently outperforms CNN, LSTM, vanilla Transformer, and Deep Signature baselines in accuracy, robustness, and model compactness on synthetic data and real-world datasets (equity volatility and battery degradation).

Conclusion: Combining signature transforms with attention-based architectures provides an effective and scalable framework for parameter inference in stochastic systems with rough or persistent temporal structure.

Abstract: Stochastic differential equations (SDEs) driven by fractional Brownian motion (fBm) are increasingly used to model systems with rough dynamics and long-range dependence, such as those arising in quantitative finance and reliability engineering. However, these processes are non-Markovian and lack a semimartingale structure, rendering many classical parameter estimation techniques inapplicable or computationally intractable beyond very specific cases. This work investigates two central questions: (i) whether integrating path signatures into deep learning architectures can improve the trade-off between estimation accuracy and model complexity, and (ii) what constitutes an effective architecture for leveraging signatures as feature maps. We introduce SigMA (Signature Multi-head Attention), a neural architecture that integrates path signatures with multi-head self-attention, supported by a convolutional preprocessing layer and a multilayer perceptron for effective feature encoding. SigMA learns model parameters from synthetically generated paths of fBm-driven SDEs, including fractional Brownian motion, fractional Ornstein-Uhlenbeck, and rough Heston models, with a particular focus on estimating the Hurst parameter and on joint multi-parameter inference, and it generalizes robustly to unseen trajectories. Extensive experiments on synthetic data and two real-world datasets (i.e., equity-index realized volatility and Li-ion battery degradation) show that SigMA consistently outperforms CNN, LSTM, vanilla Transformer, and Deep Signature baselines in accuracy, robustness, and model compactness. These results demonstrate that combining signature transforms with attention-based architectures provides an effective and scalable framework for parameter inference in stochastic systems with rough or persistent temporal structure.

[292] Feature-Centric Unsupervised Node Representation Learning Without Homophily Assumption

Sunwoo Kim, Soo Yong Lee, Kyungho Kim, Hyunjin Hwang, Jaemin Yoo, Kijung Shin

Main category: cs.LG

TL;DR: FUEL is an unsupervised node representation learning method that adaptively learns the optimal degree of graph convolution usage by enhancing intra-class similarity and inter-class separability, achieving SOTA performance across graphs with varying homophily levels.

DetailsMotivation: Excessive reliance on graph convolution can be suboptimal, especially in non-homophilic graphs, as it may produce overly similar embeddings for nodes with different features or topological properties. While adjusting graph convolution usage has been explored in supervised learning, it remains underexplored in unsupervised scenarios.

Method: FUEL adaptively learns the adequate degree of graph convolution usage by aiming to enhance intra-class similarity and inter-class separability in the embedding space. Since classes are unknown, it leverages node features to identify node clusters and treats these clusters as proxies for classes.

Result: Through extensive experiments using 15 baseline methods and 14 benchmark datasets, FUEL demonstrates effectiveness in downstream tasks, achieving state-of-the-art performance across graphs with diverse levels of homophily.

Conclusion: FUEL successfully addresses the challenge of adaptive graph convolution usage in unsupervised node representation learning, providing a robust solution that works well across various graph homophily levels.

Abstract: Unsupervised node representation learning aims to obtain meaningful node embeddings without relying on node labels. To achieve this, graph convolution, which aggregates information from neighboring nodes, is commonly employed to encode node features and graph topology. However, excessive reliance on graph convolution can be suboptimal-especially in non-homophilic graphs-since it may yield unduly similar embeddings for nodes that differ in their features or topological properties. As a result, adjusting the degree of graph convolution usage has been actively explored in supervised learning settings, whereas such approaches remain underexplored in unsupervised scenarios. To tackle this, we propose FUEL, which adaptively learns the adequate degree of graph convolution usage by aiming to enhance intra-class similarity and inter-class separability in the embedding space. Since classes are unknown, FUEL leverages node features to identify node clusters and treats these clusters as proxies for classes. Through extensive experiments using 15 baseline methods and 14 benchmark datasets, we demonstrate the effectiveness of FUEL in downstream tasks, achieving state-of-the-art performance across graphs with diverse levels of homophily.

[293] How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

Ali Ghodsi

Main category: cs.LG

TL;DR: The paper introduces a unified framework for sequence models, revealing trade-offs between expressivity and trainability through three key theoretical results about interaction rank, head count equivalence, and gradient propagation.

DetailsMotivation: There's a lack of unified theoretical understanding of expressivity and trainability trade-offs across diverse sequence modeling architectures (RNNs, Transformers, SSMs). The authors aim to provide a theoretical framework that can analyze and compare these different approaches.

Method: The authors introduce a unified framework representing sequence maps via input-dependent effective interaction operators W_ij(X). They identify two construction patterns: (1) Unified Factorized Framework (attention-style mixing) where W_ij varies through scalar coefficients applied to shared value maps, and (2) Structured Dynamics (state-space recurrences) where W_ij is induced by latent dynamical systems.

Result: Three main theoretical results: (1) Interaction Rank Gap - attention-style models are constrained to low-dimensional operator spans and cannot represent certain structured dynamical maps; (2) Equivalence Theorem - representing linear SSMs requires k heads for k-dimensional subspace lag operators; (3) Gradient Highway Result - attention layers have distance-independent gradient paths while stable linear dynamics show distance-dependent gradient attenuation.

Conclusion: The framework formalizes a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design and explaining why different architectures excel in different scenarios.

Abstract: Sequence modeling has produced diverse architectures – from classical recurrent neural networks to modern Transformers and state space models (SSMs) – yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator $W_{ij}(X)$, making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which $W_{ij}(X)$ varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which $W_{ij}$ is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a $k$-dimensional subspace on length-$n$ sequences requires and is achievable with $H=k$ heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.

[294] From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, Patrik Reizinger

Main category: cs.LG

TL;DR: SAEs and sparse probes don’t create truly disentangled concept representations despite appearing decorrelated; concepts map to many features and steering affects multiple concepts, showing correlation metrics are insufficient for evaluating independence.

DetailsMotivation: Current interpretability methods evaluate concept representations in isolation with implicit independence assumptions, but it's unclear if they actually recover disentangled representations when concepts are correlated in practice.

Method: Created controlled multi-concept evaluation with increasing correlations between textual concepts (sentiment, domain, tense). Evaluated featurizers’ ability to learn disentangled representations under correlation, then performed steering experiments to test independent manipulability.

Result: Found one-to-many relationship: features correspond to at most one concept, but concepts distribute across many features. SAE features affect many concepts when steered (not selective/independent), though they affect disjoint subspaces. Correlation metrics insufficient for establishing steering independence.

Conclusion: Correlational disentanglement metrics don’t guarantee independent manipulability; affecting disjoint subspaces isn’t sufficient for concept selectivity. Highlights need for compositional evaluations in interpretability research.

Abstract: A central goal of interpretability is to recover representations of causally relevant concepts from the activations of neural networks. The quality of these concept representations is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear whether common featurization methods - including sparse autoencoders (SAEs) and sparse probes - recover disentangled representations of these concepts. This study proposes a multi-concept evaluation setting where we control the correlations between textual concepts, such as sentiment, domain, and tense, and analyze performance under increasing correlations between them. We first evaluate the extent to which featurizers can learn disentangled representations of each concept under increasing correlational strengths. We observe a one-to-many relationship from concepts to features: features correspond to no more than one concept, but concepts are distributed across many features. Then, we perform steering experiments, measuring whether each concept is independently manipulable. Even when trained on uniform distributions of concepts, SAE features generally affect many concepts when steered, indicating that they are neither selective nor independent; nonetheless, features affect disjoint subspaces. These results suggest that correlational metrics for measuring disentanglement are generally not sufficient for establishing independence when steering, and that affecting disjoint subspaces is not sufficient for concept selectivity. These results underscore the importance of compositional evaluations in interpretability research.

[295] FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation

Runze Li, Hanchen Wang, Wenjie Zhang, Binghao Li, Yu Zhang, Xuemin Lin, Ying Zhang

Main category: cs.LG

TL;DR: FADTI: A frequency-aware diffusion model for multivariate time series imputation that uses Fourier Bias Projection to inject frequency-domain inductive biases, outperforming existing methods especially under high missing rates.

DetailsMotivation: Existing Transformer- and diffusion-based imputation models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts in applications like healthcare, traffic forecasting, and biological modeling.

Method: FADTI combines diffusion-based generation with frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module that supports multiple spectral bases for adaptive encoding of stationary/non-stationary patterns, integrated with temporal modeling through self-attention and gated convolution.

Result: FADTI consistently outperforms state-of-the-art methods on multiple benchmarks, including a newly introduced biological time series dataset, particularly under high missing rates.

Conclusion: Injecting frequency-domain inductive bias into generative imputation processes via Fourier Bias Projection significantly improves performance, especially for challenging scenarios with structured missing patterns and distribution shifts.

Abstract: Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where sensor failures and irregular sampling lead to pervasive missing values. However, existing Transformer- and diffusion-based models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts. We propose FADTI, a diffusion-based framework that injects frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module and combines it with temporal modeling through self-attention and gated convolution. FBP supports multiple spectral bases, enabling adaptive encoding of both stationary and non-stationary patterns. This design injects frequency-domain inductive bias into the generative imputation process. Experiments on multiple benchmarks, including a newly introduced biological time series dataset, show that FADTI consistently outperforms state-of-the-art methods, particularly under high missing rates. Code is available at https://anonymous.4open.science/r/TimeSeriesImputation-52BF

[296] Automatic Reward Shaping from Multi-Objective Human Heuristics

Yuqing Xie, Jiayu Chen, Wenhao Tang, Ya Zhang, Chao Yu, Yu Wang

Main category: cs.LG

TL;DR: MORSE is a framework that automatically combines multiple heuristic rewards into a unified reward function using bi-level optimization with stochastic exploration to avoid local minima.

DetailsMotivation: Designing effective reward functions is challenging in RL, especially for multi-objective environments where balancing multiple objectives manually is difficult and time-consuming.

Method: MORSE uses bi-level optimization: inner loop trains policy to maximize current shaped reward, outer loop updates reward function to optimize task performance. Introduces stochasticity with noise guided by task performance and prediction error of a fixed random neural network to encourage exploration.

Result: Experiments in MuJoCo and Isaac Sim show MORSE effectively balances multiple objectives across robotic tasks, achieving performance comparable to manually tuned reward functions.

Conclusion: MORSE provides a general framework for automatic reward shaping that can handle multi-objective environments without extensive manual tuning, demonstrating practical effectiveness in robotic tasks.

Abstract: Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration (MORSE), a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.

[297] Tracking Temporal Dynamics of Vector Sets with Gaussian Process

Taichi Aida, Mamoru Komachi, Toshinobu Ogiso, Hiroya Takamura, Daichi Mochihashi

Main category: cs.LG

TL;DR: Proposes a method using infinite-dimensional Gaussian processes with Random Fourier Features to model and track temporal evolution of vector sets across domains like crime analysis and linguistics.

DetailsMotivation: Understanding temporal evolution of vector sets is fundamental across domains (ecology, crime analysis, linguistics), but challenging due to complex structures that evolve over time.

Method: Models distribution underlying each set of vectors using infinite-dimensional Gaussian processes, approximates latent function with Random Fourier Features to obtain compact, comparable vector representations over time.

Result: Method successfully captures temporal dynamics in sociological (crime distributions) and linguistic (word embeddings) data, providing interpretable and robust representations in low-dimensional space.

Conclusion: Proposed approach offers a powerful framework for analyzing structural changes in temporally indexed vector sets across diverse domains, enabling tracking and visualization of temporal transitions.

Abstract: Understanding the temporal evolution of sets of vectors is a fundamental challenge across various domains, including ecology, crime analysis, and linguistics. For instance, ecosystem structures evolve due to interactions among plants, herbivores, and carnivores; the spatial distribution of crimes shifts in response to societal changes; and word embedding vectors reflect cultural and semantic trends over time. However, analyzing such time-varying sets of vectors is challenging due to their complicated structures, which also evolve over time. In this work, we propose a novel method for modeling the distribution underlying each set of vectors using infinite-dimensional Gaussian processes. By approximating the latent function in the Gaussian process with Random Fourier Features, we obtain compact and comparable vector representations over time. This enables us to track and visualize temporal transitions of vector sets in a low-dimensional space. We apply our method to both sociological data (crime distributions) and linguistic data (word embeddings), demonstrating its effectiveness in capturing temporal dynamics. Our results show that the proposed approach provides interpretable and robust representations, offering a powerful framework for analyzing structural changes in temporally indexed vector sets across diverse domains.

[298] TrajSyn: Privacy-Preserving Dataset Distillation from Federated Model Trajectories for Server-Side Adversarial Training

Mukur Gupta, Niharika Gupta, Saifur Rahman, Shantanu Pal, Chandan Karmakar

Main category: cs.LG

TL;DR: TrajSyn enables server-side adversarial training in Federated Learning by synthesizing proxy datasets from client model update trajectories, improving robustness without accessing raw client data or burdening edge devices.

DetailsMotivation: Deep learning models on edge devices in safety-critical applications are vulnerable to adversarial attacks, especially in Federated Learning settings. Adversarial training is difficult in FL due to privacy constraints and limited compute on edge devices.

Method: TrajSyn synthesizes a proxy dataset from the trajectories of client model updates, enabling effective server-side adversarial training without accessing raw client data.

Result: TrajSyn consistently improves adversarial robustness on image classification benchmarks with no extra compute burden on client devices.

Conclusion: TrajSyn provides a privacy-preserving solution for adversarial robustness in Federated Learning by enabling server-side training using synthesized proxy data from client update trajectories.

Abstract: Deep learning models deployed on edge devices are increasingly used in safety-critical applications. However, their vulnerability to adversarial perturbations poses significant risks, especially in Federated Learning (FL) settings where identical models are distributed across thousands of clients. While adversarial training is a strong defense, it is difficult to apply in FL due to strict client-data privacy constraints and the limited compute available on edge devices. In this work, we introduce TrajSyn, a privacy-preserving framework that enables effective server-side adversarial training by synthesizing a proxy dataset from the trajectories of client model updates, without accessing raw client data. We show that TrajSyn consistently improves adversarial robustness on image classification benchmarks with no extra compute burden on the client device.

[299] Generalization and Feature Attribution in Machine Learning Models for Crop Yield and Anomaly Prediction in Germany

Roland Baatz

Main category: cs.LG

TL;DR: ML models for crop yield prediction show good test performance but poor temporal generalization, and their SHAP explanations remain plausible even when models fail to generalize, revealing critical limitations in post hoc explainability methods.

DetailsMotivation: To examine the generalization performance and interpretability of ML models for crop yield prediction, particularly addressing the challenge of trusting model explanations when models may not generalize to unseen temporal conditions.

Method: Systematic comparison of ensemble tree-based models (XGBoost, Random Forest) and deep learning approaches (LSTM, TCN) using high-quality, long-term dataset from Germany’s NUTS-3 regions, with evaluation on both spatially split test sets and temporally independent validation years.

Result: All models perform well on conventional test sets but degrade substantially on temporally independent validation, revealing persistent generalization limitations. Models with weak temporal validation can still produce seemingly credible SHAP feature importance values, exposing vulnerability in post hoc explainability methods.

Conclusion: Need for validation-aware interpretation of ML predictions in agriculture; feature importance should not be accepted unless models generalize to unseen conditions. Advocates for domain-aware validation, hybrid modeling, and more rigorous scrutiny of explainability methods.

Abstract: This study examines the generalization performance and interpretability of machine learning (ML) models used for predicting crop yield and yield anomalies in Germany’s NUTS-3 regions. Using a high-quality, long-term dataset, the study systematically compares the evaluation and temporal validation behavior of ensemble tree-based models (XGBoost, Random Forest) and deep learning approaches (LSTM, TCN). While all models perform well on spatially split, conventional test sets, their performance degrades substantially on temporally independent validation years, revealing persistent limitations in generalization. Notably, models with strong test-set accuracy, but weak temporal validation performance can still produce seemingly credible SHAP feature importance values. This exposes a critical vulnerability in post hoc explainability methods: interpretability may appear reliable even when the underlying model fails to generalize. These findings underscore the need for validation-aware interpretation of ML predictions in agricultural and environmental systems. Feature importance should not be accepted at face value unless models are explicitly shown to generalize to unseen temporal and spatial conditions. The study advocates for domain-aware validation, hybrid modeling strategies, and more rigorous scrutiny of explainability methods in data-driven agriculture. Ultimately, this work addresses a growing challenge in environmental data science: how can we evaluate generalization robustly enough to trust model explanations?

[300] An Efficient Gradient-Based Inference Attack for Federated Learning

Pablo Montaña-Fernández, Ines Ortega-Fernandez

Main category: cs.LG

TL;DR: A new gradient-based membership inference attack for federated learning that exploits temporal evolution of last-layer gradients across multiple rounds, with extensions to attribute inference, showing strong performance on various datasets.

DetailsMotivation: Federated learning reduces direct data exposure but model update exchanges can still leak sensitive information. Existing attacks may not fully exploit multi-round temporal patterns, and there's a need to understand threats from different adversary types (semi-honest vs. malicious aggregators/data owners).

Method: Uses shadow technique to learn round-wise gradient patterns of training records without accessing private data. Exploits temporal evolution of last-layer gradients across multiple federated rounds. Model-agnostic approach applicable to any gradient-based model for both classification and regression. Extension to attribute inference by contrasting gradient responses under alternative attribute hypotheses.

Result: Strong attack performance on CIFAR-100 and Purchase100 for membership inference, and on Breast Cancer Wisconsin for attribute inference. Comparable computational/memory overhead to existing attacks. Multi-round FL increases vulnerability, aggregators pose greater threat than data owners, and richer high-dimensional data leads to stronger leakage than simpler tabular data.

Conclusion: Multi-round federated learning can increase vulnerability to inference attacks despite privacy benefits. Aggregators present more substantial threats than data owners. Attack performance depends on dataset characteristics, with richer data leading to stronger leakage. The proposed attacks are practical and effective across different settings.

Abstract: Federated Learning is a machine learning setting that reduces direct data exposure, improving the privacy guarantees of machine learning models. Yet, the exchange of model updates between the participants and the aggregator can still leak sensitive information. In this work, we present a new gradient-based membership inference attack for federated learning scenarios that exploits the temporal evolution of last-layer gradients across multiple federated rounds. Our method uses the shadow technique to learn round-wise gradient patterns of the training records, requiring no access to the private dataset, and is designed to consider both semi-honest and malicious adversaries (aggregators or data owners). Beyond membership inference, we also provide a natural extension of the proposed attack to discrete attribute inference by contrasting gradient responses under alternative attribute hypotheses. The proposed attacks are model-agnostic, and therefore applicable to any gradient-based model and can be applied to both classification and regression settings. We evaluate the attack on CIFAR-100 and Purchase100 datasets for membership inference and on Breast Cancer Wisconsin for attribute inference. Our findings reveal strong attack performance and comparable computational and memory overhead in membership inference when compared to another attack from the literature. The obtained results emphasize that multi-round federated learning can increase the vulnerability to inference attacks, that aggregators pose a more substantial threat than data owners, and that attack performance is strongly influenced by the nature of the training dataset, with richer, high-dimensional data leading to stronger leakage than simpler tabular data.

[301] Understanding NTK Variance in Implicit Neural Representations

Chengguang Ou, Yixin Zhuang

Main category: cs.LG

TL;DR: The paper provides a unified theoretical framework explaining how different INR architectural components (positional encoding, spherical normalization, Hadamard modulation) improve NTK conditioning and reduce spectral bias by affecting pairwise similarity factors and scaling terms.

DetailsMotivation: Implicit Neural Representations (INRs) suffer from slow convergence and poor recovery of high-frequency details due to spectral bias. While prior work links this to Neural Tangent Kernel (NTK), it's unclear how specific architectural choices affect NTK conditioning.

Method: The authors analyze INR mechanisms through their impact on pairwise similarity factors and scaling terms that determine NTK eigenvalue variance. They derive closed-form variance decompositions for common INR components: positional encoding (reshapes input similarity), spherical normalization (reduces variance via layerwise scaling), and Hadamard modulation (introduces similarity factors below one for multiplicative variance reduction).

Result: Experiments across multiple tasks confirm the predicted variance reductions and demonstrate faster, more stable convergence with improved reconstruction quality. The unified framework explains how diverse INR architectures mitigate spectral bias by improving NTK conditioning.

Conclusion: The paper provides a unified theoretical understanding of how different INR architectural components work to improve NTK conditioning and reduce spectral bias, offering insights for designing better INR architectures.

Abstract: Implicit Neural Representations (INRs) often converge slowly and struggle to recover high-frequency details due to spectral bias. While prior work links this behavior to the Neural Tangent Kernel (NTK), how specific architectural choices affect NTK conditioning remains unclear. We show that many INR mechanisms can be understood through their impact on a small set of pairwise similarity factors and scaling terms that jointly determine NTK eigenvalue variance. For standard coordinate MLPs, limited input-feature interactions induce large eigenvalue dispersion and poor conditioning. We derive closed-form variance decompositions for common INR components and show that positional encoding reshapes input similarity, spherical normalization reduces variance via layerwise scaling, and Hadamard modulation introduces additional similarity factors strictly below one, yielding multiplicative variance reduction. This unified view explains how diverse INR architectures mitigate spectral bias by improving NTK conditioning. Experiments across multiple tasks confirm the predicted variance reductions and demonstrate faster, more stable convergence with improved reconstruction quality.

[302] DEER: Draft with Diffusion, Verify with Autoregressive Models

Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo, Shi-Min Hu

Main category: cs.LG

TL;DR: DEER introduces a speculative decoding framework that uses diffusion-based language models as drafters instead of autoregressive models, achieving significantly longer draft acceptance (up to 32 tokens) and higher speedups (5.54x vs 2.41x) compared to existing methods.

DetailsMotivation: Autoregressive decoding in LLM-driven systems suffers from latency issues. Existing speculative decoding approaches use AR drafters which have two fundamental problems: (1) step-wise uncertainty accumulation leading to trust collapse between target model and drafter, and (2) inherently sequential decoding. These limitations result in limited speedups.

Method: DEER uses diffusion large language models (dLLMs) as drafters instead of AR models. It employs a two-stage training pipeline to align dLLM-based drafters with the target AR model, and adopts single-step decoding to generate long draft segments. The framework drafts with diffusion models and verifies with AR models.

Result: DEER achieves draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. On HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x.

Conclusion: Diffusion-based drafters can overcome the fundamental limitations of AR drafters in speculative decoding, enabling significantly higher efficiency and speedups for LLM-driven agentic and reasoning systems through parallel decoding and reduced uncertainty accumulation.

Abstract: Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/

[303] Chorus: Harmonizing Context and Sensing Signals for Data-Free Model Customization in IoT

Liyu Zhang, Yejia Liu, Kwun Ho Liu, Runxi Huang, Xiaomin Ouyang

Main category: cs.LG

TL;DR: Chorus: A context-aware, data-free model customization approach that adapts IoT sensor models to unseen deployment conditions without requiring target-domain data, using cross-modal reconstruction and adaptive context-sensor integration.

DetailsMotivation: Real-world IoT sensor data is collected under diverse contextual conditions (sensor placements, ambient environments) that significantly affect data patterns and downstream performance. Traditional domain adaptation methods often ignore context information or use simplistic integration strategies, making them ineffective for unseen context shifts after deployment.

Method: 1) Unsupervised cross-modal reconstruction between unlabeled sensor data and language-based context embeddings, with regularization for robust context representations. 2) Training a lightweight gated head on limited labeled samples to dynamically balance sensor and context contributions. 3) Context-caching mechanism to reduce inference latency by reusing cached context representations and updating only upon detected context shifts.

Result: Experiments on IMU, speech, and WiFi sensing tasks under diverse context shifts show Chorus outperforms state-of-the-art baselines by up to 11.3% in unseen contexts, while maintaining comparable latency on smartphone and edge devices.

Conclusion: Chorus provides an effective context-aware approach for adapting IoT sensor models to unseen deployment conditions without requiring target-domain data, achieving superior performance in handling context shifts while maintaining practical latency constraints for real-world deployment.

Abstract: In real-world IoT applications, sensor data is usually collected under diverse and dynamic contextual conditions where factors such as sensor placements or ambient environments can significantly affect data patterns and downstream performance. Traditional domain adaptation or generalization methods often ignore such context information or use simplistic integration strategies, making them ineffective in handling unseen context shifts after deployment. In this paper, we propose Chorus, a context-aware, data-free model customization approach that adapts models to unseen deployment conditions without requiring target-domain data. The key idea is to learn effective context representations that capture their influence on sensor data patterns and to adaptively integrate them based on the degree of context shift. Specifically, Chorus first performs unsupervised cross-modal reconstruction between unlabeled sensor data and language-based context embeddings, while regularizing the context embedding space to learn robust, generalizable context representations. Then, it trains a lightweight gated head on limited labeled samples to dynamically balance sensor and context contributions-favoring context when sensor evidence is ambiguous and vice versa. To further reduce inference latency, Chorus employs a context-caching mechanism that reuses cached context representations and updates only upon detected context shifts. Experiments on IMU, speech, and WiFi sensing tasks under diverse context shifts show that Chorus outperforms state-of-the-art baselines by up to 11.3% in unseen contexts, while maintaining comparable latency on smartphone and edge devices.

[304] Accelerating High-Throughput Catalyst Screening by Direct Generation of Equilibrium Adsorption Structures

Songze Huo, Xiao-Ming Cao

Main category: cs.LG

TL;DR: DBCata is a deep generative model that generates high-fidelity adsorption geometries for catalyst screening without needing energy/force data, outperforming ML potentials and improving DFT accuracy.

DetailsMotivation: Current ML interatomic potentials (MLIPs) for catalyst screening rely on limited training data from near-equilibrium structures, leading to unreliable adsorption structures and energy predictions.

Method: DBCata integrates a periodic Brownian-bridge framework with an equivariant graph neural network to create a low-dimensional transition manifold between unrelaxed and DFT-relaxed structures, without requiring explicit energy or force information.

Result: Achieves interatomic distance MAE of 0.035 Å on Catalysis-Hub dataset (3x better than SOTA ML potentials), improves DFT accuracy within 0.1 eV in 94% of cases via hybrid outlier detection, and enables efficient ORR catalyst screening.

Conclusion: DBCata demonstrates powerful capabilities for generating accurate adsorption geometries, facilitating accelerated high-throughput computational screening for catalyst design and optimization.

Abstract: The adsorption energy serves as a crucial descriptor for the large-scale screening of catalysts. Nevertheless, the limited distribution of training data for the extensively utilised machine learning interatomic potential (MLIP), predominantly sourced from near-equilibrium structures, results in unreliable adsorption structures and consequent adsorption energy predictions. In this context, we present DBCata, a deep generative model that integrates a periodic Brownian-bridge framework with an equivariant graph neural network to establish a low-dimensional transition manifold between unrelaxed and DFT-relaxed structures, without requiring explicit energy or force information. Upon training, DBCata effectively generates high-fidelity adsorption geometries, achieving an interatomic distance mean absolute error (DMAE) of 0.035 \textÅ on the Catalysis-Hub dataset, which is nearly three times superior to that of the current state-of-the-art machine learning potential models. Moreover, the corresponding DFT accuracy can be improved within 0.1 eV in 94% of instances by identifying and refining anomalous predictions through a hybrid chemical-heuristic and self-supervised outlier detection approach. We demonstrate that the remarkable performance of DBCata facilitates accelerated high-throughput computational screening for efficient alloy catalysts in the oxygen reduction reaction, highlighting the potential of DBCata as a powerful tool for catalyst design and optimisation.

[305] Leveraging Foundational Models and Simple Fusion for Multi-modal Physiological Signal Analysis

Youssef Ghallab, Omar Iraqy, Mohamed Kandil, Mohamed Ashraf, Saadeldine Eletter, Morougue Ghazal, Ayman Khalafallah, Nagwa El-Makky

Main category: cs.LG

TL;DR: The paper introduces a multi-modal approach using self-supervised pre-training of ECG and EEG encoders with a dual-masking strategy, then fuses them via simple concatenation for emotion recognition, achieving near state-of-the-art performance with limited labeled data.

DetailsMotivation: Physiological signals like ECG and EEG provide complementary health insights, but multi-modal integration is challenging due to limited labeled data and modality-specific differences. The authors aim to overcome these challenges by leveraging foundation-model approaches.

Method: Adapted CBraMod encoder for large-scale self-supervised ECG pre-training with dual-masking strategy (intra- and inter-lead dependencies). Used pre-trained CBraMod encoder for EEG and pre-trained symmetric ECG encoder, then fused representations via simple embedding concatenation, allowing classification head to learn cross-modal interactions.

Result: Achieved near state-of-the-art performance on emotion recognition tasks, demonstrating that carefully designed physiological encoders with straightforward fusion substantially improve downstream performance despite limited multi-modal supervision.

Conclusion: Foundation-model approaches can effectively harness the holistic nature of physiological signals, enabling scalable, label-efficient, and generalizable solutions for healthcare and affective computing, even with simple fusion techniques.

Abstract: Physiological signals such as electrocardiograms (ECG) and electroencephalograms (EEG) provide complementary insights into human health and cognition, yet multi-modal integration is challenging due to limited multi-modal labeled data, and modality-specific differences . In this work, we adapt the CBraMod encoder for large-scale self-supervised ECG pretraining, introducing a dual-masking strategy to capture intra- and inter-lead dependencies. To overcome the above challenges, we utilize a pre-trained CBraMod encoder for EEG and pre-train a symmetric ECG encoder, equipping each modality with a rich foundational representation. These representations are then fused via simple embedding concatenation, allowing the classification head to learn cross-modal interactions, together enabling effective downstream learning despite limited multi-modal supervision. Evaluated on emotion recognition, our approach achieves near state-of-the-art performance, demonstrating that carefully designed physiological encoders, even with straightforward fusion, substantially improve downstream performance. These results highlight the potential of foundation-model approaches to harness the holistic nature of physiological signals, enabling scalable, label-efficient, and generalizable solutions for healthcare and affective computing.

[306] Distillation-Guided Structural Transfer for Continual Learning Beyond Sparse Distributed Memory

Huiyan Xue, Xuming Ran, Yaxin Li, Qi Xu, Enhui Li, Yi Xu, Qiang Zhang

Main category: cs.LG

TL;DR: SSD improves sparse continual learning by using selective distillation within Top-K subnetworks to enable cross-task knowledge reuse without replay or task labels.

DetailsMotivation: Sparse neural systems like SDMLP show resilience against catastrophic forgetting but have rigid modularity that limits cross-task knowledge reuse and causes performance degradation under high sparsity.

Method: Selective Subnetwork Distillation (SSD) treats distillation as topology-aligned information conduit, identifies high-frequency activation neurons, and selectively distills knowledge within previous Top-K subnetworks and output logits without replay or task labels.

Result: Experiments on Split CIFAR-10, CIFAR-100, and MNIST show SSD improves accuracy, retention, and representation coverage compared to existing sparse continual learning methods.

Conclusion: SSD provides a structurally grounded solution for sparse continual learning that enables structural realignment while preserving sparse modularity, overcoming limitations of rigid modularity in existing sparse systems.

Abstract: Sparse neural systems are gaining traction for efficient continual learning due to their modularity and low interference. Architectures such as Sparse Distributed Memory Multi-Layer Perceptrons (SDMLP) construct task-specific subnetworks via Top-K activation and have shown resilience against catastrophic forgetting. However, their rigid modularity limits cross-task knowledge reuse and leads to performance degradation under high sparsity. We propose Selective Subnetwork Distillation (SSD), a structurally guided continual learning framework that treats distillation not as a regularizer but as a topology-aligned information conduit. SSD identifies neurons with high activation frequency and selectively distills knowledge within previous Top-K subnetworks and output logits, without requiring replay or task labels. This enables structural realignment while preserving sparse modularity. Experiments on Split CIFAR-10, CIFAR-100, and MNIST demonstrate that SSD improves accuracy, retention, and representation coverage, offering a structurally grounded solution for sparse continual learning.

[307] Topological Metric for Unsupervised Embedding Quality Evaluation

Aleksei Shestov, Anton Klenitskiy, Daria Denisova, Amurkhan Dzagkoev, Daniil Petrovich, Andrey Savchenko, Maksim Makarenko

Main category: cs.LG

TL;DR: Persistence is a topology-based unsupervised metric that evaluates embedding quality by analyzing geometric structure and topological richness using persistent homology.

DetailsMotivation: Modern unsupervised/self-supervised learning lacks reliable evaluation metrics without labels. Existing metrics often assume linear separability or rely on covariance, failing to capture complex geometric structure.

Method: Uses persistent homology to quantify geometric structure and topological richness of embedding spaces. Captures global and multi-scale organization in a fully unsupervised manner.

Result: Persistence achieves top-tier correlations with downstream performance across diverse domains, outperforming existing unsupervised metrics and enabling reliable model/hyperparameter selection.

Conclusion: Persistence provides a robust, topology-aware metric for evaluating embedding quality without labels, addressing a key challenge in modern representation learning.

Abstract: Modern representation learning increasingly relies on unsupervised and self-supervised methods trained on large-scale unlabeled data. While these approaches achieve impressive generalization across tasks and domains, evaluating embedding quality without labels remains an open challenge. In this work, we propose Persistence, a topology-aware metric based on persistent homology that quantifies the geometric structure and topological richness of embedding spaces in a fully unsupervised manner. Unlike metrics that assume linear separability or rely on covariance structure, Persistence captures global and multi-scale organization. Empirical results across diverse domains show that Persistence consistently achieves top-tier correlations with downstream performance, outperforming existing unsupervised metrics and enabling reliable model and hyperparameter selection.

[308] Quantum Machine Learning for Cybersecurity: A Taxonomy and Future Directions

Siva Sai, Ishika Goyal, Shubham Sharma, Sri Harshita Manuri, Vinay Chamola, Rajkumar Buyya

Main category: cs.LG

TL;DR: Survey paper on Quantum Machine Learning applications in cybersecurity, covering QNNs, QSVMs, VQCs, QGANs and their use in intrusion detection, malware classification, and cloud security.

DetailsMotivation: Classical ML and signature-based security approaches are failing against modern cyber threats due to evolving tactics and high data volumes. Quantum Machine Learning offers potential advantages in processing high-dimensional structures for security applications.

Method: Comprehensive survey methodology: maps QML techniques (QNNs, QSVMs, VQCs, QGANs) across supervised, unsupervised, and generative learning paradigms to cybersecurity tasks. Discusses applications in intrusion/anomaly detection, malware/botnet classification, encrypted-traffic analytics, and cloud security.

Result: Provides systematic overview of QML’s potential in cybersecurity, identifies specific applications where quantum advantages could be realized, and discusses how this survey improves upon existing research by providing comprehensive mapping of techniques to security domains.

Conclusion: QML shows promise for enhancing cybersecurity capabilities but has current limitations that need addressing. The paper provides directions for future research to overcome these challenges and realize QML’s potential in security applications.

Abstract: The increasing number of cyber threats and rapidly evolving tactics, as well as the high volume of data in recent years, have caused classical machine learning, rules, and signature-based defence strategies to fail, rendering them unable to keep up. An alternative, Quantum Machine Learning (QML), has recently emerged, making use of computations based on quantum mechanics. It offers better encoding and processing of high-dimensional structures for certain problems. This survey provides a comprehensive overview of QML techniques relevant to the domain of security, such as Quantum Neural Networks (QNNs), Quantum Support Vector Machines (QSVMs), Variational Quantum Circuits (VQCs), and Quantum Generative Adversarial Networks (QGANs), and discusses the contributions of this paper in relation to existing research in the field and how it improves over them. It also maps these methods across supervised, unsupervised, and generative learning paradigms, and to core cybersecurity tasks, including intrusion and anomaly detection, malware and botnet classification, and encrypted-traffic analytics. It also discusses their application in the domain of cloud computing security, where QML can enhance secure and scalable operations. Many limitations of QML in the domain of cybersecurity have also been discussed, along with the directions for addressing them.

[309] Bits for Privacy: Evaluating Post-Training Quantization via Membership Inference

Chenxiang Zhang, Tongxi Qu, Zhong Li, Tian Zhang, Jun Pang, Sjouke Mauw

Main category: cs.LG

TL;DR: Quantization reduces privacy leakage in neural networks - lower precision models show up to 10x less vulnerability to membership inference attacks, though with some utility trade-off.

DetailsMotivation: Existing privacy analyses focus on full-precision models, but quantization is widely used to reduce computational costs. There's a gap in understanding how bit-width reduction affects privacy leakage, especially for post-training quantization methods that don't require retraining.

Method: Systematic study of privacy-utility relationship in post-training quantization (PTQ) using membership inference attacks as evaluation framework. Analyzed three PTQ algorithms (AdaRound, BRECQ, OBC) across multiple precision levels (4-bit, 2-bit, 1.58-bit) on CIFAR-10, CIFAR-100, and TinyImageNet datasets.

Result: Low-precision PTQs consistently reduce privacy leakage. Lower-precision models show up to an order of magnitude reduction in membership inference vulnerability compared to full-precision counterparts, though with decreased utility. Additional ablation studies show quantizing only the last layer at higher precision enables fine-grained control over privacy-utility trade-off.

Conclusion: Quantization can serve as a practical privacy protection mechanism. The findings offer actionable insights for practitioners to balance efficiency, utility, and privacy in real-world deployments through precision-level adjustments and selective layer quantization.

Abstract: Deep neural networks are widely deployed with quantization techniques to reduce memory and computational costs by lowering the numerical precision of their parameters. While quantization alters model parameters and their outputs, existing privacy analyses primarily focus on full-precision models, leaving a gap in understanding how bit-width reduction can affect privacy leakage. We present the first systematic study of the privacy-utility relationship in post-training quantization (PTQ), a versatile family of methods that can be applied to pretrained models without further training. Using membership inference attacks as our evaluation framework, we analyze three popular PTQ algorithms-AdaRound, BRECQ, and OBC-across multiple precision levels (4-bit, 2-bit, and 1.58-bit) on CIFAR-10, CIFAR-100, and TinyImageNet datasets. Our findings consistently show that low-precision PTQs can reduce privacy leakage. In particular, lower-precision models demonstrate up to an order of magnitude reduction in membership inference vulnerability compared to their full-precision counterparts, albeit at the cost of decreased utility. Additional ablation studies on the 1.58-bit quantization level show that quantizing only the last layer at higher precision enables fine-grained control over the privacy-utility trade-off. These results offer actionable insights for practitioners to balance efficiency, utility, and privacy protection in real-world deployments.

[310] Empirical Investigation of the Impact of Phase Information on Fault Diagnosis of Rotating Machinery

Hiroyoshi Nagahama, Katsufumi Inoue, Masayoshi Todorokihara, Michifumi Yoshioka

Main category: cs.LG

TL;DR: Two phase-aware preprocessing strategies improve vibration-based predictive maintenance by addressing random phase variations in multi-axis data.

DetailsMotivation: Most learning-based approaches for predictive maintenance of rotating machinery either discard phase information during spectral feature extraction or use raw time-waveforms without explicitly leveraging phase information, leaving potential performance gains unexplored.

Method: Introduces two phase-aware preprocessing strategies: (1) three-axis independent phase adjustment that aligns each axis individually to zero phase, and (2) single-axis reference phase adjustment that preserves inter-axis relationships by applying uniform time shifts. Evaluates six deep learning architectures under a two-stage learning framework using a newly constructed rotor dataset with synchronized three-axis sensors.

Result: Both methods show architecture-independent improvements: three-axis independent method achieves consistent gains (+2.7% for Transformer), while single-axis reference approach delivers superior performance with up to 96.2% accuracy (+5.4%) by preserving spatial phase relationships.

Conclusion: Both phase alignment strategies are practical and scalable enhancements for predictive maintenance systems, demonstrating the importance of explicitly leveraging phase information in vibration signal analysis.

Abstract: Predictive maintenance of rotating machinery increasingly relies on vibration signals, yet most learning-based approaches either discard phase during spectral feature extraction or use raw time-waveforms without explicitly leveraging phase information. This paper introduces two phase-aware preprocessing strategies to address random phase variations in multi-axis vibration data: (1) three-axis independent phase adjustment that aligns each axis individually to zero phase (2) single-axis reference phase adjustment that preserves inter-axis relationships by applying uniform time shifts. Using a newly constructed rotor dataset acquired with a synchronized three-axis sensor, we evaluate six deep learning architectures under a two-stage learning framework. Results demonstrate architecture-independent improvements: the three-axis independent method achieves consistent gains (+2.7% for Transformer), while the single-axis reference approach delivers superior performance with up to 96.2% accuracy (+5.4%) by preserving spatial phase relationships. These findings establish both phase alignment strategies as practical and scalable enhancements for predictive maintenance systems.

[311] A Regime-Aware Fusion Framework for Time Series Classification

Honey Singh Chauhan, Zahraa S. Abdallah

Main category: cs.LG

TL;DR: Fusion-3 (F3) adaptively fuses Rocket, Sax, and Sfa representations for time series classification, showing consistent improvements over Rocket alone on specific dataset types identified through meta-feature clustering.

DetailsMotivation: Kernel-based methods like Rocket are effective for time series classification but don't perform equally well across all datasets. The authors revisit the intuition that different representations capture complementary structure and that selective fusion could yield consistent improvements on systematically identifiable dataset types.

Method: Introduces Fusion-3 (F3), a lightweight framework that adaptively fuses Rocket, Sax, and Sfa representations. Uses meta-features (series length, spectral structure, roughness, class imbalance) to cluster UCR datasets into six interpretable data-structure regimes. Combines three complementary analyses: non-parametric paired statistics, ablation studies, and SHAP attribution to identify which dataset properties predict fusion gains.

Result: Fusion typically outperforms strong baselines in regimes with structured variability or rich frequency content, while offering diminishing returns in highly irregular or outlier-heavy settings. F3 yields small but consistent average improvements over Rocket on 113 UCR datasets, supported by frequentist and Bayesian evidence. Fusion primarily improves performance by rescuing specific errors, with adaptive increases in frequency-domain weighting where corrections occur.

Conclusion: Selectively applied fusion provides dependable and interpretable extension to strong kernel-based methods, correcting their weaknesses precisely where the data support it. The framework offers consistent improvements with clearly identifiable failure cases.

Abstract: Kernel-based methods such as Rocket are among the most effective default approaches for univariate time series classification (TSC), yet they do not perform equally well across all datasets. We revisit the long-standing intuition that different representations capture complementary structure and show that selectively fusing them can yield consistent improvements over Rocket on specific, systematically identifiable kinds of datasets. We introduce Fusion-3 (F3), a lightweight framework that adaptively fuses Rocket, Sax, and Sfa representations. To understand when fusion helps, we cluster UCR datasets into six groups using meta-features capturing series length, spectral structure, roughness, and class imbalance, and treat these clusters as interpretable data-structure regimes. Our analysis shows that fusion typically outperforms strong baselines in regimes with structured variability or rich frequency content, while offering diminishing returns in highly irregular or outlier-heavy settings. To support these findings, we combine three complementary analyses: non-parametric paired statistics across datasets, ablation studies isolating the roles of individual representations, and attribution via SHAP to identify which dataset properties predict fusion gains. Sample-level case studies further reveal the underlying mechanism: fusion primarily improves performance by rescuing specific errors, with adaptive increases in frequency-domain weighting precisely where corrections occur. Using 5-fold cross-validation on the 113 UCR datasets, F3 yields small but consistent average improvements over Rocket, supported by frequentist and Bayesian evidence and accompanied by clearly identifiable failure cases. Our results show that selectively applied fusion provides dependable and interpretable extension to strong kernel-based methods, correcting their weaknesses precisely where the data support it.

[312] Robustness Evaluation of Machine Learning Models for Fault Classification and Localization In Power System Protection

Julian Oelhaf, Mehran Pashaei, Georg Kordowich, Christian Bergler, Andreas Maier, Johann Jäger, Siming Bayer

Main category: cs.LG

TL;DR: A unified framework for evaluating ML model robustness in power system protection, showing fault classification is stable but fault localization degrades significantly with sensor/communication issues.

DetailsMotivation: Renewable energy integration challenges conventional protection schemes, requiring adaptive ML-based solutions. However, practical deployment requires robustness against missing, noisy, or degraded sensor data that real-world systems encounter.

Method: Developed a unified framework using high-fidelity EMT simulations to model realistic degradation scenarios (sensor outages, reduced sampling rates, transient communication losses). Provides consistent methodology for benchmarking ML models and quantifying impact of limited observability.

Result: Fault classification remains highly stable under most degradation types (drops only ~13% under single-phase loss). Fault localization is more sensitive, with voltage loss increasing localization error by over 150%. Identified critical measurement channels needed for resilient operation.

Conclusion: The framework offers actionable guidance for robustness-aware design of future ML-assisted protection systems, highlighting the need to address sensitivity of fault localization to sensor/communication degradation.

Abstract: The growing penetration of renewable and distributed generation is transforming power systems and challenging conventional protection schemes that rely on fixed settings and local measurements. Machine learning (ML) offers a data-driven alternative for centralized fault classification (FC) and fault localization (FL), enabling faster and more adaptive decision-making. However, practical deployment critically depends on robustness. Protection algorithms must remain reliable even when confronted with missing, noisy, or degraded sensor data. This work introduces a unified framework for systematically evaluating the robustness of ML models in power system protection. High-fidelity EMT simulations are used to model realistic degradation scenarios, including sensor outages, reduced sampling rates, and transient communication losses. The framework provides a consistent methodology for benchmarking models, quantifying the impact of limited observability, and identifying critical measurement channels required for resilient operation. Results show that FC remains highly stable under most degradation types but drops by about 13% under single-phase loss, while FL is more sensitive overall, with voltage loss increasing localization error by over 150%. These findings offer actionable guidance for robustness-aware design of future ML-assisted protection systems.

[313] EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning

Jianfei Ma, Wee Sun Lee

Main category: cs.LG

TL;DR: EUBRL is a Bayesian RL algorithm that uses epistemic uncertainty guidance for principled exploration, achieving near-optimal regret bounds and superior sample efficiency in challenging RL tasks.

DetailsMotivation: The paper addresses the fundamental exploration-exploitation dilemma in RL, where agents face epistemic uncertainty (systematic uncertainty due to limited knowledge) at the boundary between known and unknown states. The motivation is to develop a principled exploration strategy that adaptively reduces regret from estimation errors.

Method: The authors propose EUBRL, a Bayesian reinforcement learning algorithm that leverages epistemic guidance for exploration. The method uses a class of sufficiently expressive priors in infinite-horizon discounted Markov Decision Processes (MDPs) to provide systematic uncertainty quantification and guide exploration decisions.

Result: Theoretical results establish nearly minimax-optimal regret and sample complexity guarantees for the proposed algorithm. Empirical evaluations on tasks with sparse rewards, long horizons, and stochasticity demonstrate that EUBRL achieves superior sample efficiency, scalability, and consistency compared to baseline methods.

Conclusion: EUBRL successfully addresses the exploration-exploitation dilemma through epistemic uncertainty guidance, providing both theoretical guarantees and practical performance improvements in challenging RL environments. The approach offers a principled framework for exploration that adaptively reduces regret from estimation errors.

Abstract: At the boundary between the known and the unknown, an agent inevitably confronts the dilemma of whether to explore or to exploit. Epistemic uncertainty reflects such boundaries, representing systematic uncertainty due to limited knowledge. In this paper, we propose a Bayesian reinforcement learning (RL) algorithm, $\texttt{EUBRL}$, which leverages epistemic guidance to achieve principled exploration. This guidance adaptively reduces per-step regret arising from estimation errors. We establish nearly minimax-optimal regret and sample complexity guarantees for a class of sufficiently expressive priors in infinite-horizon discounted MDPs. Empirically, we evaluate $\texttt{EUBRL}$ on tasks characterized by sparse rewards, long horizons, and stochasticity. Results demonstrate that $\texttt{EUBRL}$ achieves superior sample efficiency, scalability, and consistency.

[314] FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

Yeonwoo Cha, Semin Kim, Jinhyeon Kwon, Seunghoon Hong

Main category: cs.LG

TL;DR: FlowBind is an efficient any-to-any generation framework that uses a shared latent space with modality-specific invertible flows, achieving competitive quality with significantly reduced parameters and training time.

DetailsMotivation: Existing flow-based approaches for any-to-any generation are inefficient due to large-scale data requirements, restrictive pairing constraints, high computational costs from modeling joint distributions, and complex multi-stage training.

Method: FlowBind learns a shared latent space capturing cross-modal information with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders/decoders for direct translation.

Result: FlowBind achieves comparable generation quality while requiring up to 6x fewer parameters and training 10x faster than prior methods, as demonstrated on text, image, and audio modalities.

Conclusion: FlowBind provides an efficient and simplified framework for any-to-any generation that substantially reduces data requirements and computational costs while maintaining competitive performance across multiple modalities.

Abstract: Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by their inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computational cost from modeling joint distribution, and rely on complex multi-stage training. We propose FlowBind, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6x fewer parameters and training 10x faster than prior methods. The project page with code is available at https://yeonwoo378.github.io/official_flowbind.

[315] FM-EAC: Feature Model-based Enhanced Actor-Critic for Multi-Task Control in Dynamic Environments

Quanxi Zhou, Wencan Mao, Manabu Tsukada, John C. S. Lui, Yusheng Ji

Main category: cs.LG

TL;DR: FM-EAC integrates model-based and model-free RL with feature-based models and enhanced actor-critic framework for improved multi-task transferability in dynamic environments.

DetailsMotivation: Modern RL methods struggle with effective transferability across tasks and scenarios, despite the convergence of model-based and model-free approaches in Dyna-Q. There's a need for better generalizability in multi-task control for dynamic environments.

Method: Proposes Feature Model-Based Enhanced Actor-Critic (FM-EAC) that integrates planning, acting, and learning. Combines MBRL and MFRL strengths using novel feature-based models and enhanced actor-critic framework. Allows customization of sub-networks for user-specific requirements.

Result: FM-EAC consistently outperforms state-of-the-art MBRL and MFRL methods in simulations across urban and agricultural applications. Demonstrates improved generalizability and transferability across tasks.

Conclusion: FM-EAC successfully bridges model-based and model-free RL approaches, offering a flexible framework with customizable sub-networks that achieves superior performance and transferability in multi-task dynamic environments.

Abstract: Model-based reinforcement learning (MBRL) and model-free reinforcement learning (MFRL) evolve along distinct paths but converge in the design of Dyna-Q [1]. However, modern RL methods still struggle with effective transferability across tasks and scenarios. Motivated by this limitation, we propose a generalized algorithm, Feature Model-Based Enhanced Actor-Critic (FM-EAC), that integrates planning, acting, and learning for multi-task control in dynamic environments. FM-EAC combines the strengths of MBRL and MFRL and improves generalizability through the use of novel feature-based models and an enhanced actor-critic framework. Simulations in both urban and agricultural applications demonstrate that FM-EAC consistently outperforms many state-of-the-art MBRL and MFRL methods. More importantly, different sub-networks can be customized within FM-EAC according to user-specific requirements.

[316] Statistics of Min-max Normalized Eigenvalues in Random Matrices

Hyakka Nakada, Shu Tanaka

Main category: cs.LG

TL;DR: Study analyzes statistical properties of min-max normalized eigenvalues in random matrices, derives scaling laws for cumulative distribution and residual errors in matrix factorization.

DetailsMotivation: Random matrix theory is important in mathematics, physics, and machine learning. In data science, input data are typically normalized before processing, so understanding the statistical properties of normalized eigenvalues is practically relevant.

Method: Apply previously proposed effective distribution for normalized eigenvalues to evaluate scaling laws of cumulative distribution. Derive residual errors that occur during matrix factorization of random matrices. Conduct numerical experiments to verify theoretical predictions.

Result: Theoretical predictions for scaling laws of cumulative distribution of normalized eigenvalues and residual errors in matrix factorization are derived and verified through numerical experiments.

Conclusion: The study provides analytical results for statistical properties of min-max normalized eigenvalues in random matrices, with practical implications for data science applications where normalization is standard practice.

Abstract: Random matrix theory has played an important role in various areas of pure mathematics, mathematical physics, and machine learning. From a practical perspective of data science, input data are usually normalized prior to processing. Thus, this study investigates the statistical properties of min-max normalized eigenvalues in random matrices. Previously, the effective distribution for such normalized eigenvalues has been proposed. In this study, we apply it to evaluate a scaling law of the cumulative distribution. Furthermore, we derive the residual error that arises during matrix factorization of random matrices. We conducted numerical experiments to verify these theoretical predictions.

[317] Double Horizon Model-Based Policy Optimization

Akihiro Kubo, Paavo Parmas, Shin Ishii

Main category: cs.LG

TL;DR: DHMBPO proposes a double-horizon approach in MBRL with separate distribution rollouts (long) for on-policy sampling and training rollouts (short) for stable gradient estimation, balancing distribution shift, model bias, and gradient instability.

DetailsMotivation: Traditional MBRL faces a dilemma in choosing rollout length: longer rollouts preserve on-policy training but amplify model bias, while also creating tension between reducing value estimation bias and increasing policy gradient variance. The optimal horizons for these two conflicting objectives differ.

Method: DHMBPO divides rollout procedure into two parts: (1) Long “distribution rollout” (DR) generates on-policy state samples to mitigate distribution shift, and (2) Short “training rollout” (TR) uses differentiable transitions for accurate value gradient estimation with stable updates, requiring fewer updates and reducing runtime.

Result: The double-horizon approach effectively balances distribution shift, model bias, and gradient instability, surpassing existing MBRL methods on continuous-control benchmarks in both sample efficiency and runtime.

Conclusion: Separating rollout horizons for different purposes (distribution sampling vs. gradient estimation) resolves the fundamental conflict in MBRL rollout length selection, leading to improved performance and efficiency.

Abstract: Model-based reinforcement learning (MBRL) reduces the cost of real-environment sampling by generating synthetic trajectories (called rollouts) from a learned dynamics model. However, choosing the length of the rollouts poses two dilemmas: (1) Longer rollouts better preserve on-policy training but amplify model bias, indicating the need for an intermediate horizon to mitigate distribution shift (i.e., the gap between on-policy and past off-policy samples). (2) Moreover, a longer model rollout may reduce value estimation bias but raise the variance of policy gradients due to backpropagation through multiple steps, implying another intermediate horizon for stable gradient estimates. However, these two optimal horizons may differ. To resolve this conflict, we propose Double Horizon Model-Based Policy Optimization (DHMBPO), which divides the rollout procedure into a long “distribution rollout” (DR) and a short “training rollout” (TR). The DR generates on-policy state samples for mitigating distribution shift. In contrast, the short TR leverages differentiable transitions to offer accurate value gradient estimation with stable gradient updates, thereby requiring fewer updates and reducing overall runtime. We demonstrate that the double-horizon approach effectively balances distribution shift, model bias, and gradient instability, and surpasses existing MBRL methods on continuous-control benchmarks in terms of both sample efficiency and runtime.

Neeraj Sarna, Yuanyuan Li, Michael von Gablenz

Main category: cs.LG

TL;DR: This paper explores using chain-of-thought and task instruction prompting combined with negative prompting and prompt rewriting to reduce copyrighted content generation in text-to-image models.

DetailsMotivation: Large text-to-image models can memorize and reproduce copyrighted training data, creating legal and financial risks for users and developers. There's a need for techniques to mitigate copyright infringement in AI-generated content.

Method: Combines chain-of-thought and task instruction prompting with two copyright mitigation strategies: negative prompting and prompt rewriting. Evaluates generated images based on similarity to copyrighted images and relevance to user input.

Result: Presents numerical experiments on various models, providing insights on effectiveness of these techniques across different model complexities.

Conclusion: The proposed combination of chain-of-thought prompting, task instruction, negative prompting, and prompt rewriting shows potential for reducing copyrighted content generation in text-to-image models, with effectiveness varying by model complexity.

Abstract: Large scale text-to-image generation models can memorize and reproduce their training dataset. Since the training dataset often contains copyrighted material, reproduction of training dataset poses a copyright infringement risk, which could result in legal liabilities and financial losses for both the AI user and the developer. The current works explores the potential of chain-of-thought and task instruction prompting in reducing copyrighted content generation. To this end, we present a formulation that combines these two techniques with two other copyright mitigation strategies: a) negative prompting, and b) prompt re-writing. We study the generated images in terms their similarity to a copyrighted image and their relevance of the user input. We present numerical experiments on a variety of models and provide insights on the effectiveness of the aforementioned techniques for varying model complexity.

[319] From Risk to Resilience: Towards Assessing and Mitigating the Risk of Data Reconstruction Attacks in Federated Learning

Xiangrui Xu, Zhize Li, Yufei Han, Bin Wang, Jiqiang Liu, Wei Wang

Main category: cs.LG

TL;DR: The paper introduces Invertibility Loss (InvLoss) as a theoretical framework to quantify data reconstruction attack risks in federated learning, develops a risk estimator, and proposes adaptive noise defenses.

DetailsMotivation: There's a lack of theoretically-grounded risk quantification framework for Data Reconstruction Attacks (DRA) in Federated Learning systems, making it difficult to characterize and assess privacy risks systematically.

Method: Introduces Invertibility Loss (InvLoss) to quantify maximum achievable DRA effectiveness, derives a tight computable upper bound, develops InvRE risk estimator, and proposes two adaptive noise perturbation defenses.

Result: Shows DRA risk is governed by spectral properties of Jacobian matrices, provides unified explanation for defense effectiveness, validates framework on real-world datasets with systematic risk evaluation and mitigation.

Conclusion: The InvLoss framework enables systematic DRA risk quantification and mitigation in FL systems, offering attack-agnostic risk evaluation and effective privacy defenses without harming classification accuracy.

Abstract: Data Reconstruction Attacks (DRA) pose a significant threat to Federated Learning (FL) systems by enabling adversaries to infer sensitive training data from local clients. Despite extensive research, the question of how to characterize and assess the risk of DRAs in FL systems remains unresolved due to the lack of a theoretically-grounded risk quantification framework. In this work, we address this gap by introducing Invertibility Loss (InvLoss) to quantify the maximum achievable effectiveness of DRAs for a given data instance and FL model. We derive a tight and computable upper bound for InvLoss and explore its implications from three perspectives. First, we show that DRA risk is governed by the spectral properties of the Jacobian matrix of exchanged model updates or feature embeddings, providing a unified explanation for the effectiveness of defense methods. Second, we develop InvRE, an InvLoss-based DRA risk estimator that offers attack method-agnostic, comprehensive risk evaluation across data instances and model architectures. Third, we propose two adaptive noise perturbation defenses that enhance FL privacy without harming classification accuracy. Extensive experiments on real-world datasets validate our framework, demonstrating its potential for systematic DRA risk evaluation and mitigation in FL systems.

[320] Metanetworks as Regulatory Operators: Learning to Edit for Requirement Compliance

Ioannis Kalogeropoulos, Giorgos Bouritsas, Yannis Panagakis

Main category: cs.LG

TL;DR: A framework for efficiently editing neural networks to satisfy diverse requirements (fairness, privacy, efficiency) without sacrificing performance, using a learned graph metanetwork editor.

DetailsMotivation: As ML models are deployed in high-stakes settings, there's a need to ensure they satisfy various requirements beyond performance (fairness, compliance, computational constraints). Current post-processing methods often compromise performance, while retraining is time-consuming or unavailable.

Method: Proposes a unifying framework where a graph metanetwork (itself an NN) learns to edit neural networks in a single inference step. The metanetwork is trained on NN populations to minimize an objective with two terms: requirement enforcement and utility preservation.

Result: Experiments on diverse tasks (data minimization, bias mitigation, weight pruning) show improved trade-offs between performance, requirement satisfaction, and time efficiency compared to post-processing or retraining alternatives.

Conclusion: The framework enables efficient model editing to satisfy requirements without sacrificing utility, addressing critical challenges in deploying ML models in high-stakes settings.

Abstract: As machine learning models are increasingly deployed in high-stakes settings, e.g. as decision support systems in various societal sectors or in critical infrastructure, designers and auditors are facing the need to ensure that models satisfy a wider variety of requirements (e.g. compliance with regulations, fairness, computational constraints) beyond performance. Although most of them are the subject of ongoing studies, typical approaches face critical challenges: post-processing methods tend to compromise performance, which is often counteracted by fine-tuning or, worse, training from scratch, an often time-consuming or even unavailable strategy. This raises the following question: “Can we efficiently edit models to satisfy requirements, without sacrificing their utility?” In this work, we approach this with a unifying framework, in a data-driven manner, i.e. we learn to edit neural networks (NNs), where the editor is an NN itself - a graph metanetwork - and editing amounts to a single inference step. In particular, the metanetwork is trained on NN populations to minimise an objective consisting of two terms: the requirement to be enforced and the preservation of the NN’s utility. We experiment with diverse tasks (the data minimisation principle, bias mitigation and weight pruning) improving the trade-offs between performance, requirement satisfaction and time efficiency compared to popular post-processing or re-training alternatives.

[321] Soft Geometric Inductive Bias for Object Centric Dynamics

Hampus Linander, Conor Heins, Alexander Tschantz, Marco Perin, Christopher Buckley

Main category: cs.LG

TL;DR: Geometric algebra neural networks provide soft geometric inductive bias for object-centric world models, outperforming non-equivariant baselines in physical fidelity for long-horizon rollouts of 2D rigid body dynamics with obstacles.

DetailsMotivation: Exact group equivariance can degrade performance when symmetries are broken in physical dynamics, so there's a need for softer geometric priors that can handle imperfect symmetry while maintaining physical fidelity.

Method: Object-centric world models built with geometric algebra neural networks that provide soft geometric inductive bias, trained autoregressively for next-step predictions in simulated 2D rigid body dynamics environments with static obstacles.

Result: The soft inductive bias of geometric algebra models results in better performance in terms of physical fidelity for long-horizon rollouts compared to non-equivariant baseline models.

Conclusion: Geometric algebra offers an effective middle ground between hand-crafted physics and unstructured deep nets, delivering sample-efficient dynamics models for multi-object scenes, complementing recent soft-equivariance ideas.

Abstract: Equivariance is a powerful prior for learning physical dynamics, yet exact group equivariance can degrade performance if the symmetries are broken. We propose object-centric world models built with geometric algebra neural networks, providing a soft geometric inductive bias. Our models are evaluated using simulated environments of 2d rigid body dynamics with static obstacles, where we train for next-step predictions autoregressively. For long-horizon rollouts we show that the soft inductive bias of our models results in better performance in terms of physical fidelity compared to non-equivariant baseline models. The approach complements recent soft-equivariance ideas and aligns with the view that simple, well-chosen priors can yield robust generalization. These results suggest that geometric algebra offers an effective middle ground between hand-crafted physics and unstructured deep nets, delivering sample-efficient dynamics models for multi-object scenes.

[322] Multi-stage Bayesian optimisation for dynamic decision-making in self-driving labs

Luca Torresi, Pascal Friederich

Main category: cs.LG

TL;DR: The paper introduces an extension to Bayesian optimization that enables flexible sampling of multi-stage workflows and decision-making based on intermediate proxy measurements, overcoming limitations of standard BO in self-driving labs.

DetailsMotivation: Standard Bayesian optimization in self-driving labs requires fixed experimental workflows with clear optimization parameters and measurable objectives, excluding on-the-fly decisions about operation sequences and intermediate measurements. Many real-world experiments need simplification to fit this common setting.

Method: The authors introduce an extension to Bayesian optimization that allows flexible sampling of multi-stage workflows and makes optimal decisions based on intermediate proxy measurements. This enables more complex experimental workflows in autonomous labs.

Result: Proxy measurements yield substantial improvement over conventional Bayesian optimization across a wide range of scenarios, both in time to find good solutions and in overall optimality of found solutions.

Conclusion: The approach paves the way for using more complex and realistic experimental workflows in autonomous labs and enables smooth combination of simulations and experiments in next-generation self-driving laboratories.

Abstract: Self-driving laboratories (SDLs) are combining recent technological advances in robotics, automation, and machine learning based data analysis and decision-making to perform autonomous experimentation toward human-directed goals without requiring any direct human intervention. SDLs are successfully used in materials science, chemistry, and beyond, to optimise processes, materials, and devices in a systematic and data-efficient way. At present, the most widely used algorithm to identify the most informative next experiment is Bayesian optimisation. While relatively simple to apply to a wide range of optimisation problems, standard Bayesian optimisation relies on a fixed experimental workflow with a clear set of optimisation parameters and one or more measurable objective functions. This excludes the possibility of making on-the-fly decisions about changes in the planned sequence of operations and including intermediate measurements in the decision-making process. Therefore, many real-world experiments need to be adapted and simplified to be converted to the common setting in self-driving labs. In this paper, we introduce an extension to Bayesian optimisation that allows flexible sampling of multi-stage workflows and makes optimal decisions based on intermediate observables, which we call proxy measurements. We systematically compare the advantage of taking into account proxy measurements over conventional Bayesian optimisation, in which only the final measurement is observed. We find that over a wide range of scenarios, proxy measurements yield a substantial improvement, both in the time to find good solutions and in the overall optimality of found solutions. This not only paves the way to use more complex and thus more realistic experimental workflows in autonomous labs but also to smoothly combine simulations and experiments in the next generation of SDLs.

[323] Robustness and uncertainty: two complementary aspects of the reliability of the predictions of a classifier

Adrián Detavernier, Jasper De Bock

Main category: cs.LG

TL;DR: The paper compares Robustness Quantification (RQ) and Uncertainty Quantification (UQ) for assessing classifier prediction reliability, finds they are complementary, and proposes a hybrid approach that outperforms both.

DetailsMotivation: To understand and compare two different approaches (RQ and UQ) for assessing the reliability of individual classifier predictions, as there's no clear consensus on which approach is better for evaluating prediction reliability.

Method: The authors compare RQ and UQ approaches on multiple benchmark datasets, analyze their performance, and develop a hybrid approach that combines both methods to leverage their complementary strengths.

Result: No clear winner between RQ and UQ approaches - they are complementary. The hybrid approach combining both outperforms individual RQ and UQ methods. The study also provides dataset-specific assessments of relative importance of uncertainty vs robustness as sources of unreliability.

Conclusion: RQ and UQ are complementary approaches for reliability assessment, and combining them in a hybrid framework yields better performance than using either approach alone. The relative importance of uncertainty vs robustness varies by dataset.

Abstract: We consider two conceptually different approaches for assessing the reliability of the individual predictions of a classifier: Robustness Quantification (RQ) and Uncertainty Quantification (UQ). We compare both approaches on a number of benchmark datasets and show that there is no clear winner between the two, but that they are complementary and can be combined to obtain a hybrid approach that outperforms both RQ and UQ. As a byproduct of our approach, for each dataset, we also obtain an assessment of the relative importance of uncertainty and robustness as sources of unreliability.

[324] Joint Learning of Unsupervised Multi-view Feature and Instance Co-selection with Cross-view Imputation

Yuxin Cai, Yanyong Huang, Jinyuan Chang, Dongjie Wang, Tianrui Li, Xiaoyi Jiang

Main category: cs.LG

TL;DR: JUICE is a unified framework for joint feature/instance co-selection and cross-view imputation on incomplete multi-view data, outperforming existing methods that treat these tasks separately.

DetailsMotivation: Existing methods for feature and instance co-selection on incomplete multi-view data treat imputation and co-selection as independent processes, missing potential interactions. Simply concatenating multi-view data also fails to capture complementary information, limiting co-selection effectiveness.

Method: JUICE jointly learns unsupervised multi-view feature and instance co-selection with cross-view imputation. It reconstructs incomplete multi-view data using available observations in a unified framework, leverages cross-view neighborhood information to learn inter-sample relationships, and refines missing value imputation during reconstruction.

Result: Extensive experiments demonstrate that JUICE outperforms state-of-the-art methods for feature and instance co-selection on incomplete multi-view data.

Conclusion: The proposed JUICE framework successfully integrates missing data recovery with feature and instance co-selection, enabling better selection of representative features and instances by leveraging cross-view information and joint learning.

Abstract: Feature and instance co-selection, which aims to reduce both feature dimensionality and sample size by identifying the most informative features and instances, has attracted considerable attention in recent years. However, when dealing with unlabeled incomplete multi-view data, where some samples are missing in certain views, existing methods typically first impute the missing data and then concatenate all views into a single dataset for subsequent co-selection. Such a strategy treats co-selection and missing data imputation as two independent processes, overlooking potential interactions between them. The inter-sample relationships gleaned from co-selection can aid imputation, which in turn enhances co-selection performance. Additionally, simply merging multi-view data fails to capture the complementary information among views, ultimately limiting co-selection effectiveness. To address these issues, we propose a novel co-selection method, termed Joint learning of Unsupervised multI-view feature and instance Co-selection with cross-viEw imputation (JUICE). JUICE first reconstructs incomplete multi-view data using available observations, bringing missing data recovery and feature and instance co-selection together in a unified framework. Then, JUICE leverages cross-view neighborhood information to learn inter-sample relationships and further refine the imputation of missing values during reconstruction. This enables the selection of more representative features and instances. Extensive experiments demonstrate that JUICE outperforms state-of-the-art methods.

[325] How Smoothing is N-simplicial Attention?

Alexandre Dussolle, Pietro Liò

Main category: cs.LG

TL;DR: The paper introduces N-simplicial attention, which extends attention mechanisms from pairwise token interactions to higher-order interactions, adapts it for Rotary Position Embeddings, and proposes cost-effective simplex selection to manage computational complexity.

DetailsMotivation: To advance beyond current graph message-passing mechanisms (like GATs and Transformers) that rely on pairwise token interactions, by enabling higher-order interactions while managing the computational trade-off.

Method: Introduces N-simplicial attention that goes from pairwise to higher-order token interactions, adapts it for Rotary Position Embeddings, and proposes cost-effective simplex selection to focus computation on task-sensitive interactions.

Result: The paper derives a Lipschitz upper-bound for N-simplicial attention and demonstrates that despite enabling higher-order interactions, it still suffers from over-smoothing issues.

Conclusion: N-simplicial attention represents a step forward in attention mechanisms by enabling higher-order interactions, but careful management of computational complexity and over-smoothing remains important for practical applications.

Abstract: Going from pure Multilayer Perceptron (MLP) to a learnable graph message-passing mechanism at each layer has been foundational to state-of-the-art results, despite the computational trade-off (e.g. GATs or Transformers). To go a step further, in this work, we introduce N-simplicial attention, going from pairwise token similarity to higher-order interactions, and adapt it for Rotary Position Embeddings (RoPE). To help manage the increased complexity, we propose a cost-effective simplex selection enabling the model to focus its computation load onto the more task-sensitive interactions. Beyond these core mechanisms, we study how smoothing N-simplicial attention is by deriving a Lipschitz upper-bound and by demonstrating that by itself it also suffers from over-smoothing, despite opening the attention message-passing to higher-order interactions.

[326] DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration

Tianteng Gu, Bei Liu, Bo Xiao, Ke Zeng, Jiacheng Liu, Yanmin Qian

Main category: cs.LG

TL;DR: DenoiseRotator improves LLM pruning by redistributing parameter importance via orthogonal transformations before pruning, reducing performance degradation.

DetailsMotivation: Existing pruning methods focus on weight importance estimation, which limits their ability to preserve model capabilities, especially under semi-structured sparsity constraints where significant performance degradation occurs.

Method: Proposes redistributing parameter importance by minimizing information entropy of normalized importance scores, concentrating importance onto fewer weights. Implements DenoiseRotator using learnable orthogonal transformations on weight matrices, compatible with existing pruning techniques like Magnitude, SparseGPT, and Wanda.

Result: Consistently improves perplexity and zero-shot accuracy on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity. For LLaMA3-70B with SparseGPT at 2:4 sparsity, reduces perplexity gap by 58% (from 8.1 to 3.4 points degradation).

Conclusion: Redistributing parameter importance before pruning enhances pruning robustness and reduces performance degradation, with DenoiseRotator providing a practical implementation that integrates seamlessly with existing pruning methods.

Abstract: Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model’s weight matrices. Our method can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.

[327] Corrective Diffusion Language Models

Shuibai Zhang, Fred Zhangzhi Peng, Yiheng Zhang, Jin Pan, Grigorios G. Chrysos

Main category: cs.LG

TL;DR: The paper proposes a correction-oriented training approach for diffusion language models to enable effective error detection and iterative refinement, addressing limitations of standard masked diffusion training.

DetailsMotivation: Standard masked diffusion language models fail to reliably identify and correct errors in complete sequences, limiting their effectiveness for iterative refinement tasks despite their structural suitability for such operations.

Method: Proposes a correction-oriented post-training principle that explicitly supervises visible incorrect tokens to enable error-aware confidence estimation and targeted refinement, along with the Code Revision Benchmark for evaluation.

Result: Models trained with the proposed approach substantially outperform standard masked diffusion language models in correction scenarios while also improving pure completion performance.

Conclusion: The correction-oriented training enables diffusion language models to effectively identify errors and perform targeted refinement, making them more suitable for iterative correction tasks.

Abstract: Diffusion language models are structurally well-suited for iterative error correction, as their non-causal denoising dynamics allow arbitrary positions in a sequence to be revised. However, standard masked diffusion language model (MDLM) training fails to reliably induce this behavior, as models often cannot identify unreliable tokens in a complete input, rendering confidence-guided refinement ineffective. We study corrective behavior in diffusion language models, defined as the ability to assign lower confidence to incorrect tokens and iteratively refine them while preserving correct content. We show that this capability is not induced by conventional masked diffusion objectives and propose a correction-oriented post-training principle that explicitly supervises visible incorrect tokens, enabling error-aware confidence and targeted refinement. To evaluate corrective behavior, we introduce the Code Revision Benchmark (CRB), a controllable and executable benchmark for assessing error localization and in-place correction. Experiments on code revision tasks and controlled settings demonstrate that models trained with our approach substantially outperform standard MDLMs in correction scenarios, while also improving pure completion performance. Our code is publicly available at https://github.com/zhangshuibai/CDLM.

[328] Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, Vincent Roulet

Main category: cs.LG

TL;DR: The paper establishes a mathematical equivalence between autoregressive models (ARMs) and energy-based models (EBMs) for language models, showing they’re connected through probability theory and reinforcement learning principles.

DetailsMotivation: ARMs dominate LLM development while EBMs naturally characterize optimal alignment policies. The paper aims to unify these two model classes to provide theoretical insights into their relationships and capabilities.

Method: Starting from the chain rule of probability, the authors establish an explicit bijection between ARMs and EBMs in function space, showing this corresponds to a special case of the soft Bellman equation in maximum entropy reinforcement learning.

Result: The paper derives equivalence between supervised learning of ARMs and EBMs, provides theoretical error bounds for distilling EBMs into ARMs, and offers insights into ARMs’ planning capabilities despite their next-token prediction paradigm.

Conclusion: The unified view connects two important model classes, providing theoretical foundations for understanding how ARMs can exhibit planning behavior and offering practical implications for model distillation and alignment.

Abstract: Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we provide a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.

[329] Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, Dong Yu

Main category: cs.LG

TL;DR: G2RL is a gradient-guided RL framework that uses the model’s own update geometry to guide exploration, replacing external heuristics with first-order gradient information to sample trajectories that provide novel optimization directions.

DetailsMotivation: Current RL exploration methods (entropy bonuses, external semantic comparators) are misaligned with how LLMs actually learn - they encourage surface-level variation but don't guarantee that sampled trajectories differ in the update directions that shape optimization.

Method: G2RL constructs sequence-level features from model’s final layer sensitivity (obtainable at negligible cost from forward pass), compares these features within sampled groups, and gives bounded multiplicative reward scalers to trajectories that introduce novel gradient directions while deemphasizing redundant updates.

Result: Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy-based GRPO and external embedding methods.

Conclusion: A policy’s own update space provides a more faithful and effective basis for guiding exploration in LLM reinforcement learning, as G2RL expands exploration into more orthogonal and often opposing gradient directions while maintaining semantic coherence.

Abstract: Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.

[330] Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Huijie Lv, Ming Zhang, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang

Main category: cs.LG

TL;DR: The paper reveals that Qwen2.5’s strong RL performance on math benchmarks is due to data contamination, not actual reasoning improvements. Using a clean dataset (RandomCalculation), they show only accurate rewards help, and recommend using uncontaminated benchmarks across model series.

DetailsMotivation: Recent RL studies show surprising results where even random/incorrect rewards improve reasoning in Qwen2.5, but these findings don't transfer to models like Llama. The authors suspect data contamination in benchmarks is causing unreliable conclusions about RL effectiveness.

Method: 1) Empirical analysis revealing data contamination in Qwen2.5 from web-scale pre-training; 2) Creation of RandomCalculation - a generator producing fully clean arithmetic problems of arbitrary length/difficulty; 3) Testing RL methods with accurate vs random/incorrect rewards on this leakage-free dataset; 4) Fine-grained analysis comparing performance on MATH-500 vs RandomCalculation.

Result: On the clean RandomCalculation dataset, only accurate reward signals yield steady improvements that surpass the base model’s performance boundary. Random or incorrect rewards do not improve performance. The paper explains why different performance is observed on contaminated vs clean benchmarks.

Conclusion: Conclusions from contaminated benchmarks on Qwen2.5 are unreliable. Future studies should evaluate models on uncontaminated benchmarks and test various model series to ensure trustworthy conclusions about RL and related methods.

Abstract: Reasoning in large language models has long been a central research focus, and recent studies employing reinforcement learning (RL) have introduced diverse methods that yield substantial performance gains with minimal or even no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance performance. However, these breakthroughs are predominantly observed for the mathematically strong Qwen2.5 series on benchmarks such as MATH-500, AMC, and AIME, and seldom transfer to models like Llama, which warrants a more in-depth investigation. In this work, our empirical analysis reveals that pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks. Consequently, conclusions derived from contaminated benchmarks on Qwen2.5 series may be unreliable. To obtain trustworthy evaluation results, we introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation. Using this leakage-free dataset, we show that only accurate reward signals yield steady improvements that surpass the base model’s performance boundary in mathematical reasoning, whereas random or incorrect rewards do not. Moreover, we conduct more fine-grained analyses to elucidate the factors underlying the different performance observed on the MATH-500 and RandomCalculation benchmarks. Consequently, we recommend that future studies evaluate models on uncontaminated benchmarks and, when feasible, test various model series to ensure trustworthy conclusions about RL and related methods.

[331] Behavior Tokens Speak Louder: Disentangled Explainable Recommendation with Behavior Vocabulary

Xinshun Feng, Mingzhe Liu, Yi Qiao, Tongyu Zhu, Leilei Sun, Shuai Wang

Main category: cs.LG

TL;DR: BEAT is a framework that tokenizes user-item behaviors into interpretable sequences using vector-quantized autoencoding, enabling better zero-shot recommendations and coherent explanations by aligning behavioral semantics with language models.

DetailsMotivation: Existing explainable recommendation methods rely on ID-based representations that obscure semantic meaning and impose structural constraints on language models, limiting applicability in open-ended scenarios. Real-world interactions involve entangled user intents and misaligned collaborative signals with linguistic semantics.

Method: BEAT tokenizes user and item behaviors into discrete sequences via vector-quantized autoencoding to disentangle macro-level interests and micro-level intentions from graph-based representations. It uses multi-level semantic supervision and semantic alignment regularization to embed behavior tokens directly into frozen language models’ input space.

Result: Experiments on three public datasets show BEAT improves zero-shot recommendation performance while generating coherent and informative explanations. Behavior tokens capture fine-grained semantics and offer a plug-and-play interface for integrating complex behavior patterns into large language models.

Conclusion: BEAT provides a unified and transferable framework that bridges behavioral signals with language space, enabling better explainable recommendations through interpretable behavior tokens that can be seamlessly integrated with frozen language models.

Abstract: Recent advances in explainable recommendations have explored the integration of language models to analyze natural language rationales for user-item interactions. Despite their potential, existing methods often rely on ID-based representations that obscure semantic meaning and impose structural constraints on language models, thereby limiting their applicability in open-ended scenarios. These challenges are intensified by the complex nature of real-world interactions, where diverse user intents are entangled and collaborative signals rarely align with linguistic semantics. To overcome these limitations, we propose BEAT, a unified and transferable framework that tokenizes user and item behaviors into discrete, interpretable sequences. We construct a behavior vocabulary via a vector-quantized autoencoding process that disentangles macro-level interests and micro-level intentions from graph-based representations. We then introduce multi-level semantic supervision to bridge the gap between behavioral signals and language space. A semantic alignment regularization mechanism is designed to embed behavior tokens directly into the input space of frozen language models. Experiments on three public datasets show that BEAT improves zero-shot recommendation performance while generating coherent and informative explanations. Further analysis demonstrates that our behavior tokens capture fine-grained semantics and offer a plug-and-play interface for integrating complex behavior patterns into large language models.

[332] SoFlow: Solution Flow Models for One-Step Generative Modeling

Tianze Luo, Haotian Yuan, Zhuang Liu

Main category: cs.LG

TL;DR: SoFlow is a framework for one-step image generation that improves efficiency over multi-step diffusion models by using Flow Matching loss and solution consistency loss without requiring Jacobian-vector products.

DetailsMotivation: Multi-step denoising in diffusion and Flow Matching models causes major efficiency issues, motivating research on few-step generation. The authors aim to create a framework for one-step generation from scratch that avoids computational bottlenecks like Jacobian-vector products.

Method: Analyzes relationship between velocity function and solution function of velocity ODE. Proposes Flow Matching loss and solution consistency loss to train models. Flow Matching loss enables Classifier-Free Guidance during training. Consistency loss avoids Jacobian-vector product calculation, which is not well-optimized in frameworks like PyTorch.

Result: When trained from scratch using same DiT architecture and equal training epochs, SoFlow achieves better FID-50K scores than MeanFlow models on ImageNet 256x256 dataset.

Conclusion: SoFlow provides an efficient framework for one-step generation that outperforms existing methods while avoiding computational bottlenecks like Jacobian-vector products, making it more practical for real-world deployment.

Abstract: The multi-step denoising process in diffusion and Flow Matching models causes major efficiency issues, which motivates research on few-step generation. We present Solution Flow Models (SoFlow), a framework for one-step generation from scratch. By analyzing the relationship between the velocity function and the solution function of the velocity ordinary differential equation (ODE), we propose a Flow Matching loss and a solution consistency loss to train our models. The Flow Matching loss allows our models to provide estimated velocity fields for Classifier-Free Guidance (CFG) during training, which improves generation performance. Notably, our consistency loss does not require the calculation of the Jacobian-vector product (JVP), a common requirement in recent works that is not well-optimized in deep learning frameworks like PyTorch. Experimental results indicate that, when trained from scratch using the same Diffusion Transformer (DiT) architecture and an equal number of training epochs, our models achieve better FID-50K scores than MeanFlow models on the ImageNet 256x256 dataset.

[333] A Multivariate Statistical Framework for Detection, Classification and Pre-localization of Anomalies in Water Distribution Networks

Oleg Melnikov, Yurii Dorofieiev, Yurii Shakhnovskiy, Huy Truong, Victoria Degeler

Main category: cs.LG

TL;DR: SICAMS framework uses multivariate statistical analysis to detect, classify, and coarsely localize anomalies in water distribution networks using pressure/flow data, achieving robust leak detection without requiring hydraulic models.

DetailsMotivation: Water distribution networks need efficient anomaly detection methods to identify leaks and sensor malfunctions. Current approaches often require calibrated hydraulic models, which are complex and costly to maintain. There's a need for statistical methods that can work with heterogeneous sensor data without model calibration.

Method: SICAMS framework processes pressure and flow sensor data through whitening transformation to eliminate spatial correlations. Uses Hotelling’s T² statistic as an integral health indicator for anomaly detection via statistical hypothesis testing. Includes heuristic algorithm to classify anomalies (abrupt leaks, incipient leaks, sensor malfunctions) and coarse localization using sensor contribution ranking with Laplacian interpolation.

Result: Applied to BattLeDIM L-Town benchmark dataset, the method demonstrated high sensitivity and reliability in leak detection. Maintained robust performance even under multiple leaks. Hotelling’s T² statistic correlated with total leakage volume, enabling approximate water loss estimation via regression.

Conclusion: SICAMS provides a practical framework for water network anomaly management that works without calibrated hydraulic models. The statistical approach offers reliable detection, classification, and coarse localization capabilities suitable for real-world operational environments.

Abstract: This paper presents a unified framework, for the detection, classification, and preliminary localization of anomalies in water distribution networks using multivariate statistical analysis. The approach, termed SICAMS (Statistical Identification and Classification of Anomalies in Mahalanobis Space), processes heterogeneous pressure and flow sensor data through a whitening transformation to eliminate spatial correlations among measurements. Based on the transformed data, the Hotelling’s $T^2$ statistic is constructed, enabling the formulation of anomaly detection as a statistical hypothesis test of network conformity to normal operating conditions. It is shown that Hotelling’s $T^2$ statistic can serve as an integral indicator of the overall “health” of the system, exhibiting correlation with total leakage volume, and thereby enabling approximate estimation of water losses via a regression model. A heuristic algorithm is developed to analyze the $T^2$ time series and classify detected anomalies into abrupt leaks, incipient leaks, and sensor malfunctions. Furthermore, a coarse leak localization method is proposed, which ranks sensors according to their statistical contribution and employs Laplacian interpolation to approximate the affected region within the network. Application of the proposed framework to the BattLeDIM L-Town benchmark dataset demonstrates high sensitivity and reliability in leak detection, maintaining robust performance even under multiple leaks. These capabilities make the method applicable to real-world operational environments without the need for a calibrated hydraulic model.

[334] Multi-Modal Semantic Communication

Matin Mortaheb, Erciyes Karakaya, Sennur Ulukus

Main category: cs.LG

TL;DR: A multi-modal semantic communication framework that uses text queries to guide adaptive image transmission, achieving efficient communication in bandwidth-constrained environments.

DetailsMotivation: Existing transformer-based semantic communication approaches struggle with complex scenes containing multiple objects because self-attention lacks explicit task guidance, limiting their effectiveness in applications like telepresence, AR, and remote sensing.

Method: Proposes a multi-modal framework integrating text queries with visual data using cross-modal attention to compute relevance scores, then transmits image patches at adaptive resolutions via independently trained encoder-decoder pairs based on bandwidth constraints.

Result: The system enables flexible, goal-driven semantic communication that preserves task-critical information while matching total bitrate to channel capacity in complex, bandwidth-constrained environments.

Conclusion: Text-guided multi-modal semantic communication with adaptive resolution transmission offers significant efficiency gains for complex scenes by providing explicit task guidance that traditional self-attention approaches lack.

Abstract: Semantic communication aims to transmit information most relevant to a task rather than raw data, offering significant gains in communication efficiency for applications such as telepresence, augmented reality, and remote sensing. Recent transformer-based approaches have used self-attention maps to identify informative regions within images, but they often struggle in complex scenes with multiple objects, where self-attention lacks explicit task guidance. To address this, we propose a novel Multi-Modal Semantic Communication framework that integrates text-based user queries to guide the information extraction process. Our proposed system employs a cross-modal attention mechanism that fuses visual features with language embeddings to produce soft relevance scores over the visual data. Based on these scores and the instantaneous channel bandwidth, we use an algorithm to transmit image patches at adaptive resolutions using independently trained encoder-decoder pairs, with total bitrate matching the channel capacity. At the receiver, the patches are reconstructed and combined to preserve task-critical information. This flexible and goal-driven design enables efficient semantic communication in complex and bandwidth-constrained environments.

[335] FrontierCS: Evolving Challenges for Evolving Intelligence

Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, Youran Sun, Wesley Zheng, Meiyuwang Zhang, Ruyi Ji, Xuechang Tu, Zihan Zheng, Zexing Chen, Kangyang Zhou, Zhaozi Wang, Jingbang Chen, Aleksandra Korolova, Peter Henderson, Pramod Viswanath, Vijay Ganesh, Saining Xie, Zhuang Liu, Dawn Song, Sewon Min, Ion Stoica, Joseph E. Gonzalez, Jingbo Shang, Alvin Cheung

Main category: cs.LG

TL;DR: FrontierCS is a benchmark of 156 open-ended CS problems where optimal solutions are unknown but quality can be objectively evaluated, featuring algorithmic and research problems with expert reference solutions and automatic evaluators.

DetailsMotivation: Existing benchmarks focus on tasks with known optimal solutions, but there's a need for benchmarks targeting problems at the frontier of computer science where optimal solutions are unknown but solution quality can be objectively measured.

Method: Created a benchmark of 156 open-ended problems across diverse CS areas, designed by experts (CS PhDs, competitive programming participants). Problems include algorithmic problems (often NP-hard variants) and research problems with objective partial scoring. Each problem has expert reference solution and automatic evaluator.

Result: Models still lag far behind human experts on both algorithmic and research tracks. Increasing reasoning budgets alone doesn’t close the gap. Models often over-optimize for generating merely workable code instead of discovering high-quality algorithms and system designs.

Conclusion: FrontierCS provides a benchmark at the frontier of computer-science difficulty, combining open-ended design, measurable progress, and expert curation to evaluate AI models on challenging CS problems where optimal solutions are unknown.

Abstract: We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science, designed and reviewed by experts, including CS PhDs and top-tier competitive programming participants and problem setters. Unlike existing benchmarks that focus on tasks with known optimal solutions, FrontierCS targets problems where the optimal solution is unknown, but the quality of a solution can be objectively evaluated. Models solve these tasks by implementing executable programs rather than outputting a direct answer. FrontierCS includes algorithmic problems, which are often NP-hard variants of competitive programming problems with objective partial scoring, and research problems with the same property. For each problem we provide an expert reference solution and an automatic evaluator. Combining open-ended design, measurable progress, and expert curation, FrontierCS provides a benchmark at the frontier of computer-science difficulty. Empirically, we find that frontier reasoning models still lag far behind human experts on both the algorithmic and research tracks, that increasing reasoning budgets alone does not close this gap, and that models often over-optimize for generating merely workable code instead of discovering high-quality algorithms and system designs.

[336] Learning Model Parameter Dynamics in a Combination Therapy for Bladder Cancer from Sparse Biological Data

Kayode Olumoyin, Lamees El Naqa, Katarzyna Rejniak

Main category: cs.LG

TL;DR: PINN-based approach to learn time-varying cell interactions in bladder cancer under combination therapy with sparse data

DetailsMotivation: Traditional fixed-parameter models fail to capture evolving dynamics in biological systems, especially in oncology where experimental data is sparse (few tumor volume time points). Need to model time-varying interactions between cancer and immune cells under combination treatments.

Method: Physics-informed neural network (PINN) approach to predict subpopulation trajectories at unobserved time points, learning time-varying interactions between bladder cancer tumors and immune cells in limited data scenarios.

Result: The approach demonstrates consistency with biological explanations of subpopulation trajectories and provides a framework for learning evolving interactions under external interventions.

Conclusion: PINN enables learning of time-varying cell interactions in sparse data oncology scenarios, offering a framework for modeling evolving biological dynamics under therapeutic interventions.

Abstract: In a mathematical model of interacting biological organisms, where external interventions may alter behavior over time, traditional models that assume fixed parameters usually do not capture the evolving dynamics. In oncology, this is further exacerbated by the fact that experimental data are often sparse and sometimes are composed of a few time points of tumor volume. In this paper, we propose to learn time-varying interactions between cells, such as those of bladder cancer tumors and immune cells, and their response to a combination of anticancer treatments in a limited data scenario. We employ the physics-informed neural network (PINN) approach to predict possible subpopulation trajectories at time points where no observed data are available. We demonstrate that our approach is consistent with the biological explanation of subpopulation trajectories. Our method provides a framework for learning evolving interactions among biological organisms when external interventions are applied to their environment.

[337] SketchOGD: Memory-Efficient Continual Learning

Youngjae Min, Benjamin Wright, Jeremy Bernstein, Navid Azizan

Main category: cs.LG

TL;DR: SketchOGD: A memory-efficient continual learning method that uses matrix sketching to compress gradients for orthogonal gradient descent, enabling long-term learning without catastrophic forgetting.

DetailsMotivation: Existing continual learning solutions like OGD require storing all past gradients, leading to unbounded memory growth over time. This makes them impractical for long-term continual learning scenarios where memory is limited.

Method: SketchOGD combines orthogonal gradient descent with online matrix sketching. It compresses encountered model gradients into a fixed-size matrix using an online sketching algorithm, allowing OGD to operate with constant memory regardless of runtime.

Result: SketchOGD outperforms state-of-the-art OGD variants given fixed memory budgets. It provides theoretical guarantees on approximation error under a novel metric suited for OGD’s downstream tasks.

Conclusion: SketchOGD offers a practical, memory-efficient solution for continual learning that runs online without needing advance knowledge of total tasks, is simple to implement, and provides theoretical guarantees while maintaining performance.

Abstract: When machine learning models are trained continually on a sequence of tasks, they are often liable to forget what they learned on previous tasks–a phenomenon known as catastrophic forgetting. Proposed solutions to catastrophic forgetting tend to involve storing information about past tasks, meaning that memory usage is a chief consideration in determining their practicality. This paper develops a memory-efficient solution to catastrophic forgetting using the idea of matrix sketching, in the context of a simple continual learning algorithm known as orthogonal gradient descent (OGD). OGD finds weight updates that aim to preserve performance on prior datapoints, using gradients of the model on those datapoints. However, since the memory cost of storing prior model gradients grows with the runtime of the algorithm, OGD is ill-suited to continual learning over long time horizons. To address this problem, we propose SketchOGD. SketchOGD employs an online sketching algorithm to compress model gradients as they are encountered into a matrix of a fixed, user-determined size. In contrast to existing memory-efficient variants of OGD, SketchOGD runs online without the need for advance knowledge of the total number of tasks, is simple to implement, and is more amenable to analysis. We provide theoretical guarantees on the approximation error of the relevant sketches under a novel metric suited to the downstream task of OGD. Experimentally, we find that SketchOGD tends to outperform current state-of-the-art variants of OGD given a fixed memory budget.

[338] Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset

Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey

Main category: cs.LG

TL;DR: AI framework using linked EHR data to estimate UTI risk with explainable models, revealing differences in predictors across risk groups while addressing data fairness and bias challenges.

DetailsMotivation: Machine learning on EHRs has potential for clinical insights but faces challenges with data heterogeneity, sparsity, temporal misalignment, and limited labeled outcomes, particularly for urinary tract infections.

Method: Used linked EHR dataset of ~1M UK individuals; implemented data pre-processing pipeline; developed UTI risk estimation framework informed by clinical expertise; trained pairwise XGBoost models with explainable AI techniques to differentiate UTI risk categories.

Result: Found differences in clinical and demographic predictors across UTI risk groups; demonstrated potential of AI-driven insights for UTI clinical decision-making.

Conclusion: While showing promise for supporting UTI clinical decision-making, further investigation of patient sub-strata and extensive validation are needed to ensure robustness and clinical applicability.

Abstract: The use of machine learning and AI on electronic health records (EHRs) holds substantial potential for clinical insight. However, this approach faces challenges due to data heterogeneity, sparsity, temporal misalignment, and limited labeled outcomes. In this context, we leverage a linked EHR dataset of approximately one million de-identified individuals from Bristol, North Somerset, and South Gloucestershire, UK, to characterize urinary tract infections (UTIs). We implemented a data pre-processing and curation pipeline that transforms the raw EHR data into a structured format suitable for developing predictive models focused on data fairness, accountability and transparency. Given the limited availability and biases of ground truth UTI outcomes, we introduce a UTI risk estimation framework informed by clinical expertise to estimate UTI risk across individual patient timelines. Pairwise XGBoost models are trained using this framework to differentiate UTI risk categories with explainable AI techniques applied to identify key predictors and support interpretability. Our findings reveal differences in clinical and demographic predictors across risk groups. While this study highlights the potential of AI-driven insights to support UTI clinical decision-making, further investigation of patient sub-strata and extensive validation are needed to ensure robustness and applicability in clinical practice.

[339] WaveGNN: Integrating Graph Neural Networks and Transformers for Decay-Aware Classification of Irregular Clinical Time-Series

Arash Hajisafi, Maria Despoina Siampou, Bita Azarijoo, Zhen Xiong, Cyrus Shahabi

Main category: cs.LG

TL;DR: WaveGNN is a model that directly processes irregular clinical time series without interpolation, using a decay-aware Transformer for intra-series dynamics and a sample-specific GNN for interpretable inter-sensor relationships.

DetailsMotivation: Clinical time series are irregularly sampled with varying frequencies, missing data, and misaligned timestamps. Existing approaches either introduce bias through interpolation or create uninterpretable relationships, making it difficult to learn accurate intra-series and inter-series dependencies.

Method: WaveGNN operates directly on irregular multivariate time series without interpolation. It combines: 1) a decay-aware Transformer to capture intra-series dynamics, and 2) a sample-specific graph neural network that models both short-term and long-term inter-sensor relationships, generating a single, sparse, and interpretable graph per sample.

Result: WaveGNN delivers consistently strong performance across multiple benchmark datasets (P12, P19, MIMIC-III, and PAM), while other state-of-the-art baselines perform well on some datasets but poorly on others. The learned graphs align well with known physiological structures, enhancing interpretability.

Conclusion: WaveGNN’s consistency and robustness across diverse clinical settings set it apart from other methods. The model’s ability to generate interpretable graphs that align with physiological structures supports clinical decision-making while avoiding the biases introduced by interpolation-based approaches.

Abstract: Clinical time series are often irregularly sampled, with varying sensor frequencies, missing observations, and misaligned timestamps. Prior approaches typically address these irregularities by interpolating data into regular sequences, thereby introducing bias, or by generating inconsistent and uninterpretable relationships across sensor measurements, complicating the accurate learning of both intra-series and inter-series dependencies. We introduce WaveGNN, a model that operates directly on irregular multivariate time series without interpolation or conversion to a regular representation. WaveGNN combines a decay-aware Transformer to capture intra-series dynamics with a sample-specific graph neural network that models both short-term and long-term inter-sensor relationships. Therefore, it generates a single, sparse, and interpretable graph per sample. Across multiple benchmark datasets (P12, P19, MIMIC-III, and PAM), WaveGNN delivers consistently strong performance, whereas other state-of-the-art baselines tend to perform well on some datasets or tasks but poorly on others. While WaveGNN does not necessarily surpass every method in every case, its consistency and robustness across diverse settings set it apart. Moreover, the learned graphs align well with known physiological structures, enhancing interpretability and supporting clinical decision-making.

[340] Structure-Aligned Protein Language Model

Can Chen, David Heurtel-Depeiges, Robert M. Vernon, Christopher James Langmead, Yoshua Bengio, Quentin Fournier

Main category: cs.LG

TL;DR: A dual-task framework that enriches protein language models with structural knowledge through contrastive learning with protein graph neural networks and structure token prediction, improving performance across various biological tasks.

DetailsMotivation: Protein language models lack structural knowledge essential for biological applications, despite excelling at sequence-based tasks. Structural information is crucial for many protein function predictions.

Method: Two-stage approach: 1) Latent-level contrastive learning aligns residue representations between pLMs and pGNNs for inter-protein structural knowledge; 2) Physical-level task predicts structure tokens for intra-protein information. Includes residue loss selection module to handle variable-quality PDB structures.

Result: Notable performance gains across tasks: substantial improvements in deep mutational scanning fitness prediction, 59% increase in P@L for ESM2 650M contact prediction on CASP16. Gains scale with model sizes (8M to 650M) and extend to different downstream tasks.

Conclusion: The proposed structure alignment method effectively injects both inter- and intra-protein structural knowledge into pLMs through lightweight post-training, resulting in robust performance improvements across diverse biological applications.

Abstract: Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but often lack the structural knowledge essential for some biological applications. To address this, we introduce a method to enrich pLMs with structural knowledge by leveraging pre-trained protein graph neural networks (pGNNs). First, a latent-level contrastive learning task aligns residue representations from pLMs with those from pGNNs across multiple proteins, injecting inter-protein structural information. Additionally, a physical-level task integrates intra-protein information by training pLMs to predict structure tokens. Together, the proposed dual-task framework effectively incorporates both inter- and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module that uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method as a simple, lightweight post-training step to the state-of-the-art ESM2 and AMPLIFY yields notable performance gains. These improvements are consistent across a wide range of tasks, including substantial gains in deep mutational scanning (DMS) fitness prediction and a 59% increase in P@L for ESM2 650M contact prediction on CASP16. Furthermore, we demonstrate that these performance gains are robust, scaling with model sizes from 8M to 650M and extending to different downstream tasks.

[341] Bidirectional predictive coding

Gaspard Oliviers, Mufeng Tang, Rafal Bogacz

Main category: cs.LG

TL;DR: Bidirectional Predictive Coding (bPC) combines both generative and discriminative inference in a biologically plausible model, outperforming unidirectional PC models in specialized tasks and better resembling biological visual inference.

DetailsMotivation: The brain uses both generative (top-down prediction) and discriminative (bottom-up prediction) inference, but existing PC models are unidirectional and show degraded performance in tasks requiring bidirectional processing.

Method: Proposed bidirectional PC (bPC) that incorporates both generative and discriminative inference while maintaining biological plausibility, developing an energy landscape suitable for both tasks.

Result: bPC matches or outperforms unidirectional models in specialized generative/discriminative tasks, and shows superior performance in multimodal learning and inference with missing information.

Conclusion: bPC more closely resembles biological visual inference by successfully combining both generative and discriminative processing in a unified, biologically plausible framework.

Abstract: Predictive coding (PC) is an influential computational model of visual learning and inference in the brain. Classical PC was proposed as a top-down generative model, where the brain actively predicts upcoming visual inputs, and inference minimises the prediction errors. Recent studies have also shown that PC can be formulated as a discriminative model, where sensory inputs predict neural activities in a feedforward manner. However, experimental evidence suggests that the brain employs both generative and discriminative inference, while unidirectional PC models show degraded performance in tasks requiring bidirectional processing. In this work, we propose bidirectional PC (bPC), a PC model that incorporates both generative and discriminative inference while maintaining a biologically plausible circuit implementation. We show that bPC matches or outperforms unidirectional models in their specialised generative or discriminative tasks, by developing an energy landscape that simultaneously suits both tasks. We also demonstrate bPC’s superior performance in two biologically relevant tasks including multimodal learning and inference with missing information, suggesting that bPC resembles biological visual inference more closely.

[342] Optimal Prediction Using Expert Advice and Randomized Littlestone Dimension

Yuval Filmus, Steve Hanneke, Idan Mehalel, Shay Moran

Main category: cs.LG

TL;DR: The paper establishes optimal mistake bounds for randomized online learning, introducing the randomized Littlestone dimension and solving open problems in agnostic learning and prediction with expert advice.

DetailsMotivation: While deterministic online learning has well-established optimal mistake bounds characterized by the Littlestone dimension, analogous results for randomized learners were lacking. Additionally, optimal mistake bounds in the agnostic case (where the best hypothesis makes k mistakes) remained open problems, particularly for randomized learners and prediction with expert advice.

Method: The paper introduces the randomized Littlestone dimension, defined as the largest d for which there exists a tree shattered by the hypothesis class H whose average depth is 2d. They use this concept to characterize optimal mistake bounds for randomized learners in both realizable and agnostic settings.

Result: 1) The optimal expected mistake bound for randomized learners equals the randomized Littlestone dimension. 2) In the agnostic case, the optimal randomized mistake bound is k + Θ(√(kd) + d), and the optimal deterministic bound is 2k + Θ(d) + O(√(kd)). 3) For prediction with expert advice, they provide an optimal randomized learning rule whose expected mistake bound equals half of the known deterministic bound up to negligible terms.

Conclusion: The paper provides a complete characterization of optimal mistake bounds for randomized online learning, resolving long-standing open problems. The randomized Littlestone dimension serves as the fundamental complexity measure for randomized learners, analogous to the Littlestone dimension for deterministic learners. The results unify and extend previous work, particularly solving the randomized case of prediction with expert advice that had remained open for about 30 years.

Abstract: A classical result in online learning characterizes the optimal mistake bound achievable by deterministic learners using the Littlestone dimension (Littlestone ‘88). We prove an analogous result for randomized learners: we show that the optimal expected mistake bound in learning a class $\mathcal{H}$ equals its randomized Littlestone dimension, which is the largest $d$ for which there exists a tree shattered by $\mathcal{H}$ whose average depth is $2d$. We further study optimal mistake bounds in the agnostic case, as a function of the number of mistakes made by the best function in $\mathcal{H}$, denoted by $k$. We show that the optimal randomized mistake bound for learning a class with Littlestone dimension $d$ is $k + Θ(\sqrt{k d} + d )$. This also implies an optimal deterministic mistake bound of $2k + Θ(d) + O(\sqrt{k d})$, thus resolving an open question which was studied by Auer and Long [‘99]. As an application of our theory, we revisit the classical problem of prediction using expert advice: about 30 years ago Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth studied prediction using expert advice, provided that the best among the $n$ experts makes at most $k$ mistakes, and asked what are the optimal mistake bounds. Cesa-Bianchi, Freund, Helmbold, and Warmuth [‘93, ‘96] provided a nearly optimal bound for deterministic learners, and left the randomized case as an open problem. We resolve this question by providing an optimal learning rule in the randomized case, and showing that its expected mistake bound equals half of the deterministic bound of Cesa-Bianchi et al. [‘93,‘96], up to negligible additive terms. In contrast with previous works by Abernethy, Langford, and Warmuth [‘06], and by Brânzei and Peres [‘19], our result applies to all pairs $n,k$.

[343] Variational Continual Test-Time Adaptation

Fan Lyu, Kaile Du, Yuyang Li, Hanyu Zhao, Fuyuan Hu, Zhang Zhang, Guangcan Liu, Liang Wang

Main category: cs.LG

TL;DR: VCoTTA: A variational Bayesian approach for Continual Test-Time Adaptation that uses Bayesian Neural Networks and mean-teacher updates with prior mixture to reduce error accumulation from domain shifts.

DetailsMotivation: CTTA faces severe error accumulation due to uncertainty in model updates when using only unlabeled samples during continuous domain shifts at test time.

Method: Transform pretrained model to Bayesian Neural Network via variational warm-up, use mean-teacher update with variational inference for student and EMA for teacher, combine priors from source and teacher models in ELBO formulation.

Result: Experimental results on three datasets demonstrate effectiveness in mitigating error accumulation within CTTA framework.

Conclusion: VCoTTA successfully addresses uncertainty and error accumulation in CTTA through variational Bayesian approach with prior mixture strategy.

Abstract: Continual Test-Time Adaptation (CTTA) task investigates effective domain adaptation under the scenario of continuous domain shifts during testing time. Due to the utilization of solely unlabeled samples, there exists significant uncertainty in model updates, leading CTTA to encounter severe error accumulation issues. In this paper, we introduce VCoTTA, a variational Bayesian approach to measure uncertainties in CTTA. At the source stage, we transform a pretrained deterministic model into a Bayesian Neural Network (BNN) via a variational warm-up strategy, injecting uncertainties into the model. During the testing time, we employ a mean-teacher update strategy using variational inference for the student model and exponential moving average for the teacher model. Our novel approach updates the student model by combining priors from both the source and teacher models. The evidence lower bound is formulated as the cross-entropy between the student and teacher models, along with the Kullback-Leibler (KL) divergence of the prior mixture. Experimental results on three datasets demonstrate the method’s effectiveness in mitigating error accumulation within the CTTA framework.

[344] REAL: Representation Enhanced Analytic Learning for Exemplar-free Class-incremental Learning

Run He, Di Fang, Yizhu Chen, Kai Tong, Cen Chen, Yi Wang, Lap-pui Chau, Huiping Zhuang

Main category: cs.LG

TL;DR: REAL enhances exemplar-free class-incremental learning by improving representations through dual-stream pretraining with distillation and better utilizing backbone knowledge via feature fusion.

DetailsMotivation: Existing Analytic Continual Learning (ACL) methods in exemplar-free class-incremental learning suffer from ineffective representations and insufficient utilization of backbone knowledge, limiting their performance compared to exemplar-based approaches.

Method: REAL introduces: 1) Dual-stream base pretraining combining self-supervised contrastive learning for general features and supervised learning for class-specific knowledge, followed by representation enhancing distillation; 2) Feature fusion buffer to multi-layer backbone features to better utilize backbone knowledge for classifier training.

Result: REAL achieves state-of-the-art performance on CIFAR-100, ImageNet-100 and ImageNet-1k benchmarks, outperforming exemplar-free methods and rivaling exemplar-based approaches.

Conclusion: The proposed REAL method effectively addresses representation and knowledge utilization issues in exemplar-free class-incremental learning, demonstrating competitive performance that bridges the gap with exemplar-based methods.

Abstract: Exemplar-free class-incremental learning (EFCIL) aims to mitigate catastrophic forgetting in class-incremental learning (CIL) without available historical training samples as exemplars. Compared with its exemplar-based CIL counterpart that stores exemplars, EFCIL suffers more from forgetting issues. Recently, a new EFCIL branch named Analytic Continual Learning (ACL) introduces a gradient-free paradigm via Recursive Least-Square, achieving a forgetting-resistant classifier training with a frozen backbone during CIL. However, existing ACL suffers from ineffective representations and insufficient utilization of backbone knowledge. In this paper, we propose a representation-enhanced analytic learning (REAL) to address these problems. To enhance the representation, REAL constructs a dual-stream base pretraining followed by representation enhancing distillation process. The dual-stream base pretraining combines self-supervised contrastive learning for general features and supervised learning for class-specific knowledge, followed by the representation enhancing distillation to merge both streams, enhancing representations for subsequent CIL paradigm. To utilize more knowledge from the backbone, REAL presents a feature fusion buffer to multi-layer backbone features, providing informative features for the subsequent classifier training. Our method can be incorporated into existing ACL techniques and provides more competitive performance. Empirical results demonstrate that, REAL achieves state-of-the-art performance on CIFAR-100, ImageNet-100 and ImageNet-1k benchmarks, outperforming exemplar-free methods and rivaling exemplar-based approaches.

[345] Over-parameterization and Adversarial Robustness in Neural Networks: An Overview and Empirical Analysis

Srishti Gupta, Zhang Chen, Luca Demetrio, Xiaoyi Feng, Zhaoqiang Xia, Antonio Emanuele Cinà, Maura Pintor, Luca Oneto, Ambra Demontis, Battista Biggio, Fabio Roli

Main category: cs.LG

TL;DR: Over-parameterized neural networks show superior robustness against adversarial attacks compared to under-parameterized networks, but previous contradictory findings may stem from unreliable attack evaluation methods.

DetailsMotivation: There are contradictory claims in literature about whether over-parameterized networks are robust or vulnerable to adversarial examples. Previous research suggests these contradictions might arise from unreliable attack methods that fail to properly evaluate robustness, leading to overestimation of model security.

Method: Empirical study of over-parameterized networks’ robustness against adversarial examples, with a key innovation: evaluating the reliability of the attacks themselves to ensure result validity, unlike previous works that didn’t verify attack effectiveness.

Result: Over-parameterized networks demonstrate robustness against adversarial attacks, in contrast to their under-parameterized counterparts. The study confirms that proper attack evaluation is crucial for accurate robustness assessment.

Conclusion: Over-parameterized networks are indeed robust to adversarial examples when attacks are properly evaluated, resolving previous contradictory findings and highlighting the importance of reliable attack assessment methods in robustness studies.

Abstract: Thanks to their extensive capacity, over-parameterized neural networks exhibit superior predictive capabilities and generalization. However, having a large parameter space is considered one of the main suspects of the neural networks’ vulnerability to adversarial example – input samples crafted ad-hoc to induce a desired misclassification. Relevant literature has claimed contradictory remarks in support of and against the robustness of over-parameterized networks. These contradictory findings might be due to the failure of the attack employed to evaluate the networks’ robustness. Previous research has demonstrated that depending on the considered model, the algorithm employed to generate adversarial examples may not function properly, leading to overestimating the model’s robustness. In this work, we empirically study the robustness of over-parameterized networks against adversarial examples. However, unlike the previous works, we also evaluate the considered attack’s reliability to support the results’ veracity. Our results show that over-parameterized networks are robust against adversarial attacks as opposed to their under-parameterized counterparts.

[346] Imbalances in Neurosymbolic Learning: Characterization and Mitigating Strategies

Kaifu Wang, Efthymia Tsamoura, Dan Roth

Main category: cs.LG

TL;DR: The paper studies learning neural classifiers in neurosymbolic learning when only symbolic component outputs on gold labels are available, focusing on characterizing and mitigating class-specific learning imbalances caused by the symbolic component rather than just data imbalances.

DetailsMotivation: Current neurosymbolic learning research hasn't studied how symbolic components can cause learning imbalances (class-specific risks) when only symbolic outputs on gold labels are available, unlike supervised learning which only considers data imbalances.

Method: 1) Theoretical analysis of how symbolic components impact learning imbalances; 2) Technique for estimating marginal distribution of hidden gold labels using weakly supervised data; 3) Algorithms that mitigate imbalances at training and testing time by treating the marginal of hidden labels as a constraint.

Result: Theoretical analysis reveals symbolic components can greatly impact learning imbalances, contrasting with previous supervised learning research. Practical techniques demonstrate effectiveness with strong NSL and long-tailed learning baselines, showing performance improvements up to 14%.

Conclusion: Symbolic components in neurosymbolic learning can significantly affect class-specific learning imbalances, requiring new mitigation techniques that leverage estimated label marginals as constraints, leading to substantial performance improvements over existing methods.

Abstract: We study one of the most popular problems in neurosymbolic learning (NSL), that of learning neural classifiers given only the result of applying a symbolic component $σ$ to the gold labels of the elements of a vector $\mathbf x$. The gold labels of the elements in $\mathbf x$ are unknown to the learner. We make multiple contributions, theoretical and practical, to address a problem that has not been studied so far in this context, that of characterizing and mitigating learning imbalances, i.e., major differences in the errors that occur when classifying instances of different classes (aka class-specific risks). Our theoretical analysis reveals a unique phenomenon: that $σ$ can greatly impact learning imbalances. This result sharply contrasts with previous research on supervised and weakly supervised learning, which only studies learning imbalances under data imbalances. On the practical side, we introduce a technique for estimating the marginal of the hidden gold labels using weakly supervised data. Then, we introduce algorithms that mitigate imbalances at training and testing time by treating the marginal of the hidden labels as a constraint. We demonstrate the effectiveness of our techniques using strong baselines from NSL and long-tailed learning, suggesting performance improvements of up to 14%.

[347] Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data

Mulugeta Weldezgina Asres, Christian Walter Omlin, The CMS-HCAL Collaboration

Main category: cs.LG

TL;DR: AnomalyCD is a causal discovery method for binary anomaly data that improves computational efficiency and accuracy in learning graphical causal models from temporal binary alarm flags in large-scale monitoring systems.

DetailsMotivation: Existing causal discovery methods face computational challenges for real-time large-scale deployments and struggle with binary anomaly data characteristics (state transitions and sparsity) in modern monitoring systems.

Method: Proposes AnomalyCD with anomaly data-aware causality testing, sparse data and prior link compression, and edge pruning adjustment approaches for learning graphical causal models from temporal binary flag datasets.

Result: Demonstrates considerable reduction in computation overhead and moderate enhancement of accuracy on binary anomaly datasets, validated on CMS experiment sensor data and IT monitoring public dataset.

Conclusion: AnomalyCD effectively addresses computational and accuracy challenges for causal discovery in binary anomaly data, enabling practical application in large-scale real-time monitoring systems.

Abstract: Extracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a broader set of monitoring variables across multiple subsystems. However, learning graphical causal models (GCMs) comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data – the meaning of state transition and data sparsity – challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (\textsc{AnomalyCD}), addressing the accuracy and computational challenges of generating GCMs from temporal binary flag datasets. The \textsc{AnomalyCD} presents several strategies, such as anomaly data-aware causality testing, sparse data and prior link compression, and edge pruning adjustment approaches. We validate the performance of of the approach on two datasets: monitoring sensor data of the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public data set for information technology monitoring. The results on temporal GCMs demonstrate a considerable reduction of computation overhead and a moderate enhancement of accuracy on the binary anomaly data sets. Source code: https://github.com/muleina/AnomalyCD .

[348] Exact Verification of Graph Neural Networks with Incremental Constraint Solving

Minghao Liu, Chia-Hsuan Lu, Marta Kwiatkowska

Main category: cs.LG

TL;DR: Exact verification method GNNev for GNNs against adversarial attacks, supporting sum, max, and mean aggregations with superior performance on node classification.

DetailsMotivation: GNNs are used in high-stakes applications like fraud detection and healthcare but are vulnerable to adversarial attacks. Existing verification methods lack support for commonly used aggregation functions in message-passing GNNs.

Method: Developed GNNev, an exact verification method using constraint solving with bound tightening. It iteratively solves relaxed constraint satisfaction problems with incremental solving for efficiency, supporting sum, max, and mean aggregation functions.

Result: GNNev demonstrated usability and effectiveness on real-world fraud datasets (Amazon, Yelp) and biochemical datasets (MUTAG, ENZYMES). It showed superior performance for node classification and competitiveness on graph classification compared to existing tools for sum-aggregated GNNs.

Conclusion: GNNev provides the first exact verification method supporting max and mean aggregation functions, offering robust adversarial guarantees for GNNs in critical applications with improved performance over existing approaches.

Abstract: Graph neural networks (GNNs) are increasingly employed in high-stakes applications, such as fraud detection or healthcare, but are susceptible to adversarial attacks. A number of techniques have been proposed to provide adversarial robustness guarantees, but support for commonly used aggregation functions in message-passing GNNs is lacking. In this paper, we develop an exact (sound and complete) verification method for GNNs to compute guarantees against attribute and structural perturbations that involve edge addition or deletion, subject to budget constraints. Our method employs constraint solving with bound tightening, and iteratively solves a sequence of relaxed constraint satisfaction problems while relying on incremental solving capabilities of solvers to improve efficiency. We implement GNNev, a versatile exact verifier for message-passing neural networks, which supports three aggregation functions, sum, max and mean, with the latter two considered here for the first time. Extensive experimental evaluation of GNNev on real-world fraud datasets (Amazon and Yelp) and biochemical datasets (MUTAG and ENZYMES) demonstrates its usability and effectiveness, as well as superior performance for node classification and competitiveness on graph classification compared to existing exact verification tools on sum-aggregated GNNs.

[349] Scalable Bayesian Optimization via Focalized Sparse Gaussian Processes

Yunyue Wei, Vincent Zhuang, Saraswati Soedarmadji, Yanan Sui

Main category: cs.LG

TL;DR: FocalBO: A Bayesian optimization method using focalized Gaussian processes with hierarchical search to efficiently handle high-dimensional problems with large datasets.

DetailsMotivation: Standard Bayesian optimization struggles with high-dimensional problems and large sample sizes due to cubic complexity of Gaussian processes. Existing approximate GP models often produce overly-smooth estimates and focus on problems allowing large online samples.

Method: Proposes focalized GP using a novel variational loss function for stronger local prediction, and FocalBO which hierarchically optimizes the acquisition function over progressively smaller search spaces.

Result: FocalBO achieves state-of-the-art performance on robot morphology design and control of a 585-dimensional musculoskeletal system, efficiently leveraging both offline and online data.

Conclusion: Sparse GPs with focalized representation and hierarchical search enable Bayesian optimization to scale effectively to high-dimensional problems with large datasets.

Abstract: Bayesian optimization is an effective technique for black-box optimization, but its applicability is typically limited to low-dimensional and small-budget problems due to the cubic complexity of computing the Gaussian process (GP) surrogate. While various approximate GP models have been employed to scale Bayesian optimization to larger sample sizes, most suffer from overly-smooth estimation and focus primarily on problems that allow for large online samples. In this work, we argue that Bayesian optimization algorithms with sparse GPs can more efficiently allocate their representational power to relevant regions of the search space. To achieve this, we propose focalized GP, which leverages a novel variational loss function to achieve stronger local prediction, as well as FocalBO, which hierarchically optimizes the focalized GP acquisition function over progressively smaller search spaces. Experimental results demonstrate that FocalBO can efficiently leverage large amounts of offline and online data to achieve state-of-the-art performance on robot morphology design and to control a 585-dimensional musculoskeletal system.

[350] Low-Rank Tensor Decompositions for the Theory of Neural Networks

Ricardo Borsoi, Konstantin Usevich, Marianne Clausel

Main category: cs.LG

TL;DR: This paper reviews how low-rank tensor decompositions provide mathematical foundations for deep learning theory, explaining neural network performance aspects like expressivity, learnability, generalization, and identifiability.

DetailsMotivation: The paper aims to bridge the gap between deep neural networks' empirical success and the need for mathematical foundations by leveraging low-rank tensor decompositions, which have strong theoretical properties and connections to NNs.

Method: The paper employs a review methodology, synthesizing existing approaches from different communities (computer science, mathematics, signal processing) that use low-rank tensor decompositions to analyze neural networks. It provides a unified perspective on how tensor methods explain various aspects of NN performance.

Result: The review demonstrates that low-rank tensor decompositions serve as fundamental tools for theoretically explaining key aspects of deep neural network performance, including their expressivity, algorithmic learnability, computational hardness, generalization, and identifiability.

Conclusion: Low-rank tensor decompositions play a crucial role in establishing mathematical foundations for deep learning theory, offering a unified framework to understand neural network performance and opening broader perspectives for future research in this area.

Abstract: The groundbreaking performance of deep neural networks (NNs) promoted a surge of interest in providing a mathematical basis to deep learning theory. Low-rank tensor decompositions are specially befitting for this task due to their close connection to NNs and their rich theoretical results. Different tensor decompositions have strong uniqueness guarantees, which allow for a direct interpretation of their factors, and polynomial time algorithms have been proposed to compute them. Through the connections between tensors and NNs, such results supported many important advances in the theory of NNs. In this review, we show how low-rank tensor methods–which have been a core tool in the signal processing and machine learning communities–play a fundamental role in theoretically explaining different aspects of the performance of deep NNs, including their expressivity, algorithmic learnability and computational hardness, generalization, and identifiability. Our goal is to give an accessible overview of existing approaches (developed by different communities, ranging from computer science to mathematics) in a coherent and unified way, and to open a broader perspective on the use of low-rank tensor decompositions for the theory of deep NNs.

[351] Machine learning applications in archaeological practices: a review

Mathias Bellat, Jordy D. Orellana Figueroa, Jonathan S. Reeves, Ruhollah Taghizadeh-Mehrjardi, Claudio Tennie, Thomas Scholten

Main category: cs.LG

TL;DR: Review of 135 ML archaeology papers (1997-2022) shows rapid growth since 2019, with structure detection and artifact classification as top applications, but reveals methodological issues and need for better workflow guidelines.

DetailsMotivation: Despite significant increase in AI/ML applications across all archaeology subfields, there has been no comprehensive review examining the prevalence and success of these applications, as previous reviews only focused on specific subfields.

Method: Exhaustive review of 135 articles published between 1997-2022, analyzing publication trends, application areas, ML methods used, and methodological quality.

Result: Significant increase in publications from 2019; top applications: structure detection and artifact classification, followed by taphonomy and predictive modeling; clustering/unsupervised methods underrepresented; neural networks and ensemble learning dominate; methodological issues identified including poorly defined requirements and unclear goals.

Conclusion: ML is becoming important for analyzing large multivariate archaeological data but requires well-defined methodologies, better reporting, and collaborative practices; authors propose workflow guide for archaeologists to develop coherent methodologies adapted to their research needs.

Abstract: Artificial intelligence and machine learning applications in archaeology have increased significantly in recent years, and these now span all subfields, geographical regions, and time periods. The prevalence and success of these applications have remained largely unexamined, as recent reviews on the use of machine learning in archaeology have only focused only on specific subfields of archaeology. Our review examined an exhaustive corpus of 135 articles published between 1997 and 2022. We observed a significant increase in the number of publications from 2019 onwards. Automatic structure detection and artefact classification were the most represented tasks in the articles reviewed, followed by taphonomy, and archaeological predictive modelling. From the review, clustering and unsupervised methods were underrepresented compared to supervised models. Artificial neural networks and ensemble learning account for two thirds of the total number of models used. However, if machine learning models are gaining in popularity they remain subject to misunderstanding. We observed, in some cases, poorly defined requirements and caveats of the machine learning methods used. Furthermore, the goals and the needs of machine learning applications for archaeological purposes are in some cases unclear or poorly expressed. To address this, we proposed a workflow guide for archaeologists to develop coherent and consistent methodologies adapted to their research questions, project scale and data. As in many other areas, machine learning is rapidly becoming an important tool in archaeological research and practice, useful for the analyses of large and multivariate data, although not without limitations. This review highlights the importance of well-defined and well-reported structured methodologies and collaborative practices to maximise the potential of applications of machine learning methods in archaeology.

[352] Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, Zhongzhi Li, Zaibin Zhang, Guibin Zhang, Chen Zhang, Zhenfei Yin, Lei Bai

Main category: cs.LG

TL;DR: Scaling laws for RL post-training of LLMs on mathematical reasoning show larger models learn more efficiently, follow predictable power-laws, but reveal latent saturation in learning efficiency as models grow.

DetailsMotivation: While scaling laws for LLM pre-training are well-studied, scaling behaviors under RL post-training (especially for mathematical reasoning) remain largely unexplored, creating a gap in understanding how to efficiently scale reasoning capabilities.

Method: Systematic empirical investigation across the full Qwen2.5 dense model series (0.5B to 72B parameters), analyzing interactions between model scale, data volume, and computational budget to characterize RL post-training scaling behaviors.

Result: Four key findings: 1) Larger models show superior learning efficiency on compute and data metrics; 2) Test loss follows predictable power-law relationships robust across base and instruction-tuned models; 3) Learning efficiency shows latent saturation trend as model size increases; 4) In data-constrained regimes, repeated reuse of high-quality data is effective as performance depends on total optimization steps rather than unique samples.

Conclusion: The study provides principled foundation and practical guidelines for efficiently scaling LLM reasoning capabilities through RL post-training, revealing predictable scaling patterns and optimal resource allocation strategies.

Abstract: While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper presents a systematic empirical investigation of scaling behaviors in RL-based post-training, with a particular focus on mathematical reasoning. Based on a set of experiments across the full Qwen2.5 dense model series (0.5B to 72B), we characterize how model scale, data volume, and computational budget interact to shape performance. Our analysis leads to four key findings: 1.Larger models consistently exhibit superior learning efficiency on both compute and data metrics. 2.The relationship between test loss, compute, and data can be modeled by a predictive power-law which is robust across both base and instruction-tuned models. 3.Although larger models exhibit higher learning efficiency, the analytical learning efficiency term k(N) in the power-law reveals a latent saturation trend in learning efficiency as model size continues to increase. 4.In data-constrained regimes, repeated reuse of high-quality data proves highly effective, as final performance is primarily governed by the total number of optimization steps rather than the uniqueness of samples. Collectively, these results provide a principled foundation and practical guidelines for efficiently scaling the reasoning capabilities of LLMs through RL post-training.

[353] Geometry and Optimization of Shallow Polynomial Networks

Yossi Arjevani, Joan Bruna, Joe Kileel, Elzbieta Polak, Matthew Trager

Main category: cs.LG

TL;DR: Analysis of shallow neural networks with monomial activations, focusing on width-optimization relationship, teacher-student problems as low-rank tensor approximation, and detailed optimization landscape analysis for quadratic activations.

DetailsMotivation: The paper aims to understand the mathematical properties of shallow neural networks with monomial activations, particularly focusing on how network width relates to optimization behavior and analyzing teacher-student learning problems through the lens of tensor approximation theory.

Method: The authors use tensor analysis to study shallow networks with monomial activations, identifying the function space with symmetric tensors of bounded rank. They introduce a teacher-metric data discriminant to analyze optimization behavior based on training data distribution, and provide detailed analysis of optimization landscapes for quadratic activation networks using variations of the Eckart-Young Theorem.

Result: The paper characterizes the relationship between network width and optimization, frames teacher-student problems as low-rank tensor approximation with non-standard inner products, and provides complete characterization of critical points and Hessian signatures for quadratic networks with Gaussian training data.

Conclusion: Shallow neural networks with monomial activations can be effectively analyzed using tensor methods, revealing fundamental connections between network architecture, optimization landscape, and data distribution, with complete mathematical characterization possible for quadratic activation networks.

Abstract: We study shallow neural networks with monomial activations and output dimension one. The function space for these models can be identified with a set of symmetric tensors with bounded rank. We describe general features of these networks, focusing on the relationship between width and optimization. We then consider teacher-student problems, which can be viewed as problems of low-rank tensor approximation with respect to non-standard inner products that are induced by the data distribution. In this setting, we introduce a teacher-metric data discriminant which encodes the qualitative behavior of the optimization as a function of the training data distribution. Finally, we focus on networks with quadratic activations, presenting an in-depth analysis of the optimization landscape. In particular, we present a variation of the Eckart-Young Theorem characterizing all critical points and their Hessian signatures for teacher-student problems with quadratic networks and Gaussian training data.

[354] Multimodal Foundation Models for Early Disease Detection

Md Talha Mohsin, Ismail Abdulrashid

Main category: cs.LG

TL;DR: Multimodal transformer foundation model integrates EHRs, imaging, genomics, and wearables for early disease detection with robust handling of missing data and noise.

DetailsMotivation: Current diagnostic models process healthcare modalities (EHRs, imaging, genomics, wearables) in isolation, limiting ability to capture early cross-modal disease signatures and handle real-world data challenges like missing modalities and label noise.

Method: Transformer-based multimodal foundation model with modality-specific encoders mapping data into shared latent space, fused using multi-head attention with residual normalization. Trained with supervised classification plus self-supervised reconstruction and contrastive alignment for robustness.

Result: Strong performance in early-detection settings with stable classification metrics, reliable uncertainty estimates, and interpretable attention patterns across oncology, cardiology, and neurology applications.

Conclusion: The approach enables flexible pretrain-and-fine-tune foundation model supporting precision diagnostics, handling incomplete inputs, and improving early disease detection across multiple medical domains.

Abstract: Healthcare data now span EHRs, medical imaging, genomics, and wearable sensors, but most diagnostic models still process these modalities in isolation. This limits their ability to capture early, cross-modal disease signatures. This paper introduces a multimodal foundation model built on a transformer architecture that integrates heterogeneous clinical data through modality-specific encoders and cross-modal attention. Each modality is mapped into a shared latent space and fused using multi-head attention with residual normalization. We implement the framework using a multimodal dataset that simulates early-stage disease patterns across EHR sequences, imaging patches, genomic profiles, and wearable signals, including missing-modality scenarios and label noise. The model is trained using supervised classification together with self-supervised reconstruction and contrastive alignment to improve robustness. Experimental evaluation demonstrates strong performance in early-detection settings, with stable classification metrics, reliable uncertainty estimates, and interpretable attention patterns. The approach moves toward a flexible, pretrain-and-fine-tune foundation model that supports precision diagnostics, handles incomplete inputs, and improves early disease detection across oncology, cardiology, and neurology applications.

[355] Worth Their Weight: Randomized and Regularized Block Kaczmarz Algorithms without Preprocessing

Gil Goldshlager, Jiang Hu, Lin Lin

Main category: cs.LG

TL;DR: ReBlocK: Regularized randomized block Kaczmarz method that samples data uniformly, converges to weighted least-squares solution, and controls bias/variance through regularization.

DetailsMotivation: Existing randomized block Kaczmarz (RBK) requires expensive preprocessing for sampling distributions. Need algorithm that works with uniform sampling while controlling bias and variance.

Method: Analyze RBK with uniform sampling, show convergence to weighted least-squares solution, then incorporate regularization to control bias/variance - yielding ReBlocK algorithm.

Result: ReBlocK outperforms both RBK and minibatch stochastic gradient descent for inconsistent problems with rapidly decaying singular values, as shown in numerical experiments.

Conclusion: ReBlocK provides practical alternative to RBK with uniform sampling, controlling bias/variance through regularization, making it effective for large-scale problems with decaying singular values.

Abstract: Due to the ever growing amounts of data leveraged for machine learning and scientific computing, it is increasingly important to develop algorithms that sample only a small portion of the data at a time. In the case of linear least-squares, the randomized block Kaczmarz method (RBK) is an appealing example of such an algorithm, but its convergence is only understood under sampling distributions that require potentially prohibitively expensive preprocessing steps. To address this limitation, we analyze RBK when the data is sampled uniformly, showing that its iterates converge in a Monte Carlo sense to a $\textit{weighted}$ least-squares solution. Unfortunately, for general problems the bias of the weighted least-squares solution and the variance of the iterates can become arbitrarily large. We show that these quantities can be rigorously controlled by incorporating regularization into the RBK iterations, yielding the regularized algorithm ReBlocK. Numerical experiments including examples arising from natural gradient optimization demonstrate that ReBlocK can outperform both RBK and minibatch stochastic gradient descent for inconsistent problems with rapidly decaying singular values.

[356] Variational Quantum Optimization with Continuous Bandits

Marc Wanner, Johan Jonasson, Emil Carlsson, Devdatt Dubhashi

Main category: cs.LG

TL;DR: Continuous bandit approach for VQA optimization outperforms gradient methods by avoiding barren plateaus through global exploration and local exploitation.

DetailsMotivation: VQA suffer from barren plateau problem where gradient-based methods fail due to exponentially small gradients and loss differences, requiring alternative optimization approaches.

Method: Formulate VQA as best arm identification in continuous space with Lipschitz smoothness, develop information-theoretic lower bound, and create near-optimal fixed-confidence algorithm for pure exploration in continuous setting.

Result: First results for pure exploration in continuous bandit setting, with algorithm near-optimal to derived lower bound; significantly outperforms state-of-the-art gradient methods on PQC and QAOA circuits.

Conclusion: Continuous bandit methods provide effective alternative to gradient-based optimization for VQA, overcoming barren plateau problem through combined global exploration and local exploitation.

Abstract: We introduce a novel approach to variational Quantum algorithms (VQA) via continuous bandits. VQA are a class of hybrid Quantum-classical algorithms where the parameters of Quantum circuits are optimized by classical algorithms. Previous work has used zero and first order gradient based methods, however such algorithms suffer from the barren plateau (BP) problem where gradients and loss differences are exponentially small. We introduce an approach using bandits methods which combine global exploration with local exploitation. We show how VQA can be formulated as a best arm identification problem in a continuous space of arms with Lipschitz smoothness. While regret minimization has been addressed in this setting, existing methods for pure exploration only cover discrete spaces. We give the first results for pure exploration in a continuous setting and derive a fixed-confidence, information-theoretic, instance specific lower bound. Under certain assumptions on the expected payoff, we derive a simple algorithm, which is near-optimal with respect to our lower bound. Finally, we apply our continuous bandit algorithm to two VQA schemes: a PQC and a QAOA quantum circuit, showing that we significantly outperform the previously known state of the art methods (which used gradient based methods).

[357] Control-Augmented Autoregressive Diffusion for Data Assimilation

Prakhar Srivastava, Farrin Marouf Sofian, Francesco Immorlano, Kushagra Pandey, Stephan Mandt

Main category: cs.LG

TL;DR: A framework for guided generation in Auto-Regressive Diffusion Models using a lightweight controller network trained offline to anticipate future observations, applied to data assimilation for chaotic PDEs with significant speed and accuracy improvements.

DetailsMotivation: Guidance in Auto-Regressive Diffusion Models (ARDMs) is underexplored, and existing methods for data assimilation in chaotic spatiotemporal PDEs are computationally prohibitive and prone to forecast drift under sparse observations.

Method: Augments pretrained ARDM with a lightweight controller network trained offline by previewing future rollouts to output stepwise controls that anticipate upcoming observations. Views guided generation as entropy-regularized stochastic optimal control over ARDM trajectories, learning reusable policy that injects small control corrections while anchored to pretrained dynamics.

Result: At inference, data assimilation reduces to single causal forward rollout with on-the-fly corrections, requiring no adjoint computations or gradient-based optimization, yielding order-of-magnitude speedup over diffusion-based baselines. Across two canonical PDEs and six observation regimes, consistently improves stability, accuracy, and physics-aware fidelity over state-of-the-art baselines.

Conclusion: The amortized framework enables efficient guided generation in ARDMs for data assimilation tasks, offering significant computational advantages while maintaining or improving performance compared to existing methods.

Abstract: Despite recent advances in test-time scaling and finetuning of diffusion models, guidance in Auto-Regressive Diffusion Models (ARDMs) remains underexplored. We introduce an amortized framework that augments a pretrained ARDM with a lightweight controller network, trained offline by previewing future rollouts to output stepwise controls that anticipate upcoming observations under a terminal-cost objective. Our approach is motivated by viewing guided generation as an entropy-regularized stochastic optimal control problem over ARDM trajectories: we learn a reusable policy that injects small control corrections inside each denoising sub-step while remaining anchored to the pretrained dynamics. We evaluate this framework in the context of data assimilation (DA) for chaotic spatiotemporal partial differential equations (PDEs), where existing methods can be computationally prohibitive and prone to forecast drift under sparse observations. At inference, DA reduces to a single causal forward rollout with on-the-fly corrections, requiring neither adjoint computations nor gradient-based optimization, and yields an order-of-magnitude speedup over strong diffusion-based DA baselines. Across two canonical PDEs and six observation regimes, our method consistently improves stability, accuracy, and physics-aware fidelity over state-of-the-art baselines. We will release code and checkpoints publicly.

[358] Sample-Efficient Optimization over Generative Priors via Coarse Learnability

Pranjal Awasthi, Sreenivas Gollapudi, Ravi Kumar, Kamesh Munagala

Main category: cs.LG

TL;DR: A framework combining zeroth-order optimization with generative priors (like LLMs) to solve constrained optimization problems, with theoretical sample complexity guarantees under a new “coarse learnability” assumption.

DetailsMotivation: Traditional zeroth-order optimization lacks mechanisms to incorporate qualitative constraints or complex prior distributions. Existing theory doesn't provide sample complexity bounds for black-box deep generative models in model-based optimization.

Method: Introduces a framework where constraints are represented by a generative prior (e.g., LLM). Proposes “coarse learnability” assumption and designs an iterative algorithm with Metropolis-Hastings correction to approximate target distribution with polynomial sample complexity.

Result: Theoretical: Shows maximum likelihood estimation induces required coverage properties for coarse learnability. Empirical: Demonstrates LLMs can adapt to zeroth-order feedback to solve combinatorial optimization problems.

Conclusion: One of the first works to establish sample-complexity guarantees for model-based optimization with deep generative priors, bridging theory and practice for constrained zeroth-order optimization.

Abstract: In zeroth-order optimization, we seek to minimize a function $d(\cdot)$, which may encode combinatorial feasibility, using only function evaluations. We focus on the setting where solutions must also satisfy qualitative constraints or conform to a complex prior distribution. To address this, we introduce a new framework in which such constraints are represented by an initial generative prior $Ł(\cdot)$, for example, a Large Language Model (LLM). The objective is to find solutions $s$ that minimize $d(s)$ while having high probability under $Ł(s)$, effectively sampling from a target distribution proportional to $Ł(s) \cdot e^{-T \cdot d(s)}$ for a temperature parameter $T$. While this framework aligns with classical Model-Based Optimization (e.g., the Cross-Entropy method), existing theory is ill-suited for deriving sample complexity bounds in black-box deep generative models. We therefore propose a novel learning assumption, which we term \emph{coarse learnability}, where an agent with access to a polynomial number of samples can learn a model whose point-wise density approximates the target within a polynomial factor. Leveraging this assumption, we design an iterative algorithm that employs a Metropolis-Hastings correction to provably approximate the target distribution using a polynomial number of samples. To the best of our knowledge, this is one of the first works to establish such sample-complexity guarantees for model-based optimization with deep generative priors. We provide two lines of evidence supporting the coarse learnability assumption. Theoretically, we show that maximum likelihood estimation naturally induces the required coverage properties, holding for both standard exponential families and for misspecified models. Empirically, we demonstrate that LLMs can adapt their learned distributions to zeroth-order feedback to solve combinatorial optimization problems.

[359] DiffEM: Learning from Corrupted Data with Diffusion Models via Expectation Maximization

Danial Hosseintabar, Fan Chen, Giannis Daras, Antonio Torralba, Constantinos Daskalakis

Main category: cs.LG

TL;DR: DiffEM: A new Expectation-Maximization method for training diffusion models from corrupted data using conditional diffusion models for reconstruction and refinement.

DetailsMotivation: Diffusion models are powerful generative priors for high-dimensional inverse problems, but learning them from only corrupted or noisy observations remains challenging. Current methods struggle when clean training data is unavailable.

Method: DiffEM uses Expectation-Maximization with conditional diffusion models. In the E-step, it reconstructs clean data from observations using conditional diffusion models. In the M-step, it refines the conditional diffusion model using the reconstructed data.

Result: Theoretical monotonic convergence guarantees are provided under appropriate statistical conditions. Experimental results demonstrate effectiveness on various image reconstruction tasks.

Conclusion: DiffEM offers a principled EM-based approach for training diffusion models from corrupted data, with theoretical guarantees and practical effectiveness for inverse problems.

Abstract: Diffusion models have emerged as powerful generative priors for high-dimensional inverse problems, yet learning them when only corrupted or noisy observations are available remains challenging. In this work, we propose a new method for training diffusion models with Expectation-Maximization (EM) from corrupted data. Our proposed method, DiffEM, utilizes conditional diffusion models to reconstruct clean data from observations in the E-step, and then uses the reconstructed data to refine the conditional diffusion model in the M-step. Theoretically, we provide monotonic convergence guarantees for the DiffEM iteration, assuming appropriate statistical conditions. We demonstrate the effectiveness of our approach through experiments on various image reconstruction tasks.

[360] PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

Abhishek Ghosh, Ajay Nayak, Ashish Panwar, Arkaprava Basu

Main category: cs.LG

TL;DR: PyGraph is a compiler framework that automatically optimizes ML workloads for CUDA Graphs, addressing launch latency bottlenecks through code transformations, parameter copy elimination, and selective deployment.

DetailsMotivation: GPU compute throughput has grown rapidly, making CPU-side kernel launch latency a bottleneck for ML workloads that launch hundreds to thousands of short-running GPU kernels per iteration. While CUDA Graphs promise to address this by replaying kernels with a single dispatch, they remain difficult to deploy correctly and efficiently.

Method: PyGraph introduces three novel optimizations: 1) automatic code transformations to make ML applications amenable to CUDA Graphs, 2) elimination of parameter copy overheads for kernels executing in CUDA Graphs, and 3) selective deployment of CUDA Graphs guided by cost-benefit analysis. It’s built atop PyTorch2’s compilation framework and requires no programmer intervention.

Result: For 25 ML workloads from TorchBench, HuggingFace, and TIMM, PyGraph more than doubles the benefit from deploying CUDA Graph compared to the most popular and widely used ML compiler, PyTorch2.

Conclusion: PyGraph successfully addresses the CUDA Graph deployment challenges in ML workloads, providing significant performance improvements through automated compiler optimizations without requiring programmer intervention.

Abstract: Machine learning (ML) workloads launch hundreds to thousands of short-running GPU kernels per iteration. With GPU compute throughput growing rapidly, CPU-side launch latency of kernels is emerging as a bottleneck. CUDA Graphs promise to address this by replaying a set of kernels with a single dispatch of the graph, removing per-kernel launch costs. However, CUDA Graphs remain surprisingly difficult to deploy correctly and efficiently. We present PyGraph - a compiler framework to maximize the coverage and benefits of CUDA Graphs for ML workloads. It introduces three novel optimizations: it applies automatic code transformations to make ML applications amenable to CUDA Graphs; it eliminates the parameter copy overheads for kernels executing in CUDA Graphs, and it selectively deploys CUDA Graphs guided by a cost-benefit analysis. For 25 ML workloads from TorchBench, HuggingFace, and TIMM, PyGraph more than doubles the benefit from deploying CUDA Graph compared to the most popular and widely used ML compiler, PyTorch2. PyGraph is built atop PyTorch2’s compilation framework and requires no programmer intervention.

[361] CANet: ChronoAdaptive Network for Enhanced Long-Term Time Series Forecasting under Non-Stationarity

Mert Sonmezer, Seyda Ertekin

Main category: cs.LG

TL;DR: CANet introduces a novel architecture for long-term time series forecasting that addresses non-stationarity issues using style-transfer inspired techniques, achieving significant improvements over state-of-the-art methods.

DetailsMotivation: Real-world time series data often exhibits non-stationary characteristics with distribution shifts and temporal changes in statistical properties, which complicate forecasting and lead to over-stationarization in existing models.

Method: CANet uses Non-stationary Adaptive Normalization with Style Blending Gate and Adaptive Instance Normalization (AdaIN) to preserve non-stationary characteristics, multi-resolution patching for handling different temporal scales, Fourier analysis-based adaptive thresholding for noise reduction, and Stacked Kronecker Product Layer for efficiency.

Result: Extensive experiments show CANet achieves 42% reduction in MSE and 22% reduction in MAE compared to state-of-the-art methods on real-world datasets.

Conclusion: CANet effectively addresses non-stationarity in time series forecasting through style-transfer inspired techniques, demonstrating superior performance and practical applicability for real-world forecasting tasks.

Abstract: Long-term time series forecasting plays a pivotal role in various real-world applications. Despite recent advancements and the success of different architectures, forecasting is often challenging due to non-stationary nature of the real-world data, which frequently exhibit distribution shifts and temporal changes in statistical properties like mean and variance over time. Previous studies suggest that this inherent variability complicates forecasting, limiting the performance of many models by leading to loss of non-stationarity and resulting in over-stationarization (Liu, Wu, Wang and Long, 2022). To address this challenge, we introduce a novel architecture, ChoronoAdaptive Network (CANet), inspired by style-transfer techniques. The core of CANet is the Non-stationary Adaptive Normalization module, seamlessly integrating the Style Blending Gate and Adaptive Instance Normalization (AdaIN) (Huang and Belongie, 2017). The Style Blending Gate preserves and reintegrates non-stationary characteristics, such as mean and standard deviation, by blending internal and external statistics, preventing over-stationarization while maintaining essential temporal dependencies. Coupled with AdaIN, which dynamically adapts the model to statistical changes, this approach enhances predictive accuracy under non-stationary conditions. CANet also employs multi-resolution patching to handle short-term fluctuations and long-term trends, along with Fourier analysis-based adaptive thresholding to reduce noise. A Stacked Kronecker Product Layer further optimizes the model’s efficiency while maintaining high performance. Extensive experiments on real-world datasets validate CANet’s superiority over state-of-the-art methods, achieving a 42% reduction in MSE and a 22% reduction in MAE. The source code is publicly available at https://github.com/mertsonmezer/CANet.

[362] Spatio-Temporal Graph Neural Network for Urban Spaces: Interpolating Citywide Traffic Volume

Silke K. Kaiser, Filipe Rodrigues, Carlos Lima Azevedo, Lynn H. Kaack

Main category: cs.LG

TL;DR: GNNUI is a graph neural network approach for urban traffic volume estimation that handles sparse sensor coverage, overdispersed data with many zeros, and learns interpolation through masking while outperforming existing methods across multiple metrics.

DetailsMotivation: Urban traffic volume forecasting faces unique challenges compared to highways: greater structural diversity, highly overdispersed data with many zeros, unclear spatial dependencies, and very sparse sensor coverage. Existing methods struggle with these urban-specific issues.

Method: GNNUI uses a masking algorithm to learn interpolation, integrates node features to capture functional roles, and employs a loss function tailored to zero-inflated traffic distributions. The approach is evaluated on two new urban traffic benchmarks: Strava cycling data from Berlin and NYC taxi data.

Result: GNNUI outperforms recent interpolation methods across metrics (MAE, RMSE, true-zero rate, KL divergence) and remains robust from 90% to 1% sensor coverage. On Strava data, MAE increases only from 7.1 to 10.5; on Taxi data, from 23.0 to 40.4. The model also examines how graph connectivity choices affect accuracy.

Conclusion: GNNUI demonstrates strong performance despite extreme data scarcity common in urban settings, providing an effective solution for urban traffic volume estimation with sparse sensor coverage while maintaining robustness across varying coverage levels.

Abstract: Graph Neural Networks have shown strong performance in traffic volume forecasting, particularly on highways and major arterial networks. Applying them to urban settings, however, presents unique challenges: urban networks exhibit greater structural diversity, traffic volumes are highly overdispersed with many zeros, the best way to account for spatial dependencies remains unclear, and sensor coverage is often very sparse. We introduce the Graph Neural Network for Urban Interpolation (GNNUI), a novel urban traffic volume estimation approach. GNNUI employs a masking algorithm to learn interpolation, integrates node features to capture functional roles, and uses a loss function tailored to zero-inflated traffic distributions. In addition to the model, we introduce two new open, large-scale urban traffic volume benchmarks, covering different transportation modes: Strava cycling data from Berlin and New York City taxi data. GNNUI outperforms recent, some graph-based, interpolation methods across metrics (MAE, RMSE, true-zero rate, Kullback-Leibler divergence) and remains robust from 90% to 1% sensor coverage. For example, on the Strava dataset, the MAE increases only from 7.1 to 10.5, and on the Taxi dataset, from 23.0 to 40.4. These results demonstrate that GNNUI maintains strong performance despite extreme data scarcity, a common condition in real-world urban settings. We also examine how graph connectivity choices influence model accuracy.

[363] From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity

Haoming Liu, Jinnuo Liu, Yanhao Li, Liuyang Bai, Yunkai Ji, Yuanhe Guo, Shenji Wan, Hongyi Wen

Main category: cs.LG

TL;DR: Flow-based diffusion models have a two-stage training dynamic: early stage generalizes across data modes for global layouts, later stage memorizes fine details from nearest samples.

DetailsMotivation: To understand the memorization-generalization behavior of flow-based diffusion models, which remains poorly understood despite their success in generative modeling for images and videos.

Method: Revisit flow matching objective and analyze its marginal velocity field, which admits closed-form expression allowing exact computation of oracle FM target. Study reveals inherent two-stage training target structure.

Result: Analysis shows flow-based diffusion models formulate two-stage training: early stage guided by mixture of data modes (generalization), later stage dominated by nearest data sample (memorization). This explains effectiveness of practical techniques like timestep-shifted schedules, classifier-free guidance intervals, and latent space design.

Conclusion: The study deepens understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements in flow-based generative models.

Abstract: Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements.

[364] HI-SQL: Optimizing Text-to-SQL Systems through Dynamic Hint Integration

Ganesh Parab, Zishan Ahmad, Dagnachew Birru

Main category: cs.LG

TL;DR: HI-SQL: A pipeline with hint generation from historical query logs to guide LLM-based Text-to-SQL generation, improving accuracy on complex queries while reducing computational costs.

DetailsMotivation: Existing Text-to-SQL methods using LLMs struggle with complex queries involving multi-table joins and nested conditions, often relying on multi-step pipelines that are computationally expensive, high-latency, and prone to error propagation.

Method: Propose HI-SQL pipeline with novel hint generation mechanism that analyzes historical query logs to create contextual hints for handling multi-table and nested operations. These hints are integrated into SQL generation process, eliminating need for costly multi-step approaches and reducing reliance on human-crafted prompts.

Result: Experimental evaluations on multiple benchmark datasets show significant improvement in query accuracy of LLM-generated queries while ensuring efficiency in terms of LLM calls and latency.

Conclusion: HI-SQL offers a robust and practical solution for enhancing Text-to-SQL systems by leveraging historical query patterns to guide SQL generation, addressing complexity challenges while maintaining efficiency.

Abstract: Text-to-SQL generation bridges the gap between natural language and databases, enabling users to query data without requiring SQL expertise. While large language models (LLMs) have significantly advanced the field, challenges remain in handling complex queries that involve multi-table joins, nested conditions, and intricate operations. Existing methods often rely on multi-step pipelines that incur high computational costs, increase latency, and are prone to error propagation. To address these limitations, we propose HI-SQL, a pipeline that incorporates a novel hint generation mechanism utilizing historical query logs to guide SQL generation. By analyzing prior queries, our method generates contextual hints that focus on handling the complexities of multi-table and nested operations. These hints are seamlessly integrated into the SQL generation process, eliminating the need for costly multi-step approaches and reducing reliance on human-crafted prompts. Experimental evaluations on multiple benchmark datasets demonstrate that our approach significantly improves query accuracy of LLM-generated queries while ensuring efficiency in terms of LLM calls and latency, offering a robust and practical solution for enhancing Text-to-SQL systems.

[365] Taming Latency and Bandwidth: A Theoretical Framework and Adaptive Algorithm for Communication-Constrained Training

Rongwei Lu, Jingyan Jiang, Chunyang Li, Xingguang Wei, Zhi Wang

Main category: cs.LG

TL;DR: DeCo-SGD: Dynamic compression and staleness selection for multi-data-center distributed training to overcome bandwidth limitations while maintaining convergence.

DetailsMotivation: Regional energy caps limit single data center growth for large-scale model training. Multi-data-center distributed training over wide-area networks faces high latency and low/varying bandwidth, reducing throughput. Existing fixed strategies for gradient compression and delayed aggregation lack theoretical guidance and sensitivity to dynamic conditions.

Method: Proposes DeCo-SGD with theoretical framework decomposing joint optimization into traditional process plus analyzable noise terms. First convergence rate analysis for this setting, revealing exponential amplification of compression’s detrimental effect with increased staleness. Dynamically selects compression ratio and staleness based on real-time communication/computation conditions.

Result: DeCo-SGD achieves up to 5.07× speed-up over distributed SGD and 1.37× over static strategies in high-latency and low/varying bandwidth networks.

Conclusion: Dynamic adaptation of compression and staleness based on theoretical insights enables efficient multi-data-center distributed training, overcoming bandwidth limitations while maintaining convergence.

Abstract: Regional energy caps limit the growth of any single data center used for large-scale model training. This single-center training paradigm works when model size remains manageable, but exponential growth in the model size and computational demand challenges it. A natural alternative is to distribute training across multiple data centers over wide-area networks. This pools distributed resources, but suffers from high latency and low, time-varying bandwidth, sharply reducing throughout. Employing jointly gradient compression and delayed aggregation can alleviate communication problems, but introduces a complex three-way trade-off among compression ratio, staleness (delayed synchronization steps), and convergence rate. Existing work lacks theoretical guidance and can only propose fixed strategies, insensitive to computation and communication conditions. We address this with a new theoretical tool, decomposing the joint optimization problem into a traditional process plus multiple analyzable noise terms. Our analysis yields the first convergence rate for this setting and shows that increasing staleness exponentially amplifies the detrimental effect of compression. Leveraging these insights, we propose DeCo-SGD, which dynamically selects the compression ratio and staleness based on the real-time communication and computation conditions. DeCo-SGD achieves up to $5.07\times$ and $1.37\times$ speed-ups over distributed SGD and static strategy in high-latency and low, varying bandwidth networks, respectively.

[366] ChronoSelect: Robust Learning with Noisy Labels via Dynamics Temporal Memory

Jianchao Wang, Qingfeng Li, Pengcheng Zheng, Xiaorong Pu, Yazhou Ren

Main category: cs.LG

TL;DR: ChronoSelect is a novel framework for learning with noisy labels that uses temporal dynamics and a four-stage memory architecture to partition samples into clean, boundary, and noisy subsets.

DetailsMotivation: Existing methods for learning with noisy labels suffer from static snapshot evaluations and fail to leverage the rich temporal dynamics of learning evolution, which limits their effectiveness in handling noisy real-world datasets.

Method: ChronoSelect features a four-stage memory architecture that compresses prediction history into compact temporal distributions using a sliding update mechanism with controlled decay. It maintains only four dynamic memory units per sample and uses temporal trajectory analysis with dual-branch consistency for three-way sample partitioning.

Result: Theoretical guarantees prove the mechanism’s convergence and stability under noisy conditions. Extensive experiments demonstrate ChronoSelect’s state-of-the-art performance across synthetic and real-world benchmarks.

Conclusion: ChronoSelect effectively addresses the limitations of existing noisy label learning methods by leveraging temporal dynamics, providing a robust framework with theoretical guarantees and superior empirical performance.

Abstract: Training deep neural networks on real-world datasets is often hampered by the presence of noisy labels, which can be memorized by over-parameterized models, leading to significant degradation in generalization performance. While existing methods for learning with noisy labels (LNL) have made considerable progress, they fundamentally suffer from static snapshot evaluations and fail to leverage the rich temporal dynamics of learning evolution. In this paper, we propose ChronoSelect (chrono denoting its temporal nature), a novel framework featuring an innovative four-stage memory architecture that compresses prediction history into compact temporal distributions. Our unique sliding update mechanism with controlled decay maintains only four dynamic memory units per sample, progressively emphasizing recent patterns while retaining essential historical knowledge. This enables precise three-way sample partitioning into clean, boundary, and noisy subsets through temporal trajectory analysis and dual-branch consistency. Theoretical guarantees prove the mechanism’s convergence and stability under noisy conditions. Extensive experiments demonstrate ChronoSelect’s state-of-the-art performance across synthetic and real-world benchmarks.

[367] EB-gMCR: Energy-Based Generative Modeling for Signal Unmixing and Multivariate Curve Resolution

Yu-Tang Chang, Shih-Fang Chen

Main category: cs.LG

TL;DR: EB-gMCR reformulates multivariate curve resolution as a generative process with energy-based solver that automatically discovers the smallest component set and concentrations, outperforming traditional matrix factorization methods.

DetailsMotivation: Classical MCR requires user-specified number of components (usually unknown) and faces scalability challenges with increasing data or component numbers, limiting practical applications.

Method: Reformulates MCR as data generative process (gMCR) with Energy-Based solver (EB-gMCR) that automatically discovers smallest component set and concentrations; incorporates domain priors as plug-in modules.

Result: On synthetic benchmarks (up to 256 components): high reconstruction fidelity, recovers component count within 5% at 20dB noise, near-exact at 30dB. On spectral datasets: identifies correct component count and improves separation over MF-based MCR approaches (NMF variants, ICA, MCR-ALS).

Conclusion: EB-gMCR is a general solver for fixed-pattern signal unmixing that automatically determines component count, scales well, and allows domain adaptation via plug-in modules without altering core learning.

Abstract: Signal unmixing analysis decomposes data into basic patterns and is widely applied in chemical and biological research. Multivariate curve resolution (MCR), a branch of signal unmixing, separates mixed signals into components (base patterns) and their concentrations (intensity), playing a key role in understanding composition. Classical MCR is typically framed as matrix factorization (MF) and requires a user-specified number of components, usually unknown in real data. Once data or component number increases, the scalability of these MCR approaches face significant challenges. This study reformulates MCR as a data generative process (gMCR), and introduces an Energy-Based solver, EB-gMCR, that automatically discovers the smallest component set and their concentrations for reconstructing the mixed signals faithfully. On synthetic benchmarks with up to 256 components, EB-gMCR attains high reconstruction fidelity and recovers the component count within 5% at 20dB noise and near-exact at 30dB. On two public spectral datasets, it identifies the correct component count and improves component separation over MF-based MCR approaches (NMF variants, ICA, MCR-ALS). EB-gMCR is a general solver for fixed-pattern signal unmixing (components remain invariant across mixtures). Domain priors (non-negativity, nonlinear mixing) enter as plug-in modules, enabling adaptation to new instruments or domains without altering the core selection learning step. The source code is available at https://github.com/b05611038/ebgmcr_solver.

[368] BubbleOKAN: A Physics-Informed Interpretable Neural Operator for High-Frequency Bubble Dynamics

Yunhao Zhang, Sidharth S. Menon, Lin Cheng, Aswin Gnanaskandan, Ameya D. Jagtap

Main category: cs.LG

TL;DR: Two-step DeepOKAN model using physics-informed neural operators with spline basis functions outperforms traditional methods in capturing high-frequency bubble dynamics across multiple scenarios.

DetailsMotivation: To address spectral bias in deep learning models for bubble dynamics and improve interpretability while accurately capturing high-frequency features that conventional neural operators struggle with.

Method: Developed a two-step DeepONet architecture enhanced with Rowdy adaptive activation functions, then introduced Kolmogorov-Arnold Network (KAN) based DeepOKAN model using spline basis functions combined with radial basis functions (RBF) for trunk networks.

Result: DeepOKAN accurately captures both low- and high-frequency bubble dynamics across three scenarios (Rayleigh-Plesset, Keller-Miksis single radius, and Keller-Miksis multiple radii), outperforming state-of-the-art neural operators including Fourier, Wavelet, OFormer, and Convolutional Neural Operators.

Conclusion: The two-step DeepOKAN with spline basis functions offers superior performance for high-frequency bubble dynamics, addresses spectral bias limitations, provides better interpretability than MLP architectures, and presents a promising alternative to conventional numerical solvers.

Abstract: In this work, we employ physics-informed neural operators to map pressure profiles from an input function space to the corresponding bubble radius responses. Our approach employs a two-step DeepONet architecture. To address the intrinsic spectral bias of deep learning models, our model incorporates the Rowdy adaptive activation function, enhancing the representation of high-frequency features. Moreover, we introduce the Kolmogorov-Arnold network (KAN) based two-step DeepOKAN model, which enhances interpretability (often lacking in conventional multilayer perceptron architectures) while efficiently capturing high-frequency bubble dynamics without explicit utilization of activation functions in any form. We particularly investigate the use of spline basis functions in combination with radial basis functions (RBF) within our architecture, as they demonstrate superior performance in constructing a universal basis for approximating high-frequency bubble dynamics compared to alternative formulations. Furthermore, we emphasize on the performance bottleneck of RBF while learning the high frequency bubble dynamics and showcase the advantage of using spline basis function for the trunk network in overcoming this inherent spectral bias. The model is systematically evaluated across three representative scenarios: (1) bubble dynamics governed by the Rayleigh-Plesset equation with a single initial radius, (2) bubble dynamics governed by the Keller-Miksis equation with a single initial radius, and (3) Keller-Miksis dynamics with multiple initial radii. We also compare our results with state-of-the-art neural operators, including Fourier Neural Operators, Wavelet Neural Operators, OFormer, and Convolutional Neural Operators. Our findings demonstrate that the two-step DeepOKAN accurately captures both low- and high-frequency behaviors, and offers a promising alternative to conventional numerical solvers.

[369] Physics-Informed Time-Integrated DeepONet: Temporal Tangent Space Operator Learning for High-Accuracy Inference

Luis Mandl, Dibyajyoti Nayak, Tim Ricken, Somdatta Goswami

Main category: cs.LG

TL;DR: PITI-DeepONet improves long-term PDE prediction by learning time-derivative operators with physics-informed training, reducing errors by 42-98% compared to traditional methods.

DetailsMotivation: Traditional full rollout methods fail to capture causal dependencies and generalize poorly beyond training horizons, while autoregressive approaches suffer from error accumulation, limiting long-term accuracy for time-dependent PDEs.

Method: A dual-output architecture trained via physics-informed or hybrid objectives learns the time-derivative operator from current state, then integrates using classical time-stepping schemes. Includes residual monitoring for quality estimation and domain transition detection.

Result: Significant error reductions: 84% vs FR and 79% vs AR for 1D heat equation; 87% vs FR and 98% vs AR for 1D Burgers equation; 42% vs FR and 89% vs AR for 2D Allen-Cahn equation.

Conclusion: PITI-DeepONet enables more reliable long-term integration of complex time-dependent PDEs by moving beyond traditional FR and AR schemes through physics-informed operator learning.

Abstract: Accurately modeling and inferring solutions to time-dependent partial differential equations (PDEs) over extended horizons remains a core challenge in scientific machine learning. Traditional full rollout (FR) methods, which predict entire trajectories in one pass, often fail to capture the causal dependencies and generalize poorly outside the training time horizon. Autoregressive (AR) approaches, evolving the system step by step, suffer from error accumulation, limiting long-term accuracy. These shortcomings limit the long-term accuracy and reliability of both strategies. To address these issues, we introduce the Physics-Informed Time-Integrated Deep Operator Network (PITI-DeepONet), a dual-output architecture trained via fully physics-informed or hybrid physics- and data-driven objectives to ensure stable, accurate long-term evolution well beyond the training horizon. Instead of forecasting future states, the network learns the time-derivative operator from the current state, integrating it using classical time-stepping schemes to advance the solution in time. Additionally, the framework can leverage residual monitoring during inference to estimate prediction quality and detect when the system transitions outside the training domain. Applied to benchmark problems, PITI-DeepONet shows improved accuracy over extended inference time horizons when compared to traditional methods. Mean relative $\mathcal{L}_2$ errors reduced by 84% (vs. FR) and 79% (vs. AR) for the one-dimensional heat equation; by 87% (vs. FR) and 98% (vs. AR) for the one-dimensional Burgers equation; and by 42% (vs. FR) and 89% (vs. AR) for the two-dimensional Allen-Cahn equation. By moving beyond classic FR and AR schemes, PITI-DeepONet paves the way for more reliable, long-term integration of complex, time-dependent PDEs.

[370] The Eminence in Shadow: Exploiting Feature Boundary Ambiguity for Robust Backdoor Attacks

Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Tianyu Du, Jinbao Li, Jianhai Chen, Shouling Ji

Main category: cs.LG

TL;DR: The paper proposes Eminence, an explainable black-box backdoor attack framework with theoretical guarantees that achieves >90% attack success with <0.1% poison rate by exploiting sparse decision boundaries.

DetailsMotivation: Current backdoor attacks rely on heuristic methods without theoretical understanding, limiting predictability and adaptability. There's a need for rigorous analysis of why low poison rates can be effective.

Method: Theoretical analysis reveals sparse decision boundaries enable disproportionate manipulation. Eminence optimizes a universal, visually subtle trigger that strategically exploits vulnerable decision boundaries, using influence function analysis to quantify parameter shifts from margin samples.

Result: Eminence achieves >90% attack success rate with <0.1% poison rate (vs. SOTA requiring >1%), maintains negligible clean-accuracy loss, and demonstrates high transferability across models, datasets, and scenarios.

Conclusion: The work provides theoretical grounding for backdoor attacks, demonstrates an exponential relationship between margin poisoning and boundary manipulation, and offers an explainable framework with provable guarantees for robust, stealthy attacks.

Abstract: Deep neural networks (DNNs) underpin critical applications yet remain vulnerable to backdoor attacks, typically reliant on heuristic brute-force methods. Despite significant empirical advancements in backdoor research, the lack of rigorous theoretical analysis limits understanding of underlying mechanisms, constraining attack predictability and adaptability. Therefore, we provide a theoretical analysis targeting backdoor attacks, focusing on how sparse decision boundaries enable disproportionate model manipulation. Based on this finding, we derive a closed-form, ambiguous boundary region, wherein negligible relabeled samples induce substantial misclassification. Influence function analysis further quantifies significant parameter shifts caused by these margin samples, with minimal impact on clean accuracy, formally grounding why such low poison rates suffice for efficacious attacks. Leveraging these insights, we propose Eminence, an explainable and robust black-box backdoor framework with provable theoretical guarantees and inherent stealth properties. Eminence optimizes a universal, visually subtle trigger that strategically exploits vulnerable decision boundaries and effectively achieves robust misclassification with exceptionally low poison rates (< 0.1%, compared to SOTA methods typically requiring > 1%). Comprehensive experiments validate our theoretical discussions and demonstrate the effectiveness of Eminence, confirming an exponential relationship between margin poisoning and adversarial boundary manipulation. Eminence maintains > 90% attack success rate, exhibits negligible clean-accuracy loss, and demonstrates high transferability across diverse models, datasets and scenarios.

[371] How Reinforcement Learning After Next-Token Prediction Facilitates Learning

Nikolaos Tsilivis, Eran Malach, Karen Ullrich, Julia Kempe

Main category: cs.LG

TL;DR: The paper shows that reinforcement learning after next-token prediction enables transformers to generalize on reasoning tasks like parity prediction, while next-token prediction alone requires extreme resources.

DetailsMotivation: To understand why the current training paradigm (next-token prediction followed by reinforcement learning) succeeds in reasoning domains, and to theoretically expose the optimization mechanisms behind this success.

Method: Introduces a framework to study mixture distributions of short and long chain-of-thought sequences. Uses parity prediction task with rare long sequences. Theoretically analyzes autoregressive linear models and demonstrates with Llama-series models on mathematical reasoning benchmarks.

Result: Reinforcement learning after next-token prediction enables autoregressive transformers to generalize on parity prediction with rare long demonstrations, while next-token prediction alone requires exponential resources. Theoretically proves linear models can efficiently learn parity as long as long demonstrations aren’t exponentially rare.

Conclusion: The paper provides theoretical understanding of why RL after pretraining works for reasoning tasks, showing it leverages test-time computation and can learn from rare long demonstrations, explaining the success of current training paradigms.

Abstract: Recent advances in reasoning domains with neural networks have primarily been enabled by a training recipe that optimizes Large Language Models, previously trained to predict the next-token in a sequence, with reinforcement learning algorithms. We introduce a framework to study the success of this paradigm, and we theoretically expose the optimization mechanisms by which reinforcement learning improves over next-token prediction in this setting. We study learning from mixture distributions of short and long ``chain-of-thought’’ sequences encoding a single task. In particular, when the task consists of predicting the parity of $d$ bits and long sequences are rare, we show how reinforcement learning after next-token prediction enables autoregressive transformers to generalize, whereas mere next-token prediction requires extreme statistical or computational resources to do so. We further explain how reinforcement learning leverages increased test-time computation, manifested in longer responses, to facilitate this learning process. In a simplified setting, we theoretically prove that autoregressive linear models following this training recipe can efficiently learn to predict the parity of $d$ bits as long as the proportion of long demonstrations in the data mix is not exponentially small in the input dimension $d$. Finally, we demonstrate these same phenomena in other settings, including the post-training of Llama-series models on mixture variations of common mathematical reasoning benchmarks.

[372] QLENS: Towards A Quantum Perspective of Language Transformers

Aditya Gupta, Kirandeep Kaur, Vinayak Gupta, Chirag Shah

Main category: cs.LG

TL;DR: QLENS proposes a quantum mechanics-inspired framework to mathematically model Transformer inference by representing latent activations as quantum states and layers as unitary operators/Hamiltonians.

DetailsMotivation: Current Transformer interpretability methods are limited diagnostic checkpoints without mathematical frameworks for mechanistically modeling how layers facilitate state transitions. The probabilistic nature of language models parallels quantum mechanics, suggesting physics could provide a descriptive mathematical framework.

Method: QLENS converts Transformer latent activations into state vectors in a Hilbert space derived from output units. Hidden layers are reformulated as unitary operators and analogously defined Hamiltonians. The final probability distribution is obtained by applying the Born rule to the end state using a specific measurement operator.

Result: Proof-of-concept demonstration by probing a toy Transformer to investigate individual layers’ influence on prediction trajectory, showing QLENS’s potential for cross-domain insights.

Conclusion: QLENS provides a physics-based perspective on Transformer generation, establishing a foundation for leveraging quantum mechanics insights toward broader understanding of Transformers through cross-domain approaches.

Abstract: In natural language processing, current methods for understanding Transformers are successful at identifying intermediate predictions during a model’s inference. However, these approaches function as limited diagnostic checkpoints, lacking a mathematical framework for mechanistically modeling how each layer facilitates transitions between these evolving states. This interpretability gap and past successes of interdisciplinary outlooks inspire us to turn to physics in search of a descriptive mathematical framework for Transformers. We observe that language models are intrinsically probabilistic, an attribute that is echoed in the core postulates of quantum mechanics. This parallel inspires us to translate insights from this discipline to that of natural language processing. Towards this objective, we propose QLENS a novel attempt to develop a physics-based perspective on the Transformer generation process. Under QLENS, a Transformer is studied by converting its latent activations into a state vector in a Hilbert space derived from the model’s output units. This state subsequently evolves through hidden layers - reformulated as unitary operators and analogously defined Hamiltonians - during inference. The model’s final probability distribution is obtained by applying the Born rule to the end state using a specific measurement operator. To demonstrate QLENS’s potential, we conduct a proof-of-concept by probing a toy Transformer to investigate the influence of individual layers in a model’s prediction trajectory. We present our work as a foundation for cross-domain insights to be leveraged towards a broader understanding of Transformers.

[373] Amortized Active Generation of Pareto Sets

Daniel M. Steinberg, Asiri Wijesinghe, Rafael Oliveira, Piotr Koniusz, Cheng Soon Ong, Edwin V. Bonilla

Main category: cs.LG

TL;DR: A-GPS is a new framework for online discrete black-box multi-objective optimization that learns a generative model of Pareto sets, supports user preference conditioning, and achieves high sample efficiency without explicit hypervolume computation.

DetailsMotivation: The paper addresses the need for efficient online multi-objective optimization that can incorporate user preferences flexibly and avoid computationally expensive hypervolume calculations while maintaining high sample efficiency.

Method: A-GPS learns a generative model of Pareto sets using a class probability estimator to predict non-dominance relations and condition toward high-performing regions. It introduces preference direction vectors to encode user preferences and updates the model using both Pareto membership and alignment with these preferences, creating an amortized generative model.

Result: The method achieves high-quality Pareto set approximations, avoids explicit hypervolume computation, and effectively captures user preferences. Empirical results on synthetic benchmarks and protein design tasks demonstrate strong sample efficiency and effective preference incorporation.

Conclusion: A-GPS provides a simple yet powerful approach for online discrete black-box multi-objective optimization that combines efficient Pareto set generation with flexible user preference conditioning, offering practical advantages for real-world applications like protein design.

Abstract: We introduce active generation of Pareto sets (A-GPS), a new framework for online discrete black-box multi-objective optimization (MOO). A-GPS learns a generative model of the Pareto set that supports a-posteriori conditioning on user preferences. The method employs a class probability estimator (CPE) to predict non-dominance relations and to condition the generative model toward high-performing regions of the search space. We also show that this non-dominance CPE implicitly estimates the probability of hypervolume improvement (PHVI). To incorporate subjective trade-offs, A-GPS introduces preference direction vectors that encode user-specified preferences in objective space. At each iteration, the model is updated using both Pareto membership and alignment with these preference directions, producing an amortized generative model capable of sampling across the Pareto front without retraining. The result is a simple yet powerful approach that achieves high-quality Pareto set approximations, avoids explicit hypervolume computation, and flexibly captures user preferences. Empirical results on synthetic benchmarks and protein design tasks demonstrate strong sample efficiency and effective preference incorporation.

[374] Tab-PET: Graph-Based Positional Encodings for Tabular Transformers

Yunze Leng, Rohan Ghosh, Mehul Motani

Main category: cs.LG

TL;DR: Positional encodings (PEs) can improve tabular transformer performance by reducing feature effective rank and dimensionality, with graph-based PEs showing significant gains across 50 datasets.

DetailsMotivation: Tabular data lacks inherent structural cues that vision/language data have, making self-attention less effective. Existing tabular transformers don't use positional encodings due to no prior structural information, but PEs could potentially improve generalization.

Method: Proposes Tab-PET, a graph-based framework for estimating and incorporating positional encodings into tabular transformers. Uses two graph estimation paradigms: association-based and causality-based, inspired by graph topology approaches for deriving PEs.

Result: Graph-derived PEs significantly improve performance across 50 classification and regression datasets for existing tabular transformers (TabTransformer, SAINT, FT-Transformer). Association-based graphs yield more stable and pronounced gains than causality-driven ones.

Conclusion: Positional encodings play an unexpected but valuable role in tabular transformers by reducing effective feature rank and dimensionality, improving generalization. Graph-based PEs provide structural cues that enhance tabular transformer performance.

Abstract: Supervised learning with tabular data presents unique challenges, including low data sizes, the absence of structural cues, and heterogeneous features spanning both categorical and continuous domains. Unlike vision and language tasks, where models can exploit inductive biases in the data, tabular data lacks inherent positional structure, hindering the effectiveness of self-attention mechanisms. While recent transformer-based models like TabTransformer, SAINT, and FT-Transformer (which we refer to as 3T) have shown promise on tabular data, they typically operate without leveraging structural cues such as positional encodings (PEs), as no prior structural information is usually available. In this work, we find both theoretically and empirically that structural cues, specifically PEs can be a useful tool to improve generalization performance for tabular transformers. We find that PEs impart the ability to reduce the effective rank (a form of intrinsic dimensionality) of the features, effectively simplifying the task by reducing the dimensionality of the problem, yielding improved generalization. To that end, we propose Tab-PET (PEs for Tabular Transformers), a graph-based framework for estimating and inculcating PEs into embeddings. Inspired by approaches that derive PEs from graph topology, we explore two paradigms for graph estimation: association-based and causality-based. We empirically demonstrate that graph-derived PEs significantly improve performance across 50 classification and regression datasets for 3T. Notably, association-based graphs consistently yield more stable and pronounced gains compared to causality-driven ones. Our work highlights an unexpected role of PEs in tabular transformers, revealing how they can be harnessed to improve generalization.

[375] Arithmetic-Intensity-Aware Quantization

Taig Singh, Shreshth Rajan, Nikhil Jain

Main category: cs.LG

TL;DR: AIQ is a mixed precision quantization framework that optimizes per-layer bit-widths to maximize arithmetic intensity and minimize accuracy loss, improving inference throughput for memory-bound neural networks.

DetailsMotivation: Modern neural networks are becoming memory-bound, where inference throughput is limited by DRAM bandwidth rather than computational power. This creates a need for quantization methods that can reduce memory bandwidth requirements while maintaining accuracy.

Method: AIQ is a post-training quantization method that uses search algorithms to find optimal per-layer quantization schemes. It minimizes a weighted loss function that balances arithmetic intensity (AI) and accuracy, selecting different bit-widths for each layer.

Result: On ResNet-20/CIFAR-10, AIQ increases arithmetic intensity by ~50% over FP32 baseline while keeping test accuracy within ~1 percentage point, outperforming global uniform quantization. On MobileNetV2, AIQ configurations achieve 1.66x higher throughput than FP32 baseline while maintaining accuracy within 1 percentage point.

Conclusion: AIQ effectively addresses memory-bound inference by intelligently allocating quantization precision across layers, naturally quantizing larger layers more aggressively to maximize arithmetic intensity and throughput while minimizing accuracy degradation.

Abstract: As modern neural networks become increasingly memory-bound, inference throughput is limited by DRAM bandwidth rather than compute. We present Arithmetic-Intensity-Aware Quantization (AIQ), a mixed precision quantization framework that chooses per-layer bit-widths to maximize arithmetic intensity (AI) while minimizing accuracy loss. AIQ is a post-training quantization method that uses search algorithms over per-layer quantization schemes to minimize a weighted loss over AI and accuracy. On ResNet-20/CIFAR-10, AIQ increases AI by ~50% over an FP32 baseline while keeping test accuracy within ~1 percentage point, and outperforming global uniform quantization schemes. On a memory-bound MobileNetV2 architecture, AIQ configurations give a 1.66x higher throughput than the FP32 baseline while keeping test accuracy within 1 percentage point. We also find that AIQ naturally quantizes larger layers more aggressively.

[376] Multi-Agent Pointer Transformer: Seq-to-Seq Reinforcement Learning for Multi-Vehicle Dynamic Pickup-Delivery Problems

Zengyu Zou, Jingyuan Wang, Yixuan Huang, Junjie Wu

Main category: cs.LG

TL;DR: Proposes MAPT, a Transformer-based end-to-end framework for cooperative multi-vehicle dynamic pickup and delivery with stochastic requests, outperforming existing methods in performance and computational efficiency.

DetailsMotivation: Classical operations research methods struggle with computational complexity and time efficiency for large-scale dynamic routing problems. Existing RL methods have limitations: independent decoding fails to model joint action distributions, feature extraction networks struggle with inter-entity relationships, and joint action spaces are exponentially large.

Method: Multi-Agent Pointer Transformer (MAPT) framework with Transformer Encoder for entity representations, Transformer Decoder with Pointer Network for autoregressive joint action sequence generation, Relation-Aware Attention module to capture inter-entity relationships, and informative priors to guide decision-making.

Result: Experiments on 8 datasets show MAPT significantly outperforms existing baseline methods in performance and exhibits substantial computational time advantages compared to classical operations research methods.

Conclusion: MAPT provides an effective end-to-end centralized decision-making solution for cooperative multi-vehicle dynamic pickup and delivery problems with stochastic requests, addressing key challenges in joint action modeling and computational efficiency.

Abstract: This paper addresses the cooperative Multi-Vehicle Dynamic Pickup and Delivery Problem with Stochastic Requests (MVDPDPSR) and proposes an end-to-end centralized decision-making framework based on sequence-to-sequence, named Multi-Agent Pointer Transformer (MAPT). MVDPDPSR is an extension of the vehicle routing problem and a spatio-temporal system optimization problem, widely applied in scenarios such as on-demand delivery. Classical operations research methods face bottlenecks in computational complexity and time efficiency when handling large-scale dynamic problems. Although existing reinforcement learning methods have achieved some progress, they still encounter several challenges: 1) Independent decoding across multiple vehicles fails to model joint action distributions; 2) The feature extraction network struggles to capture inter-entity relationships; 3) The joint action space is exponentially large. To address these issues, we designed the MAPT framework, which employs a Transformer Encoder to extract entity representations, combines a Transformer Decoder with a Pointer Network to generate joint action sequences in an AutoRegressive manner, and introduces a Relation-Aware Attention module to capture inter-entity relationships. Additionally, we guide the model’s decision-making using informative priors to facilitate effective exploration. Experiments on 8 datasets demonstrate that MAPT significantly outperforms existing baseline methods in terms of performance and exhibits substantial computational time advantages compared to classical operations research methods.

[377] SEA: Spectral Edge Attack

Yongyu Wang

Main category: cs.LG

TL;DR: Proposes a spectral adversarial robustness evaluation method to identify vulnerable edges in graphs for maximum attack impact with minimal perturbation.

DetailsMotivation: Graph-based ML algorithms are vulnerable to attacks/perturbations that degrade performance. The challenge is achieving strong attack effectiveness while remaining undetected.

Method: Uses spectral adversarial robustness evaluation to quantitatively analyze vulnerability of each edge in a graph under attack, precisely targeting weakest links.

Result: Experimental results demonstrate the effectiveness of the proposed method in achieving maximum attack impact with minimal perturbation.

Conclusion: The proposed approach successfully addresses the challenge of strong attack effectiveness while remaining undetected by targeting the most vulnerable edges in graphs.

Abstract: Graph based machine learning algorithms occupy an important position in today AI landscape. The ability of graph topology to represent complex data structures is both the key strength of graph algorithms and a source of their vulnerability. In other words, attacking or perturbing a graph can severely degrade the performance of graph-based methods. For the attack methods, the greatest challenge is achieving strong attack effectiveness while remaining undetected. To address this problem, this paper proposes a new attack model that employs spectral adversarial robustness evaluation to quantitatively analyze the vulnerability of each edge in a graph under attack. By precisely targeting the weakest links, the proposed approach achieves the maximum attack impact with minimal perturbation. Experimental results demonstrate the effectiveness of the proposed method.

[378] Improving Recursive Transformers with Mixture of LoRAs

Mohammadmahdi Nouriborji, Morteza Rohanian, Omid Rohanian

Main category: cs.LG

TL;DR: MoL (Mixture of LoRAs) adds lightweight conditional computation to recursive transformers via LoRA experts in shared FFNs, restoring expressivity lost from parameter sharing while maintaining compact model size.

DetailsMotivation: Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity, creating a need for mechanisms that restore expressivity without significantly increasing parameters.

Method: Proposes Mixture of LoRAs (MoL) - lightweight conditional computation inserting LoRA experts inside shared feed-forward networks, enabling token-conditional weight-space modulation without untying backbone parameters. Also introduces ModernALBERT with rotary embeddings, GeGLU, FlashAttention, and distillation-based initialization.

Result: ModernALBERT (50M-120M) achieves SOTA performance among compact models on GLUE, SQuAD-v2, and BEIR, surpassing larger fully parameterized baselines. Expert-merging procedure compresses MoL into single adapter at inference while preserving accuracy.

Conclusion: Conditional weight-space modulation via MoL effectively restores expressivity lost under aggressive parameter sharing in recursive transformers, enabling efficient deployment through expert-merging compression.

Abstract: Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M–120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.

[379] Dual-Axis RCCL: Representation-Complete Convergent Learning for Organic Chemical Space

Dejun Hu, Zhiming Li, Jia-Rui Shen, Jia-Ning Tu, Zi-Hao Ye, Junliang Zhang

Main category: cs.LG

TL;DR: The paper introduces a Dual-Axis Representation-Complete Convergent Learning (RCCL) strategy and FD25 dataset to achieve convergent learning across chemical space, enabling strong generalization in molecular property prediction.

DetailsMotivation: Despite machine learning's impact on molecular modeling, it's unclear whether models can achieve convergent learning across the vast chemical space (10^30-10^60 molecules). There's a need for principled approaches to ensure representation completeness and generalization.

Method: Developed a Dual-Axis RCCL strategy combining: 1) GCN encoding of local valence environments based on valence bond theory, and 2) no-bridge graph (NBG) encoding of ring/cage topologies. Created FD25 dataset systematically covering 13,302 local valence units and 165,726 ring/cage topologies for H/C/N/O/F organic molecules.

Result: Graph neural networks trained on FD25 achieve representation-complete convergent learning with strong out-of-distribution generalization, achieving ~1.0 kcal/mol MAE across external benchmarks. The framework establishes quantitative links between representation, structural completeness, and model generalization.

Conclusion: The RCCL framework provides a principled foundation for interpretable, transferable, and data-efficient molecular intelligence by formalizing representation completeness and enabling convergent learning across chemical space.

Abstract: Machine learning is profoundly reshaping molecular and materials modeling; however, given the vast scale of chemical space (10^30-10^60), it remains an open scientific question whether models can achieve convergent learning across this space. We introduce a Dual-Axis Representation-Complete Convergent Learning (RCCL) strategy, enabled by a molecular representation that integrates graph convolutional network (GCN) encoding of local valence environments, grounded in modern valence bond theory, together with no-bridge graph (NBG) encoding of ring/cage topologies, providing a quantitative measure of chemical-space coverage. This framework formalizes representation completeness, establishing a principled basis for constructing datasets that support convergent learning for large models. Guided by this RCCL framework, we develop the FD25 dataset, systematically covering 13,302 local valence units and 165,726 ring/cage topologies, achieving near-complete combinatorial coverage of organic molecules with H/C/N/O/F elements. Graph neural networks trained on FD25 exhibit representation-complete convergent learning and strong out-of-distribution generalization, with an overall prediction error of approximately 1.0 kcal/mol MAE across external benchmarks. Our results establish a quantitative link between molecular representation, structural completeness, and model generalization, providing a foundation for interpretable, transferable, and data-efficient molecular intelligence.

[380] Hierarchical Persistence Velocity for Network Anomaly Detection: Theory and Applications to Cryptocurrency Markets

Omid Khormali

Main category: cs.LG

TL;DR: OW-HNPV is a new topological data analysis method for anomaly detection in time-varying networks that measures the rate of feature appearance/disappearance in persistence diagrams, with proven mathematical stability and superior performance for cryptocurrency anomaly detection.

DetailsMotivation: Existing topological methods measure cumulative topological presence, but there's a need for velocity-based approaches that capture the rate of topological feature changes, which is crucial for detecting structural anomalies in dynamic networks.

Method: Introduces Overlap-Weighted Hierarchical Normalized Persistence Velocity (OW-HNPV), a velocity-based perspective on persistence diagrams that measures the rate at which topological features appear and disappear, with automatic noise reduction through overlap-based weighting.

Result: Applied to Ethereum transaction networks (May 2017-May 2018), OW-HNPV achieved up to 10.4% AUC gain over baseline models for 7-day price movement predictions, outperforming established methods like VAB, persistence landscapes, and persistence images, especially for medium- to long-range forecasting (4-7 days).

Conclusion: Modeling topological velocity is crucial for detecting structural anomalies in dynamic networks, with OW-HNPV providing the most consistent and stable performance across prediction horizons while being mathematically stable.

Abstract: We introduce the Overlap-Weighted Hierarchical Normalized Persistence Velocity (OW-HNPV), a novel topological data analysis method for detecting anomalies in time-varying networks. Unlike existing methods that measure cumulative topological presence, we introduce the first velocity-based perspective on persistence diagrams, measuring the rate at which features appear and disappear, automatically downweighting noise through overlap-based weighting. We also prove that OW-HNPV is mathematically stable. It behaves in a controlled, predictable way, even when comparing persistence diagrams from networks with different feature types. Applied to Ethereum transaction networks (May 2017-May 2018), OW-HNPV demonstrates superior performance for cryptocurrency anomaly detection, achieving up to 10.4% AUC gain over baseline models for 7-day price movement predictions. Compared with established methods, including Vector of Averaged Bettis (VAB), persistence landscapes, and persistence images, velocity-based summaries excel at medium- to long-range forecasting (4-7 days), with OW-HNPV providing the most consistent and stable performance across prediction horizons. Our results show that modeling topological velocity is crucial for detecting structural anomalies in dynamic networks.

cs.MA

[381] Mapis: A Knowledge-Graph Grounded Multi-Agent Framework for Evidence-Based PCOS Diagnosis

Zanxiang He, Meng Li, Liyun Shi, Weiye Daia, Liming Nie

Main category: cs.MA

TL;DR: Mapis is a knowledge-grounded multi-agent framework for guideline-based PCOS diagnosis that outperforms existing methods by simulating clinical workflows and using a comprehensive knowledge graph.

DetailsMotivation: PCOS affects 10% of reproductive-aged women, but existing ML/DL tools lack interpretability and require large labeled datasets. Medical multi-agent systems are too general and lack domain-specific knowledge for PCOS diagnosis.

Method: Mapis builds on the 2023 International Guideline to create a structured collaborative workflow with specialized agents: gynecological endocrine and radiology agents verify inclusion criteria, while an exclusion agent rules out other causes. Uses a comprehensive PCOS knowledge graph for evidence-based decisions.

Result: Outperforms 9 baselines on public benchmarks and clinical datasets. On clinical data: surpasses traditional ML by 13.56%, single-agent by 6.55%, and previous medical multi-agent systems by 7.05% in Accuracy.

Conclusion: Mapis is the first knowledge-grounded multi-agent framework for guideline-based PCOS diagnosis, effectively addressing limitations of existing methods through domain-specific knowledge integration and interpretable clinical workflow simulation.

Abstract: Polycystic Ovary Syndrome (PCOS) constitutes a significant public health issue affecting 10% of reproductive-aged women, highlighting the critical importance of developing effective diagnostic tools. Previous machine learning and deep learning detection tools are constrained by their reliance on large-scale labeled data and an lack of interpretability. Although multi-agent systems have demonstrated robust capabilities, the potential of such systems for PCOS detection remains largely unexplored. Existing medical multi-agent frameworks are predominantly designed for general medical tasks, suffering from insufficient domain integration and a lack of specific domain knowledge. To address these challenges, we propose Mapis, the first knowledge-grounded multi-agent framework explicitly designed for guideline-based PCOS diagnosis. Specifically, it built upon the 2023 International Guideline into a structured collaborative workflow that simulates the clinical diagnostic process. It decouples complex diagnostic tasks across specialized agents: a gynecological endocrine agent and a radiology agent collaborative to verify inclusion criteria, while an exclusion agent strictly rules out other causes. Furthermore, we construct a comprehensive PCOS knowledge graph to ensure verifiable, evidence-based decision-making. Extensive experiments on public benchmarks and specialized clinical datasets, benchmarking against nine diverse baselines, demonstrate that Mapis significantly outperforms competitive methods. On the clinical dataset, it surpasses traditional machine learning models by 13.56%, single-agent by 6.55%, and previous medical multi-agent systems by 7.05% in Accuracy.

[382] MedChat: A Multi-Agent Framework for Multimodal Diagnosis with Large Language Models

Philip R. Liu, Sparsh Bansal, Jimmy Dinh, Aditya Pawar, Ramani Satishkumar, Shail Desai, Neeraj Gupta, Xin Wang, Shu Hu

Main category: cs.MA

TL;DR: MedChat is a multi-agent diagnostic framework that combines specialized vision models with multiple role-specific LLM agents to improve glaucoma detection and clinical reporting, addressing limitations of single-agent approaches in medical imaging.

DetailsMotivation: To address challenges in applying general LLMs to medical imaging, including hallucinations, limited interpretability, insufficient domain knowledge, and the inability to emulate multidisciplinary medical team reasoning, while improving glaucoma detection and clinical reporting efficiency.

Method: Proposes MedChat, a multi-agent framework with specialized vision models and multiple role-specific LLM agents coordinated by a director agent, enabling interactive diagnostic reporting through a clinical review interface.

Result: The framework enhances reliability, reduces hallucination risk, and provides interactive diagnostic reporting suitable for both clinical review and educational use.

Conclusion: MedChat offers a promising approach to overcome limitations of single-agent LLM systems in medical imaging by emulating multidisciplinary team reasoning through coordinated specialized agents.

Abstract: The integration of deep learning-based glaucoma detection with large language models (LLMs) presents an automated strategy to mitigate ophthalmologist shortages and improve clinical reporting efficiency. However, applying general LLMs to medical imaging remains challenging due to hallucinations, limited interpretability, and insufficient domain-specific medical knowledge, which can potentially reduce clinical accuracy. Although recent approaches combining imaging models with LLM reasoning have improved reporting, they typically rely on a single generalist agent, restricting their capacity to emulate the diverse and complex reasoning found in multidisciplinary medical teams. To address these limitations, we propose MedChat, a multi-agent diagnostic framework and platform that combines specialized vision models with multiple role-specific LLM agents, all coordinated by a director agent. This design enhances reliability, reduces hallucination risk, and enables interactive diagnostic reporting through an interface tailored for clinical review and educational use. Code available at https://github.com/Purdue-M2/MedChat.

[383] Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems

Chengxuan Xia, Qianye Wu, Sixuan Tian, Yilun Hao

Main category: cs.MA

TL;DR: A coordination framework for LLM agents with dynamic task routing, bidirectional feedback, and parallel agent evaluation to improve adaptiveness in collaborative tasks.

DetailsMotivation: Existing multi-agent frameworks rely on static workflows, fixed roles, and limited communication, reducing effectiveness in open-ended, high-complexity domains.

Method: Proposes a coordination framework with three core mechanisms: dynamic task routing (reallocation based on confidence/workload), bidirectional feedback (structured critiques for iterative improvement), and parallel agent evaluation (competition on ambiguous subtasks with evaluator-driven selection).

Result: Demonstrates substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines.

Conclusion: Highlights the benefits of incorporating both adaptiveness and structured competition in multi-agent LLM systems.

Abstract: Large language model (LLM) agents have shown increasing promise for collaborative task completion. However, existing multi-agent frameworks often rely on static workflows, fixed roles, and limited inter-agent communication, reducing their effectiveness in open-ended, high-complexity domains. This paper proposes a coordination framework that enables adaptiveness through three core mechanisms: dynamic task routing, bidirectional feedback, and parallel agent evaluation. The framework allows agents to reallocate tasks based on confidence and workload, exchange structured critiques to iteratively improve outputs, and crucially compete on high-ambiguity subtasks with evaluator-driven selection of the most suitable result. We instantiate these principles in a modular architecture and demonstrate substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines. Our findings highlight the benefits of incorporating both adaptiveness and structured competition in multi-agent LLM systems.

cs.MM

[384] A Preprocessing Framework for Video Machine Vision under Compression

Fei Zhao, Mengxi Guo, Shijie Zhao, Junlin Li, Li Zhang, Xiaodong Xie

Main category: cs.MM

TL;DR: A video preprocessing framework for machine vision tasks that boosts rate-accuracy performance by retaining crucial information for downstream tasks, saving over 15% bitrate compared to standard codecs.

DetailsMotivation: Most video coding optimization methods focus on minimizing human perceptual distortion, overlooking the specific demands of machine vision systems that require different information retention for accurate analysis.

Method: Proposes a neural preprocessor that retains crucial information for machine vision tasks, combined with a differentiable virtual codec to provide rate and distortion constraints during training, while using standard codecs for testing.

Result: Extensive experiments on two typical downstream tasks with various backbone networks show the approach saves over 15% of bitrate compared to using only standard codec anchor versions.

Conclusion: The proposed video preprocessing framework effectively addresses machine vision demands, provides significant bitrate savings while maintaining accuracy, and can be easily applied to real-world scenarios using standard codecs.

Abstract: There has been a growing trend in compressing and transmitting videos from terminals for machine vision tasks. Nevertheless, most video coding optimization method focus on minimizing distortion according to human perceptual metrics, overlooking the heightened demands posed by machine vision systems. In this paper, we propose a video preprocessing framework tailored for machine vision tasks to address this challenge. The proposed method incorporates a neural preprocessor which retaining crucial information for subsequent tasks, resulting in the boosting of rate-accuracy performance. We further introduce a differentiable virtual codec to provide constraints on rate and distortion during the training stage. We directly apply widely used standard codecs for testing. Therefore, our solution can be easily applied to real-world scenarios. We conducted extensive experiments evaluating our compression method on two typical downstream tasks with various backbone networks. The experimental results indicate that our approach can save over 15% of bitrate compared to using only the standard codec anchor version.

[385] One Size Doesn’t Fit All: Age-Aware Gamification Mechanics for Multimedia Learning Environments

Sarah Kaißer, Markus Kleffmann, Kristina Schaaff

Main category: cs.MM

TL;DR: Paper proposes age-aware gamification design principles for digital learning based on literature review, showing gamification needs differentiation across age groups for effectiveness.

DetailsMotivation: Current gamification in digital learning often neglects age-related differences, failing to address diverse motivational and cognitive needs across different age groups.

Method: Conducted targeted literature review to map relationships between age groups, gamification mechanics, and their effects, then derived design principles and technical patterns.

Result: Developed five design principles for age-specific gamification and three technical patterns for implementation in multimedia learning environments.

Conclusion: Gamification is not universally effective and requires differentiated, age-aware design to support engagement and inclusivity across the lifespan.

Abstract: Gamification is widely used in digital learning. However, most systems neglect age-related differences. This paper investigates how gamification can be designed in an age-aware way to address learners’ diverse motivational and cognitive needs. Based on a targeted literature review, we present a mapping of age groups, mechanics, and effects. Furthermore, we derive five design principles for age-specific gamification and identify three technical patterns for implementation in multimedia learning environments. The results indicate that gamification is not universally effective, but rather requires a differentiated design to support engagement and inclusivity across the lifespan.

eess.AS

[386] On the Use of Self-Supervised Representation Learning for Speaker Diarization and Separation

Séverin Baroudi, Hervé Bredin, Joseph Razik, Ricard Marxer

Main category: eess.AS

TL;DR: The paper investigates how well recent self-supervised speech models (like wav2vec2.0 and WavLM) perform on speaker diarization and speech separation tasks, identifying gaps in current evaluation benchmarks due to limited dataset diversity and system variety.

DetailsMotivation: While self-supervised speech models have improved many downstream speech tasks, their evaluation on speaker identity-related tasks like diarization and separation remains limited. There are gaps in current literature due to limitations in existing benchmarks, particularly lack of diversity in evaluation datasets and variety in downstream systems.

Method: The paper investigates the quality of recent self-supervised speech representations (wav2vec2.0, WavLM) on speaker diarization and speech separation tasks. It analyzes current evaluation practices and identifies limitations in existing benchmarks.

Result: The paper highlights gaps in current literature stemming from limitations in existing benchmarks, particularly the lack of diversity in evaluation datasets and variety in downstream systems for both diarization and separation tasks.

Conclusion: More comprehensive evaluation of self-supervised speech models on speaker identity tasks is needed, with greater dataset diversity and system variety in benchmarks to better understand their capabilities and limitations.

Abstract: Self-supervised speech models such as wav2vec2.0 and WavLM have been shown to significantly improve the performance of many downstream speech tasks, especially in low-resource settings, over the past few years. Despite this, evaluations on tasks such as Speaker Diarization and Speech Separation remain limited. This paper investigates the quality of recent self-supervised speech representations on these two speaker identity-related tasks, highlighting gaps in the current literature that stem from limitations in the existing benchmarks, particularly the lack of diversity in evaluation datasets and variety in downstream systems associated to both diarization and separation.

eess.IV

[387] Magnification-Aware Distillation (MAD): A Self-Supervised Framework for Unified Representation Learning in Gigapixel Whole-Slide Images

Mahmut S. Gokmen, Mitchell A. Klusty, Peter T. Nelson, Allison M. Neltner, Sen-Ching Samson Cheung, Thomas M. Pearce, David A Gutman, Brittany N. Dugger, Devavrat S. Bisht, Margaret E. Flanagan, V. K. Cody Bumgardner

Main category: eess.IV

TL;DR: MAD-NP is a self-supervised foundation model that learns resolution-invariant representations from whole-slide images by linking low-magnification context with high-magnification detail through cross-scale distillation.

DetailsMotivation: Current self-supervised methods treat different magnification levels in whole-slide images as independent views, preventing models from learning representations that remain stable when resolution changes. This is problematic for practical neuropathology workflows where consistent analysis across magnifications is essential.

Method: Magnification-Aware Distillation (MAD) - a self-supervised strategy that links low-magnification context with spatially aligned high-magnification detail. The model learns cross-scale correspondences without annotations, enabling it to understand how coarse tissue structure relates to fine cellular patterns.

Result: MAD-NP achieves strong resolution-invariant representation learning: a linear classifier trained only on 10x embeddings maintains 96.7% of its performance when applied to unseen 40x tiles. Segmentation outputs remain consistent across magnifications, preserving anatomical boundaries and minimizing noise.

Conclusion: The study demonstrates the feasibility of scalable, magnification-robust whole-slide image analysis using a unified embedding space, enabling practical neuropathology workflows with consistent performance across different resolution levels.

Abstract: Whole-slide images (WSIs) contain tissue information distributed across multiple magnification levels, yet most self-supervised methods treat these scales as independent views. This separation prevents models from learning representations that remain stable when resolution changes, a key requirement for practical neuropathology workflows. This study introduces Magnification-Aware Distillation (MAD), a self-supervised strategy that links low-magnification context with spatially aligned high-magnification detail, enabling the model to learn how coarse tissue structure relates to fine cellular patterns. The resulting foundation model, MAD-NP, is trained entirely through this cross-scale correspondence without annotations. A linear classifier trained only on 10x embeddings maintains 96.7% of its performance when applied to unseen 40x tiles, demonstrating strong resolution-invariant representation learning. Segmentation outputs remain consistent across magnifications, preserving anatomical boundaries and minimizing noise. These results highlight the feasibility of scalable, magnification-robust WSI analysis using a unified embedding space

[388] Artificial Intelligence for the Assessment of Peritoneal Carcinosis during Diagnostic Laparoscopy for Advanced Ovarian Cancer

Riccardo Oliva, Farahdiba Zarin, Alice Zampolini Faustini, Armine Vardazaryan, Andrea Rosati, Vinkle Srivastav, Nunzia Del Villano, Jacques Marescaux, Giovanni Scambia, Pietro Mascagni, Nicolas Padoy, Anna Fagotti

Main category: eess.IV

TL;DR: AI model automates Fagotti score estimation from diagnostic laparoscopy videos for advanced ovarian cancer, providing objective surgical feasibility assessment.

DetailsMotivation: Current Fagotti score assessment is subjective and operator-dependent, limiting reproducibility and widespread use in guiding treatment planning for advanced ovarian cancer with peritoneal carcinosis.

Method: Retrospective collection of DL videos with FS assessments, manual annotation of FS-relevant frames, training deep learning models to identify relevant frames, segment anatomical structures and peritoneal carcinosis, and predict video-level FS and indication to surgery.

Result: Segmentation achieved Dice scores of 70±3% for anatomical structures and 56±3% for PC. Video-level classification achieved F1-scores of 74±3% and 73±4%, FS prediction showed normalized RMSE of 1.39±0.18 and 1.15±0.08, and ItS reached F1-scores of 80±8% and 80±2% in development and test datasets respectively.

Conclusion: First AI model to predict cytoreductive surgery feasibility with automated FS estimation from DL videos, showing reproducible performance that can support surgeons through standardized intraoperative tumor burden assessment and clinical decision-making.

Abstract: Advanced Ovarian Cancer (AOC) is often diagnosed at an advanced stage with peritoneal carcinosis (PC). Fagotti score (FS) assessment at diagnostic laparoscopy (DL) guides treatment planning by estimating surgical resectability, but its subjective and operator-dependent nature limits reproducibility and widespread use. Videos of patients undergoing DL with concomitant FS assessments at a referral center were retrospectively collected and divided into a development dataset, for data annotation, AI training and evaluation, and an independent test dataset, for internal validation. In the development dataset, FS-relevant frames were manually annotated for anatomical structures and PC. Deep learning models were trained to automatically identify FS-relevant frames, segment structures and PC, and predict video-level FS and indication to surgery (ItS). AI performance was evaluated using Dice score for segmentation, F1-scores for anatomical stations (AS) and ItS prediction, and root mean square error (RMSE) for final FS estimation. In the development dataset, the segmentation model trained on 7,311 frames, achieved Dice scores of 70$\pm$3% for anatomical structures and 56$\pm$3% for PC. Video-level AS classification achieved F1-scores of 74$\pm$3% and 73$\pm$4%, FS prediction showed normalized RMSE values of 1.39$\pm$0.18 and 1.15$\pm$0.08, and ItS reached F1-scores of 80$\pm$8% and 80$\pm$2% in the development (n=101) and independent test datasets (n=50), respectively. This is the first AI model to predict the feasibility of cytoreductive surgery providing automated FS estimation from DL videos. Its reproducible and reliable performance across datasets suggests that AI can support surgeons through standardized intraoperative tumor burden assessment and clinical decision-making in AOC.

[389] Deep learning water-unsuppressed MRSI at ultra-high field for simultaneous quantitative metabolic, susceptibility and myelin water imaging

Paul J. Weiser, Jiye Kim, Jongho Lee, Amirmohammad Shamaei, Gulnur Ungan, Malte Hoffmann, Antoine Klauser, Berkin Bilgic, Ovidiu C. Andronesi

Main category: eess.IV

TL;DR: Deep learning pipeline enables simultaneous water-unsuppressed MRSI for metabolic, QSM, and myelin water fraction imaging at 7T with 2mm resolution in 12 minutes.

DetailsMotivation: Water-unsuppressed MRSI allows simultaneous imaging of water and metabolites but faces challenges from large water sidebands that interfere with metabolic fitting, especially at ultra-high field strengths.

Method: Developed WALINET+ deep learning network to remove lipids, water signal, and sidebands from 7T wu-MRSI data acquired with ECCENTRIC sampling and ultra-short TE. Used physics-informed networks for reconstruction and metabolite fitting, leveraging water signal for absolute quantification, QSM, and MWF.

Result: WALINET+ achieved <2% NRMSE in simulations and <20% bias with ±63% limits-of-agreement between wu-MRSI and water-suppressed MRSI. Some metabolites showed higher SNR in wu-MRSI. QSM and MWF from wu-MRSI matched GRE results with minimal bias (0 ppm/5.5%) and good agreement (±0.05 ppm/±12.75%).

Conclusion: High-quality simultaneous metabolic, QSM, and MWF brain mapping is achievable at 7T with 2mm isotropic resolution in 12 minutes using ECCENTRIC wu-MRSI and WALINET+, eliminating need for water suppression and separate water acquisitions.

Abstract: Purpose: Magnetic Resonance Spectroscopic Imaging (MRSI) maps endogenous brain metabolism while suppressing the overwhelming water signal. Water-unsuppressed MRSI (wu-MRSI) allows simultaneous imaging of water and metabolites, but large water sidebands cause challenges for metabolic fitting. We developed an end-to-end deep-learning pipeline to overcome these challenges at ultra-high field. Methods:Fast high-resolution wu-MRSI was acquired at 7T with non-cartesian ECCENTRIC sampling and ultra-short echo time. A water and lipid removal network (WALINET+) was developed to remove lipids, water signal, and sidebands. MRSI reconstruction was performed by DeepER and a physics-informed network for metabolite fitting. Water signal was used for absolute metabolite quantification, quantitative susceptibility mapping (QSM), and myelin water fraction imaging (MWF). Results: WALINET+ provided the lowest NRMSE (< 2%) in simulations and in vivo the smallest bias (< 20%) and limits-of-agreement (+-63%) between wu-MRSI and ws-MRSI scans. Several metabolites such as creatine and glutamate showed higher SNR in wu-MRSI. QSM and MWF obtained from wu-MRSI and GRE showed good agreement with 0 ppm/5.5% bias and +-0.05 ppm/ +- 12.75% limits-of-agreement. Conclusion: High-quality metabolic, QSM, and MWF mapping of the human brain can be obtained simultaneously by ECCENTRIC wu-MRSI at 7T with 2 mm isotropic resolution in 12 min. WALINET+ robustly removes water sidebands while preserving metabolite signal, eliminating the need for water suppression and separate water acquisitions.

[390] Audio-Visual Cross-Modal Compression for Generative Face Video Coding

Youmin Xu, Mengxi Guo, Shijie Zhao, Weiqi Li, Junlin Li, Li Zhang, Jian Zhang

Main category: eess.IV

TL;DR: Audio-Visual Cross-Modal Compression (AVCC) framework that jointly compresses audio and video by exploiting their correlation, achieving better rate-distortion performance than VVC and GFVC methods.

DetailsMotivation: Existing generative face video coding methods focus only on video motion and ignore audio's significant bitrate contribution, despite the known correlation between audio and lip movements. This cross-modal coherence hasn't been systematically exploited for compression.

Method: Proposes AVCC framework that extracts motion from video and tokenizes audio features, then aligns them through unified audio-video diffusion process for synchronized reconstruction from shared representation. Can reconstruct one modality from another in low-rate scenarios.

Result: AVCC significantly outperforms Versatile Video Coding (VVC) standard and state-of-the-art GFVC schemes in rate-distortion performance.

Conclusion: The framework paves the way for more efficient multimodal communication systems by effectively exploiting audio-visual correlation for compression.

Abstract: Generative face video coding (GFVC) is vital for modern applications like video conferencing, yet existing methods primarily focus on video motion while neglecting the significant bitrate contribution of audio. Despite the well-established correlation between audio and lip movements, this cross-modal coherence has not been systematically exploited for compression. To address this, we propose an Audio-Visual Cross-Modal Compression (AVCC) framework that jointly compresses audio and video streams. Our framework extracts motion information from video and tokenizes audio features, then aligns them through a unified audio-video diffusion process. This allows synchronized reconstruction of both modalities from a shared representation. In extremely low-rate scenarios, AVCC can even reconstruct one modality from the other. Experiments show that AVCC significantly outperforms the Versatile Video Coding (VVC) standard and state-of-the-art GFVC schemes in rate-distortion performance, paving the way for more efficient multimodal communication systems.

[391] A Gaussian Parameterization for Direct Atomic Structure Identification in Electron Tomography

Nalini M. Singh, Tiffany Chien, Arthur R. C. McCray, Colin Ophus, Laura Waller

Main category: eess.IV

TL;DR: Direct atomic structure determination from electron tomography using Gaussian parameterization instead of volumetric reconstruction.

DetailsMotivation: Classical tomography algorithms create intermediate volumetric representations that require post-processing to extract atomic structures, which can be suboptimal and sensitive to imaging artifacts.

Method: Reformulate tomography as direct atomic structure recovery by parameterizing structures as collections of Gaussians with learnable positions and properties, incorporating strong physical priors.

Result: Improved robustness to real-world imaging artifacts, validated through simulated experiments and proof-of-concept experimental data with Transmission Electron Microscopy.

Conclusion: The Gaussian atoms approach shows practical potential for materials characterization and analysis, offering a more direct and robust method for atomic structure determination from electron tomography.

Abstract: Atomic electron tomography (AET) enables the determination of 3D atomic structures by acquiring a sequence of 2D tomographic projection measurements of a particle and then computationally solving for its underlying 3D representation. Classical tomography algorithms solve for an intermediate volumetric representation that is post-processed into the atomic structure of interest. In this paper, we reformulate the tomographic inverse problem to solve directly for the locations and properties of individual atoms. We parameterize an atomic structure as a collection of Gaussians, whose positions and properties are learnable. This representation imparts a strong physical prior on the learned structure, which we show yields improved robustness to real-world imaging artifacts. Simulated experiments and a proof-of-concept result on experimentally-acquired data confirm our method’s potential for practical applications in materials characterization and analysis with Transmission Electron Microscopy (TEM). Our code is available at https://github.com/nalinimsingh/gaussian-atoms.

[392] Meta-learners for few-shot weakly-supervised optic disc and cup segmentation on fundus images

Pandega Abyan Zumarsyah, Igi Ardiyanto, Hanung Adi Nugroho

Main category: eess.IV

TL;DR: Meta-learners for few-shot weakly-supervised segmentation of optic disc/cup achieve state-of-the-art performance with minimal labeled data using Omni meta-training and efficient architectures.

DetailsMotivation: Address the challenge of optic disc and optic cup segmentation for glaucoma diagnosis with limited labeled fundus images, overcoming data scarcity issues in medical imaging.

Method: Developed meta-learners with Omni meta-training (balanced data usage, diversified shots), efficient versions to reduce computational costs, and sparsification techniques for customizable scribbles/sparse labels.

Result: Efficient Omni ProtoSeg (EO-ProtoSeg) achieves IoU scores of 88.15% (OD) and 71.17% (OC) on REFUGE with just one sparsely labeled image, outperforming few-shot/semi-supervised methods requiring more labels.

Conclusion: EO-ProtoSeg provides comparable performance to unsupervised domain adaptation methods but is much lighter (<2M parameters) and requires no retraining, making it practical for clinical glaucoma diagnosis with limited labeled data.

Abstract: This study develops meta-learners for few-shot weakly-supervised segmentation (FWS) to address the challenge of optic disc (OD) and optic cup (OC) segmentation for glaucoma diagnosis with limited labeled fundus images. We significantly improve existing meta-learners by introducing Omni meta-training which balances data usage and diversifies the number of shots. We also develop their efficient versions that reduce computational costs. In addition, we develop sparsification techniques that generate more customizable and representative scribbles and other sparse labels. After evaluating multiple datasets, we find that Omni and efficient versions outperform the original versions, with the best meta-learner being Efficient Omni ProtoSeg (EO-ProtoSeg). It achieves intersection over union (IoU) scores of 88.15% for OD and 71.17% for OC on the REFUGE dataset using just one sparsely labeled image, outperforming few-shot and semi-supervised methods which require more labeled images. Its best performance reaches 86.80% for OD and 71.78%for OC on DRISHTIGS, 88.21% for OD and 73.70% for OC on REFUGE, 80.39% for OD and 52.65% for OC on REFUGE. EO-ProtoSeg is comparable to unsupervised domain adaptation methods yet much lighter with less than two million parameters and does not require any retraining.

[393] Generative Preprocessing for Image Compression with Pre-trained Diffusion Models

Mengxi Guo, Shijie Zhao, Junlin Li, Li Zhang

Main category: eess.IV

TL;DR: This paper proposes a novel Rate-Perception optimized compression preprocessing method using a distilled diffusion model, achieving significant perceptual quality improvements without modifying standard codecs.

DetailsMotivation: Existing compression preprocessing methods are predominantly Rate-Distortion optimized and constrained by pixel-level fidelity, lacking focus on perceptual quality. The authors aim to pioneer Rate-Perception optimization for better visual quality.

Method: A two-stage framework: 1) Distill multi-step Stable Diffusion 2.1 into a compact one-step model using Consistent Score Identity Distillation (CiD), 2) Parameter-efficient fine-tuning of attention modules guided by Rate-Perception loss and differentiable codec surrogate.

Result: Achieves substantial Rate-Perception gains with up to 30.13% BD-rate reduction in DISTS on Kodak dataset, delivering superior subjective visual quality while seamlessly integrating with standard codecs.

Conclusion: The work successfully shifts compression preprocessing from Rate-Distortion to Rate-Perception optimization, leveraging diffusion model priors to enhance texture and mitigate artifacts without codec modifications.

Abstract: Preprocessing is a well-established technique for optimizing compression, yet existing methods are predominantly Rate-Distortion (R-D) optimized and constrained by pixel-level fidelity. This work pioneers a shift towards Rate-Perception (R-P) optimization by, for the first time, adapting a large-scale pre-trained diffusion model for compression preprocessing. We propose a two-stage framework: first, we distill the multi-step Stable Diffusion 2.1 into a compact, one-step image-to-image model using Consistent Score Identity Distillation (CiD). Second, we perform a parameter-efficient fine-tuning of the distilled model’s attention modules, guided by a Rate-Perception loss and a differentiable codec surrogate. Our method seamlessly integrates with standard codecs without any modification and leverages the model’s powerful generative priors to enhance texture and mitigate artifacts. Experiments show substantial R-P gains, achieving up to a 30.13% BD-rate reduction in DISTS on the Kodak dataset and delivering superior subjective visual quality.

[394] Deep Learning-Driven Quantitative Spectroscopic Photoacoustic Imaging for Segmentation and Oxygen Saturation Estimation

Ruibo Shang, Sidhartha Jandhyala, Yujia Wu, Kevin Hoffer-Hawlik, Austin Van Namen, Matthew O’Donnell, Geoffrey P. Luke

Main category: eess.IV

TL;DR: Hybrid-Net deep learning model simultaneously estimates blood oxygenation (sO2) and segments blood vessels from spectroscopic photoacoustic imaging without requiring optical fluence estimation.

DetailsMotivation: Spectroscopic photoacoustic imaging can estimate blood oxygenation noninvasively, but requires accurate optical fluence estimates which are difficult in heterogeneous tissue due to wavelength-dependent absorption and scattering variations.

Method: Developed Hybrid-Net deep neural network that simultaneously estimates sO2 and segments blood vessels. Trained first on simulated sPA data from 3D Monte Carlo simulations of light transport in breast tissue (700nm & 850nm), then retrained on experimental sPA data from tissue-mimicking phantoms with embedded blood pools.

Result: Achieved segmentation accuracy ≥0.978 in simulations with varying noise (0dB-35dB) and 0.998 in experiments; sO2 mean squared error ≤0.048 in simulations and 0.003 in experiments.

Conclusion: Hybrid-Net provides accurate blood oxygenation estimation without optical fluence calculation, potentially improving in-vivo sO2 estimation for clinical applications.

Abstract: Spectroscopic photoacoustic (sPA) imaging can potentially estimate blood oxygenation saturation (sO2) in vivo noninvasively. However, quantitatively accurate results require accurate optical fluence estimates. Robust modeling in heterogeneous tissue, where light with different wavelengths can experience significantly different absorption and scattering, is difficult. In this work, we developed a deep neural network (Hybrid-Net) for sPA imaging to simultaneously estimate sO2 in blood vessels and segment those vessels from surrounding background tissue. sO2 error was minimized only in blood vessels segmented in Hybrid-Net, resulting in more accurate predictions. Hybrid-Net was first trained on simulated sPA data (at 700 nm and 850 nm) representing initial pressure distributions from three-dimensional Monte Carlo simulations of light transport in breast tissue. Then, for experimental verification, the network was retrained on experimental sPA data (at 700 nm and 850 nm) acquired from simple tissue mimicking phantoms with an embedded blood pool. Quantitative measures were used to evaluate Hybrid-Net performance with an averaged segmentation accuracy of >= 0.978 in simulations with varying noise levels (0dB-35dB) and 0.998 in the experiment, and an averaged sO2 mean squared error of <= 0.048 in simulations with varying noise levels (0dB-35dB) and 0.003 in the experiment. Overall, these results show that Hybrid-Net can provide accurate blood oxygenation without estimating the optical fluence, and this study could lead to improvements in in-vivo sO2 estimation.

[395] Nine Years of Pediatric Iris Recognition: Evidence for Biometric Permanence

Naveenkumar G Venkataswamy, Masudul H Imtiaz, Stephanie Schuckers

Main category: eess.IV

TL;DR: Pediatric iris recognition remains stable for up to 9 years with proper imaging, supporting extended re-enrollment periods of 10-12 years for children enrolled at age 7+.

DetailsMotivation: Despite widespread use of iris recognition for children in national ID programs (India's Aadhaar, Canada's NEXUS), biometric permanence in pediatric populations remains poorly understood, creating uncertainty about appropriate re-enrollment policies.

Method: Longitudinal study of 276 subjects enrolled at ages 4-12, followed for up to 9 years through adolescence using 18,318 near-infrared iris images. Evaluated commercial (VeriEye) and open-source (OpenIris) systems with linear mixed-effects models controlling for image quality, pupil dilation, and physiological factors.

Result: False non-match rates remained below 0.5% for 9 years with pediatric-calibrated thresholds. Image quality and pupil dilation constancy (3.0-3.5 standard deviations) dominated performance more than temporal aging. Failures concentrated in 9.4% of subjects with persistent acquisition issues rather than accumulating over time.

Conclusion: Iris recognition remains viable throughout childhood with proper imaging control. Findings justify extending conservative re-enrollment policies to 10-12 year validity periods for high-quality enrollments at ages 7+, as acquisition conditions rather than biometric aging are the primary limitation.

Abstract: Biometric permanence in pediatric populations remains poorly understood despite widespread deployment of iris recognition for children in national identity programs such as India’s Aadhaar and trusted traveler programs like Canada’s NEXUS. This study presents a comprehensive longitudinal evaluation of pediatric iris recognition, analyzing 276 subjects enrolled between ages 4-12 and followed up to nine years through adolescence. Using 18,318 near-infrared iris images acquired semi-annually, we evaluated commercial (VeriEye) and open-source (OpenIris) systems through linear mixed-effects models that disentangle enrollment age, developmental maturation, and elapsed time while controlling for image quality and physiological factors. False non-match rates remained below 0.5% across the nine-year period for both matchers using pediatric-calibrated thresholds, approaching adult-level performance. However, we reveal significant algorithm-dependent temporal behaviors: VeriEye’s apparent decline reflects developmental confounding across enrollment cohorts rather than genuine template aging, while OpenIris exhibits modest but genuine temporal aging (0.5 standard deviations over eight years). Image quality and pupil dilation constancy dominated longitudinal performance, with dilation effects reaching 3.0-3.5 standard deviations, substantially exceeding temporal factors. Failures concentrated in 9.4% of subjects with persistent acquisition challenges rather than accumulating with elapsed time, confirming acquisition conditions as the primary limitation. These findings justify extending conservative re-enrollment policies, potentially to 10-12 year validity periods for high-quality enrollments at ages 7+, and demonstrate iris recognition remains viable throughout childhood and adolescence with proper imaging control.

[396] An Open-Source Framework for Quality-Assured Smartphone-Based Visible Light Iris Recognition

Naveenkumar G. Venkataswamy, Yu Liu, Soumyabrata Dey, Stephanie Schuckers, Masudul H. Imtiaz

Main category: eess.IV

TL;DR: CUVIRIS dataset enables visible spectrum iris recognition on smartphones using standardized capture protocols and lightweight models, achieving high accuracy with OSIRIS (97.9% TAR) and IrisFormer (0.057% EER).

DetailsMotivation: Smartphone-based visible spectrum iris recognition faces challenges due to lighting variability, pigmentation effects, and lack of standardized capture protocols, limiting practical deployment despite being low-cost and accessible.

Method: 1) Created CUVIRIS dataset with 752 ISO/IEC 29794-6 compliant iris images from 47 subjects using custom Android app with real-time framing and quality feedback. 2) Developed LightIrisNet, a MobileNetV3-based multi-task segmentation model for on-device deployment. 3) Adapted IrisFormer transformer-based matcher to visible spectrum domain.

Result: OSIRIS achieved 97.9% TAR at FAR=0.01 (EER=0.76%) on CUVIRIS. IrisFormer trained only on UBIRIS.v2 achieved 0.057% EER. Released Android app, LightIrisNet, trained IrisFormer weights, and subset of CUVIRIS dataset for reproducibility.

Conclusion: With standardized acquisition protocols and VIS-adapted lightweight models, accurate iris recognition on commodity smartphones is feasible under controlled conditions, bringing this biometric modality closer to practical deployment.

Abstract: Smartphone-based iris recognition in the visible spectrum (VIS) offers a low-cost and accessible biometric alternative but remains a challenge due to lighting variability, pigmentation effects, and the limited adoption of standardized capture protocols. In this work, we present CUVIRIS, a dataset of 752 ISO/IEC 29794-6 compliant iris images from 47 subjects, collected with a custom Android application that enforces real-time framing, sharpness assessment, and quality feedback. We further introduce LightIrisNet, a MobileNetV3-based multi-task segmentation model optimized for on-device deployment. In addition, we adapt IrisFormer, a transformer-based matcher, to the VIS domain. We evaluate OSIRIS and IrisFormer under a standardized protocol and benchmark against published CNN baselines reported in prior work. On CUVIRIS, the open-source OSIRIS system achieves a TAR of 97.9% at FAR = 0.01 (EER = 0.76%), while IrisFormer, trained only on the UBIRIS.v2 dataset, achieves an EER of 0.057%. To support reproducibility, we release the Android application, LightIrisNet, trained IrisFormer weights, and a subset of the CUVIRIS dataset. These results show that, with standardized acquisition and VIS-adapted lightweight models, accurate iris recognition on commodity smartphones is feasible under controlled conditions, bringing this modality closer to practical deployment.

[397] Radiomics and Clinical Features in Predictive Modelling of Brain Metastases Recurrence

Ines Faria, Matheus Silva, Crystian Saraiva, Jose Soares, Victor Alves

Main category: eess.IV

TL;DR: AI-based radiomics approach using multimodal CT/MRI imaging and clinical data to predict brain metastasis recurrence after radiotherapy, showing feasibility despite small sample size limitations.

DetailsMotivation: Brain metastases affect 20-40% of cancer patients, and early prediction of recurrence after radiotherapy could enable timely clinical intervention and improve patient outcomes.

Method: Retrospective study of 97 patients (53 after inclusion criteria) with pre-treatment and first follow-up CT/MRI scans. Used radiomics feature extraction, delta radiomics for temporal changes, multiple machine learning classifiers, and analysis of treatment planning vs delivered dose discrepancies.

Result: Demonstrated feasibility of radiomics-based ensemble models for recurrence prediction, suggesting potential association between radiation dose discrepancies and recurrence risk despite limitations of small sample size and class imbalance.

Conclusion: Supports further investigation of AI-driven tools for clinical decision-making in brain metastasis management, highlighting the potential of multimodal imaging and radiomics for recurrence prediction.

Abstract: Brain metastases affect approximately between 20% and 40% of cancer patients and are commonly treated with radiotherapy or radiosurgery. Early prediction of recurrence following treatment could enable timely clinical intervention and improve patient outcomes. This study proposes an artificial intelligence based approach for predicting brain metastasis recurrence using multimodal imaging and clinical data. A retrospective cohort of 97 patients was collected, including Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) acquired before treatment and at first follow-up, together with relevant clinical variables. Image preprocessing included CT windowing and artifact reduction, MRI enhancement, and multimodal CT MRI registration. After applying inclusion criteria, 53 patients were retained for analysis. Radiomics features were extracted from the imaging data, and delta radiomics was employed to characterize temporal changes between pre-treatment and follow-up scans. Multiple machine learning classifiers were trained and evaluated, including an analysis of discrepancies between treatment planning target volumes and delivered isodose volumes. Despite limitations related to sample size and class imbalance, the results demonstrate the feasibility of radiomics based models, namely ensemble models, for recurrence prediction and suggest a potential association between radiation dose discrepancies and recurrence risk. This work supports further investigation of AI-driven tools to assist clinical decision-making in brain metastasis management.

[398] From Pretraining to Privacy: Federated Ultrasound Foundation Model with Self-Supervised Learning

Yuncheng Jiang, Chun-Mei Feng, Jinke Ren, Jun Wei, Zixun Zhang, Yiwen Hu, Yunbi Liu, Rui Sun, Xuemei Tang, Juan Du, Xiang Wan, Yong Xu, Bo Du, Xin Gao, Guangyu Wang, Shaohua Zhou, Shuguang Cui, Zhen Li

Main category: eess.IV

TL;DR: UltraFedFM is a privacy-preserving ultrasound foundation model trained via federated learning across 16 institutions, achieving expert-level diagnostic accuracy while protecting patient data.

DetailsMotivation: Traditional ultrasound diagnostics rely heavily on physician expertise and suffer from suboptimal image quality, leading to potential errors. Existing AI methods require large labeled datasets (raising privacy concerns) and are task-specific, limiting clinical utility.

Method: Collaborative pre-training using federated learning across 16 distributed medical institutions in 9 countries, leveraging over 1 million ultrasound images covering 19 organs and 10 ultrasound modalities. This privacy-preserving approach avoids centralizing sensitive patient data.

Result: Achieves average AUROC of 0.927 for disease diagnosis and DSC of 0.878 for lesion segmentation. Surpasses mid-level ultrasonographers (4-8 years experience) and matches expert-level sonographers (10+ years experience) in diagnosing 8 common systemic diseases.

Conclusion: UltraFedFM significantly enhances clinical diagnostics while safeguarding patient privacy, representing a major advancement in AI-driven ultrasound imaging for future clinical applications.

Abstract: Ultrasound imaging is widely used in clinical diagnosis due to its non-invasive nature and real-time capabilities. However, traditional ultrasound diagnostics relies heavily on physician expertise and is often hampered by suboptimal image quality, leading to potential diagnostic errors. While artificial intelligence (AI) offers a promising solution to enhance clinical diagnosis by detecting abnormalities across various imaging modalities, existing AI methods for ultrasound face two major challenges. First, they typically require vast amounts of labeled medical data, raising serious concerns regarding patient privacy. Second, most models are designed for specific tasks, which restricts their broader clinical utility. To overcome these challenges, we present UltraFedFM, an innovative privacy-preserving ultrasound foundation model. UltraFedFM is collaboratively pre-trained using federated learning across 16 distributed medical institutions in 9 countries, leveraging a dataset of over 1 million ultrasound images covering 19 organs and 10 ultrasound modalities. This extensive and diverse data, combined with a secure training framework, enables UltraFedFM to exhibit strong generalization and diagnostic capabilities. It achieves an average area under the receiver operating characteristic curve (AUROC) of 0.927 for disease diagnosis and a dice similarity coefficient (DSC) of 0.878 for lesion segmentation. Notably, UltraFedFM surpasses the diagnostic accuracy of mid-level ultrasonographers (4-8 years of experience) and matches the performance of expert-level sonographers (10+ years of experience) in the joint diagnosis of 8 common systemic diseases.c These findings indicate that UltraFedFM can significantly enhance clinical diagnostics while safeguarding patient privacy, marking a significant advancement in AI-driven ultrasound imaging for future clinical applications.

[399] MedicoSAM: Robust Improvement of SAM for Medical Imaging

Anwai Archit, Luca Freckmann, Constantin Pape

Main category: eess.IV

TL;DR: MedicoSAM improves Segment Anything for medical images through finetuning, showing clear benefits for interactive segmentation but not for semantic segmentation.

DetailsMotivation: Current medical image segmentation models are task-specific and require costly labeled data for training/adaptation. Vision foundation models like Segment Anything offer potential for universal segmentation in medical imaging.

Method: Compare different finetuning strategies of Segment Anything on large diverse medical datasets, then evaluate on interactive and semantic segmentation tasks.

Result: Finetuning clearly improves performance for interactive segmentation, but semantic segmentation does not benefit from medical image pretraining. Best model MedicoSAM is publicly available.

Conclusion: MedicoSAM offers practical value for medical image segmentation, especially for interactive tasks, and is compatible with existing annotation tools.

Abstract: Medical image segmentation is an important analysis task in clinical practice and research. Deep learning has massively advanced the field, but current approaches are mostly based on models trained for a specific task. Training such models or adapting them to a new condition is costly due to the need for (manually) labeled data. The emergence of vision foundation models, especially Segment Anything, offers a path to universal segmentation for medical images, overcoming these issues. Here, we study how to improve Segment Anything for medical images by comparing different finetuning strategies on a large and diverse dataset. We evaluate the finetuned models on a wide range of interactive and (automatic) semantic segmentation tasks. We find that the performance can be clearly improved for interactive segmentation. However, semantic segmentation does not benefit from pretraining on medical images. Our best model, MedicoSAM, is publicly available at https://github.com/computational-cell-analytics/medico-sam. We show that it is compatible with existing tools for data annotation and believe that it will be of great practical value.

Last updated: 2025-12-19
Built with Hugo, theme modified on Stack