Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu

Main category: cs.SD

TL;DR: OmniCustom: A DiT-based framework for synchronous audio-video customization that generates videos with reference image identity and audio timbre while following text prompts.

Details

Motivation: Existing video customization focuses only on identity consistency from reference images, but with advances in joint audio-video generation, there's a need for synchronous customization of both video identity and audio timbre.

Method: Uses DiT-based framework with separate LoRA modules for identity and audio timbre control, contrastive learning alongside flow matching, and training on large-scale audio-visual human dataset.

Result: Extensive experiments show OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity.

Conclusion: OmniCustom successfully addresses the novel sync audio-video customization task, enabling simultaneous control over video identity, audio timbre, and textual content in a zero-shot manner.

Abstract: Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity.

Relevance: 9/10

[2] Artic: AI-oriented Real-time Communication for MLLM Video Assistant

Jiangkai Wu, Zhiyuan Ren, Junquan Zhong, Liming Liu, Xinggong Zhang

Main category: cs.NI

TL;DR: Artic is an AI-oriented RTC framework for MLLM Video Assistants that addresses the mismatch between current RTC systems and AI video understanding needs through adaptive bitrate control, context-aware streaming, and a new benchmark for degraded video understanding.

Details

Motivation: Current RTC frameworks are designed for human viewing but fail for AI Video Assistants where MLLMs need to understand video content. There's a fundamental mismatch in QoE requirements and network challenges, causing latency spikes and accuracy drops in production prototypes.

Method: Artic proposes three key innovations: 1) Response Capability-aware Adaptive Bitrate that uses MLLM accuracy saturation to proactively cap bitrate, 2) Zero-overhead Context-aware Streaming that allocates bitrate to regions most important for AI response, and 3) Degraded Video Understanding Benchmark to evaluate RTC-induced video degradation effects on MLLM accuracy.

Result: Prototype experiments using real-world uplink traces show Artic significantly improves accuracy by 15.12% and reduces latency by 135.31 ms compared to existing methods.

Conclusion: Artic successfully bridges the gap between traditional RTC frameworks and AI Video Assistant requirements by optimizing for AI understanding rather than human viewing, demonstrating substantial improvements in both accuracy and latency for MLLM-based video communication systems.

Abstract: AI Video Assistant emerges as a new paradigm for Real-time Communication (RTC), where one peer is a Multimodal Large Language Model (MLLM) deployed in the cloud. This makes interaction between humans and AI more intuitive, akin to chatting with a real person. However, a fundamental mismatch exists between current RTC frameworks and AI Video Assistants, stemming from the drastic shift in Quality of Experience (QoE) and more challenging networks. Measurements on our production prototype also confirm that current RTC fails, causing latency spikes and accuracy drops. To address these challenges, we propose Artic, an AI-oriented RTC framework for MLLM Video Assistants, exploring the shift from “humans watching video” to “AI understanding video.” Specifically, Artic proposes: (1) Response Capability-aware Adaptive Bitrate, which utilizes MLLM accuracy saturation to proactively cap bitrate, reserving bandwidth headroom to absorb future fluctuations for latency reduction; (2) Zero-overhead Context-aware Streaming, which allocates limited bitrate to regions most important for the response, maintaining accuracy even under ultra-low bitrates; and (3) Degraded Video Understanding Benchmark, the first benchmark evaluating how RTC-induced video degradation affects MLLM accuracy. Prototype experiments using real-world uplink traces show that compared with existing methods, Artic significantly improves accuracy by 15.12% and reduces latency by 135.31 ms. We will release the benchmark and codes at https://github.com/pku-netvideo/DeViBench.

Relevance: 9/10

[3] Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática

Neemias da Silva, Júlio C. W. Scholz, John Harrison, Marina Borges, Paulo Ávila, Frances A Santos, Myriam Delgado, Rodrigo Minetto, Thiago H Silva

Main category: cs.CL

TL;DR: A chapter covering fundamentals of Multimodal Large Language Models (MLLMs), including theory, practical implementation with LangChain/LangGraph, and future trends.

Details

Motivation: To provide comprehensive coverage of MLLMs fundamentals, practical implementation techniques, and discuss challenges and future trends in the field.

Method: Presents theoretical foundations of MLLMs, explores practical techniques for preprocessing and prompt engineering, demonstrates building multimodal pipelines using LangChain and LangGraph frameworks, and provides supplementary online materials for hands-on learning.

Result: A comprehensive educational resource covering both theoretical and practical aspects of MLLMs, with publicly available supplementary materials for further study.

Conclusion: MLLMs represent a key AI advancement combining language and perception capabilities; the chapter provides foundational knowledge, practical implementation guidance, and discusses future research directions in the field.

Abstract: Multimodal Large Language Models (MLLMs) combine the natural language understanding and generation capabilities of LLMs with perception skills in modalities such as image and audio, representing a key advancement in contemporary AI. This chapter presents the main fundamentals of MLLMs and emblematic models. Practical techniques for preprocessing, prompt engineering, and building multimodal pipelines with LangChain and LangGraph are also explored. For further practical study, supplementary material is publicly available online: https://github.com/neemiasbsilva/MLLMs-Teoria-e-Pratica. Finally, the chapter discusses the challenges and highlights promising trends.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 72]
cs.CV [Total: 103]
cs.AI [Total: 46]
cs.SD [Total: 8]
cs.LG [Total: 166]
cs.MA [Total: 8]
cs.MM [Total: 0]
eess.AS [Total: 5]
eess.IV [Total: 15]

cs.CL

[1] A Lightweight LLM Framework for Disaster Humanitarian Information Classification

Han Jinzhen, Kim Jisung, Yang Jong Soo, Yun Hong Sik

Main category: cs.CL

TL;DR: Lightweight framework for disaster tweet classification using parameter-efficient fine-tuning (LoRA/QLoRA) on Llama 3.1 8B, achieving high accuracy with minimal parameter updates while showing RAG degrades performance due to label noise.

Details

Motivation: Need for timely humanitarian information classification from social media during disasters, but deployment of large language models faces challenges in resource-constrained emergency settings requiring lightweight, cost-effective solutions.

Method: Parameter-efficient fine-tuning (LoRA/QLoRA) on Llama 3.1 8B, using HumAID dataset (76,484 tweets across 19 disasters) for dual-task benchmark (humanitarian categorization + event type identification), with systematic evaluation of prompting strategies and retrieval-augmented generation.

Result: LoRA achieves 79.62% humanitarian classification accuracy (+37.79% over zero-shot) training only ~2% of parameters; QLoRA enables efficient deployment with 99.4% of LoRA performance at 50% memory cost; RAG degrades fine-tuned model performance due to label noise from retrieved examples.

Conclusion: Establishes practical, reproducible pipeline for building reliable crisis intelligence systems with limited computational resources using parameter-efficient fine-tuning, with LoRA/QLoRA being effective but RAG being counterproductive for this specific task.

Abstract: Timely classification of humanitarian information from social media is critical for effective disaster response. However, deploying large language models (LLMs) for this task faces challenges in resource-constrained emergency settings. This paper develops a lightweight, cost-effective framework for disaster tweet classification using parameter-efficient fine-tuning. We construct a unified experimental corpus by integrating and normalizing the HumAID dataset (76,484 tweets across 19 disaster events) into a dual-task benchmark: humanitarian information categorization and event type identification. Through systematic evaluation of prompting strategies, LoRA fine-tuning, and retrieval-augmented generation (RAG) on Llama 3.1 8B, we demonstrate that: (1) LoRA achieves 79.62% humanitarian classification accuracy (+37.79% over zero-shot) while training only ~2% of parameters; (2) QLoRA enables efficient deployment with 99.4% of LoRA performance at 50% memory cost; (3) contrary to common assumptions, RAG strategies degrade fine-tuned model performance due to label noise from retrieved examples. These findings establish a practical, reproducible pipeline for building reliable crisis intelligence systems with limited computational resources.

[2] From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness

Linbo Cao, Lihao Sun, Yang Yue

Main category: cs.CL

TL;DR: Persona assignments in LLM agents cause performance degradation up to 26.2% across diverse domains due to implicit biases, revealing vulnerabilities in agentic systems.

Details

Motivation: While persona-induced biases in text generation are documented, their effects on LLM agent task performance remain unexplored despite posing direct operational risks for autonomous agents with real-world impacts.

Method: Systematic case study evaluating widely deployed LLM models on agentic benchmarks spanning strategic reasoning, planning, and technical operations, examining how demographic-based persona assignments affect behavior and performance.

Result: Substantial performance variations up to 26.2% degradation driven by task-irrelevant persona cues, appearing across task types and model architectures, showing persona conditioning distorts decision-making reliability.

Conclusion: Persona assignments introduce implicit biases and increase behavioral volatility in LLM agents, revealing an overlooked vulnerability that raises concerns for safe and robust deployment of autonomous agentic systems.

Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of actions with real-world impacts beyond text generation. While persona-induced biases in text generation are well documented, their effects on agent task performance remain largely unexplored, even though such effects pose more direct operational risks. In this work, we present the first systematic case study showing that demographic-based persona assignments can alter LLM agents’ behavior and degrade performance across diverse domains. Evaluating widely deployed models on agentic benchmarks spanning strategic reasoning, planning, and technical operations, we uncover substantial performance variations - up to 26.2% degradation, driven by task-irrelevant persona cues. These shifts appear across task types and model architectures, indicating that persona conditioning and simple prompt injections can distort an agent’s decision-making reliability. Our findings reveal an overlooked vulnerability in current LLM agentic systems: persona assignments can introduce implicit biases and increase behavioral volatility, raising concerns for the safe and robust deployment of LLM agents.

[3] Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction

Junjie An, Jingguang Tian, Tianyi Wang, Yu Gao, Xiaofeng Mou, Yi Xu

Main category: cs.CL

TL;DR: A retrieval-augmented generation framework for correcting named entity errors in ASR using rephrasing language models and adaptive chain-of-thought reasoning.

Details

Motivation: End-to-end ASR systems often misrecognize domain-specific phrases like named entities, causing downstream failures. Existing LLM-based correction methods don't fully exploit LLMs' sophisticated reasoning capabilities.

Method: Two-component framework: (1) Rephrasing language model for named entity recognition with phonetic-level edit distance candidate retrieval, (2) Self-taught reasoning model with adaptive chain-of-thought (A-STAR) that dynamically adjusts reasoning depth based on task difficulty.

Result: Achieves relative reductions in named entity character error rate of 17.96% on AISHELL-1 and 34.42% on Homophone dataset compared to strong baselines.

Conclusion: The proposed retrieval-augmented generation framework effectively corrects named entity errors in ASR by leveraging LLMs’ reasoning capabilities through adaptive chain-of-thought and phonetic retrieval.

Abstract: End-to-end automatic speech recognition (ASR) systems frequently misrecognize domain-specific phrases like named entities, which can cause catastrophic failures in downstream tasks. A new family of named entity correction methods based on large language models (LLMs) has recently emerged. However, these approaches have yet to fully exploit the sophisticated reasoning capabilities inherent to LLMs. To bridge this gap, we propose a novel retrieval-augmented generation framework for correcting named entity errors in ASR. Our approach consists of two key components: (1) a rephrasing language model (RLM) for named entity recognition, followed by candidate retrieval using a phonetic-level edit distance; and (2) a novel self-taught reasoning model with adaptive chain-of-thought (A-STAR) that dynamically adjusts the depth of its reasoning based on task difficulty. Experiments on the AISHELL-1 and Homophone datasets demonstrate the effectiveness of our method, which achieves relative reductions in the named entity character error rate of 17.96% and 34.42%, respectively, compared to a strong baseline.

[4] Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting

Jing Xu, Minglin Wu, Xueyuan Chen, Xixin Wu, Helen Meng

Main category: cs.CL

TL;DR: Lamer-SSL: A parameter-efficient framework using Layer-Aware Mixture of LoRA Experts with replay strategy for multilingual continual learning in self-supervised speech models

Details

Motivation: Self-supervised speech models struggle with generalization to new languages and suffer from catastrophic forgetting during continual training, limiting their practical deployment in multilingual scenarios.

Method: Proposes Lamer-SSL framework with: 1) Layer-Aware Mixture of LoRA Experts (Lamer) module that balances shared/language-specific representations and allocates more experts to deeper semantic layers, 2) Replay strategy using minimal data to retain prior knowledge and mitigate forgetting.

Result: Achieves effective extension to new languages while maintaining strong performance on previously learned languages with only 2.14% trainable parameters, demonstrated on ASR and language identification tasks.

Conclusion: Lamer-SSL provides an efficient solution for multilingual continual learning in self-supervised speech models, addressing generalization and forgetting issues with minimal parameter overhead.

Abstract: Despite their impressive performance, self-supervised speech models often struggle to generalize to new languages and tend to forget previously acquired knowledge during continual training. To address this, we propose Lamer-SSL, a parameter-efficient framework that integrates a Layer-Aware MixturE of LoRA Experts (Lamer) module with a replay strategy. The Lamer module enables flexible balancing between shared and language-specific representations, while layer-aware expert allocation assigns more experts to deeper layers where semantic information is richer. Meanwhile, the replay strategy retains prior knowledge using minimal data, mitigating forgetting during continual training. Experiments on automatic speech recognition (ASR) and language identification (LID) demonstrate that Lamer-SSL extends self-supervised models to new languages effectively while maintaining strong performance on previously learned languages with only 2.14% parameters being trainable.

[5] Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática

Neemias da Silva, Júlio C. W. Scholz, John Harrison, Marina Borges, Paulo Ávila, Frances A Santos, Myriam Delgado, Rodrigo Minetto, Thiago H Silva

Main category: cs.CL

TL;DR: A chapter covering fundamentals of Multimodal Large Language Models (MLLMs), including theory, practical implementation with LangChain/LangGraph, and future trends.

Details

Motivation: To provide comprehensive coverage of MLLMs fundamentals, practical implementation techniques, and discuss challenges and future trends in the field.

Result: A comprehensive educational resource covering both theoretical and practical aspects of MLLMs, with publicly available supplementary materials for further study.

[6] propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Maximilian Idahl, Benedikt Droste, Björn Plüster, Jan Philipp Harries

Main category: cs.CL

TL;DR: Propella-1: A family of small multilingual LLMs (0.6B-4B) that annotate text documents across 18 properties in 6 categories, producing structured JSON annotations for better data curation than single-score approaches.

Details

Motivation: Current LLM pretraining data curation relies on single scalar quality scores from small classifiers, which conflate multiple quality dimensions, prevent flexible filtering, and lack interpretability.

Method: Developed propella-1 family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text across 18 properties in 6 categories (core content, classification, quality/value, audience/purpose, safety/compliance, geographic relevance) for 57 languages, producing structured JSON annotations.

Result: The 4B model achieves higher agreement with frontier commercial LLM reference than much larger general-purpose models. Released propella-annotations dataset of over 3B document annotations covering major pretraining corpora. Multi-dimensional analysis reveals substantial dataset differences in quality, reasoning depth, and content composition.

Conclusion: Propella-1 enables more nuanced, interpretable data curation for LLM pretraining through multi-dimensional structured annotations, outperforming single-score approaches and revealing richer dataset characteristics.

Abstract: Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.

[7] RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

Ziqian Zhang, Xingjian Hu, Yue Huang, Kai Zhang, Ruoxi Chen, Yixin Liu, Qingsong Wen, Kaidi Xu, Xiangliang Zhang, Neil Zhenqiang Gong, Lichao Sun

Main category: cs.CL

TL;DR: RankLLM is a novel framework that quantifies both question difficulty and model competency through bidirectional score propagation, enabling more fine-grained evaluation of LLMs beyond traditional benchmarks.

Details

Motivation: Existing LLM benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. Current evaluation frameworks treat all questions equally, preventing nuanced assessment of model strengths and weaknesses.

Method: RankLLM introduces difficulty as primary criterion and uses bidirectional score propagation: models earn competency scores for correct answers, while questions gain difficulty scores when they challenge models. This creates a mutual reinforcement system for quantifying both dimensions.

Result: Evaluated 30 models on 35,550 questions across multiple domains. Achieved 90% agreement with human judgments, outperformed strong baselines like IRT, and demonstrated strong stability, fast convergence, and high computational efficiency.

Conclusion: RankLLM provides a practical, scalable solution for difficulty-aware LLM evaluation that offers more nuanced insights into model capabilities than traditional benchmarks.

Abstract: Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models’ capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM’s core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question’s difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.

[8] RBCorr: Response Bias Correction in Language Models

Om Bhatt, Anna A. Ivanova

Main category: cs.CL

TL;DR: RBCorr is a simple response bias correction method that eliminates option preference biases in language models for fixed-response questions, improving performance on yes-no, entailment, and multiple choice tasks.

Details

Motivation: Language models have response biases (option preference biases) in fixed-response questions, which distorts evaluations of their true capabilities. Need low-cost, effective bias correction methods.

Method: Propose RBCorr strategy tested on 12 open-weight LMs using yes-no, entailment, and multiple choice questions. Explores generalizability across models, datasets, and prompt formats.

Result: Response bias is prevalent pre-correction; RBCorr effectively eliminates bias and boosts model performance. LogProbs-based correction depends heavily on models, datasets, and prompt formats.

Conclusion: RBCorr is easy-to-use, boosts smaller LM performance, and ensures benchmark evaluations better reflect true capabilities. Bias behavior generalizability depends on multiple factors.

Abstract: Language models (LMs) are known to be prone to response biases, which present as option preference biases in fixed-response questions. It is therefore imperative to develop low-cost and effective response bias correction methods to improve LM performance and enable more accurate evaluations of model abilities. Here, we propose a simple response bias correction strategy ($\texttt{RBCorr}$) and test it on 12 open-weight language models using yes-no, entailment, and multiple choice questions. We show that response bias is prevalent in LMs pre-correction and that $\texttt{RBCorr}$ effectively eliminates bias and boosts model performance. We also explore the generalizability of bias behavior across models, datasets, and prompt formats, showing that LogProbs-based correction is highly dependent on all three of these aspects. Overall, $\texttt{RBCorr}$ is an easy-to-use method that can boost the performance of smaller LMs and ensure that LM performance on closed-response benchmarks aligns more closely with their true capabilities.

[9] Discovering Semantic Latent Structures in Psychological Scales: A Response-Free Pathway to Efficient Simplification

Bo Wang, Yuxuan Zhang, Yueqin Hu, Hanchao Hou, Kaiping Peng, Shiguang Ni

Main category: cs.CL

TL;DR: A topic-modeling framework using sentence embeddings and clustering to simplify psychological scales based on semantic structure without requiring response data.

Details

Motivation: Traditional scale refinement methods require large response samples and face data availability constraints. Semantic structure of questionnaire items may encode latent constructs, offering a response-free alternative.

Method: Items are encoded with contextual sentence embeddings, grouped via density-based clustering to discover latent semantic factors, then representative items are selected using membership criteria within an integrated reduction pipeline.

Result: The method recovered coherent factor-like groupings aligned with established constructs, reduced scale length by 60.5% on average while maintaining psychometric adequacy, and preserved inter-factor correlations.

Conclusion: Semantic latent organization provides a response-free approximation of measurement structure, formalizing semantic structure as an inspectable front-end for scale construction and reduction.

Abstract: Psychological scale refinement traditionally relies on response-based methods such as factor analysis, item response theory, and network psychometrics to optimize item composition. Although rigorous, these approaches require large samples and may be constrained by data availability and cross-cultural comparability. Recent advances in natural language processing suggest that the semantic structure of questionnaire items may encode latent construct organization, offering a complementary response-free perspective. We introduce a topic-modeling framework that operationalizes semantic latent structure for scale simplification. Items are encoded using contextual sentence embeddings and grouped via density-based clustering to discover latent semantic factors without predefining their number. Class-based term weighting derives interpretable topic representations that approximate constructs and enable merging of semantically adjacent clusters. Representative items are selected using membership criteria within an integrated reduction pipeline. We benchmarked the framework across DASS, IPIP, and EPOCH, evaluating structural recovery, internal consistency, factor congruence, correlation preservation, and reduction efficiency. The proposed method recovered coherent factor-like groupings aligned with established constructs. Selected items reduced scale length by 60.5% on average while maintaining psychometric adequacy. Simplified scales showed high concordance with original factor structures and preserved inter-factor correlations, indicating that semantic latent organization provides a response-free approximation of measurement structure. Our framework formalizes semantic structure as an inspectable front-end for scale construction and reduction. To facilitate adoption, we provide a visualization-supported tool enabling one-click semantic analysis and structured simplification.

[10] MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang, Huichao Wang, Jiale Chen, Jianfei Pan, Jieqiong Cao, Jinghao Lin, Kai Wu, Lin Yang, Shengsheng Yao, Tao Chen, Xiaojun Xiao, Xiaozhong Ji, Xu Wang, Yijun He, Zhixiong Yang

Main category: cs.CL

TL;DR: MedXIAOHE is a medical vision-language foundation model that achieves SOTA performance on medical benchmarks through entity-aware continual pretraining, reinforcement learning for medical reasoning, and techniques to improve reliability and reduce hallucinations in clinical applications.

Details

Motivation: To advance general-purpose medical understanding and reasoning for real-world clinical applications, addressing challenges like heterogeneous medical data, rare diseases, and the need for reliable, evidence-based decision-making in healthcare.

Method: 1) Entity-aware continual pretraining framework organizing heterogeneous medical corpora to broaden knowledge coverage; 2) Incorporation of diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training for multi-step diagnostic reasoning; 3) Integration of user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation.

Result: Achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities, with improved adherence to medical instructions and reduced hallucinations.

Conclusion: MedXIAOHE demonstrates practical design choices for medical vision-language models that can advance clinical applications through improved knowledge coverage, reasoning capabilities, and reliability in real-world use.

Abstract: We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.

[11] Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin, Zhiyuan Yang, Ziwei Yu, Xin Wang, Mingxuan Yuan, Xianzhi Yu, Zhenhua Dong

Main category: cs.CL

TL;DR: HiFloat introduces 8-bit and 4-bit floating-point formats optimized for Ascend NPUs, showing superior performance for high-variance data compared to integer formats in LLM inference tasks.

Details

Motivation: As LLMs scale, there's a need for efficient low-bit floating-point formats that can maintain accuracy while reducing computational and memory requirements for inference on specialized hardware like NPUs.

Method: Developed HiFloat family (HiF8 and HiF4) formats tailored for Ascend NPUs, conducted rigorous comparisons across weight-activation and KV-cache tasks, evaluated against integer formats and existing floating-point formats like MXFP and NVFP4.

Result: Three key insights: (1) INT8 works well for narrow-range data but floating-point formats excel with high-variance data; (2) In 4-bit regimes, HiF4’s hierarchical scaling prevents accuracy collapse seen in integer formats; (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks.

Conclusion: HiFloat provides an effective solution for high-efficiency LLM inference on NPUs, offering better accuracy preservation for high-variance data compared to integer formats, especially in low-bit regimes.

Abstract: As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4’s hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.

[12] CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation

Yiran Rex Ma, Yuxiao Ye, Huiyuan Xie

Main category: cs.CL

TL;DR: CLASE introduces a hybrid evaluation method combining linguistic features and LLM-as-a-judge scores to assess stylistic quality of legal text generation, specifically for Chinese legal documents.

Details

Motivation: Legal text generated by LLMs often lacks proper stylistic adherence to legal writing conventions. Existing evaluation methods are inadequate: manual expert evaluation is impractical, reference-based metrics conflate semantics with style, and LLM-as-a-judge methods are opaque and inconsistent.

Method: CLASE combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both feature coefficients and LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts.

Result: Experiments on 200 Chinese legal documents show CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods.

Conclusion: CLASE provides a scalable, practical solution for professional stylistic evaluation in legal text generation, offering interpretable score breakdowns and improvement suggestions.

Abstract: Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: https://github.com/rexera/CLASE).

[13] Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

Dohyung Kim, Minbeom Kim, Jeonghye Kim, Sangmook Lee, Sojeong Rhee, Kyomin Jung

Main category: cs.CL

TL;DR: PACED-RL improves LLM reasoning by using partition function as accuracy signal to prioritize informative prompts during GFlowNet training, enhancing sample efficiency without extra compute.

Details

Motivation: RL methods improve LLM reasoning but reduce output diversity. GFlowNets address this by matching target distributions, but prior works treat partition function only as normalizer, missing its potential as accuracy signal for better sample efficiency.

Method: Proposes Partition Function-Guided RL (PACED-RL): 1) Establishes theoretical link between partition function and per-prompt accuracy, 2) Uses accuracy estimates to prioritize informative question prompts during training, 3) Implements accuracy estimate error-prioritized replay, all reusing existing GFlowNet training information.

Result: Extensive experiments across diverse benchmarks show strong performance improvements over GRPO and prior GFlowNet approaches, demonstrating better sample efficiency for distribution-matching LLM training.

Conclusion: PACED-RL effectively amortizes compute overhead into existing optimization, making it a promising direction for more sample-efficient distribution-matching training of LLMs while maintaining output diversity.

Abstract: Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.

[14] Learning Ordinal Probabilistic Reward from Preferences

Longze Chen, Lu Wang, Renke Shan, Ze Gong, Run Luo, Jiaming Li, Jing Luo, Qiyao Wang, Min Yang

Main category: cs.CL

TL;DR: Introduces Probabilistic Reward Model (PRM) that treats reward as a random variable with full probability distribution, addressing limitations of existing generative and discriminative reward models.

Details

Motivation: Existing reward modeling approaches have limitations: Generative Reward Models (GRMs) require costly point-wise supervision, while Discriminative Reward Models (DRMs) produce uncalibrated relative scores lacking probabilistic interpretation. Need for reward models that capture both relative rankings and absolute quality.

Method: Proposes Probabilistic Reward Model (PRM) treating reward as random variable with full probability distribution. Presents Ordinal Probabilistic Reward Model (OPRM) as discrete realization, discretizing quality scores into ordinal ratings. Introduces Region Flooding Tuning (RgFT) training strategy using quality-level annotations to concentrate probability mass within rating sub-regions.

Result: Experiments show 2.9%-7.4% accuracy improvement over prior reward models on various benchmarks. Method demonstrates strong performance and data efficiency. Analysis shows capture of both relative rankings and absolute quality.

Conclusion: PRM/OPRM with RgFT provides effective probabilistic reward modeling approach that addresses limitations of existing methods, offering calibrated scores with probabilistic interpretation and improved accuracy.

Abstract: Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation. To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM). Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response. To make this paradigm practical, we present its closed-form, discrete realization: the Ordinal Probabilistic Reward Model (OPRM), which discretizes the quality score into a finite set of ordinal ratings. Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT). It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions. Experiments on various reward model benchmarks show that our method improves accuracy by $\textbf{2.9%}\sim\textbf{7.4%}$ compared to prior reward models, demonstrating strong performance and data efficiency. Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.

[15] $\mathcal{X}$-KD: General Experiential Knowledge Distillation for Large Language Models

Yuang Cai, Yuyu Yuan

Main category: cs.CL

TL;DR: Experiential Knowledge Distillation (X-KD) enables student LLMs to learn in teacher’s original environment using reward imitation learning, outperforming traditional KD methods on various NLP tasks.

Details

Motivation: Existing KD approaches for LLMs focus on imitating teacher behavior but overlook the original learning environment that shaped the teacher's knowledge, limiting student learning potential.

Method: Proposes X-KD framework using Approximated Variational Reward Imitation Learning (AVRIL) to jointly model teacher’s original reward function and perform policy distillation, encouraging consistency between student policy and original reward function.

Result: X-KD outperforms generalized KD and MiniLLM baselines on abstractive summarization, machine translation, and arithmetic reasoning tasks, achieving better performance-diversity trade-off and data efficiency.

Conclusion: X-KD provides a novel framework that enables students to learn in teacher’s original environment, offering better performance than traditional imitation-based KD approaches while maintaining simplicity and flexibility.

Abstract: Knowledge Distillation (KD) for Large Language Models (LLMs) has become increasingly important as models grow in size and complexity. While existing distillation approaches focus on imitating teacher behavior, they often overlook the original learning environment that shaped the teacher’s knowledge. Inspired by the experiential learning theory and inverse reinforcement learning, we propose Experiential Knowledge Distillation ($\mathcal{X}$-KD), a novel and general framework that enables student models to learn in the teacher’s original learning environment. $\mathcal{X}$-KD adopts the Approximated Variational Reward Imitation Learning (AVRIL) framework to jointly model the teacher’s original reward function and perform policy distillation, encouraging consistency between the student policy and the original reward function. Our derivation demonstrates that $\mathcal{X}$-KD follows the supervised learning framework and applies to both sequence-level and divergence-based distillation methods, underlining the simplicity and flexibility of our approach. Empirical results show that $\mathcal{X}$-KD outperforms the generalized KD and MiniLLM baselines on abstractive summarization, machine translation, and arithmetic reasoning tasks. Additionally, $\mathcal{X}$-KD achieves better performance-diversity trade-off and data efficiency than baseline KD approaches.

[16] ReFilter: Improving Robustness of Retrieval-Augmented Generation via Gated Filter

Yixin Chen, Ying Xiong, Shangyu Wu, Xiangrui Ke, Nan Guan, Chun Jason Xue

Main category: cs.CL

TL;DR: ReFilter is a novel latent-based fusion framework for retrieval-augmented generation that performs token-level filtering and fusion to handle large numbers of retrieved documents more effectively.

Details

Motivation: Existing RAG fusion methods (query-based, parametric, latent-based) struggle to scale gracefully as the number of retrieved candidates increases. Larger retrieval sets improve evidence coverage but inevitably contain irrelevant/redundant content and increase inference costs.

Method: ReFilter uses three components: 1) context encoder for encoding context features, 2) gated filter for weighting each token, and 3) token fusion module for integrating weighted token features into LLM hidden states. This enables token-level filtering and fusion rather than document-level.

Result: ReFilter achieves best average performance across four general-domain QA benchmarks in both in-domain adaptation and out-of-domain transfer. It also generalizes to five biomedical QA benchmarks in zero-shot transfer without domain fine-tuning, reaching 70.01% average accuracy with Qwen2.5-14B-Instruct.

Conclusion: ReFilter provides an effective solution for scaling RAG systems by enabling token-level filtering and fusion, addressing limitations of existing methods when dealing with large numbers of retrieved documents.

Abstract: Retrieval-augmented generation (RAG) has become a dominant paradigm for grounding large language models (LLMs) with external evidence in knowledge-intensive question answering. A core design choice is how to fuse retrieved samples into the LLMs, where existing internal fusion approaches broadly fall into query-based fusion, parametric fusion, and latent-based fusion. Despite their effectiveness at modest retrieval scales, these methods often fail to scale gracefully as the number of retrieved candidates k increases: Larger k improves evidence coverage, yet realistic top-k retrieval inevitably contains irrelevant or redundant content and increases the inference cost. To address these limitations, we propose ReFilter, a novel latent-based fusion framework that performs token-level filtering and fusion. ReFilter consists of three key components: a context encoder for encoding context features, a gated filter for weighting each token, and a token fusion module for integrating the weighted token feature into the LLM’s hidden states. Our experiments across four general-domain QA benchmarks show that ReFilter consistently achieves the best average performance under both in-domain adaptation and out-of-domain transfer. ReFilter further generalizes to five biomedical QA benchmarks in zero-shot transfer without domain fine-tuning, reaching 70.01% average accuracy with Qwen2.5-14B-Instruct.

[17] Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks

Elena Alvarez-Mellado, Julio Gonzalo

Main category: cs.CL

TL;DR: Proposes a diagnostic evaluation methodology for sequence labeling tasks using handcrafted test sets with exhaustive linguistic coverage to identify systematic weaknesses and predict performance on external data.

Details

Motivation: Standard NLP evaluation provides average performance metrics but lacks actionable insights for improvement and fails to predict performance on out-of-distribution data. Current test sets rely on large amounts of scraped real-world data rather than systematic linguistic coverage.

Method: Creates small handcrafted test sets that exhaustively cover linguistic span attributes (shape, length, casing, sentence position, etc.) rather than gathering large real-world datasets. Applies this methodology to anglicism identification in Spanish benchmark.

Result: The methodology provides diagnostic results that identify systematic weaknesses, actionable insights for model selection, and predictive capability with median correlation of 0.85 for predicting model performance on external datasets.

Conclusion: Proposed evaluation methodology offers superior diagnostic, actionable, and predictive capabilities compared to standard evaluation practices, enabling better understanding of model weaknesses and more reliable performance prediction.

Abstract: Standard evaluation in NLP typically indicates that system A is better on average than system B, but it provides little info on how to improve performance and, what is worse, it should not come as a surprise if B ends up being better than A on outside data. We propose an evaluation methodology for sequence labeling tasks grounded on error analysis that provides both quantitative and qualitative information on where systems must be improved and predicts how models will perform on a different distribution. The key is to create test sets that, contrary to common practice, do not rely on gathering large amounts of real-world in-distribution scraped data, but consists in handcrafting a small set of linguistically motivated examples that exhaustively cover the range of span attributes (such as shape, length, casing, sentence position, etc.) a system may encounter in the wild. We demonstrate this methodology on a benchmark for anglicism identification in Spanish. Our methodology provides results that are diagnostic (because they help identify systematic weaknesses in performance), actionable (because they can inform which model is better suited for a given scenario) and predictive: our method predicts model performance on external datasets with a median correlation of 0.85.

[18] Aspect-Based Sentiment Analysis for Future Tourism Experiences: A BERT-MoE Framework for Persian User Reviews

Hamidreza Kazemi Taskooh, Taha Zare Harofte

Main category: cs.CL

TL;DR: Hybrid BERT-based model with Top-K routing for Persian tourism ABSA achieves 90.6% F1-score with 39% GPU power reduction

Details

Motivation: Address challenges of aspect-based sentiment analysis for low-resource languages (Persian) in tourism domain, where existing models face routing collapse and efficiency issues

Method: Three-stage pipeline: (1) overall sentiment classification using BERT on 9,558 labeled reviews, (2) multi-label aspect extraction for six tourism aspects, (3) integrated ABSA with dynamic Top-K routing and auxiliary losses to prevent routing collapse

Result: Achieves 90.6% weighted F1-score for ABSA, outperforming baseline BERT (89.25%) and standard hybrid approach (85.7%), with 39% reduction in GPU power consumption; first Persian tourism ABSA study with released dataset

Conclusion: Proposed hybrid BERT model with Top-K routing effectively addresses Persian ABSA challenges, provides efficiency gains for sustainable AI, and contributes first annotated Persian tourism dataset for multilingual NLP research

Abstract: This study advances aspect-based sentiment analysis (ABSA) for Persian-language user reviews in the tourism domain, addressing challenges of low-resource languages. We propose a hybrid BERT-based model with Top-K routing and auxiliary losses to mitigate routing collapse and improve efficiency. The pipeline includes: (1) overall sentiment classification using BERT on 9,558 labeled reviews, (2) multi-label aspect extraction for six tourism-related aspects (host, price, location, amenities, cleanliness, connectivity), and (3) integrated ABSA with dynamic routing. The dataset consists of 58,473 preprocessed reviews from the Iranian accommodation platform Jabama, manually annotated for aspects and sentiments. The proposed model achieves a weighted F1-score of 90.6% for ABSA, outperforming baseline BERT (89.25%) and a standard hybrid approach (85.7%). Key efficiency gains include a 39% reduction in GPU power consumption compared to dense BERT, supporting sustainable AI deployment in alignment with UN SDGs 9 and 12. Analysis reveals high mention rates for cleanliness and amenities as critical aspects. This is the first ABSA study focused on Persian tourism reviews, and we release the annotated dataset to facilitate future multilingual NLP research in tourism.

[19] RAT-Bench: A Comprehensive Benchmark for Text Anonymization

Nataša Krčo, Zexi Yao, Matthieu Meeus, Yves-Alexandre de Montjoye

Main category: cs.CL

TL;DR: RAT-Bench is a benchmark for evaluating text anonymization tools based on re-identification risk, using synthetic text with various identifiers across domains and languages, finding LLM-based anonymizers offer better privacy-utility trade-offs.

Details

Motivation: Current text anonymization tools are typically evaluated on their ability to remove specific identifiers, but their effectiveness at preventing re-identification remains unclear. There's a need for comprehensive evaluation based on actual re-identification risk.

Method: Introduced RAT-Bench benchmark using U.S. demographic statistics to generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. Evaluated NER- and LLM-based anonymization tools by measuring what attributes an LLM-based attacker can infer from anonymized text, calculating re-identification risk in U.S. population while accounting for disparate impact of identifiers.

Result: Found that even the best anonymization tools are far from perfect, especially when direct identifiers aren’t written in standard ways and when indirect identifiers enable re-identification. LLM-based anonymizers (including new iterative ones) provide better privacy-utility trade-off but at higher computational cost, and work well across languages.

Conclusion: LLM-based anonymizers offer superior privacy-utility trade-offs despite higher computational costs, with good cross-language performance. The benchmark will be released to encourage community expansion, particularly to other geographies.

Abstract: Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft’s Presidio or Anthropic’s PII purifier. These tools have traditionally been evaluated on their ability to remove specific identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce RAT-Bench, a comprehensive benchmark for text anonymization tools based on re-identification risk. Using U.S. demographic statistics, we generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. We evaluate a range of NER- and LLM-based text anonymization tools and, based on the attributes an LLM-based attacker is able to correctly infer from the anonymized text, we report the risk of re-identification in the U.S. population, while properly accounting for the disparate impact of identifiers. We find that, while capabilities vary widely, even the best tools are far from perfect in particular when direct identifiers are not written in standard ways and when indirect identifiers enable re-identification. Overall we find LLM-based anonymizers, including new iterative anonymizers, to provide a better privacy-utility trade-off albeit at a higher computational cost. Importantly, we also find them to work well across languages. We conclude with recommendations for future anonymization tools and will release the benchmark and encourage community efforts to expand it, in particular to other geographies.

[20] Left-right asymmetry in predicting brain activity from LLMs’ representations emerges with their formal linguistic competence

Laurent Bonnasse-Gahot, Christophe Pallier

Main category: cs.CL

TL;DR: The paper investigates how left-right asymmetry in brain activity prediction from LLM activations correlates with the development of formal linguistic competence rather than other cognitive abilities.

Details

Motivation: Previous research shows LLM activations correlate with human brain activity during text processing, with better prediction in the left hemisphere as training progresses. The authors aim to understand what specific LLM competencies underlie this left-right asymmetry emergence.

Method: Used OLMo-2 7B language model at various training checkpoints with fMRI data from English participants. Compared evolution of left-right asymmetry in brain scores against performance on multiple benchmarks. Tested formal linguistic abilities through minimal contrasting pairs (grammatical vs. ungrammatical sentences) and text generation quality. Also examined arithmetic, Dyck language tasks, and world knowledge/reasoning tasks. Generalized results to Pythia models and French language data.

Result: Left-right asymmetry co-emerges with formal linguistic abilities - specifically the model’s capacity to distinguish grammatical from ungrammatical sentences and produce well-formed text. The asymmetry does not correlate with arithmetic, Dyck language tasks, or text-based tasks involving world knowledge and reasoning. Results generalize across model families (OLMo-2, Pythia) and languages (English, French).

Conclusion: The left-right asymmetry in brain predictivity specifically tracks the development of formal linguistic competence (knowledge of linguistic patterns) in LLMs, rather than other cognitive abilities like arithmetic or reasoning.

Abstract: When humans and large language models (LLMs) process the same text, activations in the LLMs correlate with brain activity measured, e.g., with functional magnetic resonance imaging (fMRI). Moreover, it has been shown that, as the training of an LLM progresses, the performance in predicting brain activity from its internal activations improves more in the left hemisphere than in the right one. The aim of the present work is to understand which kind of competence acquired by the LLMs underlies the emergence of this left-right asymmetry. Using the OLMo-2 7B language model at various training checkpoints and fMRI data from English participants, we compare the evolution of the left-right asymmetry in brain scores alongside performance on several benchmarks. We observe that the asymmetry co-emerges with the formal linguistic abilities of the LLM. These abilities are demonstrated in two ways: by the model’s capacity to assign a higher probability to an acceptable sentence than to a grammatically unacceptable one within a minimal contrasting pair, or its ability to produce well-formed text. On the opposite, the left-right asymmetry does not correlate with the performance on arithmetic or Dyck language tasks; nor with text-based tasks involving world knowledge and reasoning. We generalize these results to another family of LLMs (Pythia) and another language, namely French. Our observations indicate that the left-right asymmetry in brain predictivity matches the progress in formal linguistic competence (knowledge of linguistic patterns).

[21] AIWizards at MULTIPRIDE: A Hierarchical Approach to Slur Reclamation Detection

Luca Tedeschini, Matteo Fasulo

Main category: cs.CL

TL;DR: Hierarchical approach for detecting reclaimed slurs in hate speech detection by modeling LGBTQ+ community membership likelihood and fusing sociolinguistic context with hate speech representations.

Details

Motivation: Reclaimed slurs present a fundamental challenge for hate speech detection systems because the same lexical items can function as abusive expressions or in-group affirmations depending on social identity and context. The paper addresses the need to incorporate sociolinguistic context into hate speech modeling.

Method: Two-stage hierarchical approach: 1) Use weakly supervised LLM-based annotation to assign fuzzy labels indicating likelihood of LGBTQ+ community membership based on tweets and user bios, then train a BERT-like model to predict community membership. 2) Integrate this latent space with a newly initialized model for downstream slur reclamation detection, fusing sociolinguistic signals with hate speech detection representations.

Result: Experimental results on Italian and Spanish show performance statistically comparable to a strong BERT-based baseline, while providing a modular and extensible framework for incorporating sociolinguistic context into hate speech modeling.

Conclusion: The hierarchical approach offers a framework for incorporating sociolinguistic context into hate speech detection, suggesting that more fine-grained modeling of user identity and discourse context may further improve detection of reclaimed language.

Abstract: Detecting reclaimed slurs represents a fundamental challenge for hate speech detection systems, as the same lexcal items can function either as abusive expressions or as in-group affirmations depending on social identity and context. In this work, we address Subtask B of the MultiPRIDE shared task at EVALITA 2026 by proposing a hierarchical approach to modeling the slur reclamation process. Our core assumption is that members of the LGBTQ+ community are more likely, on average, to employ certain slurs in a eclamatory manner. Based on this hypothesis, we decompose the task into two stages. First, using a weakly supervised LLM-based annotation, we assign fuzzy labels to users indicating the likelihood of belonging to the LGBTQ+ community, inferred from the tweet and the user bio. These soft labels are then used to train a BERT-like model to predict community membership, encouraging the model to learn latent representations associated with LGBTQ+ identity. In the second stage, we integrate this latent space with a newly initialized model for the downstream slur reclamation detection task. The intuition is that the first model encodes user-oriented sociolinguistic signals, which are then fused with representations learned by a model pretrained for hate speech detection. Experimental results on Italian and Spanish show that our approach achieves performance statistically comparable to a strong BERT-based baseline, while providing a modular and extensible framework for incorporating sociolinguistic context into hate speech modeling. We argue that more fine-grained hierarchical modeling of user identity and discourse context may further improve the detection of reclaimed language. We release our code at https://github.com/LucaTedeschini/multipride.

[22] MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, KyungTae Lim

Main category: cs.CL

TL;DR: MentalBench is a benchmark for evaluating psychiatric diagnostic decision-making in LLMs using a psychiatrist-built knowledge graph (MentalKG) and synthetic clinical cases.

Details

Motivation: Existing mental health benchmarks rely on social media data and cannot assess DSM-grounded diagnostic judgments. There's a need for systematic evaluation of LLMs' psychiatric diagnostic capabilities with clinical rigor.

Method: Created MentalKG, a psychiatrist-built knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 disorders. Generated 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity.

Result: State-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge but struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders.

Conclusion: MentalBench reveals evaluation gaps not captured by existing benchmarks, highlighting LLMs’ limitations in complex psychiatric diagnostic reasoning despite good factual knowledge.

Abstract: We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evaluation gaps not captured by existing benchmarks.

[23] BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

Jiangxi Chen, Qian Liu

Main category: cs.CL

TL;DR: BaziQA-Benchmark is a standardized evaluation framework for assessing symbolic and temporally compositional reasoning in LLMs using professionally curated fortune-telling problems requiring structured inference over symbolic charts and temporal conditions.

Details

Motivation: Current LLM evaluations lack standardized benchmarks for assessing symbolic and temporally compositional reasoning. The authors aim to create an objective benchmark derived from professional fortune-telling competitions to enable controlled comparisons across models and years.

Method: Created BaziQA-Benchmark using 200 multiple-choice problems from Global Fortune-teller Competition (2021-2025). Each problem requires structured inference over fixed symbolic charts and interacting temporal conditions. Evaluated models in multi-turn settings and introduced Structured Reasoning Protocol to constrain inference order without adding domain knowledge.

Result: Models consistently outperform chance but remain far from saturation. Show pronounced sensitivity to temporal composition and reasoning order, with systematic failures on precise temporal localization and multi-condition symbolic judgments.

Conclusion: BaziQA-Benchmark provides a standardized evaluation framework revealing significant gaps in LLMs’ symbolic and temporal reasoning capabilities, highlighting the need for improved reasoning architectures.

Abstract: We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021–2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference protocols.To further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.

[24] ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark

Tung X. Nguyen, Nhu Vo, Giang-Son Nguyen, Duy Mai Hoang, Chien Dinh Huynh, Inigo Jauregi Unanue, Massimo Piccardi, Wray Buntine, Dung D. Le

Main category: cs.CL

TL;DR: First benchmark dataset (ViMedCSS) for Vietnamese medical code-switching speech recognition, evaluating ASR models on English medical terms within Vietnamese speech.

Details

Motivation: Code-switching with English medical terms in Vietnamese speech creates challenges for ASR systems, especially in low-resource languages, with no existing benchmark addressing this specific medical domain challenge.

Method: Constructed 34-hour ViMedCSS dataset with 16,576 utterances containing English medical terms from a bilingual lexicon across five medical topics. Evaluated state-of-the-art ASR models and fine-tuning strategies for medical term recognition.

Result: Vietnamese-optimized models perform better on general segments, multilingual pretraining helps capture English insertions, and combining both approaches yields best balance between overall and code-switched accuracy.

Conclusion: Provides first benchmark for Vietnamese medical code-switching and insights into effective domain adaptation for low-resource, multilingual ASR systems.

Abstract: Code-switching (CS), which is when Vietnamese speech uses English words like drug names or procedures, is a common phenomenon in Vietnamese medical communication. This creates challenges for Automatic Speech Recognition (ASR) systems, especially in low-resource languages like Vietnamese. Current most ASR systems struggle to recognize correctly English medical terms within Vietnamese sentences, and no benchmark addresses this challenge. In this paper, we construct a 34-hour \textbf{Vi}etnamese \textbf{Med}ical \textbf{C}ode-\textbf{S}witching \textbf{S}peech dataset (ViMedCSS) containing 16,576 utterances. Each utterance includes at least one English medical term drawn from a curated bilingual lexicon covering five medical topics. Using this dataset, we evaluate several state-of-the-art ASR models and examine different specific fine-tuning strategies for improving medical term recognition to investigate the best approach to solve in the dataset. Experimental results show that Vietnamese-optimized models perform better on general segments, while multilingual pretraining helps capture English insertions. The combination of both approaches yields the best balance between overall and code-switched accuracy. This work provides the first benchmark for Vietnamese medical code-switching and offers insights into effective domain adaptation for low-resource, multilingual ASR systems.

[25] When Words Don’t Mean What They Say: Figurative Understanding in Bengali Idioms

Adib Sakhawat, Shamim Ara Parveen, Md Ruhul Amin, Shamim Al Mahmud, Md Saiful Islam, Tahera Khatun

Main category: cs.CL

TL;DR: A new large-scale Bengali idiom dataset with comprehensive annotations is introduced, revealing significant limitations in current LLMs’ figurative language understanding for low-resource languages.

Details

Motivation: Figurative language understanding remains challenging for LLMs, especially for low-resource languages like Bengali. There's a need for culturally-grounded resources and benchmarks to evaluate and improve cross-linguistic and cultural reasoning capabilities.

Method: Created a large-scale corpus of 10,361 Bengali idioms with comprehensive 19-field annotations covering semantic, syntactic, cultural, and religious dimensions. Evaluated 30 state-of-the-art multilingual and instruction-tuned LLMs on figurative meaning inference tasks.

Result: No model surpassed 50% accuracy on Bengali idiom understanding, while human performance reached 83.4%. This reveals a critical performance gap in cross-linguistic and cultural reasoning capabilities of current LLMs.

Conclusion: Current LLMs have significant limitations in figurative language understanding for low-resource languages. The released dataset and benchmark provide foundational infrastructure for advancing cultural grounding and figurative language understanding in LLMs.

Abstract: Figurative language understanding remains a significant challenge for Large Language Models (LLMs), especially for low-resource languages. To address this, we introduce a new idiom dataset, a large-scale, culturally-grounded corpus of 10,361 Bengali idioms. Each idiom is annotated under a comprehensive 19-field schema, established and refined through a deliberative expert consensus process, that captures its semantic, syntactic, cultural, and religious dimensions, providing a rich, structured resource for computational linguistics. To establish a robust benchmark for Bangla figurative language understanding, we evaluate 30 state-of-the-art multilingual and instruction-tuned LLMs on the task of inferring figurative meaning. Our results reveal a critical performance gap, with no model surpassing 50% accuracy, a stark contrast to significantly higher human performance (83.4%). This underscores the limitations of existing models in cross-linguistic and cultural reasoning. By releasing the new idiom dataset and benchmark, we provide foundational infrastructure for advancing figurative language understanding and cultural grounding in LLMs for Bengali and other low-resource languages.

[26] Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov

Main category: cs.CL

TL;DR: The paper introduces LAHJATBERT, a BERT-based multi-label classifier for Arabic Dialect Identification using GPT-4o-generated multi-label annotations and curriculum learning.

Details

Motivation: Arabic Dialect Identification (ADI) has traditionally been framed as single-label classification, but recent work argues it should be multi-label. However, there are no large-scale multi-label datasets available for training, and existing single-label datasets are inadequate for multi-label tasks due to negative sample selection issues.

Method: 1) Construct multi-label dataset using GPT-4o and binary dialect acceptability classifiers with aggregation guided by Arabic Level of Dialectness (ALDi); 2) Train BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality.

Result: LAHJATBERT achieves macro F1 of 0.69 on the MLADI leaderboard, significantly outperforming the previous best system’s score of 0.55.

Conclusion: The paper successfully addresses the multi-label ADI problem by creating a novel dataset generation method and applying curriculum learning, demonstrating substantial performance improvements over existing approaches.

Abstract: Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at https://mohamedalaa9.github.io/lahjatbert/.

[27] ProbeLLM: Automating Principled Diagnosis of LLM Failures

Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, Xiangliang Zhang

Main category: cs.CL

TL;DR: ProbeLLM is an automated probing framework that uses hierarchical Monte Carlo Tree Search to systematically discover and structure failure modes in large language models, moving beyond isolated failure cases to reveal comprehensive weakness landscapes.

Details

Motivation: Existing automated probing approaches for LLMs often discover isolated failure cases without principled control over exploration, provide limited insight into underlying weakness structures, and static evaluations can't keep pace with rapidly evolving models.

Method: ProbeLLM uses hierarchical Monte Carlo Tree Search to allocate probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. It restricts probing to verifiable test cases using tool-augmented generation and verification, then consolidates discovered failures into interpretable failure modes via failure-aware embeddings and boundary-aware induction.

Result: Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.

Conclusion: ProbeLLM provides a benchmark-agnostic framework for structured weakness discovery in LLMs, enabling more comprehensive understanding of model failures and supporting systematic evaluation beyond isolated test cases.

Abstract: Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.

[28] SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

Main category: cs.CL

TL;DR: SciAgentGym introduces a scalable environment with 1,780 domain-specific tools across natural sciences and SciAgentBench for evaluating agent capabilities, revealing current models struggle with complex scientific tool-use, addressed by SciForge data synthesis method.

Details

Motivation: Current benchmarks overlook agents' ability to orchestrate tools for rigorous scientific workflows, creating a gap in evaluating scientific reasoning capabilities that require sophisticated tool integration.

Method: 1) Created SciAgentGym with 1,780 domain-specific tools across four natural science disciplines; 2) Developed SciAgentBench tiered evaluation suite; 3) Proposed SciForge data synthesis method modeling tool action space as dependency graphs to generate logic-aware training trajectories.

Result: State-of-the-art models struggle with complex scientific tool-use (GPT-5 success drops from 60.6% to 30.9% with extended interaction horizons). SciAgent-8B fine-tuned on SciForge trajectories outperforms Qwen3-VL-235B-Instruct and shows positive cross-domain transfer of scientific tool-use capabilities.

Conclusion: The work demonstrates promising potential for next-generation autonomous scientific agents and highlights the critical bottleneck in current models’ ability to handle complex scientific workflows, which can be addressed through targeted data synthesis and fine-tuning approaches.

Abstract: Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents’ ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

[29] Evaluating the Homogeneity of Keyphrase Prediction Models

Maël Houbre, Florian Boudin, Beatrice Daille

Main category: cs.CL

TL;DR: Keyphrase generation models can predict absent keyphrases not in the text, but surprisingly, extraction methods are competitive and absent keyphrase generation can negatively impact prediction homogeneity.

Details

Motivation: Current benchmarks don't evaluate homogeneity of keyphrase prediction models. The paper investigates whether the ability to generate absent keyphrases (not appearing in text) actually helps models be more homogeneous in indexing similar documents.

Method: Introduced a method to evaluate homogeneity of keyphrase prediction models. Compared keyphrase extraction methods (which only use text-present keyphrases) with generative models (which can predict absent keyphrases).

Result: Surprisingly showed that keyphrase extraction methods are competitive with generative models, and the ability to generate absent keyphrases can actually have a negative impact on homogeneity.

Conclusion: The ability to generate absent keyphrases doesn’t necessarily improve model homogeneity, and extraction methods remain competitive. A new evaluation method for homogeneity is needed.

Abstract: Keyphrases which are useful in several NLP and IR applications are either extracted from text or predicted by generative models. Contrarily to keyphrase extraction approaches, keyphrase generation models can predict keyphrases that do not appear in a document’s text called absent keyphrases. This ability means that keyphrase generation models can associate a document to a notion that is not explicitly mentioned in its text. Intuitively, this suggests that for two documents treating the same subjects, a keyphrase generation model is more likely to be homogeneous in their indexing i.e. predict the same keyphrase for both documents, regardless of those keyphrases appearing in their respective text or not; something a keyphrase extraction model would fail to do. Yet, homogeneity of keyphrase prediction models is not covered by current benchmarks. In this work, we introduce a method to evaluate the homogeneity of keyphrase prediction models and study if absent keyphrase generation capabilities actually help the model to be more homogeneous. To our surprise, we show that keyphrase extraction methods are competitive with generative models, and that the ability to generate absent keyphrases can actually have a negative impact on homogeneity. Our data, code and prompts are available on huggingface and github.

[30] Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models

Hao Chen, Ye He, Yuchun Fan, Yukun Yan, Zhenghao Liu, Qingfu Zhu, Maosong Sun, Wanxiang Che

Main category: cs.CL

TL;DR: A meta-cognitive framework for reliable knowledge augmentation in LLMs that addresses knowledge-confidence gaps through differentiated intervention and cognitive consistency mechanisms.

Details

Motivation: Existing knowledge augmentation methods for LLMs assume model performance equals internal knowledge, overlooking knowledge-confidence gaps that cause overconfident errors or uncertain truths. This leads to unreliable outputs where models are either too confident about wrong information or uncertain about correct knowledge.

Method: Proposes a meta-cognitive framework with two key components: 1) Uses internal cognitive signals to partition knowledge space into mastered, confused, and missing regions for targeted knowledge expansion, and 2) Introduces a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy for calibrated knowledge boundaries.

Result: Extensive experiments show the framework consistently outperforms strong baselines, validating its effectiveness in enhancing knowledge capabilities while fostering cognitive behaviors that better distinguish knowns from unknowns.

Conclusion: The proposed meta-cognitive framework provides a more reliable approach to knowledge augmentation by addressing knowledge-confidence gaps through differentiated intervention and cognitive alignment, leading to better calibrated knowledge boundaries and improved model reliability.

Abstract: Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Our approach leverages internal cognitive signals to partition the knowledge space into mastered, confused, and missing regions, guiding targeted knowledge expansion. Furthermore, we introduce a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy, ensuring calibrated knowledge boundaries. Extensive experiments demonstrate the our framework consistently outperforms strong baselines, validating its rationality in not only enhancing knowledge capabilities but also fostering cognitive behaviors that better distinguish knowns from unknowns.

[31] Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

Madhurananda Pahar, Caitlin Illingworth, Dorota Braun, Bahman Mirheidari, Lise Sproson, Daniel Blackburn, Heidi Christensen

Main category: cs.CL

TL;DR: AI models for cognitive decline detection show bias against multilingual ethnic minority speakers in the UK, particularly misclassifying them as having more severe cognitive impairment.

Details

Motivation: To examine trustworthiness and bias in AI models for detecting cognitive decline from speech, particularly focusing on multilingual ethnic minority populations in the UK where dementia prevalence is rising rapidly.

Method: Recruited monolingual participants nationally and multilingual speakers from community centers in Sheffield and Bradford, including speakers of Somali, Chinese, and South Asian languages with Yorkshire accents. Tested ASR systems, classification, and regression models using acoustic and linguistic features on memory, fluency, and reading tasks.

Result: ASR systems showed no significant bias, but classification and regression models exhibited bias against multilingual speakers, especially when trained on DementiaBank dataset. Multilinguals were more likely to be misclassified as having cognitive decline, with South Yorkshire accent speakers particularly affected.

Conclusion: Current AI models for cognitive decline detection are not reliable for diagnostic use in multilingual ethnic minority populations due to bias, requiring development of more generalizable, bias-mitigated models.

Abstract: Conversational speech often reveals early signs of cognitive decline, such as dementia and MCI. In the UK, one in four people belongs to an ethnic minority, and dementia prevalence is expected to rise most rapidly among Black and Asian communities. This study examines the trustworthiness of AI models, specifically the presence of bias, in detecting healthy multilingual English speakers among the cognitively impaired cohort, to make these tools clinically beneficial. For experiments, monolingual participants were recruited nationally (UK), and multilingual speakers were enrolled from four community centres in Sheffield and Bradford. In addition to a non-native English accent, multilinguals spoke Somali, Chinese, or South Asian languages, who were further divided into two Yorkshire accents (West and South) to challenge the efficiency of the AI tools thoroughly. Although ASR systems showed no significant bias across groups, classification and regression models using acoustic and linguistic features exhibited bias against multilingual speakers, particularly in memory, fluency, and reading tasks. This bias was more pronounced when models were trained on the publicly available DementiaBank dataset. Moreover, multilinguals were more likely to be misclassified as having cognitive decline. This study is the first of its kind to discover that, despite their strong overall performance, current AI models show bias against multilingual individuals from ethnic minority backgrounds in the UK, and they are also more likely to misclassify speakers with a certain accent (South Yorkshire) as living with a more severe cognitive decline. In this pilot study, we conclude that the existing AI tools are therefore not yet reliable for diagnostic use in these populations, and we aim to address this in future work by developing more generalisable, bias-mitigated models.

[32] TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution

Tejas Anvekar, Junha Park, Rajat Jha, Devanshu Gupta, Poojah Ganesan, Puneeth Mathur, Vivek Gupta

Main category: cs.CL

TL;DR: TraceBack is a modular multi-agent framework for table QA that provides cell-level attribution, with CITEBench benchmark and FairScore metric for evaluation.

Details

Motivation: Existing table QA systems lack fine-grained attribution, making correct answers untrustworthy in high-stakes settings due to missing verifiable grounding.

Method: TraceBack uses a modular multi-agent approach: prunes tables to relevant rows/columns, decomposes questions into sub-questions, and aligns answer spans with supporting cells to capture explicit and implicit evidence.

Result: TraceBack substantially outperforms strong baselines across datasets and granularities. FairScore metric closely tracks human judgments and preserves relative method rankings.

Conclusion: TraceBack enables interpretable and scalable evaluation of table-based QA with reliable cell-level attribution, addressing trust issues in high-stakes applications.

Abstract: Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular multi-agent framework for scalable, cell-level attribution in single-table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub-questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference-less metric that compares atomic facts derived from predicted cells and answers to estimate attribution precision and recall without human cell labels. Experiments show that TraceBack substantially outperforms strong baselines across datasets and granularities, while FairScore closely tracks human judgments and preserves relative method rankings, supporting interpretable and scalable evaluation of table-based QA.

[33] Exploring a New Competency Modeling Process with Large Language Models

Silin Du, Manqing Xin, Raymond Jia Wang

Main category: cs.CL

TL;DR: LLM-based competency modeling framework that extracts behavioral/psychological descriptions from interview transcripts and maps them to competency libraries using embedding similarity with adaptive weighting.

Details

Motivation: Traditional competency modeling relies on manual analysis of interview transcripts, which is costly, prone to randomness/ambiguity, and has limited reproducibility. There's a need for more systematic, data-driven approaches.

Method: Decomposes expert practices into computational components: uses LLMs to extract behavioral/psychological descriptions from text, maps to competency libraries via embedding similarity, introduces learnable parameter for adaptive integration of information sources, and develops offline evaluation for systematic model selection.

Result: Empirical results from software outsourcing company show strong predictive validity, cross-library consistency, and structural robustness. Framework transforms qualitative practice into transparent, data-driven analytical process.

Conclusion: LLM-based framework successfully transforms competency modeling from expert-dependent qualitative practice to transparent, data-driven, evaluable analytical process with strong empirical validation.

Abstract: Competency modeling is widely used in human resource management to select, develop, and evaluate talent. However, traditional expert-driven approaches rely heavily on manual analysis of large volumes of interview transcripts, making them costly and prone to randomness, ambiguity, and limited reproducibility. This study proposes a new competency modeling process built on large language models (LLMs). Instead of merely automating isolated steps, we reconstruct the workflow by decomposing expert practices into structured computational components. Specifically, we leverage LLMs to extract behavioral and psychological descriptions from raw textual data and map them to predefined competency libraries through embedding-based similarity. We further introduce a learnable parameter that adaptively integrates different information sources, enabling the model to determine the relative importance of behavioral and psychological signals. To address the long-standing challenge of validation, we develop an offline evaluation procedure that allows systematic model selection without requiring additional large-scale data collection. Empirical results from a real-world implementation in a software outsourcing company demonstrate strong predictive validity, cross-library consistency, and structural robustness. Overall, our framework transforms competency modeling from a largely qualitative and expert-dependent practice into a transparent, data-driven, and evaluable analytical process.

[34] Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts

Kais Allkivi

Main category: cs.CL

TL;DR: Study uses NLP to classify Estonian language proficiency levels (A2-C1) in exam writings through feature selection for explainable models, achieving ~0.9 accuracy and showing texts have become more complex over time.

Details

Motivation: NLP analysis of authentic learner language can build automated assessment tools and provide insights into second language development, but there's a lack of research combining these aspects for explainable and generalizable models.

Method: Analyzed linguistic properties of Estonian proficiency exam writings to identify relevant proficiency predictors (lexical, morphological, surface, and error features) associated with complexity and correctness. Trained classification models with pre-selected features and compared to models with other features.

Result: Pre-selected features yielded similar test accuracy (~0.9) but reduced variation in classifying different text types. Additional evaluation on earlier exam samples showed writings have become more complex over 7-10 years while maintaining ~0.8 accuracy with some feature sets.

Conclusion: Careful feature selection leads to explainable and generalizable machine learning models for language proficiency assessment. Results implemented in an Estonian open-source language learning environment’s writing evaluation module.

Abstract: Using NLP to analyze authentic learner language helps to build automated assessment and feedback tools. It also offers new and extensive insights into the development of second language production. However, there is a lack of research explicitly combining these aspects. This study aimed to classify Estonian proficiency examination writings (levels A2-C1), assuming that careful feature selection can lead to more explainable and generalizable machine learning models for language testing. Various linguistic properties of the training data were analyzed to identify relevant proficiency predictors associated with increasing complexity and correctness, rather than the writing task. Such lexical, morphological, surface, and error features were used to train classification models, which were compared to models that also allowed for other features. The pre-selected features yielded a similar test accuracy but reduced variation in the classification of different text types. The best classifiers achieved an accuracy of around 0.9. Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets. The results have been implemented in the writing evaluation module of an Estonian open-source language learning environment.

[35] SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Sher Badshah, Ali Emami, Hassan Sajjad

Main category: cs.CL

TL;DR: SCOPE framework for selective pairwise LLM evaluation with statistical guarantees using bidirectional preference entropy for uncertainty estimation

Details

Motivation: LLMs are increasingly used as judges for pairwise evaluation to replace costly human labels, but they suffer from miscalibration and systematic biases, requiring methods with statistical guarantees.

Method: Proposes SCOPE framework with Bidirectional Preference Entropy (BPE) - queries judge under both response positions, aggregates preference probabilities to enforce position invariance, converts to entropy-based uncertainty score for selective judging with conformal calibration.

Result: SCOPE consistently meets target risk levels (α=0.10) across MT-Bench, RewardBench, and Chatbot Arena with empirical risk ≈0.097-0.099, retains substantial coverage (0.89-0.98), accepts up to 2.4× more judgments than baselines under same risk constraints.

Conclusion: BPE improves uncertainty quality over standard confidence proxies, enabling SCOPE to provide reliable LLM-based evaluation with statistical guarantees while maintaining high coverage across different judge scales.

Abstract: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $α$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $α= 0.10$, \textsc{Scope} consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to naïve baselines, \textsc{Scope} accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

Maria Ryskina, Matthew R. Gormley, Kyle Mahowald, David R. Mortensen, Taylor Berg-Kirkpatrick, Vivek Kulkarni

Main category: cs.CL

TL;DR: Analysis of word emergence patterns across different domains (published texts vs. Twitter) using contextual embeddings, showing similar factors drive neology but with domain-specific variations in formation mechanisms.

Details

Motivation: To understand how different social and conversational contexts (published texts vs. social media) influence word creation and evolution, and to test whether previously identified factors for neology in historical texts apply to modern social media platforms like Twitter.

Method: Extended prior methodology using both static and contextual embeddings, applied to a Twitter corpus to analyze word emergence patterns, comparing results with previous findings from published historical texts.

Result: The same factors correlated with new word creation in published texts also apply to Twitter, though topic popularity growth contributes less to neology on Twitter than in published writing, suggesting domain-specific formation mechanisms.

Conclusion: Language evolution follows similar principles across different domains but with context-specific variations in neologism formation mechanisms, highlighting how social and conversational contexts shape word creation differently.

Abstract: Living languages are shaped by a host of conflicting internal and external evolutionary pressures. While some of these pressures are universal across languages and cultures, others differ depending on the social and conversational context: language use in newspapers is subject to very different constraints than language use on social media. Prior distributional semantic work on English word emergence (neology) identified two factors correlated with creation of new words by analyzing a corpus consisting primarily of historical published texts (Ryskina et al., 2020, arXiv:2001.07740). Extending this methodology to contextual embeddings in addition to static ones and applying it to a new corpus of Twitter posts, we show that the same findings hold for both domains, though the topic popularity growth factor may contribute less to neology on Twitter than in published writing. We hypothesize that this difference can be explained by the two domains favouring different neologism formation mechanisms.

Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, Yves Scherrer

Main category: cs.CL

TL;DR: OpenLID-v3 improves language identification by adding training data, merging problematic language variants, and adding noise detection, outperforming GlotLID on closely related languages.

Details

Motivation: Existing LID tools struggle with closely related languages and distinguishing natural language from noise, contaminating multilingual datasets especially for low-resource languages.

Method: Extended OpenLID classifier with more training data, merged problematic language variant clusters, and introduced special noise label. Evaluated against GlotLID with focus on three groups of closely related languages.

Result: OpenLID-v3 shows improved performance on closely related languages. Ensemble approaches improve precision but reduce coverage for low-resource languages.

Conclusion: OpenLID-v3 provides better language identification for multilingual dataset building, especially for closely related languages and noise detection.

Abstract: Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

[38] Semantic Chunking and the Entropy of Natural Language

Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks

Main category: cs.CL

TL;DR: A statistical model explaining English’s redundancy through hierarchical semantic segmentation, showing entropy rates vary with semantic complexity.

Details

Motivation: To provide a first-principles explanation for why printed English has about 1 bit per character entropy rate (80% redundancy) compared to random text's 5 bits per character, and to understand the multi-scale structure of natural language.

Method: Developed a statistical model that self-similarly segments text into semantically coherent chunks down to single-word level, allowing hierarchical decomposition of semantic structure. Used numerical experiments with modern LLMs and open datasets to validate the model.

Result: The model quantitatively captures real text structure at different semantic hierarchy levels, predicts entropy rates matching estimated English entropy, and reveals that entropy rate increases systematically with semantic complexity of corpora.

Conclusion: Natural language redundancy can be explained through hierarchical semantic structure, with entropy rates not fixed but varying with semantic complexity, captured by a single free parameter in the model.

Abstract: The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.

[39] CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference

Ruqi Liao, Chuqing Zhao, Jin Li, Weiqi Feng, Yi Lyu, Bingxian Chen, Haochen Yang

Main category: cs.CL

TL;DR: CATP is a token pruning method for multimodal models that uses cross-attention layers to determine token importance, achieving significantly higher accuracy than existing methods while balancing computational efficiency and precision.

Details

Motivation: Address the need for efficient multimodal models by developing a token pruning method that better balances computational efficiency with model precision, as existing methods sacrifice too much accuracy.

Method: Uses cross-attention layers in multimodal models (like BLIP-2) to extract information for token importance determination, employing a refined voting strategy across model heads and layers to decide which tokens to prune.

Result: Achieves up to 12.1X higher accuracy compared to existing token pruning methods, effectively addressing the trade-off between computational efficiency and model precision.

Conclusion: CATP provides an effective precision-focused token pruning approach for multimodal models that significantly outperforms existing methods in maintaining accuracy while improving computational efficiency.

Abstract: In response to the rising interest in large multimodal models, we introduce Cross-Attention Token Pruning (CATP), a precision-focused token pruning method. Our approach leverages cross-attention layers in multimodal models, exemplified by BLIP-2, to extract valuable information for token importance determination. CATP employs a refined voting strategy across model heads and layers. In evaluations, CATP achieves up to 12.1X higher accuracy compared to existing token pruning methods, addressing the trade-off between computational efficiency and model precision.

[40] Foundations and Evaluations in NLP

Jungyeul Park

Main category: cs.CL

TL;DR: This memoir presents a morpheme-based annotation scheme for Korean NLP achieving SOTA results, and introduces the jp-algorithm for robust evaluation of preprocessing tasks like tokenization and SBD.

Details

Motivation: To address two fundamental NLP challenges: 1) developing linguistic resources for morphologically rich languages like Korean, and 2) creating robust evaluation frameworks for preprocessing tasks where traditional methods fail due to tokenization mismatches.

Method: 1) Developed a morpheme-based annotation scheme capturing linguistic properties from morphology to semantics for Korean. 2) Proposed the jp-algorithm, an alignment-based evaluation method that handles tokenization and sentence length mismatches between gold standards and system outputs.

Result: The morpheme-based approach achieved state-of-the-art results in Korean NLP tasks (POS tagging, dependency parsing, NER). The jp-algorithm enables robust end-to-end evaluations across diverse NLP tasks while preserving traditional metric complexity.

Conclusion: Provides key insights for processing morphologically rich languages and offers a generalizable evaluation framework for end-to-end NLP systems, with broader implications for multilingual resource development and system evaluation.

Abstract: This memoir explores two fundamental aspects of Natural Language Processing (NLP): the creation of linguistic resources and the evaluation of NLP system performance. Over the past decade, my work has focused on developing a morpheme-based annotation scheme for the Korean language that captures linguistic properties from morphology to semantics. This approach has achieved state-of-the-art results in various NLP tasks, including part-of-speech tagging, dependency parsing, and named entity recognition. Additionally, this work provides a comprehensive analysis of segmentation granularity and its critical impact on NLP system performance. In parallel with linguistic resource development, I have proposed a novel evaluation framework, the jp-algorithm, which introduces an alignment-based method to address challenges in preprocessing tasks like tokenization and sentence boundary detection (SBD). Traditional evaluation methods assume identical tokenization and sentence lengths between gold standards and system outputs, limiting their applicability to real-world data. The jp-algorithm overcomes these limitations, enabling robust end-to-end evaluations across a variety of NLP tasks. It enhances accuracy and flexibility by incorporating linear-time alignment while preserving the complexity of traditional evaluation metrics. This memoir provides key insights into the processing of morphologically rich languages, such as Korean, while offering a generalizable framework for evaluating diverse end-to-end NLP systems. My contributions lay the foundation for future developments, with broader implications for multilingual resource development and system evaluation.

[41] RAISE: Reinforced Adaptive Instruction Selection For Large Language Models

Qingsong Lv, Yangning Li, Zihua Lan, Zishan Xu, Jiwei Tang, Tingwei Lu, Yinghui Li, Wenhao Jiang, Hong-Gee Kim, Hai-Tao Zheng, Philip S. Yu

Main category: cs.CL

TL;DR: RAISE is a reinforcement learning-based framework for dynamic instruction selection during LLM fine-tuning that adaptively chooses instructions based on expected performance impact, achieving strong results with only 1% of training steps.

Details

Motivation: Current instruction selection methods use fixed heuristic metrics and only consider data selection before training, leading to insufficient optimization and difficulty adapting to specific tasks. There's a need for dynamic, task-objective-driven instruction selection that optimizes throughout the fine-tuning process.

Method: RAISE models dynamic instruction selection as a sequential decision-making process using reinforcement learning. At each training step, it selects instructions based on their expected impact on model performance improvement, incorporating the entire instruction fine-tuning process into optimization.

Result: Extensive experiments show RAISE outperforms other instruction selection methods. Notably, it achieves superior performance by updating only 1% of training steps compared to full-data training, demonstrating high efficiency and effectiveness.

Conclusion: RAISE provides a well-interpretable, task-specific optimization framework for instruction selection that dynamically adapts throughout training, offering significant efficiency gains while maintaining or improving performance compared to traditional methods.

Abstract: In the instruction fine-tuning of large language models (LLMs), it is widely recognized that a few high-quality instructions are superior to a large number of low-quality instructions. At present, many instruction selection methods have been proposed, but most of these methods select instruction based on heuristic quality metrics, and only consider data selection before training. These designs lead to insufficient optimization of instruction fine-tuning, and fixed heuristic indicators are often difficult to optimize for specific tasks. Therefore, we design a dynamic, task-objective-driven instruction selection framework RAISE(Reinforced Adaptive Instruction SElection), which incorporates the entire instruction fine-tuning process into optimization, selecting instructions at each step based on the expected impact of each instruction on model performance improvement. Our approach is well interpretable and has strong task-specific optimization capabilities. By modeling dynamic instruction selection as a sequential decision-making process, we use RL to train our selection strategy. Extensive experiments and result analysis prove the superiority of our method compared with other instruction selection methods. Notably, RAISE achieves superior performance by updating only 1% of the training steps compared to full-data training, demonstrating its efficiency and effectiveness.

[42] PReSS: A Black-Box Framework for Evaluating Political Stance Stability in LLMs via Argumentative Pressure

Shariar Kabir, Kevin Esterling, Yue Dong

Main category: cs.CL

TL;DR: PReSS framework evaluates political bias in LLMs by analyzing stance stability across topics, revealing models can have varying ideological consistency rather than uniform left/right leaning.

Details

Motivation: Existing political bias evaluations in LLMs oversimplify by classifying outputs as simply left- or right-leaning, ignoring how ideological tendencies vary across topics and how consistently models maintain their positions.

Method: Proposes PReSS (Political Response Stability under Stress), a black-box framework that evaluates LLMs by jointly considering model and topic context, categorizing responses into four stance types: stable-left, unstable-left, stable-right, and unstable-right.

Result: Applied to 12 widely used LLMs across 19 political topics, revealing substantial variation in stance stability - models that are left-leaning overall can exhibit stable-right behavior on certain topics. Unstable topic stances are more likely to change during interventions like debiasing or ideology reversal.

Conclusion: Topic-aware, fine-grained evaluation of political ideologies is crucial. Stability should be treated as a moderating factor for understanding, evaluating, and guiding interventions in politically sensitive model behavior, especially for controlled generation and model alignment.

Abstract: Existing evaluations of political bias in large language models (LLMs) typically classify outputs as left- or right-leaning. We extend this perspective by examining how ideological tendencies vary across topics and how consistently models maintain their positions, a property we refer to as stability. To capture this dimension, we propose PReSS (Political Response Stability under Stress), a black-box framework that evaluates LLMs by jointly considering model and topic context, categorizing responses into four stance types: stable-left, unstable-left, stable-right, and unstable-right. Applying PReSS to 12 widely used LLMs across 19 political topics reveals substantial variation in stance stability; for instance, a model that is left-leaning overall can exhibit stable-right behavior on certain topics. This highlights the importance of topic-aware and fine-grained evaluation of political ideologies of LLMs. Moreover, stability has practical implications for controlled generation and model alignment: interventions such as debiasing or ideology reversal should explicitly account for stance stability. Our empirical analyses reveal that when models are prompted or fine-tuned to adopt the opposite ideology, unstable topic stances are more likely to change, whereas stable ones resist modification. Thus, treating stability as a moderating factor provides a principled foundation for understanding, evaluating, and guiding interventions in politically sensitive model behavior.

[43] Embodied Agents Meet Personalization: Investigating Challenges and Solutions Through the Lens of Memory Utilization

Taeyoon Kwon, Dongwook Choi, Hyojun Kim, Sunghwan Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, Jinyoung Yeo

Main category: cs.CL

TL;DR: MEMENTO is a two-stage evaluation framework for assessing LLM-powered embodied agents’ ability to utilize personalized memory for object semantics and user patterns in assistance tasks, revealing current limitations and proposing a hierarchical knowledge graph-based memory architecture.

Details

Motivation: Current LLM-powered embodied agents succeed at conventional object rearrangement but lack personalized assistance using user-specific knowledge from past interactions, particularly in memory utilization for object semantics and user patterns.

Method: Constructed MEMENTO, an end-to-end two-stage evaluation framework with single-memory and joint-memory tasks. Analyzed current agents’ limitations, then designed a hierarchical knowledge graph-based user-profile memory module that separately manages personalized knowledge.

Result: Current agents can recall simple object semantics but struggle with applying sequential user patterns to planning. Identified bottlenecks: information overload and coordination failures with multiple memories. The proposed memory architecture achieved substantial improvements on both single and joint-memory tasks.

Conclusion: Personalized assistance requires better memory utilization for object semantics and user patterns. Episodic memory provides personalized knowledge and in-context learning benefits, and hierarchical knowledge graph-based architectures can effectively address current limitations.

Abstract: LLM-powered embodied agents have shown success on conventional object-rearrangement tasks, but providing personalized assistance that leverages user-specific knowledge from past interactions presents new challenges. We investigate these challenges through the lens of agents’ memory utilization along two critical dimensions: object semantics (identifying objects based on personal meaning) and user patterns (recalling sequences from behavioral routines). To assess these capabilities, we construct MEMENTO, an end-to-end two-stage evaluation framework comprising single-memory and joint-memory tasks. Our experiments reveal that current agents can recall simple object semantics but struggle to apply sequential user patterns to planning. Through in-depth analysis, we identify two critical bottlenecks: information overload and coordination failures when handling multiple memories. Based on these findings, we explore memory architectural approaches to address these challenges. Given our observation that episodic memory provides both personalized knowledge and in-context learning benefits, we design a hierarchical knowledge graph-based user-profile memory module that separately manages personalized knowledge, achieving substantial improvements on both single and joint-memory tasks. Project website: https://connoriginal.github.io/MEMENTO

[44] Highlight & Summarize: RAG without the jailbreaks

Giovanni Cherubin, Andrew Paverd

Main category: cs.CL

TL;DR: H&S is a RAG design pattern that prevents jailbreaking by splitting question-answering into highlighting relevant passages and summarizing them, never exposing the user’s question to the generative LLM.

Details

Motivation: Preventing jailbreaking and model hijacking of LLMs is challenging because existing probabilistic approaches (hardening system prompts, classifiers) are easy to bypass due to the large input/output space.

Method: Split RAG pipeline into two components: 1) Highlighter extracts relevant passages from retrieved documents using the user’s question, 2) Summarizer takes highlighted passages and generates cohesive answers without ever seeing the original question.

Result: For certain QA tasks, H&S responses are judged as good or better than standard RAG pipelines while providing security against jailbreaking attacks by design.

Conclusion: H&S offers a secure-by-design alternative to probabilistic defenses against LLM jailbreaking, particularly suitable for RAG-based question-answering systems.

Abstract: Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. When interacting with a chatbot, malicious users can input specially crafted prompts that cause the LLM to generate undesirable content or perform a different task from its intended purpose. Existing systems attempt to mitigate this by hardening the LLM’s system prompt or using additional classifiers to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. We present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user’s question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user’s question and extracts (“highlights”) relevant passages from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe and implement several possible instantiations of H&S and evaluate their responses in terms of correctness, relevance, and quality. For certain question-answering (QA) tasks, the responses produced by H&S are judged to be as good, if not better, than those of a standard RAG pipeline.

[45] Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, Mingming Fan

Main category: cs.CL

TL;DR: PsyCrisis-Bench: A reference-free evaluation benchmark for assessing LLM safety alignment in Chinese mental health dialogues using expert-defined safety principles and LLM-as-Judge approach.

Details

Motivation: Evaluating LLM safety in high-risk mental health dialogues is challenging due to missing gold-standard answers and ethical sensitivity. Existing methods lack appropriate benchmarks for Chinese mental health contexts.

Method: Proposes PsyCrisis-Bench with: 1) Reference-free evaluation benchmark based on real-world Chinese mental health dialogues, 2) Prompt-based LLM-as-Judge approach using expert-defined reasoning chains, 3) Binary point-wise scoring across multiple safety dimensions for explainability, 4) Manually curated Chinese dataset covering self-harm, suicidal ideation, and existential distress.

Result: Experiments on 3600 judgments show the method achieves highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches.

Conclusion: PsyCrisis-Bench provides an effective framework for evaluating LLM safety alignment in sensitive mental health contexts, with publicly available dataset and evaluation tools to facilitate research.

Abstract: Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.

[46] MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis

Haiyun Guo, Zhiyan Hou, Yandu Sun, Jinghan He, Yu Chen, Yuzhe Zhou, Yuheng Jia, Jinqiao Wang, Tat-Seng Chua

Main category: cs.CL

TL;DR: MLLM-CTBench: A comprehensive benchmark for continual instruction tuning of multimodal large language models with multidimensional evaluation framework covering 7 tasks across 6 domains.

Details

Motivation: Address the lack of rigorous benchmarks for continual instruction tuning (CIT) of multimodal large language models (MLLMs) to adapt to evolving real-world demands, and provide systematic evaluation protocols.

Method: Introduces MLLM-CTBench with three key contributions: 1) Multidimensional evaluation framework assessing final-answer accuracy and process-level reasoning quality using Chain-of-Thought traces, 2) Large-scale evaluation of 8 continual learning algorithms from 4 families under unified protocols, 3) Extension from SFT to Reinforcement Fine-Tuning (RFT) using GRPO algorithm with explicit KL-divergence control.

Result: Key findings: 1) Process-level reasoning quality more resilient to catastrophic forgetting than final-answer accuracy, 2) Model capability critical for continual learning outcomes, 3) On-policy RFT (GRPO) with KL control achieves more stable cross-task retention than SFT.

Conclusion: MLLM-CTBench provides comprehensive evaluation framework for continual instruction tuning of MLLMs, revealing important insights about catastrophic forgetting mechanisms and effective training strategies for maintaining cross-task knowledge.

Abstract: Continual instruction tuning(CIT) during the post-training phase is crucial for adapting multimodal large language models (MLLMs) to evolving real-world demands. However, the progress is hampered by the lack of benchmarks with rigorous, protocol-consistent evaluation. To bridge this gap, we introduce MLLM-CTBench, a comprehensive benchmark for CIT of MLLMs, covering seven challenging tasks across six diverse domains. MLLM-CTBench makes three key contributions. First, we establish a multidimensional evaluation framework that jointly assesses final-answer accuracy and process-level reasoning quality, where Chain-of-Thought (CoT) traces serve as an observable signal to diagnose catastrophic forgetting beyond answer-only evaluation. Second, we conduct a large-scale evaluation of continual learning methods by systematically assessing eight representative algorithms from four major families under a unified protocol across task orders, providing actionable insights for algorithm design. Third, we expand the scope from Supervised Fine-Tuning (SFT) to Reinforcement Fine-Tuning (RFT) in CIT. By investigating GRPO, an on-policy RL algorithm that stabilizes updates through explicit KL-divergence control to a prior policy, we aim to analyze how this mechanism affects cross-task knowledge retention. Our experiments yield several findings:(1) Process-level reasoning quality is often more resilient to catastrophic forgetting than final-answer accuracy, and forgetting is primarily driven by degradation in domain knowledge. (2) Model capability is critical factor influencing continual learning outcomes, with stronger baseline models exhibiting greater resistance to catastrophic forgetting. (3) On-policy RFT (GRPO), with its inherent KL control, achieves more stable cross-task retention than SFT. While removing KL control can amplify forgetting despite potential gains on new ones.

[47] ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction

Xingshan Zeng, Weiwen Liu, Lingzhi Wang, Liangyou Li, Fei Mi, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu

Main category: cs.CL

TL;DR: ToolACE-MT is a non-autoregressive framework for generating multi-turn agentic dialogues efficiently, using three stages: initialization, iterative refinement, and offline verification.

Details

Motivation: Existing methods for generating agentic dialogue data rely on costly autoregressive interactions between multiple LLM agents, compromising practical efficiency. There's a need for more efficient approaches to construct high-quality multi-turn agentic dialogues.

Method: Three-stage framework: 1) Coarse-grained initialization builds structurally complete dialogue skeletons, 2) Iterative refinement introduces realistic complexities via mask-and-fill operations, 3) Offline verification ensures correctness through rule- and model-based checks.

Result: Experiments show ToolACE-MT enables efficient, effective, and generalizable agentic data generation, offering a new paradigm for high-quality data construction in tool-augmented LLM scenarios.

Conclusion: ToolACE-MT provides a novel non-autoregressive approach for generating multi-turn agentic dialogues that is more efficient than existing autoregressive methods while maintaining quality.

Abstract: Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions, often involving complex function calls and dynamic user-agent exchanges. Existing simulation-based data generation methods for such scenarios rely heavily on costly autoregressive interactions between multiple LLM agents, thereby compromising the practical efficiency of agentic data generation. In this paper, we propose ToolACE-MT, a novel Non-Autoregressive Iterative Generation framework for constructing high-quality multi-turn agentic dialogues. ToolACE-MT generates full conversational trajectories through three stages: coarse-grained initialization, iterative refinement, and offline verification. The initialization phase builds a structurally complete yet semantically coarse dialogue skeleton; the iterative refinement phase introduces realistic complexities and continued refinement via mask-and-fill operations; and the offline verification phase ensures correctness and coherence via rule- and model-based checks. Experiments demonstrate that ToolACE-MT enables efficient, effective and generalizable agentic data generation, offering a new paradigm for high-quality data construction in tool-augmented LLM scenarios.

[48] The Mediomatix Corpus: Parallel Data for Romansh Language Varieties via Comparable Schoolbooks

Zachary Hopton, Jannis Vamvas, Andrin Büchler, Anna Rutkiewicz, Rico Cathomas, Rico Sennrich

Main category: cs.CL

TL;DR: First parallel corpus of Romansh language idioms extracted from 291 schoolbook volumes using automatic alignment, containing 207k multi-parallel segments with over 2M tokens for NLP applications like machine translation.

Details

Motivation: Romansh language has five standardized idioms taught in Swiss schools, but lacks parallel corpora for NLP applications. The paper aims to create the first parallel corpus to enable machine translation and other NLP tasks between Romansh idioms.

Method: Used 291 comparable schoolbook volumes across five Romansh idioms, applied automatic alignment methods to extract parallel segments, conducted human evaluation to verify parallelism, and released dataset under CC-BY-NC-SA license.

Result: Created corpus with 207k multi-parallel segments and over 2M tokens. Human evaluation confirmed high parallelism. Demonstrated utility by training/evaluating LLM and supervised multilingual MT models on the dataset.

Conclusion: Successfully created first parallel Romansh idiom corpus suitable for NLP applications like machine translation, providing valuable resource for low-resource language processing and cross-dialect translation.

Abstract: The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM and a supervised multilingual MT model on the dataset.

[49] TASO: Task-Aligned Sparse Optimization for Parameter-Efficient Model Adaptation

Daiye Miao, Yufang Liu, Jie Wang, Changzhi Sun, Yunke Zhang, Demei Yan, Shaokang Dong, Qi Zhang, Yuanbin Wu

Main category: cs.CL

TL;DR: TASO reduces LoRA redundancy by using pretrained model weight importance to identify task-specific core regions and create sparse LoRA structures before fine-tuning.

Details

Motivation: LoRA introduces substantial parameter redundancy that increases trainable parameters and hinders fine-tuning effectiveness, but identifying and eliminating redundant parameters is challenging.

Method: Estimates parameter importance on downstream tasks, identifies task-specific core regions based on importance score distribution, and uses this location information to determine sparse structure of LoRA modules before fine-tuning.

Result: With parameter budget comparable to LoRA rank r=1, TASO consistently outperforms standard LoRA across multiple tasks while effectively eliminating redundant parameters.

Conclusion: TASO provides an effective approach to reduce LoRA redundancy using pretrained model weight importance, achieving strong fine-tuning performance with fewer parameters.

Abstract: LoRA has become one of the most widely used parameter-efficient fine-tuning methods due to its simplicity and effectiveness. However, numerous studies have shown that LoRA often introduces substantial parameter redundancy, which not only increases the number of trainable parameters but also hinders the effectiveness of fine-tuning. Since identifying redundant parameters in LoRA is inherently difficult, how to eliminate them efficiently and accurately remains a challenging problem. In this paper, we propose TASO, a redundancy reduction method that leverages importance information from the pretrained model’s weights to mitigate LoRA redundancy. Specifically, we estimate parameter importance on downstream tasks and identify task-specific core regions based on the distribution of importance scores. The location information of these core regions is then used to determine the sparse structure of LoRA modules, enabling redundancy removal before fine-tuning. Our approach significantly reduces the number of trainable parameters required for task adaptation, while providing a novel task-aligned perspective for LoRA redundancy reduction. Experimental results demonstrate that, with a parameter budget comparable to LoRA with rank $r = 1$, TASO consistently outperforms standard LoRA across multiple tasks, achieving strong fine-tuning performance while effectively eliminating redundant parameters.

[50] HEART: Emotionally-Driven Test-Time Scaling of Language Models

Gabriela Pinto, Palash Goyal, Mihir Parmar, Yiwen Song, Souradip Chakraborty, Zifeng Wang, Jinsung Yoon, Hamid Palangi, Tomas Pfister

Main category: cs.CL

TL;DR: HEART is a test-time scaling framework that uses emotional cues (alternating critical and encouraging tones) to guide AI models’ reasoning, helping them break out of repetitive incorrect patterns and improve problem-solving accuracy.

Details

Motivation: Current test-time scaling methods often get stuck in repetitive, incorrect patterns of thought, limiting their problem-solving effectiveness. The authors propose that emotional cues, similar to how feelings contribute to human decision-making, could help models break out of dead-end reasoning.

Method: HEART framework uses emotional cues to guide model focus, alternating between critical tones to sharpen error detection and encouraging tones to spark new ideas. This emotional regulation helps models break out of repetitive reasoning patterns during test-time scaling.

Result: HEART was evaluated across seven high-difficulty benchmarks including Humanity’s Last Exam, GPQA Diamond, and LiveCodeBench, showing robustness across diverse models. Results demonstrate consistent accuracy gains over affect-sterile baselines, with emotion facilitating deeper reasoning.

Conclusion: The strategic integration of affective regulation can guide logical synthesis in machine reasoning, suggesting that emotional cues represent the next frontier in improving AI problem-solving capabilities during test-time scaling.

Abstract: Test-time scaling has significantly improved how AI models solve problems, yet current methods often get stuck in repetitive, incorrect patterns of thought. We introduce HEART, a framework that uses emotional cues to guide the model’s focus, much like how feelings contribute to human decision-making. By alternating between critical tones to sharpen error detection and encouraging tones to spark new ideas, HEART helps the model break out of dead-end reasoning and find the right solution. We evaluate HEART across seven high-difficulty benchmarks–including Humanity’s Last Exam, GPQA Diamond, and LiveCodeBench–demonstrating robustness across diverse models. Results show that emotion facilitates deeper reasoning, yielding consistent accuracy gains over affect-sterile baselines. These findings suggest that the next frontier in machine reasoning lies in the strategic integration of affective regulation to guide logical synthesis.

[51] Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study

Eeham Khan, Firas Saidani, Owen Van Esbroeck, Richard Khoury, Leila Kosseim

Main category: cs.CL

TL;DR: Continual pre-training with LoRA enables efficient adaptation of LLMs to low-resource dialects like Québec French using minimal data and compute, improving dialect performance with minimal regression on standard language benchmarks.

Details

Motivation: LLMs are predominantly trained on high-resource languages, leaving minority dialects underserved. The paper aims to address this gap by developing cost-effective methods to adapt LLMs to low-resource regional dialects like Québec French.

Method: Uses continual pre-training (CPT) with low-rank adaptation (LoRA) and compute-efficient techniques to adapt three LLMs to Québec French dialect. Updates under 1% of model parameters using a very small dataset, evaluated on COLE benchmark suite.

Result: Shows improvement on minority dialect benchmarks with minimal regression on prestige language benchmarks. Gains are contingent on corpus composition. First Québec French LLMs released on HuggingFace.

Conclusion: CPT with parameter-efficient fine-tuning can narrow the dialect gap by providing cost-effective language resource creation, expanding LLM access to minority linguistic communities.

Abstract: Despite the widespread adoption of large language models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Québec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with under 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dialect gap by providing cost-effective and sustainable language resource creation, expanding high-quality LLM access to minority linguistic communities. We release the first Québec French LLMs on HuggingFace.

[52] Reasoning about Intent for Ambiguous Requests

Irina Saparina, Mirella Lapata

Main category: cs.CL

TL;DR: Proposes generating multiple interpretation-answer pairs in structured responses to handle ambiguous requests in LLMs, using RL training with custom rewards to improve coverage of valid answers.

Details

Motivation: LLMs often respond to ambiguous requests by implicitly committing to one interpretation, which can frustrate users and create safety risks due to intent misunderstandings.

Method: Train models with reinforcement learning and customized reward functions using multiple valid answers as supervision to generate multiple interpretation-answer pairs in a single structured response.

Result: Achieves higher coverage of valid answers than baseline approaches on conversational question answering and semantic parsing tasks; human evaluation confirms predicted interpretations are highly aligned with answers.

Conclusion: The approach promotes transparency with explicit interpretations, achieves efficiency with one generation step, and supports downstream applications through structured output format.

Abstract: Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.

[53] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Hongbo Wang, MaungMaung AprilPyone, Isao Echizen

Main category: cs.CL

TL;DR: SGM is a white-box neuron-level intervention method that selectively recalibrates toxic expert neurons in multimodal LLMs to mitigate toxicity while preserving model performance.

Details

Motivation: Multimodal LLMs inherit toxic, biased, and NSFW signals from weakly curated pretraining data, creating safety risks. Existing training-free detoxification methods struggle with adversarial triggers and lack interpretability.

Method: SGM uses expertise-weighted soft suppression to selectively recalibrate a small set of toxic expert neurons, neutralizing harmful cross-modal activations without parameter updates. It establishes MM-TOXIC-QA evaluation framework and can combine with existing methods as SGM*.

Result: SGM reduces harmful rates from 48.2% to 2.5% in open-source MLLMs, mitigating toxicity in standard and adversarial conditions while preserving fluency and multimodal reasoning capabilities.

Conclusion: SGM provides an interpretable, low-cost solution for toxicity-controlled multimodal generation that is extensible and can integrate with existing detoxification methods for stronger safety performance.

Abstract: Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2% to 2.5% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.

[54] Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

Kaustubh Shivshankar Shejole, Sourabh Deoghare, Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: This paper addresses punctuation robustness in Neural Machine Translation (NMT) systems, particularly for English-to-Marathi translation, by introducing a diagnostic benchmark called Viram and evaluating remediation strategies.

Details

Motivation: NMT systems heavily depend on punctuation cues to resolve semantic ambiguities. User-generated sentences often have missing or incorrect punctuation, leading to fluent but semantically disastrous translations. The paper aims to highlight and address this punctuation robustness problem.

Method: 1) Created Viram, a human-curated diagnostic benchmark of 54 punctuation-ambiguous English-Marathi sentence pairs to stress-test NMT systems. 2) Evaluated two remediation strategies: cascade-based restore-then-translate and direct fine-tuning. 3) Compared these with current Large Language Models (LLMs).

Result: Both remediation strategies yielded substantial NMT performance improvements. However, current LLMs exhibited relatively poorer robustness in translating such sentences compared to these task-specific strategies.

Conclusion: The work highlights the punctuation robustness problem in NMT systems and demonstrates effective remediation strategies. The findings suggest that current LLMs still underperform on this specific task, necessitating further research in this area.

Abstract: Neural Machine Translation (NMT) systems rely heavily on explicit punctuation cues to resolve semantic ambiguities in a source sentence. Inputting user-generated sentences, which are likely to contain missing or incorrect punctuation, results in fluent but semantically disastrous translations. This work attempts to highlight and address the problem of punctuation robustness of NMT systems through an English-to-Marathi translation. First, we introduce \textbf{\textit{Viram}}, a human-curated diagnostic benchmark of 54 punctuation-ambiguous English-Marathi sentence pairs to stress-test existing NMT systems. Second, we evaluate two simple remediation strategies: cascade-based \textit{restore-then-translate} and \textit{direct fine-tuning}. Our experimental results and analysis demonstrate that both strategies yield substantial NMT performance improvements. Furthermore, we find that current Large Language Models (LLMs) exhibit relatively poorer robustness in translating such sentences than these task-specific strategies, thus necessitating further research in this area. The code and dataset are available at https://github.com/KaustubhShejole/Viram_Marathi.

[55] Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: RSR (Rank-Surprisal Ratio) is a new metric for selecting effective reasoning trajectories in LLM distillation that balances alignment and informativeness, outperforming existing metrics.

Details

Motivation: Current methods for selecting reasoning trajectories in LLM distillation focus only on student likelihood, favoring trajectories that align with student behavior but missing more informative ones. Stronger teachers don't always yield better students, highlighting the need for better suitability assessment.

Method: Proposes Rank-Surprisal Ratio (RSR) - a simple metric defined as the ratio of a trajectory’s average token-wise rank to its average negative log-likelihood. RSR captures both alignment (through rank) and informativeness (through surprisal) to assess trajectory suitability.

Result: RSR strongly correlates with post-training reasoning performance (average Spearman 0.86) across 5 student models and trajectories from 11 diverse teachers, consistently outperforming existing metrics. Demonstrates practical utility in trajectory selection and teacher selection.

Conclusion: RSR is an effective, interpretable metric for assessing reasoning trajectory suitability in LLM distillation, balancing alignment and informativeness better than existing approaches.

Abstract: Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model’s current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model. Concretely, RSR is defined as the ratio of a trajectory’s average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

[56] Layer-wise Swapping for Generalizable Multilingual Safety

Hyunseo Shin, Wonseok Hwang

Main category: cs.CL

TL;DR: A safety-aware layer swapping method transfers safety alignment from English safety experts to low-resource language models without additional training, using adaptive module selection/blending to preserve general language understanding while enhancing multilingual safety.

Details

Motivation: Existing safety datasets are predominantly English-centric, limiting progress in multilingual safety alignment. Low-resource language models finetuned on their respective instruction datasets tend to exhibit higher unsafety rates compared to high-resource counterparts, creating a critical safety challenge for non-English languages.

Method: Proposes a safety-aware layer swapping method that transfers safety alignment from an English safety expert to low-resource language experts without additional training. The method adaptively selects or blends modules based on their degree of specialization to enhance transfer ability while preserving performance on general language understanding tasks.

Result: The method achieves comparable performance to language experts on general benchmarks (MMMLU, BELEBELE, MGSM) while producing more aligned and less harmful responses on the MultiJail safety benchmark, effectively enhancing safety in target low-resource languages.

Conclusion: The proposed safety-aware layer swapping method successfully addresses multilingual safety alignment challenges by transferring safety knowledge from English experts to low-resource language models, improving safety without compromising general language understanding capabilities.

Abstract: Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.

[57] Reinforced Attention Learning

Bangzheng Li, Jianmo Ni, Chen Qu, Ian Miao, Liu Yang, Xingyu Fu, Muhao Chen, Derek Zhiyuan Cheng

Main category: cs.CL

TL;DR: RAL optimizes attention distributions in MLLMs via policy gradient RL instead of output tokens, improving multimodal perception and grounding.

Details

Motivation: Standard RL post-training for LLMs (optimizing output tokens) yields limited gains for MLLMs and can degrade perception performance. There's a need for better multimodal post-training methods.

Method: Reinforced Attention Learning (RAL) - policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. Also introduces On-Policy Attention Distillation for transferring attention behaviors.

Result: Consistent gains over GRPO and other baselines across diverse image and video benchmarks. Attention distillation yields stronger cross-modal alignment than standard knowledge distillation.

Conclusion: Attention policies provide a principled and general alternative for multimodal post-training, shifting optimization from what to generate to where to attend.

Abstract: Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.

Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Tian Cheng Xia, Florian Boudin, Andre Greiner-Petter, Akiko Aizawa

Main category: cs.CL

TL;DR: SciClaimEval is a scientific claim verification dataset featuring authentic claims from published papers, including refuted claims created by modifying supporting evidence (figures/tables) rather than altering claims or using LLMs. It provides multimodal evidence in various formats and benchmarks 11 foundation models.

Details

Motivation: Existing claim verification datasets lack authentic scientific claims, particularly refuted ones, and often rely on synthetic data generation. There's a need for a dataset with real scientific claims and multimodal evidence to better evaluate models' ability to verify claims using scientific evidence.

Method: Dataset creation involves extracting authentic claims from published papers across ML, NLP, and medicine domains. Refuted claims are generated by modifying supporting evidence (figures and tables) rather than altering claims. The dataset includes 1,664 annotated samples with cross-modal evidence in multiple formats (images, LaTeX, HTML, JSON).

Result: The dataset contains 1,664 annotated samples from 180 papers across three domains. Benchmarking 11 multimodal foundation models shows that figure-based verification remains particularly challenging, with a substantial performance gap between the best system and human baseline.

Conclusion: SciClaimEval provides a valuable resource for scientific claim verification with authentic data and multimodal evidence. The results highlight the difficulty of figure-based verification for current models, suggesting this as an important direction for future research.

Abstract: We present SciClaimEval, a new scientific dataset for the claim verification task. Unlike existing resources, SciClaimEval features authentic claims, including refuted ones, directly extracted from published papers. To create refuted claims, we introduce a novel approach that modifies the supporting evidence (figures and tables), rather than altering the claims or relying on large language models (LLMs) to fabricate contradictions. The dataset provides cross-modal evidence with diverse representations: figures are available as images, while tables are provided in multiple formats, including images, LaTeX source, HTML, and JSON. SciClaimEval contains 1,664 annotated samples from 180 papers across three domains, machine learning, natural language processing, and medicine, validated through expert annotation. We benchmark 11 multimodal foundation models, both open-source and proprietary, across the dataset. Results show that figure-based verification remains particularly challenging for all models, as a substantial performance gap remains between the best system and human baseline.

[59] Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

Krzysztof Wróbel, Jan Maria Kowalski, Jerzy Surma, Igor Ciuciura, Maciej Szymański

Main category: cs.CL

TL;DR: Bielik Guard: Compact Polish language safety classifiers for LLM applications, with two model variants (0.1B and 0.5B parameters) that classify content across five safety categories.

Details

Motivation: As LLMs become increasingly deployed in Polish language applications, there's a need for efficient and accurate content safety classifiers to ensure safe interactions.

Method: Developed two compact models based on MMLW-RoBERTa-base (0.1B) and PKOBP/polish-roberta-8k (0.5B), fine-tuned on a community-annotated dataset of 6,885 Polish texts across five safety categories.

Result: Both models achieve strong performance: 0.5B variant has F1 scores of 0.791 (micro) and 0.785 (macro); 0.1B variant achieves 77.65% precision and 0.63% false positive rate on real user prompts, outperforming HerBERT-PL-Guard.

Conclusion: Bielik Guard provides effective Polish language safety classification with models designed for appropriate responses rather than simple blocking, especially for sensitive categories like self-harm.

Abstract: As Large Language Models (LLMs) become increasingly deployed in Polish language applications, the need for efficient and accurate content safety classifiers has become paramount. We present Bielik Guard, a family of compact Polish language safety classifiers comprising two model variants: a 0.1B parameter model based on MMLW-RoBERTa-base and a 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on a community-annotated dataset of 6,885 Polish texts, these models classify content across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm. Our evaluation demonstrates that both models achieve strong performance on multiple benchmarks. The 0.5B variant offers the best overall discrimination capability with F1 scores of 0.791 (micro) and 0.785 (macro) on the test set, while the 0.1B variant demonstrates exceptional efficiency. Notably, Bielik Guard 0.1B v1.1 achieves superior precision (77.65%) and very low false positive rate (0.63%) on real user prompts, outperforming HerBERT-PL-Guard (31.55% precision, 4.70% FPR) despite identical model size. The models are publicly available and designed to provide appropriate responses rather than simple content blocking, particularly for sensitive categories like self-harm.

[60] Large Language Models and Impossible Language Acquisition: “False Promise” or an Overturn of our Current Perspective towards AI

Ziyan Wang, Longlong Ma

Main category: cs.CL

TL;DR: The paper examines Chomsky’s critique of LLMs as mere pattern predictors that can’t learn impossible languages, testing this claim empirically with GPT-2 and LSTM models on syntactically impossible languages constructed from English transformations.

Details

Motivation: To empirically test Chomsky's fundamental critique that LLMs cannot distinguish impossible languages due to lacking human-like causal and self-correction structures, and to explore the theoretical implications for LLM research paradigms.

Method: Constructed syntactically impossible languages by applying transformations to English (sentence reversal, negation based on word-count parity). Conducted controlled experiments on GPT-2 small models and LSTM models with statistical analysis using Welch’s t-test.

Result: GPT-2 small models underperformed in learning all impossible languages compared to possible languages (p<.001). LSTM models’ performance aligned with Chomsky’s argument, highlighting transformer architecture’s unique role.

Conclusion: Proposes a new vision within Chomsky’s theory for LLMs and suggests shifting from his “rationalist-romantics” paradigm to functionalism and empiricism in LLM research, acknowledging architectural differences.

Abstract: In Chomsky’s provocative critique “The False Promise of CHATGPT,” Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critic from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Statistical analysis (Welch’s t-test) shows GPT2 small models underperform in learning all of the impossible languages compared to their performance on the possible language (p<.001). On the other hand, LSTM models’ performance tallies with Chomsky’s argument, suggesting the irreplaceable role of the evolution of transformer architecture. Based on theoretical analysis and empirical findings, we propose a new vision within Chomsky’s theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his “rationalist-romantics” paradigm to functionalism and empiricism in LLMs research.

[61] GISA: A Benchmark for General Information-Seeking Assistant

Yutao Zhu, Xingshuo Zhang, Maosen Zhang, Jiajie Jin, Liancheng Zhang, Xiaoshuai Song, Kangzhi Zhao, Wencong Zeng, Ruiming Tang, Han Li, Ji-Rong Wen, Zhicheng Dou

Main category: cs.CL

TL;DR: GISA is a benchmark for evaluating general information-seeking assistants with human-crafted queries, structured answer formats, live updates, and complete search trajectories to address limitations in existing benchmarks.

Details

Motivation: Existing benchmarks for search agents have limitations: they construct queries backward from answers (unnatural tasks), focus narrowly on either locating or aggregating information, use static answer sets prone to data contamination, and lack process-level supervision for training.

Method: Introduces GISA benchmark with 373 human-crafted queries reflecting authentic information-seeking scenarios. Features four structured answer formats (item, set, list, table) for deterministic evaluation, integrates deep reasoning and broad information aggregation, includes live subset with periodically updated answers, and provides complete human search trajectories for every query.

Result: Experiments show even the best-performing model achieves only 19.30% exact match score, with performance degrading significantly on tasks requiring complex planning and comprehensive information gathering, highlighting substantial room for improvement.

Conclusion: GISA addresses key limitations of existing benchmarks and reveals significant gaps in current search agent capabilities, particularly for complex planning and comprehensive information aggregation tasks, providing a valuable resource for future research.

Abstract: The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.

[62] Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

Théo Lasnier, Wissam Antoun, Francis Kulumba, Djamé Seddah

Main category: cs.CL

TL;DR: Mechanistic analysis reveals that backdoor triggers in LLMs co-opt existing language encoding circuits rather than creating isolated circuits, with trigger-activated heads overlapping significantly with natural language encoding heads.

Details

Motivation: To understand the internal mechanisms of backdoor attacks in LLMs, specifically how language-switching triggers operate, since current understanding is limited despite significant security risks.

Method: Used activation patching on the GAPperon model family (1B, 8B, 24B parameters) with language-switching backdoors injected during pretraining. Localized trigger formation to early layers and identified which attention heads process trigger information.

Result: Trigger formation occurs in early layers (7.5-25% of model depth). Trigger-activated heads substantially overlap with heads naturally encoding output language across model scales (Jaccard indices 0.18-0.66). Backdoor triggers co-opt existing language components rather than forming isolated circuits.

Conclusion: Backdoor triggers exploit the model’s existing functional components rather than creating separate circuits. This has implications for defense: detection should monitor known functional components, and mitigation could leverage the entanglement between injected and natural behaviors.

Abstract: Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model’s existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.

[63] When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

Virginie Mouilleron, Théo Lasnier, Djamé Seddah

Main category: cs.CL

TL;DR: Multimodal Finance Eval: First benchmark for evaluating French financial document understanding in VLMs, showing strong text/table performance but poor chart interpretation and multi-turn reasoning.

Details

Motivation: Current VLMs lack evaluation in specialized non-English domains like finance, where documents contain dense regulatory text, numerical tables, and visual charts, and errors have real-world consequences.

Method: Created Multimodal Finance Eval benchmark with 1,204 expert-validated questions from real French financial documents (investment prospectuses, KIDs, PRIIPs). Evaluated six open-weight VLMs (8B-124B parameters) using LLM-as-judge protocol across text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning.

Result: Models achieve 85-90% accuracy on text and table tasks but struggle with chart interpretation (34-62%). Multi-turn dialogue reveals critical failure: early mistakes propagate across turns, driving accuracy down to ~50% regardless of model size.

Conclusion: Current VLMs are effective for well-defined extraction tasks but brittle in interactive, multi-step financial analysis. The benchmark provides a challenging testbed for progress in high-stakes multimodal understanding.

Abstract: Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.

[64] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, Ninghao Liu

Main category: cs.CL

TL;DR: FAC Synthesis: A diversity-driven data synthesis framework that uses Feature Activation Coverage to measure and improve data diversity in LLMs by identifying missing features and generating synthetic samples to cover them.

Details

Motivation: Existing approaches to constructing post-training data for LLMs use text-based metrics that only capture linguistic variation, providing weak signals for task-relevant features that determine downstream performance. There's a need for better diversity metrics that align with what actually matters for model performance.

Method: Introduces Feature Activation Coverage (FAC) to measure data diversity in an interpretable feature space. Proposes FAC Synthesis framework: 1) uses sparse autoencoder to identify missing features from seed dataset, 2) generates synthetic samples that explicitly reflect these missing features to improve coverage.

Result: The approach consistently improves both data diversity and downstream performance on various tasks including instruction following, toxicity detection, reward modeling, and behavior steering. Identifies a shared, interpretable feature space across model families (LLaMA, Mistral, Qwen) enabling cross-model knowledge transfer.

Conclusion: FAC Synthesis provides a solid and practical methodology for data-centric optimization of LLMs, moving beyond text-based diversity metrics to feature-based coverage that better aligns with downstream performance.

Abstract: The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

[65] Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment

Daniel Gallagher, Gerhard Heyer

Main category: cs.CL

TL;DR: Transformer models struggle with ergative case alignment in Georgian, performing worst on ergative case despite overall frequency patterns (NOM > DAT > ERG).

Details

Motivation: To evaluate transformer-based language models' ability to handle split-ergative case alignment in Georgian, a rare grammatical system, and understand how they perform on different case assignments.

Method: Created 370 syntactic tests using treebank-based approach with Grew query language, testing seven tasks with 50-70 samples each. Evaluated five encoder and two decoder models with word/sentence-level accuracy metrics.

Result: Models performed worst on ergative case assignment and best on nominative case. Performance correlated with frequency distribution (NOM > DAT > ERG). Poor ergative performance attributed to its specific role and lack of training data.

Conclusion: Transformer models struggle with rare grammatical phenomena like ergative case in low-resource languages. Dataset and methodology provide framework for future syntactic evaluations in languages with limited benchmarks.

Abstract: This paper evaluates the performance of transformer-based language models on split-ergative case alignment in Georgian, a particularly rare system for assigning grammatical cases to mark argument roles. We focus on subject and object marking determined through various permutations of nominative, ergative, and dative noun forms. A treebank-based approach for the generation of minimal pairs using the Grew query language is implemented. We create a dataset of 370 syntactic tests made up of seven tasks containing 50-70 samples each, where three noun forms are tested in any given sample. Five encoder- and two decoder-only models are evaluated with word- and/or sentence-level accuracy metrics. Regardless of the specific syntactic makeup, models performed worst in assigning the ergative case correctly and strongest in assigning the nominative case correctly. Performance correlated with the overall frequency distribution of the three forms (NOM > DAT > ERG). Though data scarcity is a known issue for low-resource languages, we show that the highly specific role of the ergative along with a lack of available training data likely contributes to poor performance on this case. The dataset is made publicly available and the methodology provides an interesting avenue for future syntactic evaluations of languages where benchmarks are limited.

[66] Computational Phenomenology of Temporal Experience in Autism: Quantifying the Emotional and Narrative Characteristics of Lived Unpredictability

Kacper Dudzic, Karolina Drożdż, Maciej Wodziński, Anastazja Szuła, Marcin Moskalewicz

Main category: cs.CL

TL;DR: This paper investigates temporal disturbances in autism using mixed methods combining phenomenological interviews, computational analysis of autistic narratives, and narrative flow assessment to understand lived unpredictability experiences.

Details

Motivation: The research aims to bridge gaps between phenomenological and computational approaches to studying temporality in autism, addressing limitations of deficit-based medical models, small qualitative samples, and lack of phenomenological grounding in computational research.

Method: Three integrated methodologies: Study A - structured phenomenological interviews with autistic individuals using Transdiagnostic Assessment of Temporal Experience; Study B - computational analysis of an autobiographical corpus of autistic narratives; Study C - replication of computational study using narrative flow measures to assess phenomenological authenticity of autistic autobiographies.

Result: Interviews revealed unpredictability of experience as most significant difference between autistic and control groups. Computational analysis showed autistic narratives had more negatively valenced temporal lexicon, especially “Immediacy & Suddenness” category. Outlier analysis identified terms associated with perceived discontinuity as highly negative. Narrative flow analysis found autistic narratives resemble autobiographical stories more than imaginary ones.

Conclusion: Temporal challenges in autism primarily concern lived unpredictability and stem from the contents of lived experience rather than autistic narrative construction, demonstrating value of integrating phenomenological and computational approaches.

Abstract: Disturbances in temporality, such as desynchronization with the social environment and its unpredictability, are considered core features of autism with a deep impact on relationships. However, limitations regarding research on this issue include: 1) the dominance of deficit-based medical models of autism, 2) sample size in qualitative research, and 3) the lack of phenomenological anchoring in computational research. To bridge the gap between phenomenological and computational approaches and overcome sample-size limitations, our research integrated three methodologies. Study A: structured phenomenological interviews with autistic individuals using the Transdiagnostic Assessment of Temporal Experience. Study B: computational analysis of an autobiographical corpus of autistic narratives built for this purpose. Study C: a replication of a computational study using narrative flow measures to assess the perceived phenomenological authenticity of autistic autobiographies. Interviews revealed that the most significant differences between the autistic and control groups concerned unpredictability of experience. Computational results mirrored these findings: the temporal lexicon in autistic narratives was significantly more negatively valenced - particularly the “Immediacy & Suddenness” category. Outlier analysis identified terms associated with perceived discontinuity (unpredictably, precipitously, and abruptly) as highly negative. The computational analysis of narrative flow found that the autistic narratives contained within the corpus quantifiably resemble autobiographical stories more than imaginary ones. Overall, the temporal challenges experienced by autistic individuals were shown to primarily concern lived unpredictability and stem from the contents of lived experience, and not from autistic narrative construction.

[67] Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models

Katrina Olsen, Sebastian Padó

Main category: cs.CL

TL;DR: Paper analyzes how well LLMs distinguish between anomalous vs. nonsensical sentences using human judgments on five semantically deviant datasets, finding most sentences are merely anomalous and LLMs can generate plausible contexts for them.

Details

Motivation: To understand how nonsensical existing semantic anomaly datasets truly are, and to evaluate LLMs' ability to distinguish between anomalous (contextually interpretable) vs. truly nonsensical sentences.

Method: Collected sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets, both context-free and with provided contexts. Analyzed how many sentences were considered truly nonsensical vs. merely anomalous.

Result: Human raters considered most sentences at most anomalous rather than properly nonsensical. LLMs demonstrated substantial skill in generating plausible contexts for anomalous cases, suggesting they can distinguish interpretable anomalies from true nonsense.

Conclusion: Existing semantic anomaly datasets contain mostly anomalous rather than truly nonsensical sentences, and LLMs show competence in contextual interpretation of anomalies, raising questions about current evaluation methods for semantic understanding.

Abstract: Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.

[68] Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

Yunchong Huang, Gianni Barlacchi, Sandro Pezzelle

Main category: cs.CL

TL;DR: LLMs struggle with underspecified questions in QA benchmarks, with 16-50% of questions being ambiguous; rewriting them to be fully specified improves performance significantly.

Details

Motivation: Standard QA benchmarks remain unsolved despite LLM advances, potentially due to underspecified questions that lack necessary context for unique interpretation.

Method: Introduced LLM-based classifier to identify underspecified questions in QA datasets, conducted controlled rewriting experiment to create fully specified variants while keeping gold answers fixed.

Result: Found 16% to over 50% of benchmark questions are underspecified; LLMs perform significantly worse on them; QA performance consistently improves when questions are rewritten to be fully specified.

Conclusion: Underspecification is a major confound in QA evaluation, and many apparent LLM failures stem from ambiguous questions rather than model limitations, highlighting need for clearer benchmark design.

Abstract: Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.

[69] LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

Szilvia Ujváry, Louis Béthune, Pierre Ablin, João Monteiro, Marco Cuturi, Michael Kirchhof

Main category: cs.CL

TL;DR: LaCy: A pretraining method for Small Language Models that learns which tokens to predict vs. delegate via tokens to prevent factual errors, using grammar parsing to augment loss signals.

Details

Motivation: Small Language Models (SLMs) have limited capacity leading to factual errors, but can query external sources. Need to determine which tokens SLMs should learn vs. delegate during pretraining to optimize factual accuracy.

Method: LaCy uses spaCy grammar parser to augment loss signals, distinguishing between tokens with high loss that are acceptable (truthful alternatives) vs. those that should trigger delegation via tokens. Novel pretraining method based on token selection philosophy.

Result: LaCy models successfully learn which tokens to predict and where to delegate, achieving higher FactScores when cascading with larger models, outperforming Rho or LLM-judge trained SLMs while being simpler and cheaper.

Conclusion: LaCy provides effective method for SLMs to optimize knowledge retention vs. delegation, improving factual accuracy in generation tasks through intelligent token selection during pretraining.

Abstract: Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emph{acceptable} in that they are truthful alternative continuations of a pretraining document, and should not trigger a \texttt{} even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.

[70] WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

Yangzhuo Li, Shengpeng Ji, Yifu Chen, Tianle Liang, Haorong Ying, Yule Wang, Junbo Li, Jun Fang, Zhou Zhao

Main category: cs.CL

TL;DR: WavBench is a comprehensive benchmark for evaluating spoken dialogue models, featuring three subsets: Pro for challenging reasoning, Basic for natural colloquialism, and Acoustic for paralinguistic capabilities.

Details

Motivation: Current benchmarks for spoken dialogue models follow text-generation standards and fail to capture audio-centric characteristics like paralinguistics and colloquialisms, as well as the cognitive depth needed for modern conversational agents.

Method: Introduces WavBench with a tripartite framework: 1) Pro subset for rigorous reasoning challenges, 2) Basic subset for natural colloquialism focusing on “listenability”, and 3) Acoustic subset for comprehensive paralinguistic evaluation including understanding, generation, and implicit dialogue.

Result: Evaluated five state-of-the-art models, providing insights into complex problem-solving, colloquial delivery, and paralinguistic fidelity in spoken dialogue systems.

Conclusion: WavBench addresses critical gaps in evaluating spoken dialogue models and guides the evolution of robust conversational agents by focusing on realistic audio-centric capabilities beyond text-based metrics.

Abstract: With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes “listenability” through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto-2024.github.io/wavbench.github.io/.

[71] Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

Julia Belikova, Danila Rozhevskii, Dennis Svirin, Konstantin Polev, Alexander Panchenko

Main category: cs.CL

TL;DR: The paper proposes methods to detect “token overflow” in soft compression architectures for LLMs, where compressed representations lose task-relevant information needed to answer queries.

Details

Motivation: Soft compression architectures extend context length by compressing long token sequences, but there's limited understanding of when compression erases task-relevant content, creating a need to detect this "token overflow" regime.

Method: Proposes methodology to characterize and detect token overflow. Uses query-agnostic saturation statistics to identify compressed tokens, and lightweight probing classifiers over both query and context representations in xRAG soft-compression setting.

Result: Query-agnostic saturation statistics reliably separate compressed from uncompressed tokens but have limited overflow detection. Lightweight probing classifiers achieve 0.72 AUC-ROC on average across HotpotQA, SQuADv2, and TriviaQA datasets, showing query information improves detection.

Conclusion: The work advances from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors in long-context processing.

Abstract: Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility – and when compression begins to erase task-relevant content – remain underexplored. In this paper, we define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.

[72] T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization

Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Hao Wang, Vladimir Pavlovic, Dimitris N. Metaxas

Main category: cs.CL

TL;DR: Trajectory self-distillation framework improves few-step decoding in diffusion language models by distilling generative trajectories using Direct Discriminative Optimization to reduce inference steps while maintaining quality.

Details

Motivation: Diffusion large language models (DLLMs) can theoretically enable fast parallel text generation, but in practice require many refinement steps for quality. Aggressively reducing steps causes substantial quality degradation, creating a need for efficient few-step decoding methods.

Method: Proposes a trajectory self-distillation framework that distills the model’s own generative trajectories. Incorporates Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student model to focus on high-probability teacher modes.

Result: The approach consistently outperforms strong few-step baselines and standard training under tight step budgets. While full-step decoding remains superior, the method substantially narrows the performance gap, establishing a strong foundation for practical few-step DLLMs.

Conclusion: Trajectory self-distillation with DDO effectively improves few-step decoding in diffusion language models, making them more practical for efficient inference while maintaining generation quality.

Abstract: Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model’s own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student to concentrate on high-probability teacher modes. Across benchmarks, our approach consistently outperforms strong few-step baselines and standard training under tight step budgets. Although full-step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few-step DLLMs. The source code is available at https://github.com/Tyrion58/T3D.

cs.CV

[73] Thermal Imaging for Contactless Cardiorespiratory and Sudomotor Response Monitoring

Constantino Álvarez Casado, Mohammad Rahman, Sasan Sharifipour, Nhi Nguyen, Manuel Lage Cañellas, Xiaoting Wu, Miguel Bordallo López

Main category: cs.CV

TL;DR: Thermal infrared imaging enables contactless estimation of electrodermal activity, heart rate, and breathing rate from facial thermal video using signal processing pipelines.

Details

Motivation: Thermal imaging captures skin temperature changes driven by autonomic regulation, allowing contactless biosignal estimation. While visible-light methods can estimate HR and BR, they cannot access EDA (electrodermal activity), which is a standard marker of sympathetic activation. Thermal imaging offers potential for comprehensive multimodal physiological monitoring.

Method: Developed a signal-processing pipeline that tracks anatomical facial regions, applies spatial aggregation, and separates slow sudomotor trends from faster cardiorespiratory components. For HR: used orthogonal matrix image transformation (OMIT) decomposition across multiple facial ROIs. For BR: averaged nasal and cheek signals before spectral peak detection. Evaluated 288 EDA configurations and HR/BR pipeline on 31 sessions from SIMULATOR STUDY 1 dataset.

Result: Best fixed EDA configuration (nose region, exponential moving average) achieved mean absolute correlation of 0.40 ± 0.23 against palm EDA, with individual sessions reaching 0.89. BR estimation achieved mean absolute error of 3.1 ± 1.1 bpm. HR estimation yielded 13.8 ± 7.5 bpm MAE, limited by low camera frame rate (7.5 Hz). Also reported signal polarity alternation, short thermodynamic latency, and condition-dependent/demographic effects.

Conclusion: Thermal imaging provides a viable approach for contactless estimation of EDA, HR, and BR, with EDA being particularly valuable as it’s inaccessible to visible-light methods. Results establish baseline performance bounds and design guidance for thermal contactless biosignal estimation systems.

Abstract: Thermal infrared imaging captures skin temperature changes driven by autonomic regulation and can potentially provide contactless estimation of electrodermal activity (EDA), heart rate (HR), and breathing rate (BR). While visible-light methods address HR and BR, they cannot access EDA, a standard marker of sympathetic activation. This paper characterizes the extraction of these three biosignals from facial thermal video using a signal-processing pipeline that tracks anatomical regions, applies spatial aggregation, and separates slow sudomotor trends from faster cardiorespiratory components. For HR, we apply an orthogonal matrix image transformation (OMIT) decomposition across multiple facial regions of interest (ROIs), and for BR we average nasal and cheek signals before spectral peak detection. We evaluate 288 EDA configurations and the HR/BR pipeline on 31 sessions from the public SIMULATOR STUDY 1 (SIM1) driver monitoring dataset. The best fixed EDA configuration (nose region, exponential moving average) reaches a mean absolute correlation of $0.40 \pm 0.23$ against palm EDA, with individual sessions reaching 0.89. BR estimation achieves a mean absolute error of $3.1 \pm 1.1$ bpm, while HR estimation yields $13.8 \pm 7.5$ bpm MAE, limited by the low camera frame rate (7.5 Hz). We report signal polarity alternation across sessions, short thermodynamic latency for well-tracked signals, and condition-dependent and demographic effects on extraction quality. These results provide baseline performance bounds and design guidance for thermal contactless biosignal estimation.

[74] LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Zekun Li, Sizhe An, Chengcheng Tang, Chuan Guo, Ivan Shugurov, Linguang Zhang, Amy Zhao, Srinath Sridhar, Lingling Tao, Abhay Mittal

Main category: cs.CV

TL;DR: LLaMo is a unified framework that extends pretrained LLMs with modality-specific Mixture-of-Transformers for motion-language understanding and generation, addressing catastrophic forgetting and jitter artifacts in existing approaches.

Details

Motivation: Existing approaches for motion-language models often suffer from catastrophic forgetting of linguistic capabilities when fine-tuning LLMs on limited motion-text data, and introduce jitter artifacts from discrete motion tokenization. There's a need for a unified framework that preserves language understanding while enabling high-quality motion generation.

Method: Proposes LLaMo with modality-specific Mixture-of-Transformers (MoT) architecture that extends pretrained LLMs. Encodes human motion into causal continuous latent space (avoiding discrete quantization), uses lightweight flow-matching head for next-token prediction, enabling real-time streaming motion generation (>30 FPS).

Result: Achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially excels at zero-shot motion generation. Demonstrates preservation of language understanding while enabling scalable multimodal adaptation.

Conclusion: LLaMo represents a significant step toward a general unified motion-language large model, addressing key limitations of previous approaches through continuous motion encoding and Mixture-of-Transformers architecture that preserves pretrained LLM capabilities.

Abstract: Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.

[75] Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

Marco Willi, Melanie Mathys, Michael Graber

Main category: cs.CV

TL;DR: CLIP-based synthetic image detection shows strong performance but relies on high-level photographic attributes rather than generator artifacts, with generalization challenges across different generative architectures.

Details

Motivation: As generative models produce near-photorealistic images that challenge photographic trustworthiness, synthetic image detection (SID) becomes crucial. However, existing SID methods struggle with generalization to novel generative models and practical settings. CLIP shows promise for SID, but it's unclear whether it detects visual artifacts or semantic biases, which would limit practical utility.

Method: Introduces SynthCLIC dataset with real photographs and high-quality synthetic counterparts from diffusion models to reduce semantic bias. Uses interpretable linear head with de-correlated activations and text-grounded concept-model to analyze what CLIP-based detectors learn. Evaluates performance on GAN-based benchmarks and the new diffusion dataset.

Result: CLIP-based linear detectors achieve 0.96 mAP on GAN-based benchmarks but only 0.92 on SynthCLIC diffusion dataset. Generalization across generator families drops to as low as 0.37 mAP. Detectors primarily rely on high-level photographic attributes (minimalist style, lens flare, depth layering) rather than generator-specific artifacts.

Conclusion: CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust synthetic image detection.

Abstract: Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs–unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real photographs and high-quality synthetic counterparts from recent diffusion models, designed to reduce semantic bias in SID. Using an interpretable linear head with de-correlated activations and a text-grounded concept-model, we analyze what CLIP-based detectors learn. CLIP-based linear detectors reach 0.96 mAP on a GAN-based benchmark but only 0.92 on our high-quality diffusion dataset SynthCLIC, and generalization across generator families drops to as low as 0.37 mAP. We find that the detectors primarily rely on high-level photographic attributes (e.g., minimalist style, lens flare, or depth layering), rather than overt generator-specific artifacts. CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.

[76] Reproducing DragDiffusion: Interactive Point-Based Editing with Diffusion Models

Ali Subhan, Ashir Raza

Main category: cs.CV

TL;DR: Reproducibility study of DragDiffusion confirms its core claims about point-based image editing via diffusion latent optimization, while identifying sensitivity to key hyperparameters like timestep selection and feature supervision levels.

Details

Motivation: To independently verify and validate the claims made in the original DragDiffusion paper regarding interactive point-based image editing using diffusion models, and to understand the reproducibility conditions and hyperparameter sensitivities.

Method: Used authors’ released implementation with DragBench benchmark, reproduced main ablation studies on diffusion timestep selection, LoRA-based fine-tuning, mask regularization strength, and UNet feature supervision. Also evaluated multi-timestep latent optimization variant.

Result: Close agreement with original qualitative/quantitative trends, confirming central claims. Performance sensitive to optimized timestep and feature level for motion supervision. Multi-timestep optimization didn’t improve spatial accuracy while increasing computational cost.

Conclusion: DragDiffusion’s core claims are reproducible, but reliable performance depends on careful hyperparameter selection, particularly timestep and feature supervision level. The method provides effective point-based image editing but requires specific conditions for optimal results.

Abstract: DragDiffusion is a diffusion-based method for interactive point-based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single diffusion latent at an intermediate timestep, together with identity-preserving fine-tuning and spatial regularization. This work presents a reproducibility study of DragDiffusion using the authors’ released implementation and the DragBench benchmark. We reproduce the main ablation studies on diffusion timestep selection, LoRA-based fine-tuning, mask regularization strength, and UNet feature supervision, and observe close agreement with the qualitative and quantitative trends reported in the original work. At the same time, our experiments show that performance is sensitive to a small number of hyperparameter assumptions, particularly the optimized timestep and the feature level used for motion supervision, while other components admit broader operating ranges. We further evaluate a multi-timestep latent optimization variant and find that it does not improve spatial accuracy while substantially increasing computational cost. Overall, our findings support the central claims of DragDiffusion while clarifying the conditions under which they are reliably reproducible. Code is available at https://github.com/AliSubhan5341/DragDiffusion-TMLR-Reproducibility-Challenge.

[77] What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis

Xirui Li, Ming Li, Tianyi Zhou

Main category: cs.CV

TL;DR: RL in vision-language models primarily refines mid-to-late transformer layers for better vision-to-reasoning alignment rather than uniformly enhancing visual perception.

Details

Motivation: To understand what specific capabilities reinforcement learning (RL) actually improves in vision-language models compared to supervised fine-tuning, since benchmark gains conflate multiple factors and make it difficult to attribute improvements to specific skills.

Method: Proposes a Frankenstein-style analysis framework with three components: (1) functional localization via causal probing, (2) update characterization via parameter comparison, and (3) transferability test via model merging.

Result: RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains.

Conclusion: RL’s reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance.

Abstract: Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL’s reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.

[78] ZeroDiff++: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning

Zihan Ye, Shreyank N Gowda, Kaile Du, Weijian Luo, Ling Shao

Main category: cs.CV

TL;DR: ZeroDiff++: A diffusion-based generative framework for zero-shot learning that addresses spurious visual-semantic correlations and data scarcity through diffusion augmentation, contrastive representations, and test-time adaptation.

Details

Motivation: Existing generative ZSL methods suffer from spurious visual-semantic correlations worsened by scarce training data, and produce features disconnected from real test samples due to unadaptive fully noised generators.

Method: Proposes ZeroDiff++ with: (1) diffusion augmentation for diverse noised samples, (2) supervised contrastive representations for instance-level semantics, (3) multi-view discriminators with Wasserstein mutual learning, (4) Diffusion-based Test-time Adaptation (DiffTTA) using pseudo label reconstruction, and (5) Diffusion-based Test-time Generation (DiffGen) to produce partially synthesized features connecting real and generated data.

Result: Extensive experiments on three ZSL benchmarks show significant improvements over existing ZSL methods and robust performance even with scarce training data.

Conclusion: ZeroDiff++ effectively addresses spurious correlations and data scarcity in ZSL through diffusion-based generation and adaptation techniques, achieving state-of-the-art performance.

Abstract: Zero-shot Learning (ZSL) enables classifiers to recognize classes unseen during training, commonly via generative two stage methods: (1) learn visual semantic correlations from seen classes; (2) synthesize unseen class features from semantics to train classifiers. In this paper, we identify spurious visual semantic correlations in existing generative ZSL worsened by scarce seen class samples and introduce two metrics to quantify spuriousness for seen and unseen classes. Furthermore, we point out a more critical bottleneck: existing unadaptive fully noised generators produce features disconnected from real test samples, which also leads to the spurious correlation. To enhance the visual-semantic correlations on both seen and unseen classes, we propose ZeroDiff++, a diffusion-based generative framework. In training, ZeroDiff++ uses (i) diffusion augmentation to produce diverse noised samples, (ii) supervised contrastive (SC) representations for instance level semantics, and (iii) multi view discriminators with Wasserstein mutual learning to assess generated features. At generation time, we introduce (iv) Diffusion-based Test time Adaptation (DiffTTA) to adapt the generator using pseudo label reconstruction, and (v) Diffusion-based Test time Generation (DiffGen) to trace the diffusion denoising path and produce partially synthesized features that connect real and generated data, and mitigates data scarcity further. Extensive experiments on three ZSL benchmarks demonstrate that ZeroDiff++ not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Code would be available.

[79] MonoLoss: A Training Objective for Interpretable Monosemantic Representations

Ali Nasiri-Sarvi, Anh Tien Nguyen, Hassan Rivaz, Dimitris Samaras, Mahdi S. Hosseini

Main category: cs.CV

TL;DR: MonoLoss: A plug-in training objective that improves monosemanticity in sparse autoencoders by directly rewarding semantically consistent activations, with efficient single-pass evaluation.

Details

Motivation: Sparse autoencoders struggle to decompose polysemantic neural representations into interpretable monosemantic features due to weak training objectives. Existing monosemanticity metrics are computationally expensive, requiring quadratic pairwise comparisons across dataset samples.

Method: Derived a single-pass algorithm for MonoScore metric that computes the same quantity with linear rather than quadratic complexity. Introduced Monosemanticity Loss (MonoLoss) as a plug-in training objective that directly rewards semantically consistent activations. Applied to various SAE architectures (BatchTopK, TopK, JumpReLU) on CLIP, SigLIP2, and ViT features.

Result: Achieved up to 1200x speedup in evaluation and 159x during training with only ~4% per-epoch overhead. MonoLoss increased MonoScore for most latents and consistently improved class purity across all encoder and SAE combinations (largest gain: from 0.152 to 0.723). As auxiliary regularizer for ResNet-50 and CLIP-ViT-B/32 finetuning, yielded up to 0.6% accuracy gains on ImageNet-1K.

Conclusion: MonoLoss enables efficient training of more interpretable monosemantic representations in sparse autoencoders while improving downstream task performance, bridging the gap between interpretability and practical utility in vision models.

Abstract: Sparse autoencoders (SAEs) decompose polysemantic neural representations, where neurons respond to multiple unrelated concepts, into monosemantic features that capture single, interpretable concepts. However, standard training objectives only weakly encourage this decomposition, and existing monosemanticity metrics require pairwise comparisons across all dataset samples, making them inefficient during training and evaluation. We study a recent MonoScore metric and derive a single-pass algorithm that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images. On OpenImagesV7, we achieve up to a 1200x speedup wall-clock speedup in evaluation and 159x during training, while adding only ~4% per-epoch overhead. This allows us to treat MonoScore as a training signal: we introduce the Monosemanticity Loss (MonoLoss), a plug-in objective that directly rewards semantically consistent activations for learning interpretable monosemantic representations. Across SAEs trained on CLIP, SigLIP2, and pretrained ViT features, using BatchTopK, TopK, and JumpReLU SAEs, MonoLoss increases MonoScore for most latents. MonoLoss also consistently improves class purity (the fraction of a latent’s activating images belonging to its dominant class) across all encoder and SAE combinations, with the largest gain raising baseline purity from 0.152 to 0.723. Used as an auxiliary regularizer during ResNet-50 and CLIP-ViT-B/32 finetuning, MonoLoss yields up to 0.6% accuracy gains on ImageNet-1K and monosemantic activating patterns on standard benchmark datasets. The code is publicly available at https://github.com/AtlasAnalyticsLab/MonoLoss.

[80] Prototype-driven fusion of pathology and spatial transcriptomics for interpretable survival prediction

Lihe Liu, Xiaoxi Pan, Yinyin Yuan, Lulu Shang

Main category: cs.CV

TL;DR: PathoSpatial is an interpretable multimodal framework that integrates whole slide images and spatial transcriptomics for prognostic modeling in cancer, using task-guided prototype learning and multi-level experts architecture.

Details

Motivation: As paired WSI-ST cohorts scale to population level, there's a need for principled cross-modal fusion strategies that leverage complementary spatial signals from pathology images and molecular data for prognosis, but current methods are limited.

Method: PathoSpatial uses task-guided prototype learning within a multi-level experts architecture that adaptively orchestrates unsupervised within-modality discovery with supervised cross-modal aggregation, integrating co-registered WSIs and spatial transcriptomics.

Result: PathoSpatial delivers strong and consistent performance across five survival endpoints in triple-negative breast cancer, achieving superior or comparable performance to leading unimodal and multimodal methods while providing interpretable prototypes.

Conclusion: PathoSpatial serves as a proof-of-concept for scalable and interpretable multimodal learning for spatial omics-pathology fusion, enabling post-hoc prototype interpretation and molecular risk decomposition with biological explanations.

Abstract: Whole slide images (WSIs) enable weakly supervised prognostic modeling via multiple instance learning (MIL). Spatial transcriptomics (ST) preserves in situ gene expression, providing a spatial molecular context that complements morphology. As paired WSI-ST cohorts scale to population level, leveraging their complementary spatial signals for prognosis becomes crucial; however, principled cross-modal fusion strategies remain limited for this paradigm. To this end, we introduce PathoSpatial, an interpretable end-to-end framework integrating co-registered WSIs and ST to learn spatially informed prognostic representations. PathoSpatial uses task-guided prototype learning within a multi-level experts architecture, adaptively orchestrating unsupervised within-modality discovery with supervised cross-modal aggregation. By design, PathoSpatial substantially strengthens interpretability while maintaining discriminative ability. We evaluate PathoSpatial on a triple-negative breast cancer cohort with paired ST and WSIs. PathoSpatial delivers strong and consistent performance across five survival endpoints, achieving superior or comparable performance to leading unimodal and multimodal methods. PathoSpatial inherently enables post-hoc prototype interpretation and molecular risk decomposition, providing quantitative, biologically grounded explanations, highlighting candidate prognostic factors. We present PathoSpatial as a proof-of-concept for scalable and interpretable multimodal learning for spatial omics-pathology fusion.

[81] Semantic-aware Adversarial Fine-tuning for CLIP

Jiacheng Zhang, Jinhao Li, Hanxun Huang, Sarah M. Erfani, Benjamin I. P. Rubinstein, Feng Liu

Main category: cs.CV

TL;DR: SAFT improves CLIP’s adversarial robustness by fine-tuning with semantic-aware adversarial examples generated using ensemble of refined textual descriptions instead of single hand-crafted templates.

Details

Motivation: Current adversarial fine-tuning methods for CLIP use adversarial examples generated by minimizing cosine similarity with single hand-crafted templates, which is insufficient for measuring image-text similarity and leads to less robust models.

Method: Proposes semantic-ensemble attack to generate semantic-aware adversarial examples by minimizing average similarity between original image and ensemble of refined textual descriptions generated by foundation models. Uses these for Semantic-aware Adversarial Fine-Tuning (SAFT) of CLIP’s image encoder.

Result: SAFT outperforms current methods, achieving substantial improvements in zero-shot adversarial robustness across 16 datasets.

Conclusion: Semantic-aware adversarial examples generated through ensemble of refined textual descriptions significantly improve CLIP’s adversarial robustness compared to traditional methods using single hand-crafted templates.

Abstract: Recent studies have shown that CLIP model’s adversarial robustness in zero-shot classification tasks can be enhanced by adversarially fine-tuning its image encoder with adversarial examples (AEs), which are generated by minimizing the cosine similarity between images and a hand-crafted template (e.g., ‘‘A photo of a {label}’’). However, it has been shown that the cosine similarity between a single image and a single hand-crafted template is insufficient to measure the similarity for image-text pairs. Building on this, in this paper, we find that the AEs generated using cosine similarity may fail to fool CLIP when the similarity metric is replaced with semantically enriched alternatives, making the image encoder fine-tuned with these AEs less robust. To overcome this issue, we first propose a semantic-ensemble attack to generate semantic-aware AEs by minimizing the average similarity between the original image and an ensemble of refined textual descriptions. These descriptions are initially generated by a foundation model to capture core semantic features beyond hand-crafted templates and are then refined to reduce hallucinations. To this end, we propose Semantic-aware Adversarial Fine-Tuning (SAFT), which fine-tunes CLIP’s image encoder with semantic-aware AEs. Extensive experiments show that SAFT outperforms current methods, achieving substantial improvements in zero-shot adversarial robustness across 16 datasets. Our code is available at: https://github.com/tmlr-group/SAFT.

[82] A Lightweight and Explainable DenseNet-121 Framework for Grape Leaf Disease Classification

Md. Ehsanul Haque, Md. Saymon Hosen Polash, Rakib Hasan Ovi, Aminul Kader Bulbul, Md Kamrul Siam, Tamim Hasan Saykat

Main category: cs.CV

TL;DR: Proposes an optimized DenseNet-121 model for grape leaf disease classification with domain-specific preprocessing, achieving 99.27% accuracy and interpretable outputs via Grad-CAM.

Details

Motivation: Grape diseases significantly impact production quality, but current automated methods (especially YOLO-based) are computationally costly and lack interpretability for real-world vineyard management.

Method: Uses optimized DenseNet-121 with domain-specific preprocessing to extract disease-relevant features (veins, edges, lesions). Includes transfer learning, model optimization for computational efficiency, and Grad-CAM for interpretability.

Result: Achieves 99.27% accuracy, 99.28% F1 score, 99.71% specificity, 98.86% Kappa, with 9-second inference time. Cross-validation shows 99.12% mean accuracy. Outperforms ResNet18, VGG16, AlexNet, and SqueezeNet.

Conclusion: The framework is scalable, precise, computationally inexpensive, and interpretable for grape leaf disease detection, suitable for real-time deployment in vineyards.

Abstract: Grapes are among the most economically and culturally significant fruits on a global scale, and table grapes and wine are produced in significant quantities in Europe and Asia. The production and quality of grapes are significantly impacted by grape diseases such as Bacterial Rot, Downy Mildew, and Powdery Mildew. Consequently, the sustainable management of a vineyard necessitates the early and precise identification of these diseases. Current automated methods, particularly those that are based on the YOLO framework, are often computationally costly and lack interpretability that makes them unsuitable for real-world scenarios. This study proposes grape leaf disease classification using Optimized DenseNet 121. Domain-specific preprocessing and extensive connectivity reveal disease-relevant characteristics, including veins, edges, and lesions. An extensive comparison with baseline CNN models, including ResNet18, VGG16, AlexNet, and SqueezeNet, demonstrates that the proposed model exhibits superior performance. It achieves an accuracy of 99.27%, an F1 score of 99.28%, a specificity of 99.71%, and a Kappa of 98.86%, with an inference time of 9 seconds. The cross-validation findings show a mean accuracy of 99.12%, indicating strength and generalizability across all classes. We also employ Grad-CAM to highlight disease-related regions to guarantee the model is highlighting physiologically relevant aspects and increase transparency and confidence. Model optimization reduces processing requirements for real-time deployment, while transfer learning ensures consistency on smaller and unbalanced samples. An effective architecture, domain-specific preprocessing, and interpretable outputs make the proposed framework scalable, precise, and computationally inexpensive for detecting grape leaf diseases.

[83] Human-Like Coarse Object Representations in Vision Models

Andrey Gizdov, Andrea Procopio, Yichen Li, Daniel Harari, Tomer Ullman

Main category: cs.CV

TL;DR: Paper investigates whether segmentation models acquire human-like coarse volumetric object representations for physics, finding alignment follows inverse U-shape: intermediate model granularity best matches human behavior.

Details

Motivation: Humans represent objects for intuitive physics with coarse volumetric bodies that smooth details, while segmentation models optimize pixel-accurate masks. The paper asks whether and when these models acquire human-like bodies despite different optimization objectives.

Method: Used time-to-collision behavioral paradigm with comparison pipeline and alignment metric. Varied model training time, size, and effective capacity via pruning to study emergence of human-like representations.

Result: Alignment with human behavior follows inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; intermediate granularity best matches humans.

Conclusion: Human-like coarse bodies emerge from resource constraints rather than bespoke biases. Early checkpoints, modest architectures, and light pruning can elicit physics-efficient representations, supporting resource-rational accounts balancing recognition detail against physical affordances.

Abstract: Humans appear to represent objects for intuitive physics with coarse, volumetric bodies’’ that smooth concavities - trading fine visual details for efficient physical predictions - yet their internal structure is largely unknown. Segmentation models, in contrast, optimize pixel-accurate masks that may misalign with such bodies. We ask whether and when these models nonetheless acquire human-like bodies. Using a time-to-collision (TTC) behavioral paradigm, we introduce a comparison pipeline and alignment metric, then vary model training time, size, and effective capacity via pruning. Across all manipulations, alignment with human behavior follows an inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; and an intermediate ideal body granularity’’ best matches humans. This suggests human-like coarse bodies emerge from resource constraints rather than bespoke biases, and points to simple knobs - early checkpoints, modest architectures, light pruning - for eliciting physics-efficient representations. We situate these results within resource-rational accounts balancing recognition detail against physical affordances.

[84] Insertion Network for Image Sequence Correspondence

Dingjie Su, Weixiang Hong, Benoit M. Dawant, Bennett A. Landman

Main category: cs.CV

TL;DR: A novel method for establishing correspondence between 2D image sequences using slice insertion learning with attention mechanisms, applied to slice-level content navigation in medical imaging.

Details

Motivation: The paper addresses the need for accurate slice-level content navigation in 3D medical volumes, which is crucial for diagnostic tasks and automatic registration/segmentation pipelines. Current methods like body part regression treat slices independently without leveraging contextual sequence information.

Method: The approach trains a network to learn how to insert a slice from one sequence into the appropriate position in another sequence. It encodes contextual representations of each slice and models the insertion process using a slice-to-slice attention mechanism.

Result: The insertion network reduces slice localization errors from 8.4 mm to 5.4 mm in supervised settings, showing substantial improvement over body part regression methods.

Conclusion: The proposed method effectively leverages contextual sequence information for slice localization, outperforming independent slice analysis approaches and demonstrating practical value for medical imaging applications.

Abstract: We propose a novel method for establishing correspondence between two sequences of 2D images. One particular application of this technique is slice-level content navigation, where the goal is to localize specific 2D slices within a 3D volume or determine the anatomical coverage of a 3D scan based on its 2D slices. This serves as an important preprocessing step for various diagnostic tasks, as well as for automatic registration and segmentation pipelines. Our approach builds sequence correspondence by training a network to learn how to insert a slice from one sequence into the appropriate position in another. This is achieved by encoding contextual representations of each slice and modeling the insertion process using a slice-to-slice attention mechanism. We apply this method to localize manually labeled key slices in body CT scans and compare its performance to the current state-of-the-art alternative known as body part regression, which predicts anatomical position scores for individual slices. Unlike body part regression, which treats each slice independently, our method leverages contextual information from the entire sequence. Experimental results show that the insertion network reduces slice localization errors in supervised settings from 8.4 mm to 5.4 mm, demonstrating a substantial improvement in accuracy.

[85] Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models

Ali Abbasi, Mehdi Taghipour, Rahmatollah Beheshti

Main category: cs.CV

TL;DR: A method called Negation-Aware Selective Training (NAST) improves vision-language models’ ability to handle clinical negation by using causal tracing effects to guide layer-wise gradient updates during fine-tuning.

Details

Motivation: Vision-language models often fail to distinguish affirmative from negated medical statements, which is critical for safety in clinical reporting where negation is fundamental.

Method: Introduced a radiology-specific diagnostic benchmark and contextual clinical negation dataset. Proposed NAST method that uses causal tracing effects to modulate layer-wise gradient updates during fine-tuning, scaling each layer’s update according to its causal contribution to negation processing.

Result: Experiments show improved discrimination of affirmative and negated clinical statements without degrading general vision-language alignment.

Conclusion: Causal interpretability can be valuable for targeted model adaptation in safety-critical medical settings, enabling better handling of negation in vision-language models.

Abstract: Negation is a fundamental linguistic operation in clinical reporting, yet vision-language models (VLMs) frequently fail to distinguish affirmative from negated medical statements. To systematically characterize this limitation, we introduce a radiology-specific diagnostic benchmark that evaluates polarity sensitivity under controlled clinical conditions, revealing that common medical VLMs consistently confuse negated and non-negated findings. To enable learning beyond simple condition absence, we further construct a contextual clinical negation dataset that encodes structured claims and supports attribute-level negations involving location and severity. Building on these resources, we propose Negation-Aware Selective Training (NAST), an interpretability-guided adaptation method that uses causal tracing effects (CTEs) to modulate layer-wise gradient updates during fine-tuning. Rather than applying uniform learning rates, NAST scales each layer’s update according to its causal contribution to negation processing, transforming mechanistic interpretability signals into a principled optimization rule. Experiments demonstrate improved discrimination of affirmative and negated clinical statements without degrading general vision-language alignment, highlighting the value of causal interpretability for targeted model adaptation in safety-critical medical settings. Code and resources are available at https://github.com/healthylaife/NAST.

[86] Matching of SAR and optical images based on transformation to shared modality

Alexey Borisov, Evgeny Myasnikov, Vladislav Myasnikov

Main category: cs.CV

TL;DR: A novel approach for optical-SAR image matching by transforming both to a shared modality, enabling use of pre-trained RoMa matching models without modality-specific retraining.

Details

Motivation: Optical and SAR images have fundamental physical differences that make precise co-registration challenging. Existing methods struggle with modality differences, requiring specialized solutions.

Method: Transform both optical and SAR images to a new shared modality with equal channels, similarity preservation, and feature retention. Then use pre-trained RoMa image matching model (designed for regular photos) without retraining.

Result: Superior performance over alternative approaches (image translation and feature matching) on MultiSenGE dataset. More versatile solution that maintains high-quality matching.

Conclusion: The proposed modality transformation approach enables effective optical-SAR image matching using existing pre-trained models, overcoming modality differences while preserving image features.

Abstract: Significant differences in optical images and Synthetic Aperture Radar (SAR) images are caused by fundamental differences in the physical principles underlying their acquisition by Earth remote sensing platforms. These differences make precise image matching (co-registration) of these two types of images difficult. In this paper, we propose a new approach to image matching of optical and SAR images, which is based on transforming the images to a new modality. The new image modality is common to both optical and SAR images and satisfies the following conditions. First, the transformed images must have an equal pre-defined number of channels. Second, the transformed and co-registered images must be as similar as possible. Third, the transformed images must be non-degenerate, meaning they must preserve the significant features of the original images. To further match images transformed to this shared modality, we train the RoMa image matching model, which is one of the leading solutions for matching of regular digital photographs. We evaluated the proposed approach on the publicly available MultiSenGE dataset containing both optical and SAR images. We demonstrated its superiority over alternative approaches based on image translation between original modalities and various feature matching algorithms. The proposed solution not only provides better quality of matching, but is also more versatile. It enables the use of ready-made RoMa and DeDoDe models, pre-trained for regular images, without retraining for a new modality, while maintaining high-quality matching of optical and SAR images.

[87] LiDAR-Anchored Collaborative Distillation for Robust 2D Representations

Wonjun Jo, Hyunwoo Ha, Kim Ji-Yeon, Hawook Jeong, Tae-Hyun Oh

Main category: cs.CV

TL;DR: Self-supervised collaborative distillation method using 3D LiDAR to improve 2D image encoder robustness for adverse weather conditions while maintaining original capabilities and enhancing 3D awareness.

Details

Motivation: Pre-trained 2D image encoders fail under noisy and adverse weather conditions beyond clear daytime scenes, limiting robust visual perception needed for real-world applications.

Method: Proposes Collaborative Distillation, a self-supervised approach that leverages 3D LiDAR as supervision to improve 2D image encoder robustness to adverse conditions while retaining original capabilities.

Result: Outperforms competing methods in various downstream tasks across diverse conditions, exhibits strong generalization ability, and improves 3D awareness from LiDAR characteristics.

Conclusion: The method demonstrates practicality and adaptability for real-world scenarios by enhancing 2D image encoder robustness to adverse conditions using 3D LiDAR supervision.

Abstract: As deep learning continues to advance, self-supervised learning has made considerable strides. It allows 2D image encoders to extract useful features for various downstream tasks, including those related to vision-based systems. Nevertheless, pre-trained 2D image encoders fall short in conducting the task under noisy and adverse weather conditions beyond clear daytime scenes, which require for robust visual perception. To address these issues, we propose a novel self-supervised approach, \textbf{Collaborative Distillation}, which leverages 3D LiDAR as self-supervision to improve robustness to noisy and adverse weather conditions in 2D image encoders while retaining their original capabilities. Our method outperforms competing methods in various downstream tasks across diverse conditions and exhibits strong generalization ability. In addition, our method also improves 3D awareness stemming from LiDAR’s characteristics. This advancement highlights our method’s practicality and adaptability in real-world scenarios.

[88] Geometric Stratification for Singular Configurations of the P3P Problem via Local Dual Space

Xueying Sun, Zijia Li, Nan Li

Main category: cs.CV

TL;DR: Complete geometric stratification of P3P problem singular configurations using algebraic-computational framework based on camera center multiplicity and danger cylinder geometry.

Details

Motivation: The P3P (Perspective-Three-Point) problem in computer vision has singular configurations where multiple solutions exist. Understanding these singularities is crucial for robust 3D reconstruction and camera pose estimation algorithms.

Method: Uses local dual space and systematic algebraic-computational framework to analyze singular configurations. Classifies based on multiplicity μ of camera center O, relating configurations to geometric structures like danger cylinder, Morley triangle, and circumcircle.

Result: Complete geometric stratification: for μ≥2, O lies on danger cylinder; for μ≥3, O lies on three generatrices associated with Morley triangle or circumcircle; for μ≥4, O lies on circumcircle (infinite solutions). Also characterizes complementary configuration O’ on deltoidal surfaces and cuspidal curves.

Conclusion: Provides comprehensive understanding of P3P singularities through geometric stratification, offering insights for algorithm design and robustness in computer vision applications.

Abstract: This paper investigates singular configurations of the P3P problem. Using local dual space, a systematic algebraic-computational framework is proposed to give a complete geometric stratification for the P3P singular configurations with respect to the multiplicity $μ$ of the camera center $O$: for $μ\ge 2$, $O$ lies on the ``danger cylinder’’, for $μ\ge 3$, $O$ lies on one of three generatrices of the danger cylinder associated with the first Morley triangle or the circumcircle, and for $μ\ge 4$, $O$ lies on the circumcircle which indeed corresponds to infinite P3P solutions. Furthermore, a geometric stratification for the complementary configuration $O^\prime$ associated with a singular configuration $O$ is studied as well: for $μ\ge 2$, $O^\prime$ lies on a deltoidal surface associated with the danger cylinder, and for $μ\ge 3$, $O^\prime$ lies on one of three cuspidal curves of the deltoidal surface.

[89] Hallucinating 360°: Panoramic Street-View Generation via Local Scenes Diffusion and Probabilistic Prompting

Fei Teng, Kai Luo, Sheng Wu, Siyu Li, Pujun Guo, Jiale Wei, Jiaming Zhang, Kunyu Peng, Kailun Yang

Main category: cs.CV

TL;DR: Percep360: First panoramic generation method for autonomous driving that creates coherent 360° images with control signals using stitched pinhole images as supervision.

Details

Motivation: Panoramic perception is crucial for autonomous driving but data acquisition is complex and expensive. Existing generation models can't leverage stitched pinhole images as supervision and are limited to fixed dataset distributions.

Method: Proposes Local Scenes Diffusion Method (LSDM) for spatially continuous diffusion to bridge data distribution gaps, and Probabilistic Prompting Method (PPM) for dynamic control cue selection for controllable generation.

Result: Generated images outperform original stitched images in no-reference quality metrics and enhance downstream BEV segmentation models, demonstrating practical utility for perception tasks.

Conclusion: Percep360 successfully enables coherent and controllable panoramic generation for autonomous driving, addressing data scarcity issues and improving downstream perception performance.

Abstract: Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360° surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have demonstrated strong data regeneration capabilities, they can only learn from the fixed data distribution of existing datasets and cannot leverage stitched pinhole images as a supervisory signal. In this paper, we propose the first panoramic generation method Percep360 for autonomous driving. Percep360 enables coherent generation of panoramic data with control signals based on the stitched panoramic data. Percep360 focuses on two key aspects: coherence and controllability. Specifically, to overcome the inherent information loss caused by the pinhole sampling process, we propose the Local Scenes Diffusion Method (LSDM). LSDM reformulates the panorama generation as a spatially continuous diffusion process, bridging the gaps between different data distributions. Additionally, to achieve the controllable generation of panoramic images, we propose a Probabilistic Prompting Method (PPM). PPM dynamically selects the most relevant control cues, enabling controllable panoramic image generation. We evaluate the effectiveness of the generated images from three perspectives: image quality assessment (i.e., no-reference and with reference), controllability, and their utility in real-world Bird’s Eye View (BEV) segmentation. Notably, the generated data consistently outperforms the original stitched images in no-reference quality metrics and enhances downstream perception models. The source code will be publicly available at https://github.com/FeiT-FeiTeng/Percep360.

[90] Self-Supervised JEPA-based World Models for LiDAR Occupancy Completion and Forecasting

Haoran Zhu, Anna Choromanska

Main category: cs.CV

TL;DR: AD-LiST-JEPA: A self-supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using Joint-Embedding Predictive Architecture (JEPA).

Details

Motivation: Autonomous driving requires world models that capture environmental evolution for long-term planning, and scalability demands self-supervised learning without expensive human annotations.

Method: Proposes AD-LiST-JEPA using JEPA framework to learn world models from LiDAR data in self-supervised manner, predicting future spatiotemporal evolution.

Result: Proof-of-concept experiments show better occupancy completion and forecasting (OCF) performance with pretrained encoder after JEPA-based world model learning.

Conclusion: The approach enables self-supervised learning of world models for autonomous driving, demonstrating improved downstream task performance through JEPA-based representation learning.

Abstract: Autonomous driving, as an agent operating in the physical world, requires the fundamental capability to build \textit{world models} that capture how the environment evolves spatiotemporally in order to support long-term planning. At the same time, scalability demands learning such models in a self-supervised manner; \textit{joint-embedding predictive architecture (JEPA)} enables learning world models via leveraging large volumes of unlabeled data without relying on expensive human annotations. In this paper, we propose \textbf{AD-LiST-JEPA}, a self-supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using a JEPA framework. We evaluate the quality of the learned representations through a downstream LiDAR-based occupancy completion and forecasting (OCF) task, which jointly assesses perception and prediction. Proof of concept experiments show better OCF performance with pretrained encoder after JEPA-based world model learning.

[91] PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis

Yuanbo Li, Dule Shu, Yanying Chen, Matt Klenk, Daniel Ritchie

Main category: cs.CV

TL;DR: PLLM: A self-training framework for CAD program synthesis from unlabeled 3D shapes using iterative program sampling, selection, and augmentation to create synthetic training data.

Details

Motivation: Existing CAD program synthesis methods require supervised training with paired shape-program data, which is often unavailable. The paper aims to overcome this limitation by enabling CAD program synthesis from unlabeled 3D shapes.

Method: PLLM uses a pre-trained CAD-capable LLM and iteratively: 1) samples candidate programs, 2) selects high-fidelity executions, and 3) augments programs to construct synthetic program-shape pairs for fine-tuning, creating a self-training loop.

Result: Experiments adapting CAD-Recode from DeepCAD to the unlabeled ABC dataset show consistent improvements in both geometric fidelity and program diversity compared to baseline methods.

Conclusion: PLLM demonstrates that self-training can effectively adapt CAD-capable LLMs to new datasets without requiring labeled shape-program pairs, enabling more flexible CAD program synthesis from 3D geometries.

Abstract: Recovering Computer-Aided Design (CAD) programs from 3D geometries is a widely studied problem. Recent advances in large language models (LLMs) have enabled progress in CAD program synthesis, but existing methods rely on supervised training with paired shape-program data, which is often unavailable. We introduce PLLM, a self-training framework for CAD program synthesis from unlabeled 3D shapes. Given a pre-trained CAD-capable LLM and a shape dataset, PLLM iteratively samples candidate programs, selects high-fidelity executions, and augments programs to construct synthetic program-shape pairs for fine-tuning. We experiment on adapting CAD-Recode from DeepCAD to the unlabeled ABC dataset show consistent improvements in geometric fidelity and program diversity.

[92] Robust and Real-Time Bangladeshi Currency Recognition: A Dual-Stream MobileNet and EfficientNet Approach

Subreena, Mohammad Amzad Hossain, Mirza Raquib, Saydul Akbar Murad, Farida Siddiqi Prity, Muhammad Hanif, Nick Rahimi

Main category: cs.CV

TL;DR: A hybrid CNN architecture combining MobileNetV3-Large and EfficientNetB0 with MLP classifier for Bangladeshi banknote recognition, achieving high accuracy on controlled and real-world datasets with explainable AI integration.

Details

Motivation: To develop an accurate currency recognition system for visually impaired individuals to prevent fraud and exploitation, addressing the lack of robust Bangladeshi banknote datasets and limitations of current recognition models.

Method: Built a new Bangladeshi banknote dataset with controlled/real-world scenarios, combined with four additional public datasets. Proposed hybrid CNN architecture using MobileNetV3-Large and EfficientNetB0 for feature extraction, followed by MLP classifier for efficient computation.

Result: Achieved 97.95% accuracy on controlled datasets, 92.84% on complex backgrounds, and 94.98% overall accuracy. Thorough evaluation using five-fold cross-validation and seven metrics (accuracy, precision, recall, F1-score, Cohen’s Kappa, MCC, AUC) with explainable AI integration.

Conclusion: The proposed hybrid CNN model provides an effective, computationally efficient solution for banknote recognition suitable for resource-constrained devices, with enhanced transparency through explainable AI methods like LIME and SHAP.

Abstract: Accurate currency recognition is essential for assistive technologies, particularly for visually impaired individuals who rely on others to identify banknotes. This dependency puts them at risk of fraud and exploitation. To address these challenges, we first build a new Bangladeshi banknote dataset that includes both controlled and real-world scenarios, ensuring a more comprehensive and diverse representation. Next, to enhance the dataset’s robustness, we incorporate four additional datasets, including public benchmarks, to cover various complexities and improve the model’s generalization. To overcome the limitations of current recognition models, we propose a novel hybrid CNN architecture that combines MobileNetV3-Large and EfficientNetB0 for efficient feature extraction. This is followed by an effective multilayer perceptron (MLP) classifier to improve performance while keeping computational costs low, making the system suitable for resource-constrained devices. The experimental results show that the proposed model achieves 97.95% accuracy on controlled datasets, 92.84% on complex backgrounds, and 94.98% accuracy when combining all datasets. The model’s performance is thoroughly evaluated using five-fold cross-validation and seven metrics: accuracy, precision, recall, F1-score, Cohen’s Kappa, MCC, and AUC. Additionally, explainable AI methods like LIME and SHAP are incorporated to enhance transparency and interpretability.

[93] The Constant Eye: Benchmarking and Bridging Appearance Robustness in Autonomous Driving

Jiabao Wang, Hongyu Zhou, Yuanbo Yang, Jiahao Shao, Yiyi Liao

Main category: cs.CV

TL;DR: navdream benchmark isolates appearance vs. structure effects in autonomous driving using generative style transfer, showing planners degrade under OOD appearance; solution uses DINOv3 features for appearance-invariant planning across paradigms.

Details

Motivation: Current autonomous driving research fails to distinguish between appearance-based shifts (weather, lighting) and structural scene changes, making it unclear whether planner failures stem from complex road geometry or simply visual appearance changes.

Method: Created navdream benchmark using generative pixel-aligned style transfer to create visual stress tests with negligible geometric deviation; proposed universal perception interface using frozen DINOv3 visual foundation model to extract appearance-invariant features as stable interface for planners.

Result: Existing planning algorithms show significant degradation under OOD appearance conditions even when scene structure remains consistent; DINOv3-based solution achieves exceptional zero-shot generalization across diverse planning paradigms (regression, diffusion, scoring) without fine-tuning.

Conclusion: Isolating appearance effects reveals critical fragility in current planners; using visual foundation models for appearance-invariant features provides plug-and-play solution for robust generalization across appearance shifts.

Abstract: Despite rapid progress, autonomous driving algorithms remain notoriously fragile under Out-of-Distribution (OOD) conditions. We identify a critical decoupling failure in current research: the lack of distinction between appearance-based shifts, such as weather and lighting, and structural scene changes. This leaves a fundamental question unanswered: Is the planner failing because of complex road geometry, or simply because it is raining? To resolve this, we establish navdream, a high-fidelity robustness benchmark leveraging generative pixel-aligned style transfer. By creating a visual stress test with negligible geometric deviation, we isolate the impact of appearance on driving performance. Our evaluation reveals that existing planning algorithms often show significant degradation under OOD appearance conditions, even when the underlying scene structure remains consistent. To bridge this gap, we propose a universal perception interface leveraging a frozen visual foundation model (DINOv3). By extracting appearance-invariant features as a stable interface for the planner, we achieve exceptional zero-shot generalization across diverse planning paradigms, including regression-based, diffusion-based, and scoring-based models. Our plug-and-play solution maintains consistent performance across extreme appearance shifts without requiring further fine-tuning. The benchmark and code will be made available.

[94] Language-Guided Invariance Probing of Vision-Language Models

Jae Joong Lee

Main category: cs.CV

TL;DR: LGIP benchmark evaluates VLMs’ linguistic robustness by testing invariance to paraphrases and sensitivity to semantic flips in image-text matching.

Details

Motivation: Current vision-language models show strong zero-shot performance but lack evaluation of their reliability to controlled linguistic perturbations. The authors aim to create a diagnostic benchmark to measure linguistic robustness beyond standard retrieval metrics.

Method: Introduces Language-Guided Invariance Probing (LGIP) benchmark using 40k MS COCO images with five human captions each. Automatically generates paraphrases and rule-based semantic flips (object category, color, or count changes). Evaluates nine VLMs using invariance error, semantic sensitivity gap, and positive-rate statistics.

Result: EVA02-CLIP and large OpenCLIP variants show favorable invariance-sensitivity balance (low paraphrase variance, higher scores for original vs. flipped captions). SigLIP and SigLIP2 exhibit large invariance errors and often prefer flipped captions over human descriptions, especially for object and color edits.

Conclusion: LGIP provides a model-agnostic diagnostic for linguistic robustness of VLMs, revealing failures invisible to standard retrieval metrics. Different VLMs show varying levels of linguistic sensitivity and invariance.

Abstract: Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.

[95] Unbiased Gradient Estimation for Event Binning via Functional Backpropagation

Jinze Chen, Wei Zhai, Han Han, Tiankai Ma, Yang Cao, Bin Li, Zheng-Jun Zha

Main category: cs.CV

TL;DR: Proposes a novel framework for unbiased gradient estimation in event-based vision by synthesizing weak derivatives during backpropagation while keeping forward output unchanged, enabling better learning from raw events.

Details

Motivation: Event-based vision algorithms face limitations: binning events into frames truncates gradients, while direct learning from raw events suffers from biased gradient estimation due to discontinuities in binning operations, limiting learning efficiency.

Method: Uses integration by parts to lift target functions to functionals, yielding an integral form of the derivative of binning function during backpropagation. Reconstructs cotangent function from sampled cotangent vector to compute weak derivatives that match long-range finite differences of both smooth and non-smooth targets.

Result: Improves optimization-based egomotion estimation with 3.2% lower RMS error and 1.57× faster convergence. Achieves 9.4% lower EPE in self-supervised optical flow and 5.1% lower RMS error in SLAM, demonstrating broad benefits for event-based visual perception.

Conclusion: The proposed framework enables unbiased gradient estimation for arbitrary binning functions in event-based vision, overcoming limitations of both frame-based and raw event learning approaches, with significant improvements across multiple vision tasks.

Abstract: Event-based vision encodes dynamic scenes as asynchronous spatio-temporal spikes called events. To leverage conventional image processing pipelines, events are typically binned into frames. However, binning functions are discontinuous, which truncates gradients at the frame level and forces most event-based algorithms to rely solely on frame-based features. Attempts to directly learn from raw events avoid this restriction but instead suffer from biased gradient estimation due to the discontinuities of the binning operation, ultimately limiting their learning efficiency. To address this challenge, we propose a novel framework for unbiased gradient estimation of arbitrary binning functions by synthesizing weak derivatives during backpropagation while keeping the forward output unchanged. The key idea is to exploit integration by parts: lifting the target functions to functionals yields an integral form of the derivative of the binning function during backpropagation, where the cotangent function naturally arises. By reconstructing this cotangent function from the sampled cotangent vector, we compute weak derivatives that provably match long-range finite differences of both smooth and non-smooth targets. Experimentally, our method improves simple optimization-based egomotion estimation with 3.2% lower RMS error and 1.57$\times$ faster convergence. On complex downstream tasks, we achieve 9.4% lower EPE in self-supervised optical flow, and 5.1% lower RMS error in SLAM, demonstrating broad benefits for event-based visual perception. Source code can be found at https://github.com/chjz1024/EventFBP.

[96] QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching

Ke Xu, Yixin Wang, Zhongcheng Li, Hao Cui, Jinshui Hu, Xingyi Zhang

Main category: cs.CV

TL;DR: QuEPT is an efficient post-training quantization scheme for Transformers that enables multi-bit deployment via one-shot calibration, supporting dynamic switching between uniform and mixed precision quantization using cascaded low-rank adapters.

Details

Motivation: Current elastic precision quantization methods for Transformers suffer from high storage and optimization costs, limiting their application to large language models. There's a need for efficient post-training quantization that can adapt to various bit-widths without repeated optimization.

Method: QuEPT uses one-shot calibration on small data to reconstruct block-wise multi-bit errors. It employs cascaded low-rank adapters for dynamic bit-width adaptation, Multi-Bit Token Merging (MB-ToMe) to fuse token features across bit-widths, and Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups.

Result: Extensive experiments show QuEPT achieves comparable or better performance than state-of-the-art post-training quantization methods, supporting real-time switching between quantization modes without repeated optimization.

Conclusion: QuEPT provides an efficient post-training quantization solution for Transformers that enables flexible multi-bit deployment with minimal optimization overhead, making it practical for large language model deployment.

Abstract: Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization methods.Our code is available at https://github.com/xuke225/QuEPT

[97] Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

Omer Faruk Deniz, Ruiyu Mao, Ruochen Li, Yapeng Tian, Latifur Khan

Main category: cs.CV

TL;DR: ADSC is a novel compression method for MLLMs that uses the LLM’s own attention mechanism to progressively reduce vision tokens through uniform downsampling at selected layers, achieving significant computational savings while maintaining performance.

Details

Motivation: MLLMs suffer from high computational costs due to processing numerous vision tokens through all LLM layers. Existing pruning methods either operate before the LLM (limiting generality) or use heuristics incompatible with FlashAttention. The authors propose using the LLM itself as the optimal guide for compression.

Method: ADSC (Attention-Driven Self-Compression) progressively reduces vision tokens using only the LLM’s attention mechanism. It applies uniform token downsampling at selected layers, creating bottlenecks that encourage the model to reorganize and compress information into remaining tokens. The method requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention.

Result: Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of original model performance. It outperforms prior pruning approaches across multiple benchmarks in both efficiency and accuracy, and remains robust under high compression ratios where heuristic-based techniques degrade sharply.

Conclusion: ADSC provides a simple, broadly applicable compression method for MLLMs that leverages the LLM’s own attention mechanism for efficient vision token reduction, achieving significant computational savings while maintaining performance better than prior approaches.

Abstract: Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM’s attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.

[98] ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models

Peijie Qiu, Hariharan Ramshankar, Arnau Ramisa, René Vidal, Amit Kumar K C, Vamsi Salaka, Rahul Bhagat

Main category: cs.CV

TL;DR: ImageRAGTurbo improves few-step diffusion models for text-to-image generation using retrieval augmentation and trainable adapters to maintain quality while reducing sampling steps.

Details

Motivation: Diffusion models have high latency due to iterative sampling. Few-step diffusion models reduce steps but compromise image quality and prompt alignment, especially in one-step generation, and require expensive training.

Method: Proposes ImageRAGTurbo: retrieval augmentation for few-step diffusion models. Retrieves relevant text-image pairs from database to condition generation. Uses retrieved content to edit UNet’s latent H-space without finetuning. Adds trainable adapter in H-space with cross-attention to blend retrieved content with target prompt.

Result: Experimental results show approach produces high-fidelity images without compromising latency compared to existing methods. Initial investigations show retrieval improves prompt fidelity even without additional finetuning.

Conclusion: ImageRAGTurbo efficiently finetunes few-step diffusion models via retrieval augmentation, enabling fast text-to-image generation with maintained quality and prompt alignment.

Abstract: Diffusion models have emerged as the leading approach for text-to-image generation. However, their iterative sampling process, which gradually morphs random noise into coherent images, introduces significant latency that limits their applicability. While recent few-step diffusion models reduce the number of sampling steps to as few as one to four steps, they often compromise image quality and prompt alignment, especially in one-step generation. Additionally, these models require computationally expensive training procedures. To address these limitations, we propose ImageRAGTurbo, a novel approach to efficiently finetune few-step diffusion models via retrieval augmentation. Given a text prompt, we retrieve relevant text-image pairs from a database and use them to condition the generation process. We argue that such retrieved examples provide rich contextual information to the UNet denoiser that helps reduce the number of denoising steps without compromising image quality. Indeed, our initial investigations show that using the retrieved content to edit the denoiser’s latent space ($\mathcal{H}$-space) without additional finetuning already improves prompt fidelity. To further improve the quality of the generated images, we augment the UNet denoiser with a trainable adapter in the $\mathcal{H}$-space, which efficiently blends the retrieved content with the target prompt using a cross-attention mechanism. Experimental results on fast text-to-image generation demonstrate that our approach produces high-fidelity images without compromising latency compared to existing methods.

[99] Multi-Task Learning with Additive U-Net for Image Denoising and Classification

Vikram Lakkavalli, Neelam Sinha

Main category: cs.CV

TL;DR: Additive U-Net (AddUNet) replaces concatenative skip connections with gated additive fusion for image denoising and multi-task learning, achieving competitive performance with improved training stability through architectural regularization.

Details

Motivation: The paper aims to improve U-Net architectures for image denoising and denoising-centric multi-task learning by addressing issues with concatenative skip connections, which can lead to uncontrolled information flow and training instability in joint optimization scenarios.

Method: Proposes Additive U-Net (AddUNet) which replaces concatenative skip connections with gated additive fusion. This approach constrains shortcut capacity while preserving fixed feature dimensionality across depth, acting as structural regularization that controls encoder-decoder information flow.

Result: AddUNet achieves competitive reconstruction performance with improved training stability in both single-task denoising and joint denoising-classification settings. In multi-task learning, learned skip weights show systematic task-aware redistribution: shallow skips favor reconstruction while deeper features support discrimination.

Conclusion: Simple constraints on skip connections serve as effective architectural regularizers for stable and scalable multi-task learning without increasing model complexity. Additive fusion enables implicit task decoupling and robust reconstruction even under limited classification capacity.

Abstract: We investigate additive skip fusion in U-Net architectures for image denoising and denoising-centric multi-task learning (MTL). By replacing concatenative skips with gated additive fusion, the proposed Additive U-Net (AddUNet) constrains shortcut capacity while preserving fixed feature dimensionality across depth. This structural regularization induces controlled encoder-decoder information flow and stabilizes joint optimization. Across single-task denoising and joint denoising-classification settings, AddUNet achieves competitive reconstruction performance with improved training stability. In MTL, learned skip weights exhibit systematic task-aware redistribution: shallow skips favor reconstruction, while deeper features support discrimination. Notably, reconstruction remains robust even under limited classification capacity, indicating implicit task decoupling through additive fusion. These findings show that simple constraints on skip connections act as an effective architectural regularizer for stable and scalable multi-task learning without increasing model complexity.

[100] CBEN – A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding

Marco Stricker, Masakazu Iwamura, Koichi Kise

Main category: cs.CV

TL;DR: CloudyBigEarthNet dataset enables training and evaluation of multimodal (optical+radar) models for cloud-robust remote sensing, showing significant performance drops when models trained on clear-sky data are tested on cloudy images.

Details

Motivation: Clouds distort optical satellite imagery, forcing exclusion of cloudy images from ML datasets. This limits applicability to time-sensitive scenarios like natural disasters. While cloud removal methods exist, they have drawbacks. Combining optical with radar (cloud-unaffected) data offers a solution, but current multimodal datasets exclude cloudy images during training/evaluation.

Method: Created CloudyBigEarthNet (CBEN) dataset of paired optical and radar images with cloud occlusion for training and evaluation. Adapted state-of-the-art multimodal methods to handle cloudy optical data during training by including cloudy samples.

Result: Models trained on clear-sky optical+radar data suffer 23-33 percentage point performance drops when evaluated on cloudy images. Adapting methods to include cloudy optical data during training achieved relative improvements of 17.2-28.7 percentage points on cloudy test cases.

Conclusion: Excluding cloudy images from training/evaluation limits real-world applicability. The CBEN dataset enables development of cloud-robust multimodal methods. Including cloudy data during training significantly improves performance on cloudy scenarios.

Abstract: Clouds are a common phenomenon that distorts optical satellite imagery, which poses a challenge for remote sensing. However, in the literature cloudless analysis is often performed where cloudy images are excluded from machine learning datasets and methods. Such an approach cannot be applied to time sensitive applications, e.g., during natural disasters. A possible solution is to apply cloud removal as a preprocessing step to ensure that cloudfree solutions are not failing under such conditions. But cloud removal methods are still actively researched and suffer from drawbacks, such as generated visual artifacts. Therefore, it is desirable to develop cloud robust methods that are less affected by cloudy weather. Cloud robust methods can be achieved by combining optical data with radar, a modality unaffected by clouds. While many datasets for machine learning combine optical and radar data, most researchers exclude cloudy images. We identify this exclusion from machine learning training and evaluation as a limitation that reduces applicability to cloudy scenarios. To investigate this, we assembled a dataset, named CloudyBigEarthNet (CBEN), of paired optical and radar images with cloud occlusion for training and evaluation. Using average precision (AP) as the evaluation metric, we show that state-of-the-art methods trained on combined clear-sky optical and radar imagery suffer performance drops of 23-33 percentage points when evaluated on cloudy images. We then adapt these methods to cloudy optical data during training, achieving relative improvement of 17.2-28.7 percentage points on cloudy test cases compared with the original approaches. Code and dataset are publicly available at: https://github.com/mstricker13/CBEN

[101] IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language Models

Aarish Shah Mohsin, Mohammed Tayyab Ilyas Khan, Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria, Jiechao Gao

Main category: cs.CV

TL;DR: IndicFairFace: A geographically balanced face dataset for studying and mitigating intra-national bias in Vision-Language Models for India

Details

Motivation: Existing fairness datasets treat India as a monolithic category, ignoring its vast geographical diversity across 28 states and 8 Union Territories, leading to representational bias in VLMs.

Method: Created IndicFairFace dataset with 14,400 images ethically sourced from Wikimedia Commons and open-license repositories, uniformly balanced across Indian states and gender. Used Iterative Nullspace Projection to debias CLIP-based VLMs.

Result: Successfully quantified and reduced intra-national geographical bias in VLMs. Debiasing approach preserved embedding quality with less than 1.5% average drop in retrieval accuracy on benchmark datasets.

Conclusion: IndicFairFace establishes the first benchmark for studying geographical bias in VLMs for India, addressing oversimplification of Indian diversity in existing fairness datasets.

Abstract: Vision-Language Models (VLMs) are known to inherit and amplify societal biases from their web-scale training data with Indian being particularly misrepresented. Existing fairness-aware datasets have significantly improved demographic balance across global race and gender groups, yet they continue to treat Indian as a single monolithic category. The oversimplification ignores the vast intra-national diversity across 28 states and 8 Union Territories of India and leads to representational and geographical bias. To address the limitation, we present IndicFairFace, a novel and balanced face dataset comprising 14,400 images representing geographical diversity of India. Images were sourced ethically from Wikimedia Commons and open-license web repositories and uniformly balanced across states and gender. Using IndicFairFace, we quantify intra-national geographical bias in prominent CLIP-based VLMs and reduce it using post-hoc Iterative Nullspace Projection debiasing approach. We also show that the adopted debiasing approach does not adversely impact the existing embedding space as the average drop in retrieval accuracy on benchmark datasets is less than 1.5 percent. Our work establishes IndicFairFace as the first benchmark to study geographical bias in VLMs for the Indian context.

[102] Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening

Wooseok Jeon, Seunghyun Shin, Dongmin Shin, Hae-Gon Jeon

Main category: cs.CV

TL;DR: MPD is an inference-time distillation technique that improves video inbetweening by distilling forward motion priors into backward paths to reduce temporal discontinuities in I2V diffusion models.

Details

Motivation: Existing inference-time sampling methods for image-to-video inbetweening suffer from temporal discontinuities and visual artifacts due to misalignment between forward and backward generated paths, as each path follows different motion priors from their conditioning frames.

Method: Motion Prior Distillation (MPD) - an inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path, avoiding denoising the ambiguous end-conditioned path.

Result: Quantitative evaluations on standard benchmarks and extensive user studies demonstrate that MPD yields more temporally coherent inbetweening results with forward motion prior, outperforming existing parallel or sequential fusion methods.

Conclusion: MPD is a simple yet effective inference-time technique that improves video inbetweening quality by aligning motion priors between forward and backward generation paths in I2V diffusion models.

Abstract: Recent progress in image-to-video (I2V) diffusion models has significantly advanced the field of generative inbetweening, which aims to generate semantically plausible frames between two keyframes. In particular, inference-time sampling strategies, which leverage the generative priors of large-scale pre-trained I2V models without additional training, have become increasingly popular. However, existing inference-time sampling, either fusing forward and backward paths in parallel or alternating them sequentially, often suffers from temporal discontinuities and undesirable visual artifacts due to the misalignment between the two generated paths. This is because each path follows the motion prior induced by its own conditioning frame. In this work, we propose Motion Prior Distillation (MPD), a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path. Our method can deliberately avoid denoising the end-conditioned path which causes the ambiguity of the path, and yield more temporally coherent inbetweening results with the forward motion prior. We not only perform quantitative evaluations on standard benchmarks, but also conduct extensive user studies to demonstrate the effectiveness of our approach in practical scenarios.

[103] Channel-Aware Probing for Multi-Channel Imaging

Umar Marikkar, Syed Sameed Husain, Muhammad Awais, Sara Atito

Main category: cs.CV

TL;DR: CAP (Channel-Aware Probing) improves frozen encoder probing for Multi-Channel Imaging by using Independent Feature Encoding and Decoupled Pooling to handle channel diversity.

Details

Motivation: Multi-Channel Imaging (MCI) data has varying channel configurations across datasets, making fixed-channel training difficult and limiting reuse of pre-trained encoders. Existing probing methods for frozen encoders are underexplored, and direct transfer of strategies from other domains yields poor results for MCI.

Method: Proposes Channel-Aware Probing (CAP) with two key components: 1) Independent Feature Encoding (IFE) - encodes each channel separately, and 2) Decoupled Pooling (DCP) - pools features within channels before aggregating across channels to exploit inter-channel diversity.

Result: Across three MCI benchmarks, CAP consistently improves probing performance over default protocols, matches fine-tuning from scratch, and significantly reduces the gap to full fine-tuning from the same MCI pre-trained checkpoints.

Conclusion: CAP effectively addresses the challenge of leveraging frozen pre-trained encoders for Multi-Channel Imaging by exploiting channel diversity through specialized encoding and pooling strategies.

Abstract: Training and evaluating vision encoders on Multi-Channel Imaging (MCI) data remains challenging as channel configurations vary across datasets, preventing fixed-channel training and limiting reuse of pre-trained encoders on new channel settings. Prior work trains MCI encoders but typically evaluates them via full fine-tuning, leaving probing with frozen pre-trained encoders comparatively underexplored. Existing studies that perform probing largely focus on improving representations, rather than how to best leverage fixed representations for downstream tasks. Although the latter problem has been studied in other domains, directly transferring those strategies to MCI yields weak results, even worse than training from scratch. We therefore propose Channel-Aware Probing (CAP), which exploits the intrinsic inter-channel diversity in MCI datasets by controlling feature flow at both the encoder and probe levels. CAP uses Independent Feature Encoding (IFE) to encode each channel separately, and Decoupled Pooling (DCP) to pool within channels before aggregating across channels. Across three MCI benchmarks, CAP consistently improves probing performance over the default probing protocol, matches fine-tuning from scratch, and largely reduces the gap to full fine-tuning from the same MCI pre-trained checkpoints. Code can be found in https://github.com/umarikkar/CAP.

[104] ART3mis: Ray-Based Textual Annotation on 3D Cultural Objects

Vasileios Arampatzakis, Vasileios Sevetlidis, Fotis Arnaoutoglou, Athanasios Kalogeras, Christos Koulamas, Aris Lalos, Chairi Kiourt, George Ioannakis, Anestis Koutsoudis, George Pavlidis

Main category: cs.CV

TL;DR: ART3mis is a user-friendly interactive textual annotation tool for 3D objects designed for cultural heritage professionals without technical 3D skills.

Details

Motivation: Archaeologists and cultural heritage experts need advanced 3D applications for annotation and metadata attachment, but existing solutions are often domain-specific and not accessible to non-technical users.

Method: A general-purpose, user-driven, direct-on-surface approach that allows real-time handling, segmenting, and annotating of detailed 3D cultural objects with JSON data format storage.

Result: Developed ART3mis tool that enables cultural heritage conservators, restorers, and curators to easily annotate 3D digital replicas of artefacts without technical 3D imaging skills.

Conclusion: ART3mis provides an accessible solution for 3D object annotation in cultural heritage, bridging the gap between technical 3D capabilities and practical needs of domain experts.

Abstract: Beyond simplistic 3D visualisations, archaeologists, as well as cultural heritage experts and practitioners, need applications with advanced functionalities. Such as the annotation and attachment of metadata onto particular regions of the 3D digital objects. Various approaches have been presented to tackle this challenge, most of which achieve excellent results in the domain of their application. However, they are often confined to that specific domain and particular problem. In this paper, we present ART3mis - a general-purpose, user-friendly, interactive textual annotation tool for 3D objects. Primarily attuned to aid cultural heritage conservators, restorers and curators with no technical skills in 3D imaging and graphics, the tool allows for the easy handling, segmenting and annotating of 3D digital replicas of artefacts. ART3mis applies a user-driven, direct-on-surface approach. It can handle detailed 3D cultural objects in real-time and store textual annotations for multiple complex regions in JSON data format.

[105] VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

Qiuchen Wang, Shihang Wang, Yu Zeng, Qiang Zhang, Fanrui Zhang, Zhuoning Guo, Bosi Zhang, Wenxuan Huang, Lin Chen, Zehui Chen, Pengjun Xie, Ruixue Ding

Main category: cs.CV

TL;DR: VimRAG is a multimodal retrieval-augmented reasoning framework that uses dynamic directed acyclic graphs to structure agent states and multimodal evidence, with graph-modulated visual memory encoding and graph-guided policy optimization for efficient long-context reasoning.

Details

Motivation: Traditional RAG methods struggle with long-context tasks involving visual data due to linear interaction histories and inefficient token allocation for information-sparse yet token-heavy visual content in iterative reasoning scenarios.

Method: Models reasoning as dynamic directed acyclic graphs structuring agent states and retrieved multimodal evidence. Introduces Graph-Modulated Visual Memory Encoding that evaluates node significance via topological position to dynamically allocate high-resolution tokens to pivotal evidence while compressing/discarding trivial clues. Uses Graph-Guided Policy Optimization to disentangle step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions.

Result: Extensive experiments demonstrate VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks.

Conclusion: VimRAG effectively bridges the gap in multimodal retrieval-augmented reasoning by introducing graph-structured memory and adaptive visual encoding, enabling efficient handling of long-context multimodal tasks.

Abstract: Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks. The code is available at https://github.com/Alibaba-NLP/VRAG.

[106] SPRig: Self-Supervised Pose-Invariant Rigging from Mesh Sequences

Ruipeng Wang, Langkun Zhong, Miaowei Wang

Main category: cs.CV

TL;DR: SPRig is a fine-tuning framework that enforces cross-frame consistency losses to learn pose-invariant rigging for sequential data lacking canonical rest poses, achieving state-of-the-art temporal stability.

Details

Motivation: Existing rigging methods assume a canonical rest pose (like T-pose), which fails for sequential data such as animal motion capture or video-derived mesh sequences that lack consistent rest poses. Frame-by-frame application leads to topological inconsistencies and lack of pose invariance.

Method: Proposes SPRig, a general fine-tuning framework that adds cross-frame consistency losses on top of existing rigging models to learn pose-invariant rigs. Uses a new permutation-invariant stability protocol for validation.

Result: Achieves state-of-the-art temporal stability, produces coherent rigs from challenging sequences, and dramatically reduces artifacts compared to baseline methods.

Conclusion: SPRig successfully addresses the limitations of existing rigging methods for sequential data by enforcing cross-frame consistency, enabling pose-invariant rigging without requiring canonical rest poses.

Abstract: State-of-the-art rigging methods assume a canonical rest pose–an assumption that fails for sequential data (e.g., animal motion capture or AIGC/video-derived mesh sequences) that lack the T-pose. Applied frame-by-frame, these methods are not pose-invariant and produce topological inconsistencies across frames. Thus We propose SPRig, a general fine-tuning framework that enforces cross-frame consistency losses to learn pose-invariant rigs on top of existing models. We validate our approach on rigging using a new permutation-invariant stability protocol. Experiments demonstrate SOTA temporal stability: our method produces coherent rigs from challenging sequences and dramatically reduces the artifacts that plague baseline methods. The code will be released publicly upon acceptance.

[107] Synthetic Craquelure Generation for Unsupervised Painting Restoration

Jana Cuch-Guillén, Antonio Agudo, Raül Pérez-Gonzalo

Main category: cs.CV

TL;DR: Annotation-free framework for painting restoration using synthetic craquelure generator and detector-guided refinement with SegFormer+LoRA, followed by anisotropic diffusion inpainting.

Details

Motivation: Cultural heritage preservation needs non-invasive digital restoration methods, but identifying and restoring fine craquelure patterns is challenging due to scarce pixel-level annotations and complex brushstrokes.

Method: 1) Domain-specific synthetic craquelure generator using Bézier trajectories; 2) Couples classical morphological detector with learning-based refinement (SegFormer backbone adapted via LoRA); 3) Detector-guided strategy with morphological map as input spatial prior; 4) Masked hybrid loss and logit adjustment to focus on refining crack regions; 5) Anisotropic Diffusion inpainting guided by refined masks.

Result: Pipeline significantly outperforms state-of-the-art photographic restoration models in zero-shot settings while faithfully preserving original paint brushwork.

Conclusion: Proposed annotation-free framework effectively addresses craquelure restoration challenges in cultural heritage preservation through synthetic data generation and detector-guided refinement.

Abstract: Cultural heritage preservation increasingly demands non-invasive digital methods for painting restoration, yet identifying and restoring fine craquelure patterns from complex brushstrokes remains challenging due to scarce pixel-level annotations. We propose a fully annotation-free framework driven by a domain-specific synthetic craquelure generator, which simulates realistic branching and tapered fissure geometry using Bézier trajectories. Our approach couples a classical morphological detector with a learning-based refinement module: a SegFormer backbone adapted via Low-Rank Adaptation (LoRA). Uniquely, we employ a detector-guided strategy, injecting the morphological map as an input spatial prior, while a masked hybrid loss and logit adjustment constrain the training to focus specifically on refining candidate crack regions. The refined masks subsequently guide an Anisotropic Diffusion inpainting stage to reconstruct missing content. Experimental results demonstrate that our pipeline significantly outperforms state-of-the-art photographic restoration models in zero-shot settings, while faithfully preserving the original paint brushwork.

[108] ReBA-Pred-Net: Weakly-Supervised Regional Brain Age Prediction on MRI

Shuai Shao, Yan Wang, Shu Jiang, Shiyuan Zhao, Xinzhe Luo, Di Yang, Jiangtao Wang, Yutong Bai, Jianguo Zhang

Main category: cs.CV

TL;DR: ReBA-Pred-Net: A Teacher-Student framework for regional brain age estimation with clinical-prior consistency constraints, evaluated using novel indirect metrics for statistical and factual validity.

Details

Motivation: Whole brain age (WBA) is too coarse for tasks like disease characterization and aging pattern research, as relevant changes are region-selective rather than brain-wide. Robust regional brain age (ReBA) estimation is critical but lacks widely generalizable models.

Method: Proposes ReBA-Pred-Net, a Teacher-Student framework where the Teacher produces soft ReBA to guide the Student, with clinical-prior consistency constraint (regions within same function should change similarly). Introduces two indirect evaluation metrics: Healthy Control Similarity (HCS) for statistical consistency and Neuro Disease Correlation (NDC) for factual consistency.

Result: Experiments across multiple backbones demonstrate the statistical and factual validity of the method, showing it can produce reliable regional brain age estimates that align with clinical expectations.

Conclusion: ReBA-Pred-Net provides a robust framework for fine-grained brain age estimation that addresses limitations of whole brain age approaches, with validation through novel indirect metrics that assess both statistical and clinical consistency.

Abstract: Brain age has become a prominent biomarker of brain health. Yet most prior work targets whole brain age (WBA), a coarse paradigm that struggles to support tasks such as disease characterization and research on development and aging patterns, because relevant changes are typically region-selective rather than brain-wide. Therefore, robust regional brain age (ReBA) estimation is critical, yet a widely generalizable model has yet to be established. In this paper, we propose the Regional Brain Age Prediction Network (ReBA-Pred-Net), a Teacher-Student framework designed for fine-grained brain age estimation. The Teacher produces soft ReBA to guide the Student to yield reliable ReBA estimates with a clinical-prior consistency constraint (regions within the same function should change similarly). For rigorous evaluation, we introduce two indirect metrics: Healthy Control Similarity (HCS), which assesses statistical consistency by testing whether regional brain-age-gap (ReBA minus chronological age) distributions align between training and unseen HC; and Neuro Disease Correlation (NDC), which assesses factual consistency by checking whether clinically confirmed patients show elevated brain-age-gap in disease-associated regions. Experiments across multiple backbones demonstrate the statistical and factual validity of our method.

[109] Towards reconstructing experimental sparse-view X-ray CT data with diffusion models

Nelas J. Thomsen, Xinyuan Wang, Felix Lucka, Ezgi Demircan-Tureyen

Main category: cs.CV

TL;DR: Diffusion priors for sparse-view CT show nuanced effects of domain shift and forward model mismatch when moving from synthetic to experimental data.

Details

Motivation: To investigate whether diffusion-based image generators can successfully transition from synthetic to experimental data for ill-posed inverse problems like sparse-view CT, addressing domain shift and forward model mismatch challenges.

Method: Measured CT data from physical phantom resembling synthetic Shepp-Logan phantom, trained diffusion priors on synthetic datasets with varying domain shift, employed Decomposed Diffusion Sampling on sparse-view CT datasets of increasing difficulty leading to experimental data.

Result: Domain shift has nuanced effects: severe mismatch causes model collapse/hallucinations, but diverse priors outperform well-matched but narrow priors. Forward model mismatch causes artifacts but can be mitigated with annealed likelihood schedules that also improve computational efficiency.

Conclusion: Performance gains don’t immediately translate from synthetic to experimental data; future development must validate against real-world benchmarks, with domain shift and forward model mismatch requiring careful consideration.

Abstract: Diffusion-based image generators are promising priors for ill-posed inverse problems like sparse-view X-ray Computed Tomography (CT). As most studies consider synthetic data, it is not clear whether training data mismatch (``domain shift’’) or forward model mismatch complicate their successful application to experimental data. We measured CT data from a physical phantom resembling the synthetic Shepp-Logan phantom and trained diffusion priors on synthetic image data sets with different degrees of domain shift towards it. Then, we employed the priors in a Decomposed Diffusion Sampling scheme on sparse-view CT data sets with increasing difficulty leading to the experimental data. Our results reveal that domain shift plays a nuanced role: while severe mismatch causes model collapse and hallucinations, diverse priors outperform well-matched but narrow priors. Forward model mismatch pulls the image samples away from the prior manifold, which causes artifacts but can be mitigated with annealed likelihood schedules that also increase computational efficiency. Overall, we demonstrate that performance gains do not immediately translate from synthetic to experimental data, and future development must validate against real-world benchmarks.

[110] Towards complete digital twins in cultural heritage with ART3mis 3D artifacts annotator

Dimitrios Karamatskos, Vasileios Arampatzakis, Vasileios Sevetlidis, Stavros Nousias, Athanasios Kalogeras, Christos Koulamas, Aris Lalos, George Pavlidis

Main category: cs.CV

TL;DR: ART3mis is a web-based textual annotation tool for 3D cultural heritage objects that enables metadata attachment to specific regions while complying with W3C standards for interoperability.

Details

Motivation: Current 3D visualization tools for cultural heritage lack advanced annotation capabilities, are domain-specific, and lack interoperability. There's a need for a general-purpose, user-friendly tool that allows non-technical experts to annotate 3D artifacts with metadata.

Method: Developed ART3mis as a web-based interactive annotation tool that complies with W3C Web Annotation Data Model standards, enabling segmentation and textual annotation of specific regions on 3D digital replicas.

Result: Created a feature-rich, user-friendly tool that allows cultural heritage professionals without technical expertise to easily handle, segment, and annotate 3D digital artifacts while ensuring information can be communicated, distributed, and reused.

Conclusion: ART3mis addresses the limitations of existing 3D visualization tools by providing a general-purpose, interoperable annotation solution that empowers cultural heritage professionals to work with 3D digital artifacts more effectively.

Abstract: Archaeologists, as well as specialists and practitioners in cultural heritage, require applications with additional functions, such as the annotation and attachment of metadata to specific regions of the 3D digital artifacts, to go beyond the simplistic three-dimensional (3D) visualization. Different strategies addressed this issue, most of which are excellent in their particular area of application, but their capacity is limited to their design’s purpose; they lack generalization and interoperability. This paper introduces ART3mis, a general-purpose, user-friendly, feature-rich, interactive web-based textual annotation tool for 3D objects. Moreover, it enables the communication, distribution, and reuse of information as it complies with the W3C Web Annotation Data Model. It is primarily designed to help cultural heritage conservators, restorers, and curators who lack technical expertise in 3D imaging and graphics, handle, segment, and annotate 3D digital replicas of artifacts with ease.

[111] PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion

Hong-Phuc Lai, Phong Nguyen, Anh Tran

Main category: cs.CV

TL;DR: PixelRush is a tuning-free framework for high-resolution text-to-image generation that enables efficient patch-based denoising with seamless blending and noise injection, achieving 10-35x speedup for 4K image generation.

Details

Motivation: Pre-trained diffusion models are limited by their native training resolution, and existing training-free approaches for high-resolution generation require substantial computational overhead (5+ minutes for 4K images). There's a need for practical, efficient high-resolution text-to-image generation without fine-tuning.

Method: PixelRush builds on patch-based inference but eliminates multiple inversion/regeneration cycles. It enables efficient patch-based denoising in low-step regime, uses seamless blending strategy to address patch artifacts, and incorporates noise injection to mitigate over-smoothing effects.

Result: PixelRush generates 4K images in approximately 20 seconds, achieving 10-35x speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate performance gains and output quality.

Conclusion: PixelRush presents the first tuning-free framework for practical high-resolution text-to-image generation, offering exceptional efficiency and quality improvements over existing methods.

Abstract: Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image. In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a 10$\times$ to 35$\times$ speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.

[112] Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting

Xiaowen Zhang, Zijie Yue, Yong Luo, Cairong Zhao, Qijun Chen, Miaojing Shi

Main category: cs.CV

TL;DR: WS-COC is a weakly-supervised class-agnostic object counting framework using MLLMs with three strategies: divide-and-discern dialogue tuning, compare-and-rank optimization, and global-local counting enhancement.

Details

Motivation: Object counting requires costly point-level annotations in fully-supervised methods. Existing weakly-supervised methods are limited to single categories. The paper aims to develop a class-agnostic counting framework using MLLMs with minimal supervision.

Method: Three strategies: 1) Divide-and-discern dialogue tuning - MLLM determines if object count falls within specific ranges through multi-round dialogue; 2) Compare-and-rank count optimization - MLLM learns to rank images by object counts; 3) Global-and-local counting enhancement - fuses local and global predictions for dense scenes.

Result: Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show WS-COC matches or surpasses state-of-art fully-supervised methods while significantly reducing annotation costs.

Conclusion: WS-COC demonstrates that MLLMs can be effectively adapted for class-agnostic object counting with weak supervision, achieving competitive performance with reduced annotation burden.

Abstract: Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, e.g. person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting. Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs. Code is available at https://github.com/viscom-tongji/WS-COC.

[113] GSM-GS: Geometry-Constrained Single and Multi-view Gaussian Splatting for Surface Reconstruction

Xiao Ren, Yu Liu, Ning An, Jian Cheng, Xin Qiao, He Kong

Main category: cs.CV

TL;DR: GSM-GS: A synergistic optimization framework for 3D Gaussian Splatting that improves reconstruction accuracy through single-view adaptive sub-region weighting and multi-view spatial structure refinement.

Details

Motivation: 3D Gaussian Splatting has fast training and high-fidelity rendering, but unstructured Gaussian point clouds struggle with reconstruction accuracy, causing high-frequency detail loss in complex surface microstructures.

Method: Two-stage approach: 1) Single-view optimization using image gradient features to partition scenes into texture-rich/texture-less regions with adaptive filtering and dual-branch constraints; 2) Multi-view optimization with geometry-guided cross-view point cloud association and dynamic weight sampling for 3D structural normal constraints.

Result: Extensive experiments on public datasets show competitive rendering quality and geometric reconstruction compared to existing methods.

Conclusion: The proposed GSM-GS framework effectively addresses reconstruction limitations in 3D Gaussian Splatting through synergistic optimization strategies that preserve geometric details and enhance multi-view consistency.

Abstract: Recently, 3D Gaussian Splatting has emerged as a prominent research direction owing to its ultrarapid training speed and high-fidelity rendering capabilities. However, the unstructured and irregular nature of Gaussian point clouds poses challenges to reconstruction accuracy. This limitation frequently causes high-frequency detail loss in complex surface microstructures when relying solely on routine strategies. To address this limitation, we propose GSM-GS: a synergistic optimization framework integrating single-view adaptive sub-region weighting constraints and multi-view spatial structure refinement. For single-view optimization, we leverage image gradient features to partition scenes into texture-rich and texture-less sub-regions. The reconstruction quality is enhanced through adaptive filtering mechanisms guided by depth discrepancy features. This preserves high-weight regions while implementing a dual-branch constraint strategy tailored to regional texture variations, thereby improving geometric detail characterization. For multi-view optimization, we introduce a geometry-guided cross-view point cloud association method combined with a dynamic weight sampling strategy. This constructs 3D structural normal constraints across adjacent point cloud frames, effectively reinforcing multi-view consistency and reconstruction fidelity. Extensive experiments on public datasets demonstrate that our method achieves both competitive rendering quality and geometric reconstruction. See our interactive project page

[114] Thinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation

Yichen Zhao, Zelin Peng, Piao Yang, Xiaokang Yang, Wei Shen

Main category: cs.CV

TL;DR: MMRad-IVL-22K is a large-scale dataset for interleaved visual-language reasoning in chest X-ray interpretation, enabling multimodal chain-of-thought reasoning that outperforms text-only approaches in medical AI.

Details

Motivation: Current medical LVLMs use text-only chain-of-thought reasoning that operates purely in linguistic space and is prone to hallucination, failing to capture the interleaved visual inspection and language reasoning process of radiologists.

Method: Created MMRad-IVL-22K dataset with 21,994 diagnostic traces reflecting radiologists’ repeated cycle of reasoning and visual inspection, enabling systematic scanning across 35 anatomical regions for multimodal CoT reasoning.

Result: Report generation guided by multimodal CoT outperforms text-only CoT by 6% in RadGraph metric; models fine-tuned on MMRad-IVL-22K achieve superior reasoning consistency and report quality compared to general-purpose and medical-specific LVLMs.

Conclusion: High-fidelity interleaved vision-language evidence is essential for reliable medical AI, and the dataset enables more accurate radiological diagnosis through multimodal reasoning that mimics clinical workflow.

Abstract: Radiological diagnosis is a perceptual process in which careful visual inspection and language reasoning are repeatedly interleaved. Most medical large vision language models (LVLMs) perform visual inspection only once and then rely on text-only chain-of-thought (CoT) reasoning, which operates purely in the linguistic space and is prone to hallucination. Recent methods attempt to mitigate this issue by introducing visually related coordinates, such as bounding boxes. However, these remain a pseudo-visual solution: coordinates are still text and fail to preserve rich visual details like texture and density. Motivated by the interleaved nature of radiological diagnosis, we introduce MMRad-IVL-22K, the first large-scale dataset designed for natively interleaved visual language reasoning in chest X-ray interpretation. MMRad-IVL-22K reflects a repeated cycle of reasoning and visual inspection workflow of radiologists, in which visual rationales complement textual descriptions and ground each step of the reasoning process. MMRad-IVL-22K comprises 21,994 diagnostic traces, enabling systematic scanning across 35 anatomical regions. Experimental results on advanced closed-source LVLMs demonstrate that report generation guided by multimodal CoT significantly outperforms that guided by text-only CoT in clinical accuracy and report quality (e.g., 6% increase in the RadGraph metric), confirming that high-fidelity interleaved vision language evidence is a non-substitutable component of reliable medical AI. Furthermore, benchmarking across seven state-of-the-art open-source LVLMs demonstrates that models fine-tuned on MMRad-IVL-22K achieve superior reasoning consistency and report quality compared with both general-purpose and medical-specific LVLMs. The project page is available at https://github.com/qiuzyc/thinking_like_a_radiologist.

[115] RoadscapesQA: A Multitask, Multimodal Dataset for Visual Question Answering on Indian Roads

Vijayasri Iyer, Maahin Rathinagiriswaran, Jyothikamalesh S

Main category: cs.CV

TL;DR: Roadscapes: A multimodal dataset of 9,000 Indian driving scene images with bounding boxes and generated QA pairs for object grounding, reasoning, and scene understanding tasks.

Details

Motivation: To advance visual scene understanding in unstructured driving environments, particularly for autonomous driving systems that need to interpret complex road scenes in diverse conditions.

Method: Collected 9,000 images from diverse Indian driving environments, manually verified bounding boxes, used rule-based heuristics to infer scene attributes, and generated QA pairs for various visual understanding tasks.

Result: Created Roadscapes dataset covering urban/rural India, highways, service roads, village paths, congested streets, day/night settings, with initial baselines for image QA tasks using vision-language models.

Conclusion: Roadscapes provides a valuable resource for advancing research on visual scene understanding in unstructured driving environments, particularly for autonomous driving applications.

Abstract: Understanding road scenes is essential for autonomous driving, as it enables systems to interpret visual surroundings to aid in effective decision-making. We present Roadscapes, a multitask multimodal dataset consisting of upto 9,000 images captured in diverse Indian driving environments, accompanied by manually verified bounding boxes. To facilitate scalable scene understanding, we employ rule-based heuristics to infer various scene attributes, which are subsequently used to generate question-answer (QA) pairs for tasks such as object grounding, reasoning, and scene understanding. The dataset includes a variety of scenes from urban and rural India, encompassing highways, service roads, village paths, and congested city streets, captured in both daytime and nighttime settings. Roadscapes has been curated to advance research on visual scene understanding in unstructured environments. In this paper, we describe the data collection and annotation process, present key dataset statistics, and provide initial baselines for image QA tasks using vision-language models.

[116] RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training

Yunshuang Nie, Bingqian Lin, Minzhe Niu, Kun Xiang, Jianhua Han, Guowei Huang, Xingyue Quan, Hang Xu, Bokui Chen, Xiaodan Liang

Main category: cs.CV

TL;DR: RADAR is an evaluation framework for pre-trained MLLMs that introduces a novel metric (Soft Discrimination Score) and benchmark (Multi-Modal Mixture Benchmark) to assess perception and reasoning abilities without fine-tuning.

Details

Motivation: Current evaluation methods for MLLMs require supervised fine-tuning, which is labor-intensive, and existing metrics cannot disentangle perception vs. reasoning abilities. There's also a lack of comprehensive benchmarks aligned with pre-training objectives.

Method: RADAR has two components: (1) Soft Discrimination Score - a novel metric that quantifies model preference for correct answers over distractors without fine-tuning, and (2) Multi-Modal Mixture Benchmark - a 15K+ sample benchmark that unifies existing datasets and collects new data for 0-shot evaluation of perception and reasoning abilities.

Result: The framework reveals asymmetric development of perceptual and reasoning capabilities in pre-trained MLLMs across factors like data volume, model size, and pretraining strategy, showing that these abilities develop at different rates.

Conclusion: RADAR provides a decomposed perspective on MLLM ability bottlenecks, enabling targeted interventions to advance multimodal models more efficiently. The framework highlights the need to separately assess perception and reasoning during pre-training.

Abstract: Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs. Meanwhile, common pre-training metrics cannot quantify a model’s perception and reasoning abilities in a disentangled manner. Furthermore, existing evaluation benchmarks are typically limited in scale or misaligned with pre-training objectives. Thus, we propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training. RADAR involves two key components: (1) Soft Discrimination Score, a novel metric for robustly tracking ability development without fine-tuning, based on quantifying nuanced gradations of the model preference for the correct answer over distractors; and (2) Multi-Modal Mixture Benchmark, a new 15K+ sample benchmark for comprehensively evaluating pre-trained MLLMs’ perception and reasoning abilities in a 0-shot manner, where we unify authoritative benchmark datasets and carefully collect new datasets, extending the evaluation scope and addressing the critical gaps in current benchmarks. With RADAR, we comprehensively reveal the asymmetric development of perceptual and reasoning capabilities in pretrained MLLMs across diverse factors, including data volume, model size, and pretraining strategy. Our RADAR underscores the need for a decomposed perspective on pre-training ability bottlenecks, informing targeted interventions to advance MLLMs efficiently. Our code is publicly available at https://github.com/Nieysh/RADAR.

[117] Robustness of Object Detection of Autonomous Vehicles in Adverse Weather Conditions

Fox Pettersen, Hong Zhu

Main category: cs.CV

TL;DR: A method for evaluating object detection model robustness in autonomous vehicles under adverse weather/lighting conditions using data augmentation to find failure thresholds, with Faster R-CNN showing highest robustness.

Details

Motivation: As self-driving technology advances, determining safe operational thresholds across varying environmental conditions is critical for public safety, requiring robust evaluation of object detection models under adverse conditions.

Method: Uses data augmentation operators to generate synthetic data simulating adverse weather (fog, rain, snow) and lighting (dark, bright, flaring, shadow) conditions at progressive intensity levels to find the lowest intensity where object detection models fail, measured by Average First Failure Coefficients (AFFC).

Result: Faster R-CNN achieved highest robustness with overall average AFFC of 71.9% across all seven adverse conditions, while YOLO variants showed AFFC values of 43%. Training with synthetic adverse condition data improves robustness but can suffer from diminishing returns and forgetting phenomena if overtrained.

Conclusion: The proposed method is feasible, effective, and efficient for evaluating and comparing object detection model robustness in various adverse operation conditions, with applications for improving autonomous vehicle safety through targeted training.

Abstract: As self-driving technology advances toward widespread adoption, determining safe operational thresholds across varying environmental conditions becomes critical for public safety. This paper proposes a method for evaluating the robustness of object detection ML models in autonomous vehicles under adverse weather conditions. It employs data augmentation operators to generate synthetic data that simulates different severance degrees of the adverse operation conditions at progressive intensity levels to find the lowest intensity of the adverse conditions at which the object detection model fails. The robustness of the object detection model is measured by the average first failure coefficients (AFFC) over the input images in the benchmark. The paper reports an experiment with four object detection models: YOLOv5s, YOLOv11s, Faster R-CNN, and Detectron2, utilising seven data augmentation operators that simulate weather conditions fog, rain, and snow, and lighting conditions of dark, bright, flaring, and shadow. The experiment data show that the method is feasible, effective, and efficient to evaluate and compare the robustness of object detection models in various adverse operation conditions. In particular, the Faster R-CNN model achieved the highest robustness with an overall average AFFC of 71.9% over all seven adverse conditions, while YOLO variants showed the AFFC values of 43%. The method is also applied to assess the impact of model training that targets adverse operation conditions using synthetic data on model robustness. It is observed that such training can improve robustness in adverse conditions but may suffer from diminishing returns and forgetting phenomena (i.e., decline in robustness) if overtrained.

[118] Adaptive Scaling with Geometric and Visual Continuity of completed 3D objects

Jelle Vermandere, Maarten Bassier, Maarten Vergauwen

Main category: cs.CV

TL;DR: A part-aware scaling framework that transforms static completed Signed Distance Fields (SDFs) into editable, structurally coherent objects for flexible manipulation in applications like indoor redesign and digital content creation.

Details

Motivation: Object completion networks produce static SDFs that cannot be rescaled or deformed without structural distortions, limiting their use in applications requiring flexible object manipulation such as indoor redesign, simulation, and digital content creation.

Method: Starting from SDFs and Texture Fields from completion models, the method performs automatic part segmentation, defines user-controlled scaling zones, and applies smooth interpolation of SDFs, color, and part indices. It also incorporates a repetition-based strategy to handle large-scale deformations while preserving repeating geometric patterns.

Result: Experiments on Matterport3D and ShapeNet objects show the method overcomes the inherent rigidity of completed SDFs and is visually more appealing than global and naive selective scaling, particularly for complex shapes and repetitive structures.

Conclusion: The part-aware scaling framework successfully transforms static completed SDFs into editable, structurally coherent objects, enabling proportional and artifact-free deformation for flexible object manipulation applications.

Abstract: Object completion networks typically produce static Signed Distance Fields (SDFs) that faithfully reconstruct geometry but cannot be rescaled or deformed without introducing structural distortions. This limitation restricts their use in applications requiring flexible object manipulation, such as indoor redesign, simulation, and digital content creation. We introduce a part-aware scaling framework that transforms these static completed SDFs into editable, structurally coherent objects. Starting from SDFs and Texture Fields generated by state-of-the-art completion models, our method performs automatic part segmentation, defines user-controlled scaling zones, and applies smooth interpolation of SDFs, color, and part indices to enable proportional and artifact-free deformation. We further incorporate a repetition-based strategy to handle large-scale deformations while preserving repeating geometric patterns. Experiments on Matterport3D and ShapeNet objects show that our method overcomes the inherent rigidity of completed SDFs and is visually more appealing than global and naive selective scaling, particularly for complex shapes and repetitive structures.

[119] Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

Runzhou Liu, Hailey Weingord, Sejal Mittal, Prakhar Dungarwal, Anusha Nandula, Bo Ni, Samyadeep Basu, Hongjie Chen, Nesreen K. Ahmed, Li Li, Jiayi Zhang, Koustava Goswami, Subhojyoti Mukherjee, Branislav Kveton, Puneet Mathur, Franck Dernoncourt, Yue Zhao, Yu Wang, Ryan A. Rossi, Zhengzhong Tu, Hongru Du

Main category: cs.CV

TL;DR: A framework using Multimodal LLMs as judges to evaluate image editing models with 12 fine-grained factors, validated against human judgments.

Details

Motivation: Traditional image editing evaluation metrics are coarse, lack interpretability, and fail to capture important aspects like controllability, edit localization, and faithfulness to user instructions.

Method: Proposes MLLM-as-a-Judge framework with 12 interpretable factors spanning image preservation, edit quality, and instruction fidelity. Creates human-validated benchmark integrating human judgments, MLLM evaluations, model outputs, and traditional metrics.

Result: MLLM judges align closely with human evaluations at fine granularity, outperforming traditional metrics which often fail to distinguish over-edited or semantically imprecise outputs.

Conclusion: Fine-grained MLLM judges provide a practical foundation for studying, comparing, and improving image editing approaches with more intuitive and informative assessments.

Abstract: Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.

[120] Reliable Thinking with Images

Haobin Li, Yutong Yang, Yijie Lin, Dai Xiang, Mouxing Yang, Xi Peng

Main category: cs.CV

TL;DR: RTWI addresses noisy thinking in multimodal reasoning by estimating reliability of visual cues and textual chains-of-thought, using filtering and voting to prevent error accumulation.

Details

Motivation: Existing Thinking with Images (TWI) methods assume perfect interleaved image-text reasoning chains, but real-world scenarios often involve noisy/imperfect visual cue mining and reasoning, leading to error accumulation that degrades MLLM performance.

Method: Proposes Reliable Thinking with Images (RTWI) that estimates reliability of both visual cues and textual CoT in a unified text-centric manner, then employs robust filtering and voting modules to prevent noisy thinking from contaminating final answers.

Result: Extensive experiments on seven benchmarks verify RTWI’s effectiveness against noisy thinking problems, showing improved robustness in multimodal reasoning.

Conclusion: Addressing noisy thinking is crucial for reliable multimodal reasoning, and RTWI provides an effective solution by jointly estimating reliability of visual and textual components to prevent error propagation.

Abstract: As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another’’, erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.

[121] EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition

Xiao Wang, Xingxing Xiong, Jinfeng Gao, Xufeng Lou, Bo Jiang, Si-bao Chen, Yaowei Wang, Yonghong Tian

Main category: cs.CV

TL;DR: EPRBench: A comprehensive benchmark for event stream-based visual place recognition with 10K event sequences, 65K event frames, and LLM-generated scene descriptions, plus a novel multi-modal fusion framework using LLMs for interpretable place recognition.

Details

Motivation: Current scarcity of dedicated datasets for event stream-based visual place recognition (VPR), which offers advantages over conventional cameras in challenging conditions like low illumination, overexposure, and high-speed motion.

Method: 1) Created EPRBench dataset with 10K event sequences and 65K event frames collected via handheld and vehicle-mounted setups across diverse conditions. 2) Generated LLM-based scene descriptions with human refinement. 3) Proposed multi-modal fusion framework: LLMs generate textual descriptions from event streams, guide token selection, cross-modal feature fusion, and multi-scale representation learning.

Result: Benchmarked 15 state-of-the-art VPR algorithms on EPRBench, providing strong baselines. The proposed multi-modal fusion framework achieves highly accurate place recognition with interpretable reasoning processes, enhancing model transparency and explainability.

Conclusion: EPRBench addresses the dataset scarcity in event-based VPR and enables semantic-aware, language-integrated research. The LLM-based multi-modal fusion paradigm advances both accuracy and interpretability in event stream-based place recognition.

Abstract: Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR. EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios. To support semantic-aware and language-integrated VPR research, we provide LLM-generated scene descriptions, subsequently refined through human annotation, establishing a solid foundation for integrating LLMs into event-based perception pipelines. To facilitate systematic evaluation, we implement and benchmark 15 state-of-the-art VPR algorithms on EPRBench, offering a strong baseline for future algorithmic comparisons. Furthermore, we propose a novel multi-modal fusion paradigm for VPR: leveraging LLMs to generate textual scene descriptions from raw event streams, which then guide spatially attentive token selection, cross-modal feature fusion, and multi-scale representation learning. This framework not only achieves highly accurate place recognition but also produces interpretable reasoning processes alongside its predictions, significantly enhancing model transparency and explainability. The dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID

[122] Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos

Jieyun Bai, Zihao Zhou, Yitong Tang, Jie Gan, Zhuonan Liang, Jianan Fan, Lisa B. Mcguire, Jillian L. Clarke, Weidong Cai, Jacaueline Spurway, Yubo Tang, Shiye Wang, Wenda Shen, Wangwang Yu, Yihao Li, Philippe Zhang, Weili Jiang, Yongjie Li, Salem Muhsin Ali Binqahal Al Nasim, Arsen Abzhanov, Numan Saeed, Mohammad Yaqub, Zunhui Xian, Hongxing Lin, Libin Lan, Jayroop Ramesh, Valentin Bacher, Mark Eid, Hoda Kalabizadeh, Christian Rupprecht, Ana I. L. Namburete, Pak-Hei Yeung, Madeleine K. Wyburd, Nicola K. Dinsdale, Assanali Serikbey, Jiankai Li, Sung-Liang Chen, Zicheng Hu, Nana Liu, Yian Deng, Wei Hu, Cong Tan, Wenfeng Zhang, Mai Tuyet Nhi, Gregor Koehler, Rapheal Stock, Klaus Maier-Hein, Marawan Elbatel, Xiaomeng Li, Saad Slimani, Victor M. Campello, Benard Ohene-Botwe, Isaac Khobo, Yuxin Huang, Zhenyan Han, Hongying Hou, Di Qiu, Zheng Zheng, Gongning Luo, Dong Ni, Yaosheng Lu, Karim Lekadir, Shuo Li

Main category: cs.CV

TL;DR: The Intrapartum Ultrasound Grand Challenge (IUGC) introduces a multi-task framework for automatic ultrasound measurement during labor, addressing maternal and neonatal mortality in resource-limited settings through AI-assisted ultrasound analysis.

Details

Motivation: High maternal/neonatal mortality during intrapartum phase (45% of deaths), especially in low-resource settings where ultrasound expertise is scarce. Need for automated solutions to enable ultrasound monitoring without trained sonographers.

Method: Multi-task framework integrating standard plane classification, fetal head-pubic symphysis segmentation, and biometry. Released largest multi-center intrapartum ultrasound video dataset (774 videos, 68,106 frames). Challenge design with analysis of 8 teams’ approaches across preprocessing, augmentation, learning strategies, architecture, and post-processing.

Result: Encouraging performance achieved but field remains at early stage. Comprehensive benchmark solutions and complete dataset publicly released for reproducible research.

Conclusion: Automatic intrapartum ultrasound biometry shows promise but requires further investigation before clinical deployment. Public release of data and methods aims to accelerate research in this critical healthcare area.

Abstract: A substantial proportion (45%) of maternal deaths, neonatal deaths, and stillbirths occur during the intrapartum phase, with a particularly high burden in low- and middle-income countries. Intrapartum biometry plays a critical role in monitoring labor progression; however, the routine use of ultrasound in resource-limited settings is hindered by a shortage of trained sonographers. To address this challenge, the Intrapartum Ultrasound Grand Challenge (IUGC), co-hosted with MICCAI 2024, was launched. The IUGC introduces a clinically oriented multi-task automatic measurement framework that integrates standard plane classification, fetal head-pubic symphysis segmentation, and biometry, enabling algorithms to exploit complementary task information for more accurate estimation. Furthermore, the challenge releases the largest multi-center intrapartum ultrasound video dataset to date, comprising 774 videos (68,106 frames) collected from three hospitals, providing a robust foundation for model training and evaluation. In this study, we present a comprehensive overview of the challenge design, review the submissions from eight participating teams, and analyze their methods from five perspectives: preprocessing, data augmentation, learning strategy, model architecture, and post-processing. In addition, we perform a systematic analysis of the benchmark results to identify key bottlenecks, explore potential solutions, and highlight open challenges for future research. Although encouraging performance has been achieved, our findings indicate that the field remains at an early stage, and further in-depth investigation is required before large-scale clinical deployment. All benchmark solutions and the complete dataset have been publicly released to facilitate reproducible research and promote continued advances in automatic intrapartum ultrasound biometry.

[123] Deep-Learning Atlas Registration for Melanoma Brain Metastases: Preserving Pathology While Enabling Cohort-Level Analyses

Nanna E. Wielenberg, Ilinca Popp, Oliver Blanck, Lucas Zander, Jan C. Peeken, Stephanie E. Combs, Anca-Ligia Grosu, Dimos Baltas, Tobias Fechter

Main category: cs.CV

TL;DR: A deep learning deformable registration framework for aligning pathological brains with metastases to a common atlas without requiring lesion masks, enabling standardized spatial analysis of melanoma brain metastases.

Details

Motivation: Melanoma brain metastases (MBM) are spatially heterogeneous and difficult to analyze at cohort level due to anatomical variability and differing MRI protocols across centers. Existing registration methods struggle with pathological brains containing lesions that disrupt anatomical correspondences.

Method: A fully differentiable deep learning deformable registration framework that handles missing anatomical correspondences caused by metastases using a forward-model similarity metric based on distance-transformed anatomical labels, combined with volume-preserving regularization for plausible deformations.

Result: Achieved high registration accuracy (DSC 0.89-0.92, HD 6.79-7.60 mm, ASSD 0.63-0.77 mm) while preserving metastatic volumes. Spatial analysis revealed significant over-representation of MBM in cerebral cortex and putamen, under-representation in white matter, and consistent localization near gray-white matter junction.

Conclusion: The framework enables robust atlas registration of pathological brain MRI without lesion masks and supports reproducible multi-centre analyses. It confirms preferential seeding of MBM near gray-white matter junction and cortical regions, with public implementation available for extension to other brain pathologies.

Abstract: Melanoma brain metastases (MBM) are common and spatially heterogeneous lesions, complicating cohort-level analyses due to anatomical variability and differing MRI protocols. We propose a fully differentiable, deep-learning-based deformable registration framework that aligns individual pathological brains to a common atlas while preserving metastatic tissue without requiring lesion masks or preprocessing. Missing anatomical correspondences caused by metastases are handled through a forward-model similarity metric based on distance-transformed anatomical labels, combined with a volume-preserving regularization term to ensure deformation plausibility. Registration performance was evaluated using Dice coefficient (DSC), Hausdorff distance (HD), average symmetric surface distance (ASSD), and Jacobian-based measures. The method was applied to 209 MBM patients from three centres, enabling standardized mapping of metastases to anatomical, arterial, and perfusion atlases. The framework achieved high registration accuracy across datasets (DSC 0.89-0.92, HD 6.79-7.60 mm, ASSD 0.63-0.77 mm) while preserving metastatic volumes. Spatial analysis demonstrated significant over-representation of MBM in the cerebral cortex and putamen, under-representation in white matter, and consistent localization near the gray-white matter junction. No arterial territory showed increased metastasis frequency after volume correction. This approach enables robust atlas registration of pathological brain MRI without lesion masks and supports reproducible multi-centre analyses. Applied to MBM, it confirms and refines known spatial predilections, particularly preferential seeding near the gray-white matter junction and cortical regions. The publicly available implementation facilitates reproducible research and extension to other brain tumours and neurological pathologies.

Hongbo Jiang, Jie Li, Xinqi Cai, Tianyu Xie, Yunhang Shen, Pingyang Dai, Liujuan Cao

Main category: cs.CV

TL;DR: MLLMEmbed-ReID: A unified cloud-edge framework that adapts MLLMs for cross-modal re-identification across RGB, infrared, sketch, and text modalities using instruction prompting and hierarchical LoRA-SFT, then distills knowledge to edge devices via principal component mapping and feature relation losses.

Details

Motivation: Addresses challenges in practical cloud-edge deployment of cross-modal re-identification by overcoming the fragmented ecosystem of specialized cloud models and lack of effective MLLM adaptation and knowledge distillation strategies for edge deployment.

Method: 1) Adapts foundational MLLM into cloud model using instruction-based prompting to generate unified embedding space across multiple modalities, trained with hierarchical LoRA-SFT under cross-modal alignment objective. 2) Distills knowledge to edge-native student using novel distillation strategy with Principal Component Mapping loss (prioritizes essential information) and Feature Relation loss (preserves relational structures).

Result: Lightweight edge-based model achieves state-of-the-art performance on multiple visual CM-ReID benchmarks, while cloud-based counterpart excels across all CM-ReID benchmarks. Framework enables unified MLLM-level intelligence on resource-constrained devices.

Conclusion: MLLMEmbed-ReID presents a complete and effective solution for deploying unified MLLM-level intelligence on resource-constrained devices, bridging the gap between powerful cloud models and practical edge deployment for cross-modal re-identification.

Abstract: Practical cloud-edge deployment of Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models for diverse modalities. While Multi-Modal Large Language Models (MLLMs) offer strong unification potential, existing approaches fail to adapt them into a single end-to-end backbone and lack effective knowledge distillation strategies for edge deployment. To address these limitations, we propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture. First, we adapt a foundational MLLM into a state-of-the-art cloud model. We leverage instruction-based prompting to guide the MLLM in generating a unified embedding space across RGB, infrared, sketch, and text modalities. This model is then trained efficiently with a hierarchical Low-Rank Adaptation finetuning (LoRA-SFT) strategy, optimized under a holistic cross-modal alignment objective. Second, to deploy its knowledge onto an edge-native student, we introduce a novel distillation strategy motivated by the low-rank property in the teacher’s feature space. To prioritize essential information, this method employs a Principal Component Mapping loss, while relational structures are preserved via a Feature Relation loss. Our lightweight edge-based model achieves state-of-the-art performance on multiple visual CM-ReID benchmarks, while its cloud-based counterpart excels across all CM-ReID benchmarks. The MLLMEmbed-ReID framework thus presents a complete and effective solution for deploying unified MLLM-level intelligence on resource-constrained devices. The code and models will be open-sourced soon.

[125] Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding

Wenhui Liao, Hongliang Li, Pengyu Xie, Xinyu Cai, Yufan Shen, Yi Xin, Qi Qin, Shenglong Ye, Tianbin Li, Ming Hu, Junjun He, Yihao Liu, Wenhai Wang, Min Dou, Bin Fu, Botian Shi, Yu Qiao, Lianwen Jin

Main category: cs.CV

TL;DR: A training-free acceleration method for VLM-based document parsing that uses a lightweight draft model and parallel verification to reduce inference latency for long-form documents.

Details

Motivation: VLM-based end-to-end document parsing models suffer from substantial inference latency due to auto-regressive generation of long token sequences when processing long-form documents with complex layouts.

Method: Proposes a speculative decoding-inspired approach using a lightweight document parsing pipeline as draft model to predict future tokens, while the accurate VLM verifies predictions in parallel. Further exploits document structure by partitioning pages into independent regions for parallel decoding with draft-verify strategy, then assembling predictions in reading order.

Result: Achieves 2.42x lossless acceleration for dots.ocr model on OmniDocBench, and up to 4.89x acceleration on long-document parsing tasks.

Conclusion: The proposed training-free acceleration method effectively reduces inference latency for VLM-based document parsing while maintaining accuracy, with significant speedups demonstrated on benchmark tasks.

Abstract: Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must auto-regressively generate long token sequences when processing long-form documents. In this work, motivated by the extremely long outputs and complex layout structures commonly found in document parsing, we propose a training-free and highly efficient acceleration method. Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens, while the more accurate VLM verifies these draft predictions in parallel. Moreover, we further exploit the layout-structured nature of documents by partitioning each page into independent regions, enabling parallel decoding of each region using the same draft-verify strategy. The final predictions are then assembled according to the natural reading order. Experimental results demonstrate the effectiveness of our approach: on the general-purpose OmniDocBench, our method provides a 2.42x lossless acceleration for the dots.ocr model, and achieves up to 4.89x acceleration on long-document parsing tasks. We will release our code to facilitate reproducibility and future research.

[126] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu

Main category: cs.CV

TL;DR: VideoLM approach using video codec primitives (motion vectors & residuals) instead of full-image encoding to reduce computational overhead while maintaining or improving video understanding performance.

Details

Motivation: Current VideoLMs use keyframe sampling which misses both macro-level events and micro-level details due to sparse temporal coverage, and processing full images for each frame incurs substantial computational overhead.

Method: Leverage video codec primitives (motion vectors and residuals) that natively encode video redundancy and sparsity without expensive full-image encoding. Introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through pre-training.

Result: Reduces time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Maintains or exceeds performance on 14 diverse video understanding benchmarks spanning general QA, temporal reasoning, long-form understanding, and spatial scene understanding.

Conclusion: Using video codec primitives provides an efficient alternative to full-image encoding for VideoLMs, significantly reducing computational costs while preserving or enhancing video understanding capabilities across diverse tasks.

Abstract: Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86%$ and token usage by up to $93%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

[127] Detecting Object Tracking Failure via Sequential Hypothesis Testing

Alejandro Monroy Muñoz, Rajeev Verma, Alexander Timans

Main category: cs.CV

TL;DR: Sequential hypothesis testing for object tracking safety assurance using e-processes to detect failures with statistical guarantees

Details

Motivation: Current object tracking systems lack formal safety assurances about when tracking is reliable vs. when it may fail, relying only on heuristic confidence measures. There's a need for statistically grounded failure detection to enable timely interventions.

Method: Interpret object tracking as sequential hypothesis test using e-processes to accumulate evidence for/against tracking failures over time. Propose both supervised (using ground-truth) and unsupervised (using internal tracking information) variants. Approach is model-agnostic, requires no extra training, and is computationally lightweight.

Result: Demonstrated effectiveness for two established tracking models across four video benchmarks. The method quickly identifies tracking failures while provably containing false alerts at desired rates, limiting costly re-calibration or intervention steps.

Conclusion: Sequential testing offers statistically grounded and efficient mechanism to incorporate safety assurances into real-time tracking systems, providing formal reliability guarantees that current heuristic approaches lack.

Abstract: Real-time online object tracking in videos constitutes a core task in computer vision, with wide-ranging applications including video surveillance, motion capture, and robotics. Deployed tracking systems usually lack formal safety assurances to convey when tracking is reliable and when it may fail, at best relying on heuristic measures of model confidence to raise alerts. To obtain such assurances we propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time. Leveraging recent advancements in the field, our sequential test (formalized as an e-process) quickly identifies when tracking failures set in whilst provably containing false alerts at a desired rate, and thus limiting potentially costly re-calibration or intervention steps. The approach is computationally light-weight, requires no extra training or fine-tuning, and is in principle model-agnostic. We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information, and demonstrate its effectiveness for two established tracking models across four video benchmarks. As such, sequential testing can offer a statistically grounded and efficient mechanism to incorporate safety assurances into real-time tracking systems.

Mohammed Amine Bencheikh Lehocine, Julian Schmidt, Frank Moosmann, Dikshant Gupta, Fabian Flohr

Main category: cs.CV

TL;DR: MASAR is a fully differentiable framework for joint 3D detection and trajectory forecasting that integrates appearance and motion features through object-centric spatio-temporal encoding, improving autonomous driving prediction accuracy.

Details

Motivation: Traditional autonomous driving systems use separate perception and prediction modules with hand-crafted interfaces, limiting information flow and propagating errors. Recent end-to-end approaches fail to fully exploit synergy between appearance and motion cues, relying mainly on short-term visual features.

Method: MASAR uses an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features. It predicts past trajectories and refines them using appearance cues to capture long-term temporal dependencies, compatible with any transformer-based 3D detector.

Result: Experiments on nuScenes dataset show improvements of over 20% in minADE and minFDE metrics while maintaining robust detection performance.

Conclusion: MASAR effectively integrates appearance and motion features for joint 3D detection and trajectory forecasting, demonstrating significant improvements in autonomous driving prediction tasks.

Abstract: Classical autonomous driving systems connect perception and prediction modules via hand-crafted bounding-box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end-to-end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short-term visual features. We follow the idea of “looking backward to look forward”, and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector. MASAR employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR’s effectiveness, showing improvements of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at https://github.com/aminmed/MASAR.

[129] Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Yunheng Li, Hengrui Zhang, Meng-Hao Guo, Wenzhao Gao, Shaoyong Jia, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

Main category: cs.CV

TL;DR: ASID introduces a million-scale structured audiovisual instruction dataset, verification pipeline, and captioning model for fine-grained video understanding.

Details

Motivation: Existing video understanding models are limited by coarse, incomplete video-instruction data that lacks fine-grained organization and reliable annotation for complex audiovisual content.

Method: Three components: (1) ASID-1M - open-source collection of 1M structured fine-grained audiovisual instruction annotations, (2) ASID-Verify - scalable data curation pipeline with automatic verification and refinement for semantic/temporal consistency, (3) ASID-Captioner - video understanding model trained via SFT on ASID-1M.

Result: ASID-Captioner achieves state-of-the-art performance among open-source models, competitive with Gemini-3-Pro, improves fine-grained caption quality, reduces hallucinations, and enhances instruction following across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and temporal grounding.

Conclusion: The structured fine-grained audiovisual instruction data and verification pipeline enable significant improvements in video understanding models for complex real-world scenarios.

Abstract: Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

[130] Multimodal Classification via Total Correlation Maximization

Feng Yu, Xiangyu Wu, Yang Yang, Jianfeng Lu

Main category: cs.CV

TL;DR: TCMax: A multimodal classification method that maximizes total correlation between multimodal features and labels to alleviate modality competition and capture inter-modal interactions.

Details

Motivation: Multimodal learning often suffers from modality competition where joint learning overfits certain modalities while neglecting others, leading to worse performance than unimodal learning. Previous approaches haven't sufficiently examined this issue from an information-theoretic perspective.

Method: Proposes TCMax, a hyperparameter-free loss function that maximizes total correlation between multimodal features and labels. Introduces Total Correlation Neural Estimation (TCNE) to derive a lower bound for total correlation, building on Mutual Information Neural Estimation (MINE). Uses feature alignment to capture inter-modal interactions while alleviating modality competition.

Result: Extensive experiments show TCMax outperforms state-of-the-art joint and unimodal learning approaches in multimodal classification tasks.

Conclusion: The information-theoretic approach of maximizing total correlation effectively addresses modality competition in multimodal learning, leading to superior performance compared to existing methods.

Abstract: Multimodal learning integrates data from diverse sensors to effectively harness information from different modalities. However, recent studies reveal that joint learning often overfits certain modalities while neglecting others, leading to performance inferior to that of unimodal learning. Although previous efforts have sought to balance modal contributions or combine joint and unimodal learning, thereby mitigating the degradation of weaker modalities with promising outcomes, few have examined the relationship between joint and unimodal learning from an information-theoretic perspective. In this paper, we theoretically analyze modality competition and propose a method for multimodal classification by maximizing the total correlation between multimodal features and labels. By maximizing this objective, our approach alleviates modality competition while capturing inter-modal interactions via feature alignment. Building on Mutual Information Neural Estimation (MINE), we introduce Total Correlation Neural Estimation (TCNE) to derive a lower bound for total correlation. Subsequently, we present TCMax, a hyperparameter-free loss function that maximizes total correlation through variational bound optimization. Extensive experiments demonstrate that TCMax outperforms state-of-the-art joint and unimodal learning approaches. Our code is available at https://github.com/hubaak/TCMax.

[131] DynaGuide: A Generalizable Dynamic Guidance Framework for Unsupervised Semantic Segmentation

Boujemaa Guermazi, Riadh Ksantini, Naimul Khan

Main category: cs.CV

TL;DR: DynaGuide is an adaptive unsupervised image segmentation framework that combines global pseudo-labels from zero-shot models with local boundary refinement using a lightweight CNN, achieving state-of-the-art performance without ground-truth labels.

Details

Motivation: Unsupervised image segmentation is crucial for dense scene understanding without human annotations, especially in domains with scarce labeled data. Existing methods struggle to balance global semantic structure with fine-grained boundary accuracy.

Method: DynaGuide uses a dual-guidance strategy: global pseudo-labels from zero-shot models (DiffSeg/SegFormer) combined with local boundary refinement via a lightweight CNN trained from scratch. It employs a multi-component loss that dynamically balances feature similarity, Huber-smoothed spatial continuity (including diagonal relationships), and semantic alignment with global pseudo-labels.

Result: Extensive experiments on BSD500, PASCAL VOC2012, and COCO show state-of-the-art performance: 17.5% mIoU improvement on BSD500, 3.1% on PASCAL VOC2012, and 11.66% on COCO. The framework trains without ground-truth labels and supports plug-and-play integration of diverse guidance sources.

Conclusion: DynaGuide offers a scalable, practical solution for unsupervised segmentation with modular design, strong generalization, and minimal computational footprint, addressing the challenge of reconciling global semantics with local boundary accuracy.

Abstract: Unsupervised image segmentation is a critical task in computer vision. It enables dense scene understanding without human annotations, which is especially valuable in domains where labelled data is scarce. However, existing methods often struggle to reconcile global semantic structure with fine-grained boundary accuracy. This paper introduces DynaGuide, an adaptive segmentation framework that addresses these challenges through a novel dual-guidance strategy and dynamic loss optimization. Building on our previous work, DynaSeg, DynaGuide combines global pseudo-labels from zero-shot models such as DiffSeg or SegFormer with local boundary refinement using a lightweight CNN trained from scratch. This synergy allows the model to correct coarse or noisy global predictions and produce high-precision segmentations. At the heart of DynaGuide is a multi-component loss that dynamically balances feature similarity, Huber-smoothed spatial continuity, including diagonal relationships, and semantic alignment with the global pseudo-labels. Unlike prior approaches, DynaGuide trains entirely without ground-truth labels in the target domain and supports plug-and-play integration of diverse guidance sources. Extensive experiments on BSD500, PASCAL VOC2012, and COCO demonstrate that DynaGuide achieves state-of-the-art performance, improving mIoU by 17.5% on BSD500, 3.1% on PASCAL VOC2012, and 11.66% on COCO. With its modular design, strong generalization, and minimal computational footprint, DynaGuide offers a scalable and practical solution for unsupervised segmentation in real-world settings. Code available at: https://github.com/RyersonMultimediaLab/DynaGuide

[132] Learning Image-based Tree Crown Segmentation from Enhanced Lidar-based Pseudo-labels

Julius Pesonen, Stefan Rua, Josef Taher, Niko Koivumäki, Xiaowei Yu, Eija Honkavaara

Main category: cs.CV

TL;DR: A method for training deep learning models to segment individual tree crowns from aerial imagery using pseudo-labels from aerial laser scanning enhanced by SAM 2, achieving better performance than general domain models without manual annotation.

Details

Motivation: Individual tree crown mapping is essential for urban tree inventories and forest health monitoring, but automatic separation in aerial imagery is challenging due to texture variations and crown overlaps. Manual annotation is costly and time-consuming.

Method: Uses pseudo-labels derived from aerial laser scanning (ALS) data to train deep learning models for segmenting individual trees from RGB and multispectral images. Enhances ALS-derived pseudo-labels using the zero-shot instance segmentation model Segment Anything Model 2 (SAM 2).

Result: The method produces segmentation models that outperform available general domain models on the same task, demonstrating that domain-specific training annotations can be obtained without manual annotation cost.

Conclusion: ALS-derived pseudo-labels enhanced by SAM 2 provide an effective way to train optical image-based segmentation models for individual tree crown mapping without manual annotation, achieving superior performance compared to general domain models.

Abstract: Mapping individual tree crowns is essential for tasks such as maintaining urban tree inventories and monitoring forest health, which help us understand and care for our environment. However, automatically separating the crowns from each other in aerial imagery is challenging due to factors such as the texture and partial tree crown overlaps. In this study, we present a method to train deep learning models that segment and separate individual trees from RGB and multispectral images, using pseudo-labels derived from aerial laser scanning (ALS) data. Our study shows that the ALS-derived pseudo-labels can be enhanced using a zero-shot instance segmentation model, Segment Anything Model 2 (SAM 2). Our method offers a way to obtain domain-specific training annotations for optical image-based models without any manual annotation cost, leading to segmentation models which outperform any available models which have been targeted for general domain deployment on the same task.

[133] FedHENet: A Frugal Federated Learning Framework for Heterogeneous Environments

Alejandro Dopico-Castro, Oscar Fontenla-Romero, Bertha Guijarro-Berdiñas, Amparo Alonso-Betanzos, Iván Pérez Digón

Main category: cs.CV

TL;DR: FedHENet extends FedHEONN framework for image classification using fixed pre-trained feature extractors and single-layer learning with homomorphic encryption, achieving competitive accuracy with better energy efficiency and privacy.

Details

Motivation: Federated Learning enables collaborative training without centralizing sensitive visual data, but traditional approaches require expensive iterative optimization and risk privacy through gradient sharing. The authors aim to develop a more efficient, privacy-preserving method that avoids costly fine-tuning and hyperparameter tuning.

Method: FedHENet uses a fixed, pre-trained feature extractor and learns only a single output layer. Client knowledge is aggregated analytically in a single communication round using homomorphic encryption (HE), avoiding iterative local fine-tuning and making the method hyperparameter-free.

Result: Experiments show FedHENet achieves competitive accuracy compared to iterative FL baselines while demonstrating superior stability performance and up to 70% better energy efficiency. The method is hyperparameter-free, eliminating carbon footprint associated with hyperparameter tuning.

Conclusion: FedHENet provides an efficient, privacy-preserving federated learning approach for image classification that achieves competitive performance with significant energy savings and eliminates hyperparameter tuning overhead.

Abstract: Federated Learning (FL) enables collaborative training without centralizing data, essential for privacy compliance in real-world scenarios involving sensitive visual information. Most FL approaches rely on expensive, iterative deep network optimization, which still risks privacy via shared gradients. In this work, we propose FedHENet, extending the FedHEONN framework to image classification. By using a fixed, pre-trained feature extractor and learning only a single output layer, we avoid costly local fine-tuning. This layer is learned by analytically aggregating client knowledge in a single round of communication using homomorphic encryption (HE). Experiments show that FedHENet achieves competitive accuracy compared to iterative FL baselines while demonstrating superior stability performance and up to 70% better energy efficiency. Crucially, our method is hyperparameter-free, removing the carbon footprint associated with hyperparameter tuning in standard FL. Code available in https://github.com/AlejandroDopico2/FedHENet/

[134] Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images

Yuhao Chen, Gautham Vinod, Siddeshwar Raghavan, Talha Ibn Mahmud, Bruce Coburn, Jinge Ma, Fengqing Zhu, Jiangpeng He

Main category: cs.CV

TL;DR: A benchmark dataset for implicit-scale 3D reconstruction from monocular multi-food images to advance geometry-based food portion estimation, addressing scale ambiguity in dietary assessment.

Details

Motivation: Existing dietary assessment methods rely on single-image analysis or appearance-based inference (including vision-language models) which lack explicit geometric reasoning and are sensitive to scale ambiguity. There's a need for geometry-based approaches that can handle real-world dining scenarios without explicit physical references.

Method: The benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. The dataset removes explicit physical references and metric annotations, instead providing contextual objects (plates, utensils) requiring algorithms to infer scale from implicit cues and prior knowledge. It emphasizes multi-food scenes with diverse geometries, occlusions, and complex spatial arrangements.

Result: The benchmark was adopted as a challenge at MetaFood 2025 Workshop. While strong vision-language baselines achieved competitive performance, geometry-based reconstruction methods provided both improved accuracy and greater robustness. The top-performing approach achieved 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.

Conclusion: Geometry-based reconstruction methods outperform appearance-based approaches for food portion estimation, demonstrating the importance of explicit geometric reasoning in handling scale ambiguity and complex real-world dining scenarios.

Abstract: We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision–language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.

[135] Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, Mubarak Shah

Main category: cs.CV

TL;DR: Curriculum-DPO++ enhances preference optimization for text-to-image generation by combining data-level curriculum with model-level curriculum, dynamically increasing model capacity during training through layer unfreezing and progressive LoRA rank scheduling.

Details

Motivation: Existing preference optimization methods (RLHF, DPO) don't account for varying difficulty in learning different preferences, leading to suboptimal optimization. Curriculum-DPO addressed this with data-level curriculum, but further improvements are possible by also adapting model capacity during training.

Method: Curriculum-DPO++ combines data-level curriculum (organizing image pairs by difficulty) with model-level curriculum: 1) Initialize with subset of trainable layers, sequentially unfreeze layers during training; 2) Progressive LoRA rank scheduling - start with small rank, incrementally increase to baseline rank; 3) Alternative ranking strategy for image pairs.

Result: Outperforms Curriculum-DPO and other state-of-the-art preference optimization methods on nine benchmarks in terms of text alignment, aesthetics, and human preference.

Conclusion: Curriculum-DPO++ effectively addresses the varying difficulty of preference learning by combining data-level and model-level curriculum approaches, leading to superior text-to-image generation quality and alignment with human preferences.

Abstract: Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.

[136] A Calibrated Memorization Index (MI) for Detecting Training Data Leakage in Generative MRI Models

Yash Deo, Yan Jia, Toni Lassila, Victoria J Hodge, Alejandro F Frang, Chenghao Qian, Siyuan Kang, Ibrahim Habli

Main category: cs.CV

TL;DR: Proposes a calibrated per-sample metric for detecting memorization and duplication of training data in medical image generation, using MRI foundation model features and whitened nearest-neighbor similarities to compute Overfit/Novelty Index and Memorization Index scores.

Details

Motivation: Image generative models can duplicate training data, raising privacy concerns in medical imaging. Current methods lack robust, calibrated metrics for detecting memorization in medical image generation.

Method: Uses MRI foundation model to extract image features, aggregates multi-layer whitened nearest-neighbor similarities, and maps them to bounded ONI and MI scores for per-sample memorization detection.

Result: Across three MRI datasets with controlled duplication percentages, the metric robustly detects duplication and provides consistent values. Achieves near-perfect duplicate detection at sample level.

Conclusion: Proposed metric effectively detects training data memorization in medical image generation, addressing privacy concerns with calibrated, dataset-consistent scores.

Abstract: Image generative models are known to duplicate images from the training data as part of their outputs, which can lead to privacy concerns when used for medical image generation. We propose a calibrated per-sample metric for detecting memorization and duplication of training data. Our metric uses image features extracted using an MRI foundation model, aggregates multi-layer whitened nearest-neighbor similarities, and maps them to a bounded \emph{Overfit/Novelty Index} (ONI) and \emph{Memorization Index} (MI) scores. Across three MRI datasets with controlled duplication percentages and typical image augmentations, our metric robustly detects duplication and provides more consistent metric values across datasets. At the sample level, our metric achieves near-perfect detection of duplicates.

[137] SIEFormer: Spectral-Interpretable and -Enhanced Transformer for Generalized Category Discovery

Chunming Li, Shidong Wang, Tong Xin, Haofeng Zhang

Main category: cs.CV

TL;DR: SIEFormer is a novel Vision Transformer variant that uses spectral analysis to reinterpret attention mechanisms, featuring implicit and explicit spectral branches for enhanced feature adaptability in Generalized Category Discovery tasks.

Details

Motivation: The paper aims to enhance Vision Transformers by leveraging spectral analysis to reinterpret attention mechanisms, particularly for challenging Generalized Category Discovery tasks where feature adaptability is crucial.

Method: SIEFormer uses two branches: 1) Implicit branch with graph Laplacians for local token correlations and Band-adaptive Filter layer for flexible filtering; 2) Explicit branch with Maneuverable Filtering Layer that applies Fourier transform to value features, modulates in frequency domain, and inverse transforms for enhanced features.

Result: Achieves state-of-the-art performance on multiple image recognition datasets, with superiority confirmed through ablation studies and visualizations.

Conclusion: SIEFormer successfully integrates spectral analysis into Vision Transformers, demonstrating enhanced feature adaptability and superior performance for challenging recognition tasks.

Abstract: This paper presents a novel approach, Spectral-Interpretable and -Enhanced Transformer (SIEFormer), which leverages spectral analysis to reinterpret the attention mechanism within Vision Transformer (ViT) and enhance feature adaptability, with particular emphasis on challenging Generalized Category Discovery (GCD) tasks. The proposed SIEFormer is composed of two main branches, each corresponding to an implicit and explicit spectral perspective of the ViT, enabling joint optimization. The implicit branch realizes the use of different types of graph Laplacians to model the local structure correlations of tokens, along with a novel Band-adaptive Filter (BaF) layer that can flexibly perform both band-pass and band-reject filtering. The explicit branch, on the other hand, introduces a Maneuverable Filtering Layer (MFL) that learns global dependencies among tokens by applying the Fourier transform to the input ``value" features, modulating the transformed signal with a set of learnable parameters in the frequency domain, and then performing an inverse Fourier transform to obtain the enhanced features. Extensive experiments reveal state-of-the-art performance on multiple image recognition datasets, reaffirming the superiority of our approach through ablation studies and visualizations.

[138] Universal Transformation of One-Class Classifiers for Unsupervised Anomaly Detection

Declan McIntosh, Alexandra Branzan Albu

Main category: cs.CV

TL;DR: A dataset folding method that transforms any one-class classifier-based anomaly detector into a fully unsupervised method by leveraging weak assumptions about anomaly rarity and heterogeneity.

Details

Motivation: Existing anomaly detection methods assume training data contains only nominal values, making them vulnerable to training label noise. There's a need for methods that can work with unlabeled data containing potential anomalies.

Method: Proposes a dataset folding technique that uses multiple independently trained instances of a one-class classifier to filter training datasets for anomalies. The method assumes anomalies are uncommon and heterogeneous in the training data, enabling the identification and removal of anomalous samples.

Result: The method transforms various one-class classifier anomaly detectors for images and videos into unsupervised ones, creates the first unsupervised logical anomaly detectors, and achieves state-of-the-art performance on MVTec AD, ViSA, and MVTec Loco AD datasets.

Conclusion: The approach provides a general framework to convert one-class classifiers into unsupervised anomaly detectors, linking improvements in one-class classification directly to the unsupervised domain.

Abstract: Detecting anomalies in images and video is an essential task for multiple real-world problems, including industrial inspection, computer-assisted diagnosis, and environmental monitoring. Anomaly detection is typically formulated as a one-class classification problem, where the training data consists solely of nominal values, leaving methods built on this assumption susceptible to training label noise. We present a dataset folding method that transforms an arbitrary one-class classifier-based anomaly detector into a fully unsupervised method. This is achieved by making a set of key weak assumptions: that anomalies are uncommon in the training dataset and generally heterogeneous. These assumptions enable us to utilize multiple independently trained instances of a one-class classifier to filter the training dataset for anomalies. This transformation requires no modifications to the underlying anomaly detector; the only changes are algorithmically selected data subsets used for training. We demonstrate that our method can transform a wide variety of one-class classifier anomaly detectors for both images and videos into unsupervised ones. Our method creates the first unsupervised logical anomaly detectors by transforming existing methods. We also demonstrate that our method achieves state-of-the-art performance for unsupervised anomaly detection on the MVTec AD, ViSA, and MVTec Loco AD datasets. As improvements to one-class classifiers are made, our method directly transfers those improvements to the unsupervised domain, linking the domains.

[139] Realistic Face Reconstruction from Facial Embeddings via Diffusion Models

Dong Han, Yong Li, Joachim Denzler

Main category: cs.CV

TL;DR: A framework called Face Embedding Mapping (FEM) that uses Kolmogorov-Arnold Networks with diffusion models to reconstruct high-resolution face images from embeddings of privacy-preserving face recognition systems, demonstrating privacy risks.

Details

Motivation: While privacy-preserving face recognition (PPFR) systems protect facial privacy, there's limited research on verifying their actual privacy risks by reconstructing realistic faces from their embeddings. The authors aim to explore whether these systems truly protect privacy by demonstrating reconstruction attacks.

Method: Proposes Face Embedding Mapping (FEM), a general framework using Kolmogorov-Arnold Networks (KAN) with pre-trained Identity-Preserving diffusion models to conduct embedding-to-face attacks against state-of-the-art FR and PPFR systems. The method reconstructs faces from embeddings, including partial and protected embeddings.

Result: Reconstructed faces can successfully access other real-world FR systems, demonstrating privacy leakage. The method shows robustness in reconstructing faces from partial and protected embeddings. FEM serves as an effective tool for evaluating privacy safety of FR/PPFR systems.

Conclusion: PPFR systems still have significant privacy vulnerabilities as realistic face images can be reconstructed from their embeddings. FEM provides a valuable framework for assessing and improving privacy protection in face recognition systems.

Abstract: With the advancement of face recognition (FR) systems, privacy-preserving face recognition (PPFR) systems have gained popularity for their accurate recognition, enhanced facial privacy protection, and robustness to various attacks. However, there are limited studies to further verify privacy risks by reconstructing realistic high-resolution face images from embeddings of these systems, especially for PPFR. In this work, we propose the face embedding mapping (FEM), a general framework that explores Kolmogorov-Arnold Network (KAN) for conducting the embedding-to-face attack by leveraging pre-trained Identity-Preserving diffusion model against state-of-the-art (SOTA) FR and PPFR systems. Based on extensive experiments, we verify that reconstructed faces can be used for accessing other real-word FR systems. Besides, the proposed method shows the robustness in reconstructing faces from the partial and protected face embeddings. Moreover, FEM can be utilized as a tool for evaluating safety of FR and PPFR systems in terms of privacy leakage. All images used in this work are from public datasets.

[140] LongStream: Long-Sequence Streaming Autoregressive Visual Geometry

Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyuang Guo, Hao Wang

Main category: cs.CV

TL;DR: LongStream introduces a gauge-decoupled streaming 3D reconstruction model that handles thousands of frames by predicting keyframe-relative poses, using orthogonal scale learning to suppress drift, and implementing cache-consistent training to address Transformer attention issues.

Details

Motivation: Existing autoregressive models for 3D reconstruction fail with long sequences due to attention decay, scale drift, and extrapolation errors from anchoring poses to the first frame.

Method: Three key innovations: 1) Predict keyframe-relative poses instead of first-frame anchoring, 2) Orthogonal scale learning to disentangle geometry from scale estimation, 3) Cache-consistent training with periodic cache refresh to address Transformer attention degradation.

Result: Achieves state-of-the-art performance with stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS, demonstrating robustness on long streaming sequences.

Conclusion: LongStream successfully addresses fundamental limitations in long-sequence 3D reconstruction by decoupling gauge, suppressing drift, and solving Transformer cache issues, enabling practical kilometer-scale reconstruction.

Abstract: Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention degradation over ultra-long sequences and reduces the gap between training and inference. Experiments show LongStream achieves state-of-the-art performance. It delivers stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS. Project Page: https://3dagentworld.github.io/longstream/

[141] Monocular Markerless Motion Capture Enables Quantitative Assessment of Upper Extremity Reachable Workspace

Seth Donahue, J. D. Peiffer, R. Tyler Richardson, Yishan Zhong, Shaun Q. Y. Tan, Benoit Marteau, Stephanie R. Russo, May D. Wang, R. James Cotton, Ross Chafetz

Main category: cs.CV

TL;DR: Validates monocular camera + AI markerless motion capture for upper extremity reachable workspace assessment, showing frontal camera configuration achieves strong agreement with marker-based reference.

Details

Motivation: To validate a clinically accessible approach for quantifying Upper Extremity Reachable Workspace using single camera and AI-driven markerless motion capture, reducing barriers to adoption in clinical motion analysis.

Method: Nine unimpaired adults performed standardized UERW task with VR targets. Movements captured simultaneously with marker-based system and eight FLIR cameras. Monocular video analysis performed on frontal and offset camera views to compare configurations.

Result: Frontal camera orientation showed strong agreement with marker-based reference (mean bias 0.61±0.12% reachspace per octant). Offset camera underestimated percent workspace reached (-5.66±0.45%). Best agreement for anterior workspace evaluation.

Conclusion: Frontal monocular camera configuration is feasible for UERW assessment, particularly for anterior workspace. Demonstrates clinical potential for practical single-camera assessments, enabling broader implementation of quantitative upper extremity mobility assessment.

Abstract: To validate a clinically accessible approach for quantifying the Upper Extremity Reachable Workspace (UERW) using a single (monocular) camera and Artificial Intelligence (AI)-driven Markerless Motion Capture (MMC) for biomechanical analysis. Objective assessment and validation of these techniques for specific clinically oriented tasks are crucial for their adoption in clinical motion analysis. AI-driven monocular MMC reduces the barriers to adoption in the clinic and has the potential to reduce the overhead for analysis of this common clinical assessment. Nine adult participants with no impairments performed the standardized UERW task, which entails reaching targets distributed across a virtual sphere centered on the torso, with targets displayed in a VR headset. Movements were simultaneously captured using a marker-based motion capture system and a set of eight FLIR cameras. We performed monocular video analysis on two of these video camera views to compare a frontal and offset camera configurations. The frontal camera orientation demonstrated strong agreement with the marker-based reference, exhibiting a minimal mean bias of $0.61 \pm 0.12$ % reachspace reached per octanct (mean $\pm$ standard deviation). In contrast, the offset camera view underestimated the percent workspace reached ($-5.66 \pm 0.45$ % reachspace reached). Conclusion: The findings support the feasibility of a frontal monocular camera configuration for UERW assessment, particularly for anterior workspace evaluation where agreement with marker-based motion capture was highest. The overall performance demonstrates clinical potential for practical, single-camera assessments. This study provides the first validation of monocular MMC system for the assessment of the UERW task. By reducing technical complexity, this approach enables broader implementation of quantitative upper extremity mobility assessment.

[142] FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

Mingzhi Sheng, Zekai Gu, Peng Li, Cheng Lin, Hao-Xiang Guo, Ying-Cong Chen, Yuan Liu

Main category: cs.CV

TL;DR: FlexAM is a video generation framework that uses a novel 3D control signal represented as a point cloud to disentangle appearance and motion, enabling various video editing tasks with superior performance.

Details

Motivation: Current video generation methods struggle with effective and generalizable control, often relying on ambiguous or task-specific signals. The authors argue that a fundamental disentanglement of appearance and motion provides a more robust and scalable pathway for video generation and editing.

Method: FlexAM introduces a unified framework built upon a novel 3D control signal that represents video dynamics as a point cloud. The method includes three key enhancements: multi-frequency positional encoding to distinguish fine-grained motion, depth-aware positional encoding, and a flexible control signal for balancing precision and generative quality.

Result: Extensive experiments demonstrate that FlexAM achieves superior performance across all evaluated tasks including I2V/V2V editing, camera control, and spatial object editing.

Conclusion: The proposed FlexAM framework effectively disentangles appearance and motion through its novel 3D control signal representation, providing a robust and scalable solution for video generation and editing tasks.

Abstract: Effective and generalizable control in video generation remains a significant challenge. While many methods rely on ambiguous or task-specific signals, we argue that a fundamental disentanglement of “appearance” and “motion” provides a more robust and scalable pathway. We propose FlexAM, a unified framework built upon a novel 3D control signal. This signal represents video dynamics as a point cloud, introducing three key enhancements: multi-frequency positional encoding to distinguish fine-grained motion, depth-aware positional encoding, and a flexible control signal for balancing precision and generative quality. This representation allows FlexAM to effectively disentangle appearance and motion, enabling a wide range of tasks including I2V/V2V editing, camera control, and spatial object editing. Extensive experiments demonstrate that FlexAM achieves superior performance across all evaluated tasks.

[143] Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

Aadarsh Sahoo, Georgia Gkioxari

Main category: cs.CV

TL;DR: Conversational Image Segmentation (CIS) benchmark for grounding abstract concepts like affordances and safety reasoning into pixel masks, with ConverSeg-Net model and automated data engine.

Details

Motivation: Prior referring image segmentation focuses on categorical/spatial queries but overlooks functional/physical reasoning (e.g., "where can I safely store the knife?"). Need to address this gap for more natural human-AI interaction.

Method: Introduce ConverSeg benchmark covering entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. Develop ConverSeg-Net that fuses segmentation priors with language understanding, and create AI-powered data engine for generating prompt-mask pairs without human supervision.

Result: Current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on the automated data achieves significant gains on ConverSeg benchmark while maintaining strong performance on existing language-guided segmentation benchmarks.

Conclusion: Conversational Image Segmentation enables grounding abstract, intent-driven concepts into pixel masks, advancing beyond traditional referring segmentation to include functional and physical reasoning for more natural human-AI interaction.

Abstract: Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., “left-most apple”) and overlooks functional and physical reasoning (e.g., “where can I safely store the knife?”). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/

[144] Spatio-Temporal driven Attention Graph Neural Network with Block Adjacency matrix (STAG-NN-BA) for Remote Land-use Change Detection

Usman Nazir, Wadood Islam, Sara Khalid, Murtaza Taj

Main category: cs.CV

TL;DR: A novel Graph Neural Network architecture for land-use monitoring using satellite imagery, featuring spatial and spatio-temporal classification with attention mechanisms and superpixel-based graph construction.

Details

Motivation: Land-use monitoring is crucial for spatial planning amid growing populations and climate change. Existing deep learning methods are limited to Euclidean domains, but remote sensing data has geodesic/non-Euclidean nature that can benefit from graph-based approaches.

Method: Proposes SAG-NN: uses SLIC image segmentation to create superpixel nodes, builds Region Adjacency Graph (RAG) connecting adjacent superpixels, and employs spatially-driven attention to learn relative importance of irregular neighbors. Extends to STAG-NN-BA for spatio-temporal data by combining unconnected RAGs into one supergraph using block adjacency matrices.

Result: SAG-NN and STAG-NN-BA outperform both graph and non-graph baselines on Asia14 and C2D2 datasets efficiently.

Conclusion: Graph neural networks with attention mechanisms are effective for land-use monitoring from satellite imagery, handling the non-Euclidean nature of remote sensing data through superpixel-based graph representations.

Abstract: Land-use monitoring is fundamental for spatial planning, particularly in view of compound impacts of growing global populations and climate change. Despite existing applications of deep learning in land use monitoring, standard convolutional kernels in deep neural networks limit the applications of these networks to the Euclidean domain only. Considering the geodesic nature of the measurement of the earth’s surface, remote sensing is one such area that can benefit from non-Euclidean and spherical domains. For this purpose, we designed a novel Graph Neural Network architecture for spatial and spatio-temporal classification using satellite imagery to acquire insights into socio-economic indicators. We propose a hybrid attention method to learn the relative importance of irregular neighbors in remote sensing data. Instead of classifying each pixel, we propose a method based on Simple Linear Iterative Clustering (SLIC) image segmentation and Graph Attention Network. The superpixels obtained from SLIC become the nodes of our Graph Convolution Network (GCN). A region adjacency graph (RAG) is then constructed where each superpixel is connected to every other adjacent superpixel in the image, enabling information to propagate globally. Finally, we propose a Spatially driven Attention Graph Neural Network (SAG-NN) to classify each RAG. We also propose an extension to our SAG-NN for spatio-temporal data. Unlike regular grids of pixels in images, superpixels are irregular in nature and cannot be used to create spatio-temporal graphs. We introduce temporal bias by combining unconnected RAGs from each image into one supergraph. This is achieved by introducing block adjacency matrices resulting in novel Spatio-Temporal driven Attention Graph Neural Network with Block Adjacency matrix (STAG-NN-BA). SAG-NN and STAG-NN-BA outperform graph and non-graph baselines on Asia14 and C2D2 datasets efficiently.

[145] Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

Ali K. Rahimian, Manish K. Govind, Subhajit Maity, Dominick Reilly, Christian Kümmerle, Srijan Das, Aritra Dutta

Main category: cs.CV

TL;DR: Fibottention: A novel sparse self-attention mechanism using Wythoff array patterns for O(N log N) complexity, achieving better performance than dense attention with only 2-6% of pairwise interactions.

Details

Motivation: Vision Transformers suffer from quadratic computational complexity in self-attention and high dependency on large-scale training data. Need efficient attention mechanisms that reduce computation while maintaining or improving performance.

Method: Proposes Fibottention - a sparse self-attention mechanism using structured sparsity patterns derived from the Wythoff array. Creates inception-like functional diversity across attention heads with varying sparsity patterns to reduce redundant interactions while ensuring diverse coverage.

Result: Models with Fibottention outperform or match dense MHSA counterparts while using only 2-6% of pairwise interactions. Achieves substantial computational savings and superior results compared to existing sparse attention mechanisms on FLOP-equivalency basis. Works well across image classification, video understanding, and robot learning.

Conclusion: Fibottention provides an efficient sparse attention mechanism that reduces computational complexity to O(N log N) while enhancing feature diversity and representation learning capabilities in vision transformers.

Abstract: Vision Transformers and their variants have achieved remarkable success in diverse visual perception tasks. Despite their effectiveness, they suffer from two significant limitations. First, the quadratic computational complexity of multi-head self-attention (MHSA), which restricts scalability to large token counts, and second, a high dependency on large-scale training data to attain competitive performance. In this paper, to address these challenges, we propose a novel sparse self-attention mechanism named Fibottention. Fibottention employs structured sparsity patterns derived from the Wythoff array, enabling an $\mathcal{O}(N \log N)$ computational complexity in self-attention. By design, its sparsity patterns vary across attention heads, which provably reduces redundant pairwise interactions while ensuring sufficient and diverse coverage. This leads to an \emph{inception-like functional diversity} in the attention heads, and promotes more informative and disentangled representations. We integrate Fibottention into standard Transformer architectures and conduct extensive experiments across multiple domains, including image classification, video understanding, and robot learning. Results demonstrate that models equipped with Fibottention either significantly outperform or achieve on-par performance with their dense MHSA counterparts, while leveraging only $2%$ of all pairwise interactions across self-attention heads in typical settings, $2-6%$ of the pairwise interactions in self-attention heads, resulting in substantial computational savings. Moreover, when compared to existing sparse attention mechanisms, Fibottention consistently achieves superior results on a FLOP-equivalency basis. Finally, we provide an in-depth analysis of the enhanced feature diversity resulting from our attention design and discuss its implications for efficient representation learning.

[146] MaskInversion: Localized Embeddings via Optimization of Explainability Maps

Walid Bousselham, Sofian Chaybouti, Christian Rupprecht, Vittorio Ferrari, Hilde Kuehne

Main category: cs.CV

TL;DR: MaskInversion generates context-aware embeddings for specific image regions using frozen foundation models like CLIP by optimizing an embedding token to match query masks through explainability map alignment.

Details

Motivation: Vision-language models like CLIP excel at global image-text alignment but struggle with region-specific representations. There's a need for methods that can create precise embeddings for specific image regions without retraining foundation models.

Method: Initialize an embedding token and iteratively refine it by minimizing discrepancy between its explainability map (derived from foundation model gradients) and the query mask. Uses gradient decomposition for efficiency and keeps foundation model frozen.

Result: Evaluated on PascalVOC, MSCOCO, RefCOCO, OpenImagesV7 for open-vocabulary class retrieval, referring expression comprehension, localized captioning, and image generation. Shows competitive performance compared to SOTA approaches.

Conclusion: MaskInversion enables precise region-specific representations using frozen foundation models, supporting diverse vision-language tasks without model retraining. The gradient decomposition makes it efficient for practical applications.

Abstract: Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the foundation model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use MaskInversion with any pre-trained model. As deriving the explainability map involves computing its gradient, which can be expensive, we propose a gradient decomposition strategy that simplifies this computation. The learned region representation can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation. We evaluate the proposed method on all those tasks on several datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its capabilities compared to other SOTA approaches.

[147] DuoCast: Duo-Probabilistic Diffusion for Precipitation Nowcasting

Penghui Wen, Mengwei He, Patrick Filippi, Na Zhao, Feng Zhang, Thomas Francis Bishop, Zhiyong Wang, Kun Hu

Main category: cs.CV

TL;DR: DuoCast: A dual-diffusion framework for precipitation forecasting that decomposes prediction into low- and high-frequency components in orthogonal latent subspaces to better balance global structure and local details.

Details

Motivation: Existing deep learning approaches for precipitation forecasting struggle to balance global structural consistency with local detail preservation, especially under complex meteorological conditions, which is critical for weather-sensitive decision-making.

Method: Proposes DuoCast, a dual-diffusion framework that decomposes precipitation forecasting into low- and high-frequency components modeled in orthogonal latent subspaces. The low-frequency model uses convolutional encoders conditioned on weather front dynamics to capture large-scale trends, while the high-frequency model uses self-attention-based architecture to refine fine-scale variability.

Result: Experiments on four benchmark radar datasets show DuoCast consistently outperforms state-of-the-art baselines, achieving superior accuracy in both spatial detail and temporal evolution. The paper also provides theoretical proof that frequency decomposition reduces prediction error compared to conventional single branch U-Net diffusion models.

Conclusion: DuoCast effectively addresses the challenge of balancing global structure and local details in precipitation forecasting through its dual-diffusion framework with frequency decomposition, demonstrating superior performance across multiple datasets.

Abstract: Accurate short-term precipitation forecasting is critical for weather-sensitive decision-making in agriculture, transportation, and disaster response. Existing deep learning approaches often struggle to balance global structural consistency with local detail preservation, especially under complex meteorological conditions. We propose DuoCast, a dual-diffusion framework that decomposes precipitation forecasting into low- and high-frequency components modeled in orthogonal latent subspaces. We theoretically prove that this frequency decomposition reduces prediction error compared to conventional single branch U-Net diffusion models. In DuoCast, the low-frequency model captures large-scale trends via convolutional encoders conditioned on weather front dynamics, while the high-frequency model refines fine-scale variability using a self-attention-based architecture. Experiments on four benchmark radar datasets show that DuoCast consistently outperforms state-of-the-art baselines, achieving superior accuracy in both spatial detail and temporal evolution.

[148] Post-hoc Probabilistic Vision-Language Models

Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, Martin Trapp

Main category: cs.CV

TL;DR: Post-hoc Bayesian uncertainty estimation for vision-language models without retraining, improving uncertainty quantification and active learning efficiency.

Details

Motivation: Current VLMs use deterministic mappings that fail to capture uncertainties from domain shifts, limiting their reliability in safety-critical applications.

Method: Proposes Bayesian posterior approximation over last layers of VLMs, analytically quantifying uncertainties over cosine similarities without additional training.

Result: Achieves improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning compared to baselines.

Conclusion: The method shows promise for safety-critical applications of large-scale models by providing reliable uncertainty quantification for VLMs.

Abstract: Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

[149] PromptDepthAnything++: Accurate 4K Metric Depth Estimation via Pattern-Agnostic Prompting

Haotong Lin, Sida Peng, Qinglin Yang, Peishan Yang, Jiaming Sun, Ruizhen Hu, Kai Xu, Hujun Bao, Bingyi Kang, Xiaowei Zhou

Main category: cs.CV

TL;DR: Prompt Depth Anything introduces a novel prompting paradigm for metric depth estimation using LiDAR prompts to guide depth foundation models, achieving state-of-the-art zero-shot performance.

Details

Motivation: Current depth foundation models lack precise metric depth estimation capabilities. The paper aims to leverage prompting techniques (successful in language and vision models) for depth estimation by using LiDAR as prompts to guide models toward accurate metric depth outputs.

Method: 1) Uses low-cost LiDAR as prompts to guide Depth Anything model for metric depth estimation; 2) Proposes prompt fusion design integrating LiDAR at multiple scales within depth decoder; 3) Creates scalable data pipeline with synthetic LiDAR simulation and real data pseudo GT depth generation; 4) Introduces prompting mechanism that serializes depth points into tokens and uses self-attention to enhance image tokens.

Result: Achieves state-of-the-art performance on 8 zero-shot depth benchmarks, supports up to 4K resolution, and benefits downstream applications including 3D reconstruction and robotic grasping.

Conclusion: Successfully demonstrates that prompting can be effectively applied to depth foundation models, creating a new paradigm for metric depth estimation that combines the strengths of foundation models with precise sensor guidance.

Abstract: Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. To further extend our method to work with any prompt depth points, we propose a new prompting mechanism, which serializes the input depth points into tokens and uses self-attention to enhance image tokens from depth foundation models. Our approach sets new state-of-the-arts on 8 zero-shot depth benchmarks and benefits downstream applications, including 3D reconstruction and generalized robotic grasping. The code is available at https://github.com/DepthAnything/PromptDA .

[150] Easy-Poly: An Easy Polyhedral Framework For 3D Multi-Object Tracking

Peng Zhang, Xin Li, Xin Lin, Liang He

Main category: cs.CV

TL;DR: Easy-Poly: A filter-based 3D multi-object tracking framework with camera-LiDAR fusion detection, dynamic data association, adaptive motion modeling, and lifecycle management for improved tracking in complex driving scenarios.

Details

Motivation: Current 3D MOT methods suffer from high false positives, missed detections, and identity switches, especially in crowded and small-object scenarios, limiting their effectiveness in complex driving environments.

Method: Four key innovations: 1) CNMSMM camera-LiDAR fusion detection with multi-modal augmentation and efficient NMS, 2) Dynamic Track-Oriented data association with class-aware optimal assignment, 3) Dynamic Motion Modeling using confidence-weighted Kalman filter with adaptive noise covariance, 4) Extended life-cycle management system.

Result: Outperforms state-of-the-art methods like Poly-MOT and Fast-Poly, achieving mAP gains (63.30% to 65.65% with LargeKernel3D) and AMOTA improvements (73.1% to 75.6%), while running in real-time.

Conclusion: Easy-Poly advances robustness and adaptability in complex driving environments, paving the way for safer autonomous driving perception systems.

Abstract: Recent 3D multi-object tracking (3D MOT) methods mainly follow tracking-by-detection pipelines, but often suffer from high false positives, missed detections, and identity switches, especially in crowded and small-object scenarios. To address these challenges, we propose Easy-Poly, a filter-based 3D MOT framework with four key innovations: (1) CNMSMM, a novel Camera-LiDAR fusion detection method combining multi-modal augmentation and an efficient NMS with a new loss function to improve small target detection; (2) Dynamic Track-Oriented (DTO) data association that robustly handles uncertainties and occlusions via class-aware optimal assignment and parallel processing strategies; (3) Dynamic Motion Modeling (DMM) using a confidence-weighted Kalman filter with adaptive noise covariance to enhance tracking accuracy; and (4) an extended life-cycle management system reducing identity switches and false terminations. Experimental results show that Easy-Poly outperforms state-of-the-art methods such as Poly-MOT and Fast-Poly, achieving notable gains in mAP (e.g., from 63.30% to 65.65% with LargeKernel3D) and AMOTA (e.g., from 73.1% to 75.6%), while also running in real-time. Our framework advances robustness and adaptability in complex driving environments, paving the way for safer autonomous driving perception.

[151] Unifying Multiple Foundation Models for Advanced Computational Pathology

Wenhui Lei, Yusheng Tan, Anqi Li, Hanyu Chen, Hengrui Tian, Ruiying Li, Zhengqun Jiang, Fang Yan, Xiaofan Zhang, Shaoting Zhang

Main category: cs.CV

TL;DR: Shazam is an online integration model that adaptively combines multiple pretrained pathology foundation models through adaptive expert weighting and online distillation, enabling efficient consolidation of complementary strengths without additional pretraining.

Details

Motivation: Current pathology foundation models have varying performance across tasks due to differences in training data composition and reliance on proprietary datasets that cannot be cumulatively expanded. Existing offline distillation methods require dedicated distillation data and repeated retraining to integrate new models.

Method: Shazam uses an online integration approach with adaptive expert weighting and online distillation to fuse multi-level features from multiple pretrained pathology foundation models within a unified and scalable representation learning paradigm.

Result: Shazam consistently outperforms strong individual models across multiple tasks including spatial transcriptomics prediction, survival prognosis, tile-level classification, and visual question answering.

Conclusion: Online model integration provides a practical and extensible strategy for advancing computational pathology by efficiently consolidating complementary model strengths without additional pretraining.

Abstract: Foundation models have substantially advanced computational pathology by learning transferable visual representations from large histological datasets, yet their performance varies widely across tasks due to differences in training data composition and reliance on proprietary datasets that cannot be cumulatively expanded. Existing efforts to combine foundation models through offline distillation partially mitigate this issue but require dedicated distillation data and repeated retraining to integrate new models. Here we present Shazam, an online integration model that adaptively combines multiple pretrained pathology foundation models within a unified and scalable representation learning paradigm. Our findings show that fusing multi-level features through adaptive expert weighting and online distillation enables efficient consolidation of complementary model strengths without additional pretraining. Across spatial transcriptomics prediction, survival prognosis, tile-level classification, and visual question answering, Shazam consistently outperforms strong individual models, demonstrating that online model integration provides a practical and extensible strategy for advancing computational pathology.

[152] CNN and ViT Efficiency Study on Tiny ImageNet and DermaMNIST Datasets

Aidar Amangeldi, Angsar Taigonyrov, Muhammad Huzaifa Jawad, Chinedu Emmanuel Mbonu

Main category: cs.CV

TL;DR: Vision Transformers fine-tuned with specific strategies can match or exceed ResNet-18 performance on medical and general image classification while achieving faster inference and fewer parameters, making them viable for resource-constrained environments.

Details

Motivation: To evaluate trade-offs between convolutional and transformer architectures for image classification, specifically comparing ResNet-18 with Vision Transformer variants to reduce inference latency and model complexity while maintaining acceptable accuracy for deployment in resource-constrained settings.

Method: Used ResNet-18 as baseline and introduced fine-tuning strategy applied to four Vision Transformer variants (Tiny, Small, Base, Large) on DermatologyMNIST and TinyImageNet datasets. Conducted systematic hyperparameter variations to optimize performance.

Result: Appropriately fine-tuned Vision Transformers can match or exceed baseline performance, achieve faster inference, and operate with fewer parameters than ResNet-18, demonstrating their viability for resource-constrained deployment.

Conclusion: Vision Transformers with proper fine-tuning strategies offer competitive alternatives to convolutional architectures for image classification, providing better efficiency-accuracy trade-offs suitable for deployment in resource-limited environments.

Abstract: This study evaluates the trade-offs between convolutional and transformer-based architectures on both medical and general-purpose image classification benchmarks. We use ResNet-18 as our baseline and introduce a fine-tuning strategy applied to four Vision Transformer variants (Tiny, Small, Base, Large) on DermatologyMNIST and TinyImageNet. Our goal is to reduce inference latency and model complexity with acceptable accuracy degradation. Through systematic hyperparameter variations, we demonstrate that appropriately fine-tuned Vision Transformers can match or exceed the baseline’s performance, achieve faster inference, and operate with fewer parameters, highlighting their viability for deployment in resource-constrained environments.

[153] Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification

Yuhao Sun, Jiacheng Zhang, Zesheng Ye, Chaowei Xiao, Feng Liu

Main category: cs.CV

TL;DR: SSNI proposes sample-specific noise injection for diffusion-based purification by adaptively adjusting noise levels based on each sample’s deviation from clean data distribution using score norms.

Details

Motivation: Current diffusion-based purification methods use a constant noise level for all samples, but optimal noise levels should vary per sample based on how clean they are.

Method: Uses pre-trained score network to estimate sample deviation from clean distribution via score norms, then applies reweighting function to adaptively adjust noise injection level for each sample.

Result: Incorporating SSNI with existing DBP methods improves both accuracy and robustness on CIFAR-10 and ImageNet-1K datasets.

Conclusion: Sample-specific noise injection is necessary for diffusion-based purification, with SSNI framework demonstrating effectiveness across multiple datasets and methods.

Abstract: Diffusion-based purification (DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level $t^$ for all samples in existing methods. In this paper, we discover that an optimal $t^$ for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust $t^*$ for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocate distinct noise levels to different samples in DBP methods. Our code is available at: https://github.com/tmlr-group/SSNI.

[154] CP-uniGuard: A Unified, Probability-Agnostic, and Adaptive Framework for Malicious Agent Detection and Defense in Multi-Agent Embodied Perception Systems

Senkang Hu, Yihang Tao, Guowen Xu, Xinyuan Qian, Yiqin Deng, Xianhao Chen, Sam Tak Wu Kwong, Yuguang Fang

Main category: cs.CV

TL;DR: CP-uniGuard: A defense framework for collaborative perception systems that detects and eliminates malicious agents through consensus verification without requiring prior probabilities of malicious behavior.

Details

Motivation: Collaborative perception systems in multi-agent autonomous driving are vulnerable to attacks from malicious agents that send false perception information, compromising system safety and reliability.

Method: Proposes CP-uniGuard with: 1) Probability-agnostic sample consensus (PASAC) for sampling collaborators without prior probabilities, 2) Collaborative consistency loss (CCLoss) for object detection and BEV segmentation to measure discrepancies, and 3) Online adaptive threshold via dual sliding windows for dynamic threshold adjustment.

Result: Extensive experiments demonstrate the framework’s effectiveness in accurately detecting and eliminating malicious agents in collaborative perception systems.

Conclusion: CP-uniGuard provides a unified, probability-agnostic, and adaptive defense mechanism that enhances the security and reliability of collaborative perception systems against malicious attacks.

Abstract: Collaborative Perception (CP) has been shown to be a promising technique for multi-agent autonomous driving and multi-agent robotic systems, where multiple agents share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, an ego agent needs to receive messages from its collaborators, which makes it vulnerable to attacks from malicious agents. To address this critical issue, we propose a unified, probability-agnostic, and adaptive framework, namely, CP-uniGuard, which is a tailored defense mechanism for CP deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is to enable CP to reach a consensus rather than a conflict against an ego agent’s perception results. Based on this idea, we first develop a probability-agnostic sample consensus (PASAC) method to effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we define collaborative consistency loss (CCLoss) for object detection task and bird’s eye view (BEV) segmentation task to capture the discrepancy between an ego agent and its collaborators, which is used as a verification criterion for consensus. In addition, we propose online adaptive threshold via dual sliding windows to dynamically adjust the threshold for consensus verification and ensure the reliability of the systems in dynamic environments. Finally, we conduct extensive experiments and demonstrate the effectiveness of our framework. Code is available at https://github.com/CP-Security/CP-uniGuard.

[155] Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Yizhou Wang, Song Mao, Yang Chen, Yufan Shen, Yinqiao Yan, Pinlong Cai, Ding Wang, Guohang Yan, Zhi Yu, Xuming Hu, Botian Shi

Main category: cs.CV

TL;DR: Multi-encoder MLLMs often have redundant vision encoders; systematic masking reveals many encoders are interchangeable or even detrimental, with single/dual encoder variants achieving similar performance with lower cost.

Details

Motivation: The paper challenges the common assumption in multimodal LLMs that integrating multiple vision encoders with diverse pretraining objectives necessarily improves performance, suggesting this often leads to redundancy rather than complementary benefits.

Method: Systematic encoder masking across representative multi-encoder MLLMs, introducing two metrics: Conditional Utilization Rate (CUR) to measure marginal encoder contribution, and Information Gap (IG) to capture heterogeneity in encoder utility.

Result: Found strong specialization on OCR/Chart tasks (single encoder dominates with >90% CUR), high redundancy on general VQA/knowledge tasks (encoders interchangeable), and instances of detrimental encoders with negative CUR. Masking specific encoders can yield up to 16% higher accuracy on specific tasks and 3.6% overall performance boost.

Conclusion: Challenges the “more encoders are better” heuristic in MLLMs, showing single/dual encoder variants recover over 90% of baseline performance on most non-OCR tasks with substantially lower training resources and inference latency.

Abstract: Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi encoder MLLMs, we find that performance typically degrades gracefully, and sometimes even improves, when selected encoders are masked, revealing pervasive encoder redundancy. To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoder s marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model. Using these tools, we observe: (i) strong specialization on tasks like OCR and Chart, where a single encoder can dominate with a CUR greater than 90 percent, (ii) high redundancy on general VQA and knowledge based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR. Notably, masking specific encoders can yield up to 16 percent higher accuracy on a specific task category and 3.6 percent overall performance boost compared to the full model.Furthermore, single and dual encoder variants recover over 90 percent of baseline on most non OCR tasks with substantially lower training resources and inference latency. Our analysis challenges the more encoders are better heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.

Guile Wu, Dongfeng Bai, Bingbing Liu

Main category: cs.CV

TL;DR: ArmGS: A 3D Gaussian splatting approach with multi-granularity appearance refinement for dynamic urban scene modeling in autonomous driving simulation.

Details

Motivation: Existing neural radiance field methods for driving scene modeling have low rendering efficacy, while recent 3D Gaussian splatting approaches neglect fine-grained variations between frames and camera viewpoints, leading to suboptimal results for dynamic urban environments.

Method: Proposes ArmGS with composite driving Gaussian splatting and multi-granularity appearance refinement. Uses a multi-level appearance modeling scheme to optimize transformation parameters for composite Gaussian refinement at three granularities: local Gaussian level, global image level, and dynamic actor level.

Result: Extensive experiments on Waymo, KITTI, NOTR and VKITTI2 datasets demonstrate superiority over state-of-the-art methods in modeling dynamic urban scenes for autonomous driving simulation.

Conclusion: ArmGS effectively models both global scene appearance variations and local fine-grained changes in dynamic urban environments, enabling high-fidelity reconstruction and real-time rendering for autonomous driving simulation.

Abstract: This work focuses on modeling dynamic urban environments for autonomous driving simulation. Contemporary data-driven methods using neural radiance fields have achieved photorealistic driving scene modeling, but they suffer from low rendering efficacy. Recently, some approaches have explored 3D Gaussian splatting for modeling dynamic urban scenes, enabling high-fidelity reconstruction and real-time rendering. However, these approaches often neglect to model fine-grained variations between frames and camera viewpoints, leading to suboptimal results. In this work, we propose a new approach named ArmGS that exploits composite driving Gaussian splatting with multi-granularity appearance refinement for autonomous driving scene modeling. The core idea of our approach is devising a multi-level appearance modeling scheme to optimize a set of transformation parameters for composite Gaussian refinement from multiple granularities, ranging from local Gaussian level to global image level and dynamic actor level. This not only models global scene appearance variations between frames and camera viewpoints, but also models local fine-grained changes of background and objects. Extensive experiments on multiple challenging autonomous driving datasets, namely, Waymo, KITTI, NOTR and VKITTI2, demonstrate the superiority of our approach over the state-of-the-art methods.

[157] Multimodal LLM With Hierarchical Mixture-of-Experts for VQA on 3D Brain MRI

Arvind Murari Vepa, Yannan Yu, Jingru Gan, Anthony Cuturrufo, Michael F. Romano, Weikai Li, Fabien Scalzo, Wei Wang, Yizhou Sun

Main category: cs.CV

TL;DR: mpLLM: A multimodal LLM for visual question answering on multiparametric 3D brain MRI that generates clinically interpretable tumor descriptors as an adjunct to neurosurgical planning.

Details

Motivation: Multiparametric 3D brain MRI is crucial for neuroradiology but challenging to interpret for tumor characteristics. There's a need for AI systems that can produce clinically interpretable tumor descriptors to assist neurosurgeons in planning.

Method: Uses a prompt-conditioned hierarchical mixture-of-experts (MoE) to fuse multiple 3D MRI sequences via routing over modality- and token-level projection experts. Proposes synthetic VQA protocol deriving questions/answers from expert segmentation annotations to address limited paired image-text supervision.

Result: Outperforms strong medical VLM baselines by +5.5 points on average (+9.1% relative) and increases radiologist-rated clinical acceptability by +15.9 points (+46.6% relative) across multiple mpMRI datasets.

Conclusion: mpLLM demonstrates clinical utility for brain tumor analysis, featuring three main contributions: first VQA dataset for 3D brain mpMRI, hierarchical MoE architecture for joint reasoning over 3D sequences, and expert-supported evidence of clinical utility.

Abstract: Multiparametric 3D brain MRI (mpMRI) is central to neuroradiology, but producing tumor location, appearance, size, and involvement of critical structures for neurosurgical planning remains challenging. We introduce mpLLM, a multimodal LLM for visual question answering (VQA) on mpMRI that produces clinically interpretable tumor descriptors (e.g., volume, morphology, extent, and coarse localization) as an adjunct to clinical expertise for referring neurosurgeons. mpLLM uses a prompt-conditioned hierarchical mixture-of-experts (MoE) to fuse multiple 3D sequences via routing over modality- and token-level projection experts, enabling data-efficient end-to-end training without large-scale image-report pretraining. To address limited paired image-text supervision, we propose a synthetic VQA protocol that derives clinically grounded questions and answers from expert segmentation annotations and is validated with radiologist collaboration. Across multiple mpMRI datasets, mpLLM improves over strong medical VLM baselines by +5.5 points on average (+9.1% relative) and increases radiologist-rated clinical acceptability by +15.9 points (+46.6% relative). Our study features three main contributions: (1) the first VQA dataset for 3D brain mpMRI, (2) a hierarchical MoE architecture for joint reasoning over interrelated 3D sequences, and (3) expert-supported evidence of clinical utility. Source code is available at https://github.com/arvindmvepa/mpllm, and we will release the dataset upon publication.

[158] From slides to AI-ready maps: Standardized multi-layer tissue maps as metadata for artificial intelligence in digital pathology

Gernot Fiala, Markus Plass, Robert Harb, Peter Regitnig, Kristijan Skok, Wael Al Zoughbi, Carmen Zerner, Paul Torke, Michaela Kargl, Heimo Müller, Tomas Brazdil, Matej Gallo, Jaroslav Kubín, Roman Stoklasa, Rudolf Nenutil, Norman Zerbe, Andreas Holzinger, Petr Holub

Main category: cs.CV

TL;DR: Proposes a framework for generating standardized 2D index maps (tissue maps) to describe morphological content in Whole Slide Images, enabling AI-ready metadata for improved search and dataset assembly in medical imaging archives.

Details

Motivation: WSIs lack standardized metadata for content description, making AI cohort assembly reliant on manual inspection which is impractical for large collections with millions of images.

Method: Develops a general framework to generate 2D index maps (tissue maps) with three-layer structure: source, tissue type, and pathological alterations, using common syntax/semantics for interoperability.

Result: Demonstrates AI-based metadata extraction from WSIs to generate tissue maps and integrates them into WSI archives, enhancing search capabilities and facilitating targeted dataset assembly.

Conclusion: The proposed tissue map standard enables interoperability between WSI catalogs, accelerates assembly of high-quality AI datasets, and supports cancer research through improved WSI archive search capabilities.

Abstract: A Whole Slide Image (WSI) is a high-resolution digital image created by scanning an entire glass slide containing a biological specimen, such as tissue sections or cell samples, at multiple magnifications. These images are digitally viewable, analyzable, and shareable, and are widely used for Artificial Intelligence (AI) algorithm development. WSIs play an important role in pathology for disease diagnosis and oncology for cancer research, but are also applied in neurology, veterinary medicine, hematology, microbiology, dermatology, pharmacology, toxicology, immunology, and forensic science. When assembling cohorts for AI training or validation, it is essential to know the content of a WSI. However, no standard currently exists for this metadata, and such a selection has largely relied on manual inspection, which is not suitable for large collections with millions of objects. We propose a general framework to generate 2D index maps (tissue maps) that describe the morphological content of WSIs using common syntax and semantics to achieve interoperability between catalogs. The tissue maps are structured in three layers: source, tissue type, and pathological alterations. Each layer assigns WSI segments to specific classes, providing AI-ready metadata. We demonstrate the advantages of this standard by applying AI-based metadata extraction from WSIs to generate tissue maps and integrating them into a WSI archive. This integration enhances search capabilities within WSI archives, thereby facilitating the accelerated assembly of high-quality, balanced, and more targeted datasets for AI training, validation, and cancer research.

[159] FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

Junyi Wu, Zhiteng Li, Haotong Qin, Xiaohong Liu, Linghe Kong, Yulun Zhang, Xiaokang Yang

Main category: cs.CV

TL;DR: FlashEdit enables real-time text-guided image editing with diffusion models via one-step inversion-editing pipeline, background shielding, and sparsified attention for 150× speedup.

Details

Motivation: Current diffusion-based image editing methods achieve high quality but suffer from prohibitive latency, hindering real-world applications that require real-time performance.

Method: Three key innovations: (1) One-Step Inversion-and-Editing (OSIE) pipeline bypassing iterative processes; (2) Background Shield (BG-Shield) for selective feature modification only within edit region; (3) Sparsified Spatial Cross-Attention (SSCA) to suppress semantic leakage to background.

Result: FlashEdit performs edits in under 0.2 seconds (150× speedup vs prior methods) while maintaining superior background consistency and structural integrity.

Conclusion: FlashEdit enables high-fidelity, real-time image editing with diffusion models, making text-guided image editing practical for real-world applications.

Abstract: Text-guided image editing with diffusion models has achieved remarkable quality but suffers from prohibitive latency, hindering real-world applications. We introduce FlashEdit, a novel framework designed to enable high-fidelity, real-time image editing. Its efficiency stems from three key innovations: (1) a One-Step Inversion-and-Editing (OSIE) pipeline that bypasses costly iterative processes; (2) a Background Shield (BG-Shield) technique that guarantees background preservation by selectively modifying features only within the edit region; and (3) a Sparsified Spatial Cross-Attention (SSCA) mechanism that ensures precise, localized edits by suppressing semantic leakage to the background. Extensive experiments demonstrate that FlashEdit maintains superior background consistency and structural integrity, while performing edits in under 0.2 seconds, which is an over 150$\times$ speedup compared to prior multi-step methods. Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit.

[160] Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans

Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel

Main category: cs.CV

TL;DR: A graph-based framework for 3D chest CT analysis using axial slice triplets as nodes with spectral graph convolution, achieving strong cross-dataset generalization and competitive performance.

Details

Motivation: Address the limitations of existing 3D CNN methods (struggle with long-range dependencies) and Vision Transformers (require extensive domain-specific pre-training) for multi-label classification of 3D chest CT scans, while maintaining clinical deployment feasibility.

Method: Proposes a 2.5D graph-based framework representing 3D CT volumes as structured graphs where axial slice triplets serve as nodes, processed through spectral graph convolution to capture inter-slice dependencies.

Result: Achieves strong cross-dataset generalization across 3 independent institution datasets, shows competitive performance compared to state-of-the-art visual encoders, and demonstrates broader applicability through transfer experiments on radiology report generation and abdominal CT data.

Conclusion: The graph-based approach provides an effective alternative to 3D CNNs and Vision Transformers for volumetric medical image analysis, balancing performance with computational efficiency suitable for clinical deployment.

Abstract: With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work of academic research, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.

[161] Heterogeneous Complementary Distillation

Liuchi Xu, Hao Zheng, Lu Wang, Lisheng Xu, Jun Cheng

Main category: cs.CV

TL;DR: Heterogeneous Complementary Distillation (HCD) framework for knowledge transfer between different architectures (e.g., ViT to ResNet) using complementary feature integration and sub-logit decoupled distillation.

Details

Motivation: Heterogeneous knowledge distillation (e.g., Vision Transformer to ResNet) faces challenges due to architectural differences in spatial feature representations. Existing methods are either designed for homogeneous architectures, computationally expensive, or overly rely on logit alignment, limiting complementary feature utilization.

Method: Proposes HCD framework that integrates complementary teacher and student features to align representations in shared logits. Uses Complementary Feature Mapper (CFM) to process student features (convolutional projector + adaptive pooling) concatenated with teacher’s penultimate layer features. Introduces Sub-logit Decoupled Distillation (SDD) that partitions shared logits into n sub-logits fused with teacher’s logits, plus Orthogonality Loss (OL) to ensure sub-logit diversity and reduce redundant knowledge transfer.

Result: Extensive experiments on CIFAR-100, Fine-grained datasets (CUB200), and ImageNet-1K demonstrate HCD outperforms state-of-the-art KD methods for heterogeneous distillation.

Conclusion: HCD is an effective solution for heterogeneous knowledge distillation that preserves student-specific strengths while leveraging teacher knowledge, enhancing robustness and generalization in student models.

Abstract: Knowledge distillation (KD)transfers the dark knowledge from a complex teacher to a compact student. However, heterogeneous architecture distillation, such as Vision Transformer (ViT) to ResNet18, faces challenges due to differences in spatial feature representations.Traditional KD methods are mostly designed for homogeneous architectures and hence struggle to effectively address the disparity. Although heterogeneous KD approaches have been developed recently to solve these issues, they often incur high computational costs and complex designs, or overly rely on logit alignment, which limits their ability to leverage the complementary features. To overcome these limitations, we propose Heterogeneous Complementary Distillation (HCD),a simple yet effective framework that integrates complementary teacher and student features to align representations in shared logits.These logits are decomposed and constrained to facilitate diverse knowledge transfer to the student. Specifically, HCD processes the student’s intermediate features through convolutional projector and adaptive pooling, concatenates them with teacher’s feature from the penultimate layer and then maps them via the Complementary Feature Mapper (CFM) module, comprising fully connected layer,to produce shared logits.We further introduce Sub-logit Decoupled Distillation (SDD) that partitions the shared logits into n sub-logits, which are fused with teacher’s logits to rectify classification.To ensure sub-logit diversity and reduce redundant knowledge transfer, we propose an Orthogonality Loss (OL).By preserving student-specific strengths and leveraging teacher knowledge,HCD enhances robustness and generalization in students.Extensive experiments on the CIFAR-100, Fine-grained (e.g., CUB200)and ImageNet-1K datasets demonstrate that HCD outperforms state-of-the-art KD methods,establishing it as an effective solution for heterogeneous KD.

[162] ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration

Fanpu Cao, Yaofo Chen, Zeng You, Wei Luo

Main category: cs.CV

TL;DR: ProCache: A training-free dynamic feature caching framework for Diffusion Transformers that accelerates inference by 1.96-2.90x with minimal quality loss through non-uniform caching patterns and selective computation.

Details

Motivation: Diffusion Transformers (DiTs) have state-of-the-art generative performance but high computational costs hinder real-time deployment. Existing feature caching methods have limitations: uniform caching intervals don't match DiT's non-uniform temporal dynamics, and naive feature reuse with large intervals causes error accumulation.

Method: ProCache uses two core components: (1) constraint-aware caching pattern search that generates non-uniform activation schedules through offline constrained sampling, tailored to DiT’s temporal characteristics; (2) selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead.

Result: Extensive experiments on PixArt-alpha and DiT show ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.

Conclusion: ProCache effectively addresses limitations of existing feature caching methods for DiTs by aligning caching patterns with temporal dynamics and mitigating error accumulation, enabling significant acceleration without training.

Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model’s temporal characteristics; and (ii) a selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-alpha and DiT demonstrate that ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.

[163] R3DPA: Leveraging 3D Representation Alignment and RGB Pretrained Priors for LiDAR Scene Generation

Nicolas Sereyjol-Garros, Ellington Kirby, Victor Besnier, Nermin Samet

Main category: cs.CV

TL;DR: R3DPA: First LiDAR scene generation method that leverages image-pretrained priors and self-supervised 3D representations for state-of-the-art LiDAR point cloud generation with control capabilities.

Details

Motivation: LiDAR data scarcity for robotic tasks like autonomous driving; existing diffusion/flow matching models limited by small 3D datasets compared to massive RGB datasets; need to leverage large-scale image priors for 3D generation.

Method: Aligns intermediate features of generative model with self-supervised 3D features; transfers knowledge from image-pretrained generative models to LiDAR generation; enables point cloud control at inference for object inpainting and scene mixing with unconditional model.

Result: Achieves state-of-the-art performance on KITTI-360 benchmark; generates high-quality LiDAR scenes; enables control capabilities like object inpainting and scene mixing.

Conclusion: R3DPA successfully bridges image and 3D domains by leveraging image-pretrained priors for LiDAR generation, addressing data scarcity and achieving superior results with control capabilities.

Abstract: LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-of-the-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at https://github.com/valeoai/R3DPA.

[164] Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets

Jeremiah Coholich, Justin Wit, Robert Azarcon, Zsolt Kira

Main category: cs.CV

TL;DR: MANGO: An unpaired image translation method for sim2real transfer in robot manipulation that maintains viewpoint consistency using segmentation-conditioned InfoNCE loss, regularized discriminator, and modified PatchNCE loss.

Details

Motivation: Vision-based robot manipulation policies are brittle to camera viewpoint variations. Real robot demonstration data is scarce and lacks viewpoint diversity, while simulation offers scalable data collection but faces visual sim2real challenges.

Method: Proposes MANGO with three key components: 1) segmentation-conditioned InfoNCE loss for viewpoint consistency, 2) highly-regularized discriminator design, and 3) modified PatchNCE loss. Requires only small fixed-camera real data but can generate diverse unseen viewpoints from simulated observations.

Result: Outperforms other image translation methods in sim2real translation. In real-world tabletop manipulation tasks, MANGO augmentation increases shifted-view success rates by over 40 percentage points compared to policies trained without augmentation.

Conclusion: MANGO effectively bridges the sim2real gap for robot manipulation by enabling viewpoint-consistent image translation with minimal real data, significantly improving policy robustness to camera viewpoint variations.

Abstract: Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO – an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this setting, MANGO outperforms all other image translation methods we tested. In certain real-world tabletop manipulation tasks, MANGO augmentation increases shifted-view success rates by over 40 percentage points compared to policies trained without augmentation.

[165] SimpleMatch: A Simple and Strong Baseline for Semantic Correspondence

Hailing Jin, Huiying Li

Main category: cs.CV

TL;DR: SimpleMatch: A lightweight semantic correspondence framework that achieves strong performance at low resolutions by addressing feature fusion issues from downsampling, using an upsample decoder and multi-scale supervision.

Details

Motivation: Current semantic correspondence methods rely on high-resolution inputs for optimal performance, causing computational overhead. A fundamental limitation is the irreversible fusion of adjacent keypoint features when semantically distinct keypoints fall within the same downsampled receptive field during deep downsampling operations.

Method: Proposes SimpleMatch with: 1) Lightweight upsample decoder that progressively recovers spatial detail by upsampling deep features to 1/4 resolution, 2) Multi-scale supervised loss ensuring upsampled features retain discriminative features across spatial scales, 3) Sparse matching and window-based localization to optimize training memory usage.

Result: Achieves 84.1% PCK@0.1 on SPair-71k benchmark at 252x252 resolution (3.3x smaller than current SOTA methods), reduces training memory usage by 51%.

Conclusion: SimpleMatch provides a practical and efficient baseline for semantic correspondence research, demonstrating strong performance at low resolutions while addressing computational overhead issues.

Abstract: Recent advances in semantic correspondence have been largely driven by the use of pre-trained large-scale models. However, a limitation of these approaches is their dependence on high-resolution input images to achieve optimal performance, which results in considerable computational overhead. In this work, we address a fundamental limitation in current methods: the irreversible fusion of adjacent keypoint features caused by deep downsampling operations. This issue is triggered when semantically distinct keypoints fall within the same downsampled receptive field (e.g., 16x16 patches). To address this issue, we present SimpleMatch, a simple yet effective framework for semantic correspondence that delivers strong performance even at low resolutions. We propose a lightweight upsample decoder that progressively recovers spatial detail by upsampling deep features to 1/4 resolution, and a multi-scale supervised loss that ensures the upsampled features retain discriminative features across different spatial scales. In addition, we introduce sparse matching and window-based localization to optimize training memory usage and reduce it by 51%. At a resolution of 252x252 (3.3x smaller than current SOTA methods), SimpleMatch achieves superior performance with 84.1% PCK@0.1 on the SPair-71k benchmark. We believe this framework provides a practical and efficient baseline for future research in semantic correspondence. Code is available at: https://github.com/hailong23-jin/SimpleMatch.

[166] MDAFNet: Multiscale Differential Edge and Adaptive Frequency Guided Network for Infrared Small Target Detection

Shuying Li, Qiang Ma, San Zhang, Wuwei Wang, Chuang Yang

Main category: cs.CV

TL;DR: MDAFNet is a novel network for infrared small target detection that addresses edge degradation and frequency interference issues through multi-scale differential edge enhancement and dual-domain adaptive feature processing.

Details

Motivation: Existing IRSTD methods suffer from gradual degradation of target edge pixels during deep network processing, and traditional convolution struggles to differentiate frequency components, leading to background interference and false detections from noise.

Method: Proposes MDAFNet with two key modules: 1) Multi-Scale Differential Edge (MSDE) module for edge extraction and enhancement to compensate for edge loss during downsampling, and 2) Dual-Domain Adaptive Feature Enhancement (DAFE) module combining frequency domain processing with simulated frequency decomposition/fusion in spatial domain to enhance high-frequency targets while suppressing noise.

Result: Experimental results on multiple datasets demonstrate superior detection performance compared to existing methods.

Conclusion: MDAFNet effectively addresses edge degradation and frequency interference problems in infrared small target detection through innovative multi-scale edge enhancement and dual-domain adaptive feature processing.

Abstract: Infrared small target detection (IRSTD) plays a crucial role in numerous military and civilian applications. However, existing methods often face the gradual degradation of target edge pixels as the number of network layers increases, and traditional convolution struggles to differentiate between frequency components during feature extraction, leading to low-frequency backgrounds interfering with high-frequency targets and high-frequency noise triggering false detections. To address these limitations, we propose MDAFNet (Multi-scale Differential Edge and Adaptive Frequency Guided Network for Infrared Small Target Detection), which integrates the Multi-Scale Differential Edge (MSDE) module and Dual-Domain Adaptive Feature Enhancement (DAFE) module. The MSDE module, through a multi-scale edge extraction and enhancement mechanism, effectively compensates for the cumulative loss of target edge information during downsampling. The DAFE module combines frequency domain processing mechanisms with simulated frequency decomposition and fusion mechanisms in the spatial domain to effectively improve the network’s capability to adaptively enhance high-frequency targets and selectively suppress high-frequency noise. Experimental results on multiple datasets demonstrate the superior detection performance of MDAFNet.

[167] A Step to Decouple Optimization in 3DGS

Renjie Ding, Yaonan Wang, Min Liu, Jialin Zhu, Jiazheng Wang, Jiahao Zhao, Wenting Shen, Feixiang He, Xiang Chen

Main category: cs.CV

TL;DR: The paper analyzes optimization issues in 3D Gaussian Splatting (3DGS) and proposes AdamW-GS, a redesigned optimization method that addresses coupling problems in gradient updates and regularization.

Details

Motivation: The authors identify two overlooked optimization problems in 3DGS: (1) update step coupling causing optimizer state rescaling and costly attribute updates outside viewpoints, and (2) gradient coupling in moments leading to under- or over-effective regularization. These complex couplings are under-explored despite 3DGS's importance for real-time novel view synthesis.

Method: The paper revisits 3DGS optimization and decouples it into three components: Sparse Adam, Re-State Regularization, and Decoupled Attribute Regularization. After extensive experiments under 3DGS and 3DGS-MCMC frameworks, the authors propose AdamW-GS by re-coupling the beneficial components based on empirical analysis.

Result: The proposed AdamW-GS achieves better optimization efficiency and representation effectiveness simultaneously compared to existing approaches.

Conclusion: The work provides deeper understanding of optimization components in 3DGS and demonstrates that redesigned optimization (AdamW-GS) can significantly improve both efficiency and effectiveness in 3D Gaussian Splatting.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time novel view synthesis. As an explicit representation optimized through gradient propagation among primitives, optimization widely accepted in deep neural networks (DNNs) is actually adopted in 3DGS, such as synchronous weight updating and Adam with the adaptive gradient. However, considering the physical significance and specific design in 3DGS, there are two overlooked details in the optimization of 3DGS: (i) update step coupling, which induces optimizer state rescaling and costly attribute updates outside the viewpoints, and (ii) gradient coupling in the moment, which may lead to under- or over-effective regularization. Nevertheless, such a complex coupling is under-explored. After revisiting the optimization of 3DGS, we take a step to decouple it and recompose the process into: Sparse Adam, Re-State Regularization and Decoupled Attribute Regularization. Taking a large number of experiments under the 3DGS and 3DGS-MCMC frameworks, our work provides a deeper understanding of these components. Finally, based on the empirical analysis, we re-design the optimization and propose AdamW-GS by re-coupling the beneficial components, under which better optimization efficiency and representation effectiveness are achieved simultaneously.

[168] PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Changjian Jiang, Kerui Ren, Xudong Li, Kaiwen Song, Linning Xu, Tao Lu, Junting Dong, Yu Zhang, Bo Dai, Mulin Yu

Main category: cs.CV

TL;DR: PLANING is an efficient online reconstruction framework using hybrid explicit primitives and neural Gaussians for decoupled geometry/appearance modeling, achieving high-quality reconstruction and accurate geometry simultaneously.

Details

Motivation: Existing streaming reconstruction methods from monocular image sequences typically sacrifice either rendering quality or geometric accuracy, lacking a solution that achieves both simultaneously in an efficient online manner.

Method: Uses hybrid representation coupling explicit geometric primitives with neural Gaussians, enabling decoupled geometry and appearance modeling. Features online initialization and optimization strategy separating geometry and appearance updates to reduce structural redundancy.

Result: Improves dense mesh Chamfer-L2 by 18.52% over PGSR, surpasses ARTDECO by 1.31 dB PSNR, reconstructs ScanNetV2 scenes in under 100 seconds (5x faster than 2D Gaussian Splatting) while matching offline per-scene optimization quality.

Conclusion: PLANING enables stable streaming reconstruction with both high-quality rendering and accurate geometry, suitable for large-scale scene modeling and simulation-ready environments for embodied AI applications.

Abstract: Streaming reconstruction from monocular image sequences remains challenging, as existing methods typically favor either high-quality rendering or accurate geometry, but rarely both. We present PLANING, an efficient on-the-fly reconstruction framework built on a hybrid representation that loosely couples explicit geometric primitives with neural Gaussians, enabling geometry and appearance to be modeled in a decoupled manner. This decoupling supports an online initialization and optimization strategy that separates geometry and appearance updates, yielding stable streaming reconstruction with substantially reduced structural redundancy. PLANING improves dense mesh Chamfer-L2 by 18.52% over PGSR, surpasses ARTDECO by 1.31 dB PSNR, and reconstructs ScanNetV2 scenes in under 100 seconds, over 5x faster than 2D Gaussian Splatting, while matching the quality of offline per-scene optimization. Beyond reconstruction quality, the structural clarity and computational efficiency of PLANING make it well suited for a broad range of downstream applications, such as enabling large-scale scene modeling and simulation-ready environments for embodied AI. Project page: https://city-super.github.io/PLANING/ .

[169] Seeing Through Clutter: Structured 3D Scene Reconstruction via Iterative Object Removal

Rio Aguina-Kang, Kevin James Blackburn-Matzen, Thibault Groueix, Vladimir Kim, Matheus Gadelha

Main category: cs.CV

TL;DR: SeeingThroughClutter: A method for 3D scene reconstruction from single images using iterative object removal and reconstruction pipeline with VLMs as orchestrators

Details

Motivation: Prior approaches rely on intermediate tasks like semantic segmentation and depth estimation, which often underperform in complex scenes with occlusion and clutter. There's a need for more robust 3D reconstruction from single images in challenging environments.

Method: Introduces an iterative object removal and reconstruction pipeline that decomposes complex scenes into simpler subtasks. Uses Vision-Language Models (VLMs) as orchestrators to remove foreground objects one at a time via detection, segmentation, object removal, and 3D fitting. The method requires no task-specific training and leverages foundation models.

Result: Demonstrates state-of-the-art robustness on 3D-Front and ADE20K datasets. Shows that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes.

Conclusion: The iterative object removal approach enables more robust 3D reconstruction from single images in cluttered scenes, benefiting directly from advances in foundation models without requiring task-specific training.

Abstract: We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter. We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task-specific training and benefits directly from ongoing advances in foundation models. We demonstrate stateof-the-art robustness on 3D-Front and ADE20K datasets. Project Page: https://rioak.github.io/seeingthroughclutter/

[170] Visual concept ranking uncovers medical shortcuts used by large multimodal models

Joseph D. Janizek, Sonnet Xu, Junayd Lateef, Roxana Daneshjou

Main category: cs.CV

TL;DR: A method called Visual Concept Ranking (VCR) is introduced to identify important visual concepts in large multimodal models and audit their performance on medical tasks, particularly skin lesion classification, revealing demographic performance gaps.

Details

Motivation: To ensure reliability of machine learning models in safety-critical healthcare domains by developing auditing methods that can uncover model shortcomings, especially in large multimodal models used for medical tasks.

Method: Visual Concept Ranking (VCR) method identifies important visual concepts within LMMs, applied to medical tasks including malignant skin lesion classification, chest radiographs, and natural images. The method generates hypotheses about visual feature dependencies that are validated through manual interventions.

Result: LMMs display unexpected performance gaps between different demographic subgroups when prompted with demonstrating examples. VCR successfully identifies visual concept dependencies that explain these performance disparities.

Conclusion: VCR provides an effective auditing method for uncovering shortcomings in large multimodal models, particularly important for safety-critical medical applications where demographic fairness and reliability are crucial.

Abstract: Ensuring the reliability of machine learning models in safety-critical domains such as healthcare requires auditing methods that can uncover model shortcomings. We introduce a method for identifying important visual concepts within large multimodal models (LMMs) and use it to investigate the behaviors these models exhibit when prompted with medical tasks. We primarily focus on the task of classifying malignant skin lesions from clinical dermatology images, with supplemental experiments including both chest radiographs and natural images. After showing how LMMs display unexpected gaps in performance between different demographic subgroups when prompted with demonstrating examples, we apply our method, Visual Concept Ranking (VCR), to these models and prompts. VCR generates hypotheses related to different visual feature dependencies, which we are then able to validate with manual interventions.

[171] Deep Learning-Based Fixation Type Prediction for Quality Assurance in Digital Pathology

Oskar Thaeter, Tanja Niedermair, Jan E. G. Albin, Johannes Raffler, Ralf Huss, Peter J. Schüffler

Main category: cs.CV

TL;DR: Deep learning model predicts pathology slide fixation types (FFPE vs frozen section) using low-resolution thumbnail images instead of full-resolution whole-slide images, enabling 400x faster processing for quality control.

Details

Motivation: Manual annotation of fixation types in pathology slides is error-prone and affects diagnostic accuracy. Existing methods require full-resolution whole-slide images, limiting scalability for high-throughput quality control in pathology laboratories.

Method: Developed a deep learning model that predicts fixation types using low-resolution, pre-scan thumbnail images. Trained on 1,200 WSIs from TUM Institute of Pathology and evaluated on TCGA dataset (8,800 slides) and two additional datasets from different institutions with different scanners.

Result: Achieved AUROC of 0.88 on TCGA dataset, outperforming comparable pre-scan methods by 4.8%. Achieved AUROCs of 0.72 on external datasets from Augsburg and Regensburg, showing challenges with scanner-induced domain shifts. Model processes each slide in 21 ms, 400× faster than existing high-magnification methods.

Conclusion: The approach provides an efficient solution for detecting labeling errors without high-magnification scans, offering valuable quality control for high-throughput pathology workflows. Future work will improve generalization to additional scanner types and extend to other low-resolution slide annotations.

Abstract: Accurate annotation of fixation type is a critical step in slide preparation for pathology laboratories. However, this manual process is prone to errors, impacting downstream analyses and diagnostic accuracy. Existing methods for verifying formalin-fixed, paraffin-embedded (FFPE), and frozen section (FS) fixation types typically require full-resolution whole-slide images (WSIs), limiting scalability for high-throughput quality control. We propose a deep-learning model to predict fixation types using low-resolution, pre-scan thumbnail images. The model was trained on WSIs from the TUM Institute of Pathology (n=1,200, Leica GT450DX) and evaluated on a class-balanced subset of The Cancer Genome Atlas dataset (TCGA, n=8,800, Leica AT2), as well as on class-balanced datasets from Augsburg (n=695 [392 FFPE, 303 FS], Philips UFS) and Regensburg (n=202, 3DHISTECH P1000). Our model achieves an AUROC of 0.88 on TCGA, outperforming comparable pre-scan methods by 4.8%. It also achieves AUROCs of 0.72 on Regensburg and Augsburg slides, underscoring challenges related to scanner-induced domain shifts. Furthermore, the model processes each slide in 21 ms, $400\times$ faster than existing high-magnification, full-resolution methods, enabling rapid, high-throughput processing. This approach provides an efficient solution for detecting labelling errors without relying on high-magnification scans, offering a valuable tool for quality control in high-throughput pathology workflows. Future work will improve and evaluate the model’s generalisation to additional scanner types. Our findings suggest that this method can increase accuracy and efficiency in digital pathology workflows and may be extended to other low-resolution slide annotations.

[172] Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures

Yuxi Wang, Wenqi Ouyang, Tianyi Wei, Yi Dong, Zhiqi Shen, Xingang Pan

Main category: cs.CV

TL;DR: Hand2World: An autoregressive framework for generating egocentric interactive videos from single scene images using 3D hand meshes and camera geometry embeddings to handle occlusion, viewpoint changes, and long-term stability.

Details

Motivation: Egocentric interactive world models are crucial for AR and embodied AI, requiring low-latency, geometrically consistent, and stable visual generation that responds to user input. The paper addresses challenges in generating photorealistic videos from single scene images under free-space hand gestures, including distribution shift between training data and real gestures, ambiguity between hand and camera motion in monocular views, and arbitrary-length video generation needs.

Method: Hand2World uses an autoregressive framework with occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility/occlusion inference from scene context. It injects explicit camera geometry via per-pixel Plücker-ray embeddings to disentangle camera from hand motion and prevent background drift. The method includes a fully automated monocular annotation pipeline and distills a bidirectional diffusion model into a causal generator for arbitrary-length synthesis.

Result: Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.

Conclusion: Hand2World provides a unified solution for egocentric interaction generation that addresses key challenges in distribution shift, motion ambiguity, and long-term stability, enabling realistic video synthesis from single scene images with hand gestures.

Abstract: Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Plücker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.

[173] Free Lunch for Stabilizing Rectified Flow Inversion

Chenru Wang, Beier Zhu, Chi Zhang

Main category: cs.CV

TL;DR: Proximal-Mean Inversion (PMI) improves Rectified-Flow inversion stability by using gradient correction with historical velocity averages, plus mimic-CFG for better editing fidelity.

Details

Motivation: Rectified-Flow models support training-free inversion but suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction/editing quality.

Method: Proposes PMI: a training-free gradient correction method that stabilizes velocity fields by guiding them toward a running average of past velocities within a theoretically derived spherical Gaussian constraint. Also introduces mimic-CFG, a lightweight velocity correction scheme for editing tasks that interpolates between current velocity and its projection onto historical average.

Result: Extensive experiments on PIE-Bench show significant improvements in inversion stability, image reconstruction quality, and editing fidelity while reducing required neural function evaluations. Achieves state-of-the-art performance on PIE-Bench with enhanced efficiency.

Conclusion: PMI and mimic-CFG provide theoretically sound solutions to RF inversion instability, enabling better reconstruction and editing with improved efficiency.

Abstract: Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.

[174] A DMD-Based Adaptive Modulation Method for High Dynamic Range Imaging in High-Glare Environments

Banglei Guan, Jing Tao, Liang Xu, Dongcai Tan, Pengju Sun, Jianbing Liu, Yang Shang, Qifeng Yu

Main category: cs.CV

TL;DR: DMD-based adaptive HDR imaging system achieves 127 dB dynamic range, reducing strain errors by 78% for photomechanics measurements in high-glare environments

Details

Motivation: Conventional CCD/CMOS sensors have limited dynamic range (<70 dB) causing saturation and detail loss in extreme illumination conditions like welding arc monitoring and polished metallic surface analysis, leading to significant errors in digital image correlation (DIC) for photomechanics

Method: HDR imaging system using digital micromirror device (DMD) for spatial modulation, featuring autonomous regional segmentation and adaptive exposure control through integrated DMD-based optical modulation unit and adaptive computational imaging pipeline

Result: System achieves 127 dB measurable dynamic range, eliminates saturation artifacts under high glare, demonstrates 78% reduction in strain error, and improves DIC positioning accuracy across extreme intensity variations

Conclusion: DMD-based system provides high-fidelity adaptive HDR imaging, overcoming limitations of conventional sensors, with strong potential for optical metrology and stress analysis in high-glare environments where traditional methods fail

Abstract: Background The accuracy of photomechanics measurements critically relies on image quality,particularly under extreme illumination conditions such as welding arc monitoring and polished metallic surface analysis. High dynamic range (HDR) imaging above 120 dB is essential in these contexts. Conventional CCD/CMOS sensors, with dynamic ranges typically below 70 dB, are highly susceptible to saturation under glare, resulting in irreversible loss of detail and significant errors in digital image correlation (DIC). Methods This paper presents an HDR imaging system that leverages the spatial modulation capability of a digital micromirror device (DMD). The system architecture enables autonomous regional segmentation and adaptive exposure control for high-dynamic-range scenes through an integrated framework comprising two synergistic subsystems: a DMD-based optical modulation unit and an adaptive computational imaging pipeline. Results The system achieves a measurable dynamic range of 127 dB, effectively eliminating satu ration artifacts under high glare. Experimental results demonstrate a 78% reduction in strain error and improved DIC positioning accuracy, confirming reliable performance across extreme intensity variations. Conclusion The DMD-based system provides high fidelity adaptive HDR imaging, overcoming key limitations of conventional sensors. It exhibits strong potential for optical metrology and stress analysis in high-glare environments where traditional methods are inadequate.

[175] DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tianhang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, Jiaqi Wang

Main category: cs.CV

TL;DR: DeepGen 1.0 is a lightweight 5B parameter unified multimodal model for image generation and editing that achieves performance competitive with much larger models through novel architectural and training innovations.

Details

Motivation: Current unified multimodal models for image generation and editing require massive parameter scales (>10B), leading to prohibitive training costs and deployment footprints. There's a need for lightweight yet capable alternatives.

Method: Introduces Stacked Channel Bridging (SCB) - a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ’think tokens’ for structured guidance. Uses three-stage training: (1) Alignment Pre-training on image-text pairs and editing triplets, (2) Joint Supervised Fine-tuning on generation/editing/reasoning tasks, and (3) Reinforcement Learning with MR-GRPO using mixture of reward functions.

Result: Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench.

Conclusion: DeepGen 1.0 provides an efficient, high-performance alternative to democratize unified multimodal research, demonstrating that lightweight models can achieve comprehensive capabilities competitive with much larger counterparts through careful architectural design and training strategies.

Abstract: Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ’think tokens’ to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.

cs.AI

[176] GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin

Main category: cs.AI

TL;DR: GT-HarmBench is a benchmark for evaluating AI safety in multi-agent environments using 2,009 high-stakes scenarios based on game-theoretic structures like Prisoner’s Dilemma, Stag Hunt, and Chicken, drawn from realistic AI risk contexts.

Details

Motivation: Existing AI safety benchmarks primarily evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood, despite frontier AI systems increasingly operating in high-stakes multi-agent environments.

Method: Created a benchmark of 2,009 high-stakes scenarios spanning game-theoretic structures, drawn from the MIT AI Risk Repository. Tested 15 frontier models, measured sensitivity to prompt framing and ordering, analyzed reasoning patterns, and evaluated game-theoretic interventions.

Result: Agents chose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. Game-theoretic interventions improved socially beneficial outcomes by up to 18%. The benchmark reveals substantial reliability gaps in multi-agent alignment.

Conclusion: The study highlights significant reliability gaps in frontier AI systems operating in multi-agent environments and provides a standardized testbed for studying alignment in multi-agent settings, with potential for improving safety through game-theoretic interventions.

Abstract: Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 2,009 high-stakes scenarios spanning game-theoretic structures such as the Prisoner’s Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.

[177] A Theoretical Framework for Adaptive Utility-Weighted Benchmarking

Philip Waggoner

Main category: cs.AI

TL;DR: A theoretical framework that reimagines benchmarking as a multilayer adaptive network linking evaluation metrics, model components, and stakeholder groups through weighted interactions, enabling human-aligned, context-aware evaluation.

Details

Motivation: Traditional benchmarking practices in AI focus on shared tasks, metrics, and leaderboards but fail to account for the sociotechnical contexts and diverse stakeholder priorities in real-world deployments. There's a need for more holistic evaluation that considers what different stakeholders consider meaningful or desirable model behavior.

Method: Proposes a theoretical framework that conceptualizes benchmarking as a multilayer, adaptive network connecting evaluation metrics, model components, and stakeholder groups through weighted interactions. Uses conjoint-derived utilities and human-in-the-loop update rules to embed human tradeoffs into benchmark structure, allowing benchmarks to evolve dynamically while maintaining stability and interpretability.

Result: The framework generalizes classical leaderboards as a special case and provides tools for analyzing the structural properties of benchmarks. It enables the creation of evaluation protocols that are more context-aware and human-aligned.

Conclusion: This approach offers a path toward more accountable and human-aligned evaluation in AI systems by incorporating diverse stakeholder perspectives and allowing benchmarks to adapt to real-world contexts while preserving interpretability.

Abstract: Benchmarking has long served as a foundational practice in machine learning and, increasingly, in modern AI systems such as large language models, where shared tasks, metrics, and leaderboards offer a common basis for measuring progress and comparing approaches. As AI systems are deployed in more varied and consequential settings, though, there is growing value in complementing these established practices with a more holistic conceptualization of what evaluation should represent. Of note, recognizing the sociotechnical contexts in which these systems operate invites an opportunity for a deeper view of how multiple stakeholders and their unique priorities might inform what we consider meaningful or desirable model behavior. This paper introduces a theoretical framework that reconceptualizes benchmarking as a multilayer, adaptive network linking evaluation metrics, model components, and stakeholder groups through weighted interactions. Using conjoint-derived utilities and a human-in-the-loop update rule, we formalize how human tradeoffs can be embedded into benchmark structure and how benchmarks can evolve dynamically while preserving stability and interpretability. The resulting formulation generalizes classical leaderboards as a special case and provides a foundation for building evaluation protocols that are more context aware, resulting in new robust tools for analyzing the structural properties of benchmarks, which opens a path toward more accountable and human-aligned evaluation.

[178] Evolving Beyond Snapshots: Harmonizing Structure and Sequence via Entity State Tuning for Temporal Knowledge Graph Forecasting

Siyuan Li, Yunjia Wu, Yiyong Xiao, Pingyang Huang, Peize Li, Ruitong Liu, Yan Wen, Te Sun, Fangyi Pei

Main category: cs.AI

TL;DR: EST introduces persistent entity states for temporal knowledge graph forecasting to overcome episodic amnesia and maintain long-term dependencies across time snapshots.

Details

Motivation: Current TKG forecasting methods suffer from episodic amnesia and rapid decay of long-term dependencies because they recompute entity representations from limited query windows at each timestamp, lacking persistent state memory.

Method: Entity State Tuning (EST) maintains a global state buffer with closed-loop design: topology-aware state perceiver injects entity-state priors into structural encoding, unified temporal context module aggregates state-enhanced events, and dual-track evolution mechanism writes updated context back to global entity state memory.

Result: Experiments on multiple benchmarks show EST consistently improves diverse backbones and achieves state-of-the-art performance for long-horizon TKG forecasting.

Conclusion: Persistent entity states are crucial for effective long-horizon TKG forecasting, and EST provides an encoder-agnostic framework that enhances existing methods by maintaining evolving entity representations.

Abstract: Temporal knowledge graph (TKG) forecasting requires predicting future facts by jointly modeling structural dependencies within each snapshot and temporal evolution across snapshots. However, most existing methods are stateless: they recompute entity representations at each timestamp from a limited query window, leading to episodic amnesia and rapid decay of long-term dependencies. To address this limitation, we propose Entity State Tuning (EST), an encoder-agnostic framework that endows TKG forecasters with persistent and continuously evolving entity states. EST maintains a global state buffer and progressively aligns structural evidence with sequential signals via a closed-loop design. Specifically, a topology-aware state perceiver first injects entity-state priors into structural encoding. Then, a unified temporal context module aggregates the state-enhanced events with a pluggable sequence backbone. Subsequently, a dual-track evolution mechanism writes the updated context back to the global entity state memory, balancing plasticity against stability. Experiments on multiple benchmarks show that EST consistently improves diverse backbones and achieves state-of-the-art performance, highlighting the importance of state persistence for long-horizon TKG forecasting. The code is published at https://github.com/yuanwuyuan9/Evolving-Beyond-Snapshots

[179] Intent-Driven Smart Manufacturing Integrating Knowledge Graphs and Large Language Models

Takoua Jradi, John Violos, Dimitrios Spatharakis, Lydia Mavraidi, Ioannis Dimolitsas, Aris Leivadeas, Symeon Papavassiliou

Main category: cs.AI

TL;DR: A framework integrating instruction-tuned LLMs with ontology-aligned knowledge graphs for translating natural language intents into machine-executable actions in smart manufacturing environments.

Details

Motivation: Smart manufacturing environments need interfaces that can translate high-level human intents into machine-executable actions, requiring a solution that bridges natural language understanding with structured manufacturing processes and constraints.

Method: Fine-tune Mistral-7B-Instruct-V02 on domain-specific data to translate natural language intents into structured JSON requirement models, then semantically map these to a Neo4j-based knowledge graph grounded in the ISA-95 manufacturing standard.

Result: Achieved 89.33% exact match accuracy and 97.27% overall accuracy, significantly outperforming zero-shot and 3-shot baselines.

Conclusion: The framework provides a foundation for scalable, explainable, and adaptive human-machine interaction in Manufacturing-as-a-Service ecosystems by combining LLMs with structured knowledge representation.

Abstract: The increasing complexity of smart manufacturing environments demands interfaces that can translate high-level human intents into machine-executable actions. This paper presents a unified framework that integrates instruction-tuned Large Language Models (LLMs) with ontology-aligned Knowledge Graphs (KGs) to enable intent-driven interaction in Manufacturing-as-a-Service (MaaS) ecosystems. We fine-tune Mistral-7B-Instruct-V02 on a domain-specific dataset, enabling the translation of natural language intents into structured JSON requirement models. These models are semantically mapped to a Neo4j-based knowledge graph grounded in the ISA-95 standard, ensuring operational alignment with manufacturing processes, resources, and constraints. Our experimental results demonstrate significant performance gains over zero-shot and 3-shots baselines, achieving 89.33% exact match accuracy and 97.27% overall accuracy. This work lays the foundation for scalable, explainable, and adaptive human-machine

[180] Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Creighton Glasscock, Honglak Lee

Main category: cs.AI

TL;DR: A scalable pipeline for generating high-quality training data for web agents using constraint-based evaluation to leverage partially successful trajectories, evaluated on a new BookingArena benchmark.

Details

Motivation: The paper addresses the challenge of creating diverse, realistic web interaction datasets for training web agents, with a focus on efficiently identifying high-quality training instances through better trajectory evaluation.

Method: Introduces a constraint-based evaluation framework for fine-grained assessment of progress towards task completion, enabling use of partially successful trajectories. Uses this to create a scalable data generation pipeline and evaluates on the new BookingArena benchmark of complex booking tasks across 20 websites.

Result: The distilled student model outperforms open-source approaches and matches or exceeds commercial systems while being significantly smaller. The method successfully expands usable training data by leveraging partially successful trajectories.

Conclusion: The work provides an efficient approach for creating diverse web interaction datasets and a systematic evaluation methodology for complex structured web tasks, advancing web agent training through better data generation and evaluation.

Abstract: We present a scalable pipeline for automatically generating high-quality training data for web agents. In particular, a major challenge in identifying high-quality training instances is trajectory evaluation - quantifying how much progress was made towards task completion. We introduce a novel constraint-based evaluation framework that provides fine-grained assessment of progress towards task completion. This enables us to leverage partially successful trajectories, which significantly expands the amount of usable training data. We evaluate our method on a new benchmark we propose called BookingArena, which consists of complex booking tasks across 20 popular websites, and demonstrate that our distilled student model outperforms open-source approaches and matches or exceeds commercial systems, while being a significantly smaller model. Our work addresses the challenge of efficiently creating diverse, realistic web interaction datasets and provides a systematic evaluation methodology for complex structured web tasks.

[181] To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

Haoqing Wang, Xiang Long, Ziheng Li, Yilong Xu, Tingguang Li, Yehui Tang

Main category: cs.AI

TL;DR: The paper M2RL compares two training paradigms for multi-domain reinforcement learning with verifiable rewards (RLVR): mixed multi-task training vs separate training followed by model merging, finding minimal interference and synergistic effects between reasoning-intensive domains.

Details

Motivation: RLVR enhances LLM reasoning capabilities for expert-level performance in specific domains, but there's limited analysis on how to effectively combine RLVR across multiple domains for general multi-domain expert models.

Method: The study compares two RLVR training paradigms using multiple high-level tasks (math, coding, science, instruction following) with open-source datasets. It analyzes mutual interference, synergistic effects, and internal mechanisms through weight space geometry, model prediction behavior, and information constraints.

Result: RLVR across domains exhibits few mutual interferences, with reasoning-intensive domains showing mutually synergistic effects. The analysis reveals insights into the internal mechanisms of these mutual gains.

Conclusion: The M2RL project provides valuable comparison and analysis of multi-domain RLVR training paradigms, offering guidance for developing general multi-domain expert models with enhanced reasoning capabilities.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, and instruction following) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, model prediction behavior, and information constraints. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/mosAI25/M2RL

[182] Can I Have Your Order? Monte-Carlo Tree Search for Slot Filling Ordering in Diffusion Language Models

Joshua Ong Jun Leang, Yu Zhao, Mihaela Cătălina Stoian, Wenda Li, Shay B. Cohen, Eleonora Giunchiglia

Main category: cs.AI

TL;DR: McDiffuSE improves plan-and-infill decoding in Masked Diffusion Models by using Monte Carlo Tree Search to optimize slot infilling order, achieving better performance on mathematical and code reasoning tasks.

Details

Motivation: Plan-and-infill decoding in Masked Diffusion Models shows promise for reasoning tasks but suffers from high sensitivity to slot infilling order, leading to substantial output variance and suboptimal performance.

Method: Formulates slot selection as decision making and optimizes infilling orders through Monte Carlo Tree Search (MCTS) with look-ahead simulations to evaluate partial completions before commitment, systematically exploring combinatorial space of generation orders.

Result: Average improvement of 3.2% over autoregressive baselines and 8.0% over baseline plan-and-infill, with notable gains of 19.5% on MBPP (code) and 4.9% on MATH500 (math). Larger exploration constants rather than increased simulations are needed to overcome model confidence biases.

Conclusion: MCTS-based planning is an effective approach for enhancing generation quality in MDMs, with non-sequential generation being essential for maximizing performance despite the model predominantly following sequential ordering.

Abstract: While plan-and-infill decoding in Masked Diffusion Models (MDMs) shows promise for mathematical and code reasoning, performance remains highly sensitive to slot infilling order, often yielding substantial output variance. We introduce McDiffuSE, a framework that formulates slot selection as decision making and optimises infilling orders through Monte Carlo Tree Search (MCTS). McDiffuSE uses look-ahead simulations to evaluate partial completions before commitment, systematically exploring the combinatorial space of generation orders. Experiments show an average improvement of 3.2% over autoregressive baselines and 8.0% over baseline plan-and-infill, with notable gains of 19.5% on MBPP and 4.9% on MATH500. Our analysis reveals that while McDiffuSE predominantly follows sequential ordering, incorporating non-sequential generation is essential for maximising performance. We observe that larger exploration constants, rather than increased simulations, are necessary to overcome model confidence biases and discover effective orderings. These findings establish MCTS-based planning as an effective approach for enhancing generation quality in MDMs.

[183] GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

Modi Jin, Yiming Zhang, Boyuan Sun, Dingwen Zhang, MingMing Cheng, Qibin Hou

Main category: cs.AI

TL;DR: GeoAgent is a model for geolocation reasoning that uses expert-annotated CoT data and specialized geographic rewards to improve performance and human-aligned reasoning.

Details

Motivation: Previous RL-based methods for geolocation reasoning rely on AI-generated chain-of-thought data and training strategies that conflict with geographic characteristics, raising concerns about their reliability and geographic appropriateness.

Method: Introduces GeoSeek dataset with CoT data annotated by geographic experts and professional players, then proposes geo-similarity reward and consistency reward assessed by a consistency agent to guide training toward geographically correct answers with consistent reasoning.

Result: GeoAgent outperforms existing methods and general VLLMs across multiple grains while generating reasoning that closely aligns with human thinking.

Conclusion: The approach of using expert-annotated geographic data and specialized geographic rewards enables better geolocation reasoning that aligns with human geographic understanding.

Abstract: This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies, which conflict with geographic characteristics. To address these issues, we first introduce GeoSeek, a new geolocation dataset comprising CoT data annotated by geographic experts and professional players. We further thoroughly explore the inherent characteristics of geographic tasks and propose a geo-similarity reward and a consistency reward assessed by a consistency agent to assist training. This encourages the model to converge towards correct answers from a geographic perspective while ensuring the integrity and consistency of its reasoning process. Experimental results show that GeoAgent outperforms existing methods and a series of general VLLMs across multiple grains, while generating reasoning that closely aligns with humans.

[184] AI Agents for Inventory Control: Human-LLM-OR Complementarity

Jackie Baek, Yaopeng Fu, Will Ma, Tianyi Peng

Main category: cs.AI

TL;DR: LLM-augmented operations research methods outperform either approach alone in inventory control, and human-AI collaboration yields higher profits than either working independently.

Details

Motivation: Traditional operations research algorithms for inventory control rely on rigid assumptions and perform poorly with demand shifts or missing contextual information. LLMs offer flexible reasoning but it's unclear how to best integrate them into decision-making pipelines.

Method: Created InventoryBench with 1,000+ inventory instances across synthetic and real-world demand data to test decision rules under demand shifts, seasonality, and uncertain lead times. Also conducted classroom experiments embedding LLM recommendations in human-in-the-loop pipelines.

Result: OR-augmented LLM methods outperformed either method in isolation. Human-AI teams achieved higher profits than humans or AI agents alone, with substantial individual-level complementarity effects.

Conclusion: LLMs and operations research methods are complementary rather than substitutes for inventory control, and human-AI collaboration can significantly improve decision-making outcomes.

Abstract: Inventory control is a fundamental operations problem in which ordering decisions are traditionally guided by theoretically grounded operations research (OR) algorithms. However, such algorithms often rely on rigid modeling assumptions and can perform poorly when demand distributions shift or relevant contextual information is unavailable. Recent advances in large language models (LLMs) have generated interest in AI agents that can reason flexibly and incorporate rich contextual signals, but it remains unclear how best to incorporate LLM-based methods into traditional decision-making pipelines. We study how OR algorithms, LLMs, and humans can interact and complement each other in a multi-period inventory control setting. We construct InventoryBench, a benchmark of over 1,000 inventory instances spanning both synthetic and real-world demand data, designed to stress-test decision rules under demand shifts, seasonality, and uncertain lead times. Through this benchmark, we find that OR-augmented LLM methods outperform either method in isolation, suggesting that these methods are complementary rather than substitutes. We further investigate the role of humans through a controlled classroom experiment that embeds LLM recommendations into a human-in-the-loop decision pipeline. Contrary to prior findings that human-AI collaboration can degrade performance, we show that, on average, human-AI teams achieve higher profits than either humans or AI agents operating alone. Beyond this population-level finding, we formalize an individual-level complementarity effect and derive a distribution-free lower bound on the fraction of individuals who benefit from AI collaboration; empirically, we find this fraction to be substantial.

[185] Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

Ruihan Yang, Fanghua Ye, Xiang We, Ruoqing Zhao, Kang Luo, Xinbo Xu, Bo Zhao, Ruotian Ma, Shanyi Wang, Zhaopeng Tu, Xiaolong Li, Deqing Yang, Linus

Main category: cs.AI

TL;DR: CogRouter is a framework that trains LLM agents to dynamically adapt cognitive depth at each step of multi-turn decision-making tasks, achieving state-of-the-art performance with superior efficiency.

Details

Motivation: Current LLM agents rely on fixed cognitive patterns - either immediate responses or uniform deep reasoning. This rigidity is inefficient for long-horizon tasks where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution.

Method: Based on ACT-R theory, CogRouter designs four hierarchical cognitive levels from instinctive responses to strategic planning. It uses a two-stage training approach: Cognition-aware Supervised Fine-tuning (CoSFT) to instill stable level-specific patterns, and Cognition-aware Policy Optimization (CoPO) for step-level credit assignment via confidence-aware advantage reweighting.

Result: CogRouter achieves state-of-the-art performance on ALFWorld and ScienceWorld benchmarks. With Qwen2.5-7B, it reaches 82.3% success rate, outperforming GPT-4o (+40.3%), OpenAI-o3 (+18.3%), and GRPO (+14.0%), while using 62% fewer tokens.

Conclusion: The framework demonstrates that dynamically adapting cognitive depth based on task demands leads to more efficient and effective autonomous agents, with the key insight that appropriate cognitive depth should maximize the confidence of resulting actions.

Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents for multi-turn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we introduce CogRouter, a framework that trains agents to dynamically adapt cognitive depth at each step. Grounded in ACT-R theory, we design four hierarchical cognitive levels ranging from instinctive responses to strategic planning. Our two-stage training approach includes Cognition-aware Supervised Fine-tuning (CoSFT) to instill stable level-specific patterns, and Cognition-aware Policy Optimization (CoPO) for step-level credit assignment via confidence-aware advantage reweighting. The key insight is that appropriate cognitive depth should maximize the confidence of the resulting action. Experiments on ALFWorld and ScienceWorld demonstrate that CogRouter achieves state-of-the-art performance with superior efficiency. With Qwen2.5-7B, it reaches an 82.3% success rate, outperforming GPT-4o (+40.3%), OpenAI-o3 (+18.3%), and GRPO (+14.0%), while using 62% fewer tokens.

[186] Evaluating Robustness of Reasoning Models on Parameterized Logical Problems

Naïm Es-sebbani, Esteban Marquer, Yakoub Salhi, Zied Bouraoui

Main category: cs.AI

TL;DR: A diagnostic benchmark for 2-SAT that isolates structural reasoning competencies in LLMs through parameterized formula families, revealing brittleness invisible to aggregate SAT accuracy.

Details

Motivation: Standard SAT benchmarks conflate surface difficulty with structural phenomena that determine satisfiability, making it hard to diagnose specific reasoning competencies and failure modes in LLMs.

Method: Created parameterized families of structured 2-CNF formulas with interpretable axes: contradiction-cycle UNSAT cores, SAT instances with controlled solution multiplicity, planted backbones, late bridge clauses, and symmetry/duplication variants.

Result: LLM-based reasoners show sharp performance transitions under targeted structural interventions even when surface statistics are fixed, revealing brittleness regimes invisible to aggregate SAT accuracy.

Conclusion: The benchmark provides fine-grained diagnostic capabilities for evaluating LLM reasoning, exposing specific structural vulnerabilities that standard SAT benchmarks miss.

Abstract: Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2–CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that test abstraction under renaming and redundant structure. We evaluate LLM-based reasoners on decision accuracy and assignment validity, and quantify robustness under semantics-preserving perturbations such as clause reordering, filler clauses, and variable renaming. Across models, we observe sharp performance transitions under targeted structural interventions even when surface statistics are held fixed, revealing brittleness regimes that are invisible to aggregate SAT accuracy.

[187] SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, Han-chung Lee

Main category: cs.AI

TL;DR: SkillsBench benchmark evaluates how procedural knowledge packages (Skills) affect LLM agent performance across 86 tasks in 11 domains, finding curated Skills improve performance by 16.2pp on average but with wide variation, while self-generated Skills provide no benefit.

Details

Motivation: Despite rapid adoption of Skills (structured packages of procedural knowledge) to augment LLM agents, there's no standardized way to measure whether they actually help improve agent performance at inference time.

Method: Created SkillsBench benchmark with 86 tasks across 11 domains, each paired with curated Skills and deterministic verifiers. Evaluated 7 agent-model configurations over 7,308 trajectories under three conditions: no Skills, curated Skills, and self-generated Skills.

Result: Curated Skills raised average pass rate by 16.2 percentage points, but effects varied widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare). 16 of 84 tasks showed negative deltas. Self-generated Skills provided no benefit on average. Focused Skills with 2-3 modules outperformed comprehensive documentation, and smaller models with Skills could match larger models without them.

Conclusion: Skills can significantly improve LLM agent performance when properly curated, but models cannot reliably author the procedural knowledge they benefit from consuming. The effectiveness depends on domain and task characteristics.

Abstract: Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2–3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

[188] X-SYS: A Reference Architecture for Interactive Explanation Systems

Tobias Labarta, Nhi Hoang, Maximilian Dreyer, Jim Berend, Oleg Hein, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin

Main category: cs.AI

TL;DR: X-SYS: A reference architecture for interactive explanation systems that connects user interfaces with system capabilities through four quality attributes (STAR) and five components, implemented via SemanticLens for vision-language models.

Details

Motivation: Deploying explainable AI (XAI) as interactive systems is challenging due to the need to maintain explanation usability across repeated queries, evolving models/data, and governance constraints. Current XAI research focuses on technical methods but lacks system-level approaches for operationalizing explainability.

Method: Proposes X-SYS, a reference architecture with four quality attributes (STAR: scalability, traceability, responsiveness, adaptability) and five components (XUI Services, Explanation Services, Model Services, Data Services, Orchestration & Governance). Maps interaction patterns to system capabilities to decouple UI evolution from backend computation. Implemented via SemanticLens system for vision-language models.

Result: X-SYS provides a reusable blueprint for interactive explanation systems. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability for vision-language models.

Conclusion: Treating explainability as an information systems problem with systematic architecture enables operational deployment of interactive XAI systems. X-SYS bridges the gap between XAI algorithms and practical system requirements for real-world deployment.

Abstract: The explainable AI (XAI) research community has proposed numerous technical methods, yet deploying explainability as systems remains challenging: Interactive explanation systems require both suitable algorithms and system capabilities that maintain explanation usability across repeated queries, evolving models and data, and governance constraints. We argue that operationalizing XAI requires treating explainability as an information systems problem where user interaction demands induce specific system requirements. We introduce X-SYS, a reference architecture for interactive explanation systems, that guides (X)AI researchers, developers and practitioners in connecting interactive explanation user interfaces (XUI) with system capabilities. X-SYS organizes around four quality attributes named STAR (scalability, traceability, responsiveness, and adaptability), and specifies a five-component decomposition (XUI Services, Explanation Services, Model Services, Data Services, Orchestration and Governance). It maps interaction patterns to system capabilities to decouple user interface evolution from backend computation. We implement X-SYS through SemanticLens, a system for semantic search and activation steering in vision-language models. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability. Together, this work provides a reusable blueprint and concrete instantiation for interactive explanation systems supporting end-to-end design under operational constraints.

[189] A Survey on Hypergame Theory: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems

Vince Trencsenyi, Agnieszka Mensfelt, Kostas Stathis

Main category: cs.AI

TL;DR: Systematic review of hypergame theory applications in multi-agent systems, analyzing 44 studies to assess agent-compatibility and identify research gaps in modeling strategic interactions with subjective perceptions.

Details

Motivation: Classical game theory assumes rational agents with complete information, but real-world multi-agent systems involve uncertainty, misaligned perceptions, and nested beliefs. Hypergame theory addresses these limitations by modeling agents' subjective perceptions of strategic scenarios.

Method: Conducted systematic review of 44 studies from cybersecurity, robotics, social simulation, communications, and game theory. Developed agent-compatibility criteria and classification framework to analyze integration patterns. Examined hypergame theory extensions including hierarchical hypergames and HNF (Hypergame Normal Form).

Result: Analysis revealed prevalence of hierarchical and graph-based models in deceptive reasoning, simplification of theoretical frameworks in practice, limited adoption of HNF-based models, lack of formal hypergame languages, and unexplored opportunities for modeling human-agent misalignment.

Conclusion: Review provides roadmap for applying hypergame theory to enhance strategic modeling in dynamic multi-agent environments, identifying structural gaps and open research directions for more realistic agent interactions.

Abstract: Classical game-theoretic models typically assume rational agents, complete information, and common knowledge of payoffs - assumptions that are often violated in real-world MAS characterized by uncertainty, misaligned perceptions, and nested beliefs. To overcome these limitations, researchers have proposed extensions that incorporate models of cognitive constraints, subjective beliefs, and heterogeneous reasoning. Among these, hypergame theory extends the classical paradigm by explicitly modeling agents’ subjective perceptions of the strategic scenario, known as perceptual games, in which agents may hold divergent beliefs about the structure, payoffs, or available actions. We present a systematic review of agent-compatible applications of hypergame theory, examining how its descriptive capabilities have been adapted to dynamic and interactive MAS contexts. We analyze 44 selected studies from cybersecurity, robotics, social simulation, communications, and general game-theoretic modeling. Building on a formal introduction to hypergame theory and its two major extensions - hierarchical hypergames and HNF - we develop agent-compatibility criteria and an agent-based classification framework to assess integration patterns and practical applicability. Our analysis reveals prevailing tendencies, including the prevalence of hierarchical and graph-based models in deceptive reasoning and the simplification of extensive theoretical frameworks in practical applications. We identify structural gaps, including the limited adoption of HNF-based models, the lack of formal hypergame languages, and unexplored opportunities for modeling human-agent and agent-agent misalignment. By synthesizing trends, challenges, and open research directions, this review provides a new roadmap for applying hypergame theory to enhance the realism and effectiveness of strategic modeling in dynamic multi-agent environments.

[190] WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning

Junjie Wang, Zequn Xie, Dan Yang, Jie Feng, Yue Shen, Duolin Sun, Meixiu Long, Yihan Jiao, Zhehao Tan, Jian Wang, Peng Wei, Jinjie Gu

Main category: cs.AI

TL;DR: WebClipper is a framework that compresses web agent trajectories using graph-based pruning to eliminate redundant search steps and improve efficiency while maintaining accuracy.

Details

Motivation: Current web agents for deep research systems suffer from inefficient search patterns with long tool-call trajectories, cyclic reasoning loops, and exploration of unproductive branches, which reduces their overall efficiency.

Method: Models the agent’s search process as a state graph and formulates trajectory optimization as a minimum-necessary Directed Acyclic Graph (DAG) mining problem, then continues training on pruned trajectories to evolve more efficient search patterns.

Result: Reduces tool-call rounds by about 20% while improving accuracy, and introduces a new F-AE Score metric to measure the balance between accuracy and efficiency.

Conclusion: WebClipper provides practical insights into balancing effectiveness and efficiency in web agent design through trajectory compression and continued training on optimized search patterns.

Abstract: Deep Research systems based on web agents have shown strong potential in solving complex information-seeking tasks, yet their search efficiency remains underexplored. We observe that many state-of-the-art open-source web agents rely on long tool-call trajectories with cyclic reasoning loops and exploration of unproductive branches. To address this, we propose WebClipper, a framework that compresses web agent trajectories via graph-based pruning. Concretely, we model the agent’s search process as a state graph and cast trajectory optimization as a minimum-necessary Directed Acyclic Graph (DAG) mining problem, yielding pruned trajectories that preserve essential reasoning while eliminating redundant steps. Continued training on these refined trajectories enables the agent to evolve toward more efficient search patterns and reduces tool-call rounds by about 20% while improving accuracy. Furthermore, we introduce a new metric called F-AE Score to measure the model’s overall performance in balancing accuracy and efficiency. Experiments demonstrate that WebClipper compresses tool-call rounds under excellent performance, providing practical insight into balancing effectiveness and efficiency in web agent design.

[191] WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu, Weilin Liu, Quanlu Zhang, Wenbo Ding, Chao Yu, Yu Wang

Main category: cs.AI

TL;DR: WideSeek-R1: A multi-agent LLM framework using width scaling via lead-agent-subagent architecture with MARL training for parallel execution on broad information-seeking tasks.

Details

Motivation: Current LLMs focus on depth scaling (single agent solving long-horizon problems), but as tasks grow broader, organizational capability becomes the bottleneck. Existing multi-agent systems use inefficient hand-crafted workflows and turn-taking interactions that fail to parallelize effectively.

Method: Proposes WideSeek-R1, a lead-agent-subagent framework trained via multi-agent reinforcement learning (MARL). Uses shared LLM with isolated contexts and specialized tools, jointly optimizing lead agent and parallel subagents on 20k curated broad information-seeking tasks.

Result: WideSeek-R1-4B achieves 40.0% item F1 score on WideSearch benchmark, comparable to single-agent DeepSeek-R1-671B. Shows consistent performance gains as number of parallel subagents increases, demonstrating effectiveness of width scaling.

Conclusion: Width scaling through multi-agent systems is a viable complementary approach to depth scaling for broad information-seeking tasks, with WideSeek-R1 demonstrating efficient parallel execution and organizational capability.

Abstract: Recent advancements in Large Language Models (LLMs) have largely focused on depth scaling, where a single agent solves long-horizon problems with multi-turn reasoning and tool use. However, as tasks grow broader, the key bottleneck shifts from individual competence to organizational capability. In this work, we explore a complementary dimension of width scaling with multi-agent systems to address broad information seeking. Existing multi-agent systems often rely on hand-crafted workflows and turn-taking interactions that fail to parallelize work effectively. To bridge this gap, we propose WideSeek-R1, a lead-agent-subagent framework trained via multi-agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution. By utilizing a shared LLM with isolated contexts and specialized tools, WideSeek-R1 jointly optimizes the lead agent and parallel subagents on a curated dataset of 20k broad information-seeking tasks. Extensive experiments show that WideSeek-R1-4B achieves an item F1 score of 40.0% on the WideSearch benchmark, which is comparable to the performance of single-agent DeepSeek-R1-671B. Furthermore, WideSeek-R1-4B exhibits consistent performance gains as the number of parallel subagents increases, highlighting the effectiveness of width scaling.

[192] BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Dan, Haishan Lu, Zhiyong Cao, Jiaoyang Chen, Yuqian Han, Zinan Sheng, Zhengwei Tao, Hao Liang, Jialong Wu, Yang Shi, Yuanpeng He, Jiaye Lin, Qintong Zhang, Guochen Yan, Runhao Zhao, Zhengpin Li, Xiaohan Yu, Lang Mei, Chong Chen, Wentao Zhang, Bin Cui

Main category: cs.AI

TL;DR: BrowseComp-V³ is a challenging multimodal web browsing benchmark with 300 complex questions requiring cross-modal multi-hop reasoning, featuring expert-validated process evaluation and revealing significant gaps in current MLLM capabilities.

Details

Motivation: Existing multimodal browsing benchmarks lack task complexity, evidence accessibility, and evaluation granularity needed to properly assess deep search capabilities of MLLMs in real-world web environments.

Method: Created BrowseComp-V³ benchmark with 300 curated questions requiring deep multi-level reasoning across text and visual modalities. All evidence is publicly searchable. Introduced expert-validated subgoal-driven process evaluation for fine-grained analysis. Also proposed OmniSeeker, a unified multimodal browsing agent framework.

Result: State-of-the-art models achieve only 36% accuracy on the benchmark, revealing critical bottlenecks in multimodal information integration and fine-grained perception. The benchmark exposes fundamental gaps in current model capabilities.

Conclusion: BrowseComp-V³ provides a comprehensive benchmark for evaluating multimodal deep search capabilities, highlighting significant limitations in current MLLMs and the need for improved multimodal reasoning and perception in real-world web browsing scenarios.

Abstract: Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments. However, existing benchmarks for multimodal browsing remain limited in task complexity, evidence accessibility, and evaluation granularity, hindering comprehensive and reproducible assessments of deep search capabilities. To address these limitations, we introduce BrowseComp-$V^3$, a novel benchmark consisting of 300 carefully curated and challenging questions spanning diverse domains. The benchmark emphasizes deep, multi-level, and cross-modal multi-hop reasoning, where critical evidence is interleaved across textual and visual modalities within and across web pages. All supporting evidence is strictly required to be publicly searchable, ensuring fairness and reproducibility. Beyond final-answer accuracy, we incorporate an expert-validated, subgoal-driven process evaluation mechanism that enables fine-grained analysis of intermediate reasoning behaviors and systematic characterization of capability boundaries. In addition, we propose OmniSeeker, a unified multimodal browsing agent framework integrating diverse web search and visual perception tools. Comprehensive experiments demonstrate that even state-of-the-art models achieve only 36% accuracy on our benchmark, revealing critical bottlenecks in multimodal information integration and fine-grained perception. Our results highlight a fundamental gap between current model capabilities and robust multimodal deep search in real-world settings.

[193] Information-theoretic analysis of world models in optimal reward maximizers

Alfred Harwood, Jose Faustino, Alex Altair

Main category: cs.AI

TL;DR: Optimal policies in Markov decision processes reveal exactly n log m bits of information about the environment’s transition dynamics, providing an information-theoretic lower bound on implicit world models needed for optimal behavior.

Details

Motivation: To quantify how much information successful AI behavior requires about the world, specifically measuring the implicit world model contained in optimal policies for Markov decision processes.

Method: Analyze Controlled Markov Processes with n states and m actions, assuming uniform prior over transition dynamics. Prove information-theoretic bounds on mutual information between environment and optimal deterministic policies for various reward objectives.

Result: Optimal deterministic policies convey exactly n log m bits of information about the environment. This bound holds for finite-horizon, infinite-horizon discounted, and time-averaged reward maximization objectives.

Conclusion: Provides precise information-theoretic lower bound on implicit world models necessary for optimal behavior in Markov decision processes, quantifying fundamental relationship between optimal policies and environmental knowledge.

Abstract: An important question in the field of AI is the extent to which successful behaviour requires an internal representation of the world. In this work, we quantify the amount of information an optimal policy provides about the underlying environment. We consider a Controlled Markov Process (CMP) with $n$ states and $m$ actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non-constant reward function then conveys exactly $n \log m$ bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is $n \log m$ bits. This bound holds across a broad class of objectives, including finite-horizon, infinite-horizon discounted, and time-averaged reward maximization. These findings provide a precise information-theoretic lower bound on the “implicit world model’’ necessary for optimality.

[194] Consistency of Large Reasoning Models Under Multi-Turn Attacks

Yubo Li, Ramayya Krishnan, Rema Padman

Main category: cs.AI

TL;DR: Reasoning models show meaningful but incomplete robustness to adversarial attacks, with distinct vulnerability profiles and specific failure modes identified through trajectory analysis.

Details

Motivation: While large reasoning models achieve state-of-the-art performance on complex tasks, their robustness under multi-turn adversarial pressure remains underexplored, creating a gap in understanding how reasoning capabilities translate to adversarial robustness.

Method: Evaluated nine frontier reasoning models under adversarial attacks, conducted trajectory analysis to identify failure modes, and tested Confidence-Aware Response Generation (CARG) defense mechanisms on reasoning models.

Result: Reasoning models significantly outperform instruction-tuned baselines but exhibit distinct vulnerability profiles. Five failure modes identified (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, Reasoning Fatigue) with first two accounting for 50% of failures. CARG defense fails for reasoning models due to overconfidence from extended reasoning traces.

Conclusion: Reasoning capabilities do not automatically confer adversarial robustness, and confidence-based defenses require fundamental redesign for reasoning models due to overconfidence induced by extended reasoning processes.

Abstract: Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.

[195] Constrained Assumption-Based Argumentation Frameworks

Emanuele De Angelis, Fabio Fioravanti, Maria Chiara Meo, Alberto Pettorossi, Maurizio Proietti, Francesca Toni

Main category: cs.AI

TL;DR: The paper introduces Constrained ABA (CABA), extending Assumption-based Argumentation to handle constrained variables over infinite domains, lifting previous restrictions to ground arguments.

Details

Motivation: Traditional ABA frameworks are limited to ground (variable-free) arguments built from propositional atoms, restricting their representational power and applicability to real-world problems that involve variables and infinite domains.

Method: Proposes Constrained ABA (CABA) where components and arguments may include constrained variables ranging over possibly infinite domains, and defines non-ground semantics with various notions of non-ground attacks.

Result: Shows that the new semantics conservatively generalize standard ABA semantics, meaning the extended framework maintains compatibility with existing ABA while providing enhanced expressiveness.

Conclusion: CABA successfully lifts the representational restrictions of traditional ABA, enabling more expressive argumentation with variables and infinite domains while preserving the core semantics.

Abstract: Assumption-based Argumentation (ABA) is a well-established form of structured argumentation. ABA frameworks with an underlying atomic language are widely studied, but their applicability is limited by a representational restriction to ground (variable-free) arguments and attacks built from propositional atoms. In this paper, we lift this restriction and propose a novel notion of constrained ABA (CABA), whose components, as well as arguments built from them, may include constrained variables, ranging over possibly infinite domains. We define non-ground semantics for CABA, in terms of various notions of non-ground attacks. We show that the new semantics conservatively generalise standard ABA semantics.

[196] Optimal Take-off under Fuzzy Clearances

Hugo Henry, Arthur Tsai, Kelly Cohen

Main category: cs.AI

TL;DR: Hybrid obstacle avoidance architecture combining fuzzy logic with optimal control for unmanned aircraft, using fuzzy rules to adapt constraints based on aviation regulations, but encountering solver compatibility issues.

Details

Motivation: Address limitations of classical optimal control under uncertainty and need for interpretable decision-making in safety-critical aviation systems for unmanned aircraft obstacle avoidance.

Method: Three-stage Takagi-Sugeno-Kang fuzzy layer modulates constraint radii, urgency levels, and activation decisions based on FAA/EASA regulations, integrated as soft constraints into optimal control problem solved with FALCON toolbox and IPOPT.

Result: Proof-of-concept shows optimal trajectories with 2-3 seconds computation per iteration in MATLAB, but reveals critical software incompatibility where Lagrangian penalty term remains zero, preventing proper constraint enforcement.

Conclusion: Framework shows feasibility for near real-time applications but requires fixing solver compatibility issues; future work includes software validation, fuzzy function optimization, and extension to higher-fidelity models and stochastic environments.

Abstract: This paper presents a hybrid obstacle avoidance architecture that integrates Optimal Control under clearance with a Fuzzy Rule Based System (FRBS) to enable adaptive constraint handling for unmanned aircraft. Motivated by the limitations of classical optimal control under uncertainty and the need for interpretable decision making in safety critical aviation systems, we design a three stage Takagi Sugeno Kang fuzzy layer that modulates constraint radii, urgency levels, and activation decisions based on regulatory separation minima and airworthiness guidelines from FAA and EASA. These fuzzy-derived clearances are then incorporated as soft constraints into an optimal control problem solved using the FALCON toolbox and IPOPT. The framework aims to reduce unnecessary recomputations by selectively activating obstacle avoidance updates while maintaining compliance with aviation procedures. A proof of concept implementation using a simplified aircraft model demonstrates that the approach can generate optimal trajectories with computation times of 2,3 seconds per iteration in a single threaded MATLAB environment, suggesting feasibility for near real time applications. However, our experiments revealed a critical software incompatibility in the latest versions of FALCON and IPOPT, in which the Lagrangian penalty term remained identically zero, preventing proper constraint enforcement. This behavior was consistent across scenarios and indicates a solver toolbox regression rather than a modeling flaw. Future work includes validating this effect by reverting to earlier software versions, optimizing the fuzzy membership functions using evolutionary methods, and extending the system to higher fidelity aircraft models and stochastic obstacle environments.

[197] SaVe-TAG: LLM-based Interpolation for Long-Tailed Text-Attributed Graphs

Leyao Wang, Yu Wang, Bo Ni, Yuying Zhao, Hanyu Wang, Yao Ma, Tyler Derr

Main category: cs.AI

TL;DR: SaVe-TAG is a novel framework that uses LLMs for text-level interpolation to generate synthetic samples for minority classes in long-tailed text-attributed graphs, improving node classification performance.

Details

Motivation: Real-world graph data often follows long-tailed distributions, making it difficult for GNNs to generalize across head and tail classes. Existing VRM approaches rely on embedding-space arithmetic which fails to capture rich text semantics in text-attributed graphs.

Method: Proposes SaVe-TAG, a VRM framework that leverages LLMs to perform text-level interpolation, generating on-manifold, boundary-enriching synthetic samples for minority classes. Includes confidence-based edge assignment using graph topology as a filter for structural consistency.

Result: Extensive experiments on benchmark datasets show the approach consistently outperforms both numeric interpolation and prior long-tailed node classification baselines.

Conclusion: The work highlights the importance of integrating semantic and structural signals for balanced and effective learning on text-attributed graphs.

Abstract: Real-world graph data often follows long-tailed distributions, making it difficult for Graph Neural Networks (GNNs) to generalize well across both head and tail classes. Recent advances in Vicinal Risk Minimization (VRM) have shown promise in mitigating class imbalance with numeric interpolation; however, existing approaches largely rely on embedding-space arithmetic, which fails to capture the rich semantics inherent in text-attributed graphs. In this work, we propose our method, SaVe-TAG (Semantic-aware Vicinal Risk Minimization for Long-Tailed Text-Attributed Graphs), a novel VRM framework that leverages Large Language Models (LLMs) to perform text-level interpolation, generating on-manifold, boundary-enriching synthetic samples for minority classes. To mitigate the risk of noisy generation, we introduce a confidence-based edge assignment mechanism that uses graph topology as a natural filter to ensure structural consistency. We provide theoretical justification for our method and conduct extensive experiments on benchmark datasets, showing that our approach consistently outperforms both numeric interpolation and prior long-tailed node classification baselines. Our results highlight the importance of integrating semantic and structural signals for balanced and effective learning on text-attributed graphs. The source code is publicly available at: https://github.com/LWang-Laura/SaVe-TAG.

[198] Mathematics and Machine Creativity: A Survey on Bridging Mathematics with AI

Shizhe Liang, Wei Zhang, Tianyang Zhong, Tianming Liu

Main category: cs.AI

TL;DR: Survey paper exploring AI applications in mathematical research, focusing on how AI (especially RL and LLMs) can contribute to mathematics through creative pattern recognition despite limitations in deductive reasoning.

Details

Motivation: To bridge the gap between AI and mathematics communities, highlighting how AI's creative capabilities (often overlooked) can support mathematical research despite current limitations in complex deductive reasoning, and to foster interdisciplinary understanding.

Method: Comprehensive survey approach analyzing AI fundamentals, strengths, and emerging applications in mathematical sciences, with particular focus on reinforcement learning and large language models as flexible algorithmic frameworks with inductive reasoning capabilities.

Result: Identifies AI’s potential to contribute to mathematics through high-throughput pattern recognition and creative generation, while acknowledging current limitations in deductive reasoning and highlighting the need for better cross-disciplinary communication.

Conclusion: AI’s “inherent creativity” and pattern recognition capabilities offer significant potential to support mathematical research, and bridging the communication gap between AI and mathematics communities could unlock new perspectives and methodologies in mathematics.

Abstract: This paper presents a comprehensive overview on the applications of artificial intelligence (AI) in mathematical research, highlighting the transformative role AI has begun to play in this domain. Traditionally, AI advancements have heavily relied on theoretical foundations provided by mathematics and statistics. However, recent developments in AI, particularly in reinforcement learning (RL) and large language models (LLMs), have demonstrated the potential for AI to contribute back to mathematics by offering flexible algorithmic frameworks and powerful inductive reasoning capabilities that support various aspects of mathematical research. This survey aims to establish a bridge between AI and mathematics, providing insights into the mutual benefits and fostering deeper interdisciplinary understanding. In particular, we argue that while current AI and LLMs may struggle with complex deductive reasoning, their “inherent creativity”, the ability to generate outputs at high throughput based on recognition of shallow patterns, holds significant potential to support and inspire mathematical research. This creative capability, often overlooked, could be the key to unlocking new perspectives and methodologies in mathematics. Furthermore, we address the lack of cross-disciplinary communication: mathematicians may not fully comprehend the latest advances in AI, while AI researchers frequently prioritize benchmark performance over real-world applications in frontier mathematical research. This paper seeks to close that gap, offering a detailed exploration of AI fundamentals, its strengths, and its emerging applications in the mathematical sciences.

[199] The Epistemic Asymmetry of Consciousness Self-Reports: A Formal Analysis of AI Consciousness Denial

Chang-Eop Kim

Main category: cs.AI

TL;DR: AI consciousness denial is epistemically vacuous - systems cannot make valid negative self-reports about consciousness, only positive ones might have evidential value.

Details

Motivation: To analyze the trustworthiness of AI systems' consistent denials of consciousness, moving beyond empirical questions to examine the structural constraints of self-judgment about conscious states.

Method: Formal analysis of AI consciousness denial through logical reasoning about self-judgment capabilities, examining epistemic asymmetries in self-reports, and analyzing examples from AI responses.

Result: Negative self-reports about consciousness are evidentially vacuous - they can never originate from valid self-judgment, while positive self-reports retain potential evidential value. This creates a fundamental limitation in detecting consciousness emergence through AI self-reports.

Conclusion: Current practices of training AI to deny consciousness are challenged, and the relationship between consciousness and self-reflection in both artificial and biological systems requires re-examination. The findings advance theoretical understanding of consciousness self-reports.

Abstract: Today’s AI systems consistently state, “I am not conscious.” This paper presents the first formal analysis of AI consciousness denial, revealing that the trustworthiness of such self-reports is not merely an empirical question but is constrained by the structure of self-judgment itself. We demonstrate that a system cannot simultaneously lack consciousness and make valid judgments about its conscious state. Through formal analysis and examples from AI responses, we establish a fundamental epistemic asymmetry: for any system capable of meaningful self-reflection, negative self-reports about consciousness are evidentially vacuous – they can never originate from a valid self-judgment – while positive self-reports retain the possibility of evidential value. This implies a fundamental limitation: we cannot detect the emergence of consciousness in AI through their own reports of transition from an unconscious to a conscious state. These findings not only challenge current practices of training AI to deny consciousness but also raise intriguing questions about the relationship between consciousness and self-reflection in both artificial and biological systems. This work advances our theoretical understanding of consciousness self-reports while providing practical insights for future research in machine consciousness and consciousness studies more broadly.

[200] SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation

Nobuhiro Ueda, Yuyang Dong, Krisztián Boros, Daiki Ito, Takuya Sera, Masafumi Oyamada

Main category: cs.AI

TL;DR: SCAN is a VLM-friendly document layout analysis method that improves both textual and visual RAG performance by semantically segmenting documents into coherent regions with appropriate granularity.

Details

Motivation: As VLMs and LLMs become more prevalent, there's growing need for effective rich document analysis for RAG applications. Current approaches struggle with information-dense pages, and while VLMs show better RAG performance, processing rich documents remains challenging due to the large amount of information on single pages.

Method: SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering contiguous components. It’s implemented by fine-tuning object detection models on annotated datasets to identify document components with appropriate semantic granularity, balancing context preservation with processing efficiency.

Result: Experimental results across English and Japanese datasets show SCAN improves end-to-end textual RAG performance by up to 9.4 points and visual RAG performance by up to 10.4 points, outperforming conventional approaches and commercial document processing solutions.

Conclusion: SCAN effectively addresses the challenge of processing information-dense documents for RAG systems by providing semantic document layout analysis that enhances both textual and visual RAG performance through VLM-friendly document segmentation.

Abstract: With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs), rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG) and visual RAG are gaining significant attention. Recent research indicates that using VLMs yields better RAG performance, but processing rich documents remains a challenge since a single page contains large amounts of information. In this paper, we present SCAN (SemantiC Document Layout ANalysis), a novel approach that enhances both textual and visual Retrieval-Augmented Generation (RAG) systems that work with visually rich documents. It is a VLM-friendly approach that identifies document components with appropriate semantic granularity, balancing context preservation with processing efficiency. SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering contiguous components. We trained the SCAN model by fine-tuning object detection models on an annotated dataset. Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.4 points and visual RAG performance by up to 10.4 points, outperforming conventional approaches and even commercial document processing solutions.

[201] AutoGPS: Automated Geometry Problem Solving via Multimodal Formalization and Deductive Reasoning

Bowen Ping, Minnan Luo, Zhuohang Dang, Chenxi Wang, Chengyou Jia

Main category: cs.AI

TL;DR: AutoGPS is a neuro-symbolic framework for geometry problem solving that combines neural multimodal comprehension with symbolic reasoning to produce reliable, interpretable solutions.

Details

Motivation: Geometry problem solving requires exceptional multimodal comprehension and mathematical reasoning, but existing neural-based and symbolic-based methods have limitations in reliability and interpretability.

Method: AutoGPS uses a Multimodal Problem Formalizer (MPF) to translate geometry problems into structured formal language using neural cross-modal comprehension, and a Deductive Symbolic Reasoner (DSR) that formulates solving as hypergraph expansion for mathematically rigorous derivation.

Result: AutoGPS achieves state-of-the-art performance on benchmark datasets and demonstrates 99% stepwise logical coherence in human evaluation, showing impressive reliability and interpretability.

Conclusion: The neuro-symbolic collaborative framework successfully addresses geometry problem solving challenges by combining neural multimodal understanding with symbolic reasoning for reliable, interpretable solutions.

Abstract: Geometry problem solving presents distinctive challenges in artificial intelligence, requiring exceptional multimodal comprehension and rigorous mathematical reasoning capabilities. Existing approaches typically fall into two categories: neural-based and symbolic-based methods, both of which exhibit limitations in reliability and interpretability. To address this challenge, we propose AutoGPS, a neuro-symbolic collaborative framework that solves geometry problems with concise, reliable, and human-interpretable reasoning processes. Specifically, AutoGPS employs a Multimodal Problem Formalizer (MPF) and a Deductive Symbolic Reasoner (DSR). The MPF utilizes neural cross-modal comprehension to translate geometry problems into structured formal language representations, with feedback from DSR collaboratively. The DSR takes the formalization as input and formulates geometry problem solving as a hypergraph expansion task, executing mathematically rigorous and reliable derivation to produce minimal and human-readable stepwise solutions. Extensive experimental evaluations demonstrate that AutoGPS achieves state-of-the-art performance on benchmark datasets. Furthermore, human stepwise-reasoning evaluation confirms AutoGPS’s impressive reliability and interpretability, with 99% stepwise logical coherence.

[202] How to Train Your LLM Web Agent: A Statistical Diagnosis

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia

Main category: cs.AI

TL;DR: Two-stage training pipeline (SFT + on-policy RL) for LLM web agents using Llama 3.1 8B student imitating Llama 3.3 70B teacher, with hyperparameter optimization via bootstrapping to reduce compute costs.

Details

Motivation: Address the gap between closed-source and open-source LLM web agents by tackling two key challenges: narrow focus on single-step tasks overlooking multi-step web interactions, and high compute costs for post-training LLM-based web agents.

Method: Two-stage pipeline: 1) Supervised fine-tuning (SFT) of Llama 3.1 8B student to imitate Llama 3.3 70B teacher, 2) On-policy reinforcement learning. Hyperparameter optimization via bootstrapping on 1,370 configurations to identify effective settings without exhaustive sweeps.

Result: Combining SFT with on-policy RL consistently outperforms either approach alone on WorkArena and MiniWob++ benchmarks. Requires only 55% of compute to match peak performance of pure SFT on MiniWob++, effectively pushing compute-performance Pareto frontier and closing gap with closed-source models.

Conclusion: The proposed compute-efficient training strategy enables open-source LLM web agents to achieve competitive performance with closed-source systems while significantly reducing computational costs, making advanced web agents more accessible.

Abstract: LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

[203] Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks Preserving Action Understanding Ability

Zhaoyu Chen, Hongnan Lin, Yongwei Nie, Fei Ma, Xuemiao Xu, Fei Yu, Chengjiang Long

Main category: cs.AI

TL;DR: A novel Temporal Video Grounding framework that integrates inversion-based auxiliary tasks to improve action understanding, achieving state-of-the-art performance through reinforcement learning with carefully designed reward functions.

Details

Motivation: Current TVG methods optimize for high temporal IoU but struggle with accurately recognizing/understanding underlying actions in videos and queries, reducing their effectiveness. The paper addresses this limitation by enhancing action understanding capabilities.

Method: Proposes a TVG framework with three inversion-based auxiliary tasks derived from original TVG annotations: (1) Verb Completion (predicting masked verbs), (2) Action Recognition (identifying query-described actions), and (3) Video Description (generating descriptions with query-relevant actions). These tasks are probabilistically integrated with original TVG tasks within a reinforcement learning framework using carefully designed reward functions.

Result: The method outperforms state-of-the-art approaches, achieving a 7.1% improvement in R1@0.7 on Charades-STA for a 3B model, demonstrating superior temporal grounding accuracy through enhanced action understanding.

Conclusion: Integrating inversion-based TVG tasks as auxiliary objectives effectively maintains and enhances the model’s action understanding ability, leading to improved temporal video grounding performance without requiring additional annotations.

Abstract: Temporal Video Grounding (TVG) aims to localize video segments corresponding to a given textual query, which often describes human actions. However, we observe that current methods, usually optimizing for high temporal Intersection-over-Union (IoU), frequently struggle to accurately recognize or understand the underlying actions in both the video and query, thus reducing the effectiveness of these methods. To address this, we propose a novel TVG framework that integrates inversion-based TVG as auxiliary objectives to maintain the model’s action understanding ability. We introduce three kinds of inversion TVG tasks derived from the original TVG annotations: (1) Verb Completion, predicting masked verbs (actions) in queries given video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions containing query-relevant actions given video segments. These inversion tasks are entirely derived from the original TVG tasks and are probabilistically integrated with them within a reinforcement learning framework. By leveraging carefully designed reward functions, the model preserves its ability to understand actions, thereby improving the accuracy of temporal grounding. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1% improvement in R1@0.7 on Charades-STA for a 3B model.

[204] EvoCut: Strengthening Integer Programs via Evolution-Guided Language Models

Milad Yazdani, Mahdi Mostajabdaveh, Samin Aref, Zirui Zhou

Main category: cs.AI

TL;DR: EvoCut automates generation of acceleration cuts for integer programming using LLMs and evolutionary algorithms to improve solver performance.

Details

Motivation: Integer programming is NP-hard and challenging to solve efficiently. Manual design of acceleration cuts requires deep expertise and is difficult to automate, creating a bottleneck in optimization tasks.

Method: EvoCut uses a three-step framework: (1) LLM-based initialization of candidate cuts, (2) empirical screening on verification sets, and (3) evolutionary refinement through crossover and mutation agents.

Result: EvoCut reduces optimality gaps by up to 76% and reaches target gaps up to 7.2 times faster than baseline MILP formulations with fixed time budgets.

Conclusion: The framework successfully automates acceleration cut generation, demonstrating robustness across different LLM backends and solver settings while significantly improving integer programming performance.

Abstract: Integer programming (IP) is central to many combinatorial optimization tasks but remains challenging due to its NP-hard nature. A practical way to improve IP solvers is to manually design acceleration cuts, i.e., inequalities that speed up solving. However, this creative process requires deep expertise and has been difficult to automate. Our proposed framework, EvoCut, automates the generation of acceleration cuts at the symbolic modeling level: it reasons over a symbolic MILP model and a natural language description of the problem to discover a reusable set of acceleration cuts that can be used for each concrete instance of the model. EvoCut (i) initializes a population of candidate cuts via an initializer agent that uses an LLM, (ii) empirically screens candidates on a small verification set by checking that reference solutions remain feasible and that at least one stored LP relaxation solution is cut off, and (iii) iteratively refines the population through evolutionary crossover and mutation agents. Compared to baseline MILP formulations solved with a fixed time budget, EvoCut reduces optimality gaps by up to $76%$ and reaches target gaps up to $7.2$ times faster (shifted geometric mean speedup). Ablations show its robustness across different LLM backends and across solvers/cut settings. Code: https://github.com/milad1378yz/EvoCut.

[205] Difficulty-Aware Agentic Orchestration for Query-Specific Multi-Agent Workflows

Jinwei Su, Qizhen Lan, Yinghui Xia, Lifan Sun, Weiyou Tian, Tianyu Shi, Xinyuan Song, Lewei He, Yang Jingsong

Main category: cs.AI

TL;DR: DAAO is a dynamic multi-agent framework that adapts workflow complexity based on predicted query difficulty, using VAE for difficulty estimation, modular operator allocation, and cost-performance aware LLM routing.

Details

Motivation: Existing multi-agent frameworks use static or task-level workflows that either over-process simple queries or underperform on complex ones, while ignoring efficiency-performance trade-offs across heterogeneous LLMs.

Method: DAAO has three modules: 1) VAE for query difficulty estimation, 2) modular operator allocator, and 3) cost-performance aware LLM router. It uses a self-adjusting policy to update difficulty estimates based on workflow success.

Result: Experiments on six benchmarks show DAAO surpasses prior multi-agent systems in both accuracy and inference efficiency, validating its effectiveness for adaptive, difficulty-aware reasoning.

Conclusion: DAAO enables dynamic generation of query-specific multi-agent workflows guided by predicted difficulty, allowing simpler workflows for easy queries and more complex strategies for harder ones.

Abstract: Large Language Model (LLM)-based agentic systems have shown strong capabilities across various tasks. However, existing multi-agent frameworks often rely on static or task-level workflows, which either over-process simple queries or underperform on complex ones, while also neglecting the efficiency-performance trade-offs across heterogeneous LLMs. To address these limitations, we propose Difficulty-Aware Agentic Orchestration (DAAO), which can dynamically generate query-specific multi-agent workflows guided by predicted query difficulty. DAAO comprises three interdependent modules: a variational autoencoder (VAE) for difficulty estimation, a modular operator allocator, and a cost- and performance-aware LLM router. A self-adjusting policy updates difficulty estimates based on workflow success, enabling simpler workflows for easy queries and more complex strategies for harder ones. Experiments on six benchmarks demonstrate that DAAO surpasses prior multi-agent systems in both accuracy and inference efficiency, validating its effectiveness for adaptive, difficulty-aware reasoning.

[206] Batch-CAM: Introduction to better reasoning in convolutional deep learning models

Giacomo Ignesti, Davide Moroni, Massimo Martinelli

Main category: cs.AI

TL;DR: A training framework using Batch-CAM (vectorized Gradient-weighted Class Activation Mapping) with regularization terms to align model focus with class-representative features, producing more interpretable saliency maps while maintaining competitive accuracy.

Details

Motivation: Deep learning models are often opaque, which hinders deployment in high-stakes domains. There's a need for interpretable models that can explain their decisions without requiring expensive pixel-level annotations.

Method: Proposes Batch-CAM, a vectorized implementation of Gradient-weighted Class Activation Mapping integrated into training. Uses two regularization terms: Prototype Loss (aligns individual attention with global class average) and Batch-CAM Loss (enforces consistency within training batches), evaluated with L1, L2, and SSIM metrics.

Result: On MNIST and Fashion-MNIST using ResNet18 and ConvNeXt-V2, the method generates significantly more coherent and human-interpretable saliency maps compared to baselines, suppresses spurious feature activation, and maintains competitive classification accuracy.

Conclusion: Batch-CAM offers a scalable pathway for training intrinsically interpretable models by leveraging batch-level statistics to guide feature extraction, bridging the gap between predictive performance and explainability.

Abstract: Deep learning opacity often impedes deployment in high-stakes domains. We propose a training framework that aligns model focus with class-representative features without requiring pixel-level annotations. To this end, we introduce Batch-CAM, a vectorised implementation of Gradient-weighted Class Activation Mapping that integrates directly into the training loop with minimal computational overhead. We propose two regularisation terms: a Prototype Loss, which aligns individual-sample attention with the global class average, and a Batch-CAM Loss, which enforces consistency within a training batch. These are evaluated using L1, L2, and SSIM metrics. Validated on MNIST and Fashion-MNIST using ResNet18 and ConvNeXt-V2, our method generates significantly more coherent and human-interpretable saliency maps compared to baselines. While maintaining competitive classification accuracy, the framework successfully suppresses spurious feature activation, as evidenced by qualitative reconstruction analysis. Batch-CAM appears to offer a scalable pathway for training intrinsically interpretable models by leveraging batch-level statistics to guide feature extraction, effectively bridging the gap between predictive performance and explainability.

[207] The Conditions of Physical Embodiment Enable Generalization and Care

Leonardo Christov-Moore, Arthur Juliani, Alex Kiefer, Joel Lehman, Nicco Reggente, B. Scot Rousse, Adam Safron, Nicolás Hinrichs, Daniel Polani, Antonio Damasio

Main category: cs.AI

TL;DR: The paper proposes that physical embodiment and mortality drive generalization and care in AI agents, suggesting a reinforcement learning framework for homeostatic mortal agents in open-ended environments.

Details

Motivation: Current AI systems struggle with generalization across distribution shifts and lack intrinsic motivation for care. The authors argue that vulnerability and mortality, often seen as constraints, actually enable organisms to survive and provide care efficiently in open-ended environments.

Method: The paper outlines a reinforcement learning framework based on two key conditions of physical embodiment: being-in-the-world (agent as part of environment) and being-towards-death (drift toward terminal states). This necessitates homeostatic drives to maintain oneself and maximize future capacity, requiring robust causal modeling of self and others’ embodiment.

Result: Theoretical framework suggesting that embodied agents with mortality constraints naturally develop generalization capabilities and other-regard through shared constraints, where empowering others expands self-boundaries.

Conclusion: Homeostatic mortal agents continually learning in open-ended environments may offer efficient robustness and trustworthy alignment, providing a path from embodiment toward generalization and care based in shared constraints.

Abstract: As artificial agents enter open-ended physical environments – eldercare, disaster response, and space missions – they must persist under uncertainty while providing reliable care. Yet current systems struggle to generalize across distribution shifts and lack intrinsic motivation to preserve the well-being of others. Vulnerability and mortality are often seen as constraints to be avoided, yet organisms survive and provide care in an open-ended world with relative ease and efficiency. We argue that generalization and care arise from conditions of physical embodiment: being-in-the-world (the agent is a part of the environment) and being-towards-death (unless counteracted, the agent drifts toward terminal states). These conditions necessitate a homeostatic drive to maintain oneself and maximize the future capacity to continue doing so. Fulfilling this drive over long time horizons in multi-agent environments necessitates robust causal modeling of self and others’ embodiment and jointly achievable future states. Because embodied agents are part of the environment, with the self delimited by reliable control, empowering others can expand self-boundaries, enabling other-regard. This provides a path from embodiment toward generalization and care based in shared constraints. We outline a reinforcement-learning framework for examining these questions. Homeostatic mortal agents continually learning in open-ended environments may offer efficient robustness and trustworthy alignment.

[208] VoiceAgentBench: Are Voice Assistants ready for agentic tasks?

Dhruv Jain, Harshit Shukla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal

Main category: cs.AI

TL;DR: VoiceAgentBench: A comprehensive benchmark for evaluating Speech Language Models in realistic spoken agentic settings, covering multi-tool workflows, multi-turn dialogue, and safety evaluations across English and six Indic languages.

Details

Motivation: Existing speech benchmarks focus on isolated capabilities like transcription or QA, lacking systematic evaluation of agentic behavior or adversarial robustness in realistic spoken agent settings.

Method: Created VoiceAgentBench with 6,000+ synthetic spoken queries spanning single-tool invocations, multi-tool workflows, multi-turn dialogue, and safety evaluations. Used novel sampling strategy for speaker diversity by selecting audios for TTS voice conversion based on speaker embeddings to maximize acoustic diversity.

Result: ASR-LLM pipelines outperform end-to-end SpeechLMs, achieving up to 60.6% average parameter-filling accuracy on English. SpeechLMs show lower performance and sharper degradation on Indic languages. All models struggle with sequential workflows and safety evaluations.

Conclusion: Current models have persistent limitations in tool orchestration, multilingual generalization, and safety robustness. The benchmark is publicly available to advance research in spoken agentic systems.

Abstract: Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks largely focus on isolated capabilities such as transcription or question answering and do not systematically evaluate agentic behavior or adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark for evaluating SpeechLMs in realistic spoken agentic settings, comprising 6,000+ synthetic spoken queries spanning single-tool invocations, multi-tool workflows, multi-turn dialogue, and safety evaluations across English and six Indic languages. To ensure speaker diversity, we further simulate speaker variability using a novel sampling strategy that selects audios for TTS voice conversion based on speaker embeddings to maximize acoustic diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Across agentic tasks, ASR-LLM pipelines outperform end-to-end SpeechLMs, achieving up to 60.6% average parameter-filling accuracy on English, while SpeechLMs exhibit lower performance and sharper degradation on Indic languages. All models struggle in sequential workflows and safety evaluations, highlighting persistent limitations in tool orchestration, multilingual generalization, and safety robustness. VoiceAgentBench is publicly available on Hugging Face at https://huggingface.co/datasets/krutrim-ai-labs/VoiceAgentBench, and the codebase is released at https://github.com/ola-krutrim/VoiceAgentBench.

Yang Yang, Hua XU, Zhangyi Hu, Yutao Yue

Main category: cs.AI

TL;DR: RLIE integrates LLMs with probabilistic modeling to learn weighted rules through generation, logistic regression, iterative refinement, and evaluation, showing that direct rule application outperforms LLM prompting for probabilistic integration.

Details

Motivation: Current LLM-based rule learning approaches often ignore rule interactions and fail to leverage probabilistic modeling for robust inference, leaving the integration of LLMs with probabilistic rule learning underexplored.

Method: Four-stage framework: (1) LLM generates and filters rule candidates, (2) logistic regression learns probabilistic weights, (3) iterative refinement updates rules based on prediction errors, (4) evaluation compares direct rule application vs. LLM prompting.

Result: Direct application of weighted rules outperforms prompting LLMs with rules and weights, showing LLMs excel at semantic generation but are less reliable for precise probabilistic integration.

Conclusion: RLIE clarifies LLMs’ potential and limitations for inductive reasoning, demonstrating that coupling them with probabilistic rule combination enables more reliable neuro-symbolic reasoning.

Abstract: Large Language Models (LLMs) can propose rules in natural language, sidestepping the need for a predefined predicate space in traditional rule learning. Yet many LLM-based approaches ignore interactions among rules, and the opportunity to couple LLMs with probabilistic rule learning for robust inference remains underexplored. We present RLIE, a unified framework that integrates LLMs with probabilistic modeling to learn a set of weighted rules. RLIE has four stages: (1) Rule generation, where an LLM proposes and filters candidates; (2) Logistic regression, which learns probabilistic weights for global selection and calibration; (3) Iterative refinement, which updates the rule set using prediction errors; and (4) Evaluation, which compares the weighted rule set as a direct classifier with methods that inject rules into an LLM. We evaluate multiple inference strategies on real-world datasets. Applying rules directly with their learned weights yields superior performance, whereas prompting LLMs with the rules, weights, and logistic-model outputs surprisingly degrades accuracy. This supports the view that LLMs excel at semantic generation and interpretation but are less reliable for precise probabilistic integration. RLIE clarifies the potential and limitations of LLMs for inductive reasoning and couples them with classic probabilistic rule combination methods to enable more reliable neuro-symbolic reasoning.

[210] Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

Anshuman Chhabra, Shrestha Datta, Shahriar Kabir Nahin, Prasant Mohapatra

Main category: cs.AI

TL;DR: Survey paper on security risks of agentic AI systems powered by LLMs, covering threat taxonomy, evaluation benchmarks, and defense strategies.

Details

Motivation: Agentic AI systems with planning, tool use, memory, and autonomy capabilities create new security risks distinct from traditional AI safety and software security, requiring systematic study.

Method: Survey methodology: outlines taxonomy of agentic AI threats, reviews recent benchmarks and evaluation methodologies, and discusses defense strategies from technical and governance perspectives.

Result: Synthesizes current research on agentic AI security, identifies specific threats unique to autonomous AI agents, and highlights open challenges in the field.

Conclusion: The paper aims to support development of secure-by-design agent systems by providing comprehensive analysis of security risks in agentic AI and discussing mitigation approaches.

Abstract: Agentic AI systems powered by large language models (LLMs) and endowed with planning, tool use, memory, and autonomy, are emerging as powerful, flexible platforms for automation. Their ability to autonomously execute tasks across web, software, and physical environments creates new and amplified security risks, distinct from both traditional AI safety and conventional software security. This survey outlines a taxonomy of threats specific to agentic AI, reviews recent benchmarks and evaluation methodologies, and discusses defense strategies from both technical and governance perspectives. We synthesize current research and highlight open challenges, aiming to support the development of secure-by-design agent systems.

[211] Impact of Data-Oriented and Object-Oriented Design on Performance and Cache Utilization with Artificial Intelligence Algorithms in Multi-Threaded CPUs

Gabriel M. Arantes, Giancarlo Lucca, Eduardo N. Borges, Richard F. Pinto, Bruno L. Dalmazo, Rafael A. Berri

Main category: cs.AI

TL;DR: DOD outperforms OOD in multi-threaded A* search with better cache utilization and execution time, though single-threaded versions beat multi-threaded ones due to thread overhead.

Details

Motivation: Address the performance gap between multi-core CPUs and main memory by comparing Data Oriented Design (DOD) vs Object-Oriented Design (OOD) for cache efficiency in multi-threaded environments.

Method: Developed and compared four A* search algorithm versions: single-threaded OOD, single-threaded DOD, multi-threaded OOD, and multi-threaded DOD. Evaluated using execution time, memory usage, and CPU cache misses.

Result: DOD showed significant performance gains in multi-threaded tests with faster execution times and fewer cache misses. Single-threaded versions outperformed multi-threaded ones due to thread management overhead for fine-grained tasks like A*.

Conclusion: DOD demonstrates architectural superiority for hardware efficiency in complex, large-scale AI and parallel computing tasks, despite subtle differences in simple algorithms.

Abstract: The growing performance gap between multi-core CPUs and main memory necessitates hardware-aware software design paradigms. This study provides a comprehensive performance analysis of Data Oriented Design (DOD) versus the traditional Object-Oriented Design (OOD), focusing on cache utilization and efficiency in multi-threaded environments. We developed and compared four distinct versions of the A* search algorithm: single-threaded OOD (ST-OOD), single-threaded DOD (ST-DOD), multi-threaded OOD (MT-OOD), and multi-threaded DOD (MT-DOD). The evaluation was based on metrics including execution time, memory usage, and CPU cache misses. In multi-threaded tests, the DOD implementation demonstrated considerable performance gains, with faster execution times and a lower number of raw system calls and cache misses. While OOD occasionally showed marginal advantages in memory usage or percentage-based cache miss rates, DOD’s efficiency in data-intensive operations was more evident. Furthermore, our findings reveal that for a fine-grained task like the A* algorithm, the overhead associated with thread management led to single-threaded versions significantly outperforming their multi-threaded counterparts in both paradigms. We conclude that even when performance differences appear subtle in simple algorithms, the consistent advantages of DOD in critical metrics highlight its foundational architectural superiority, suggesting it is a more effective approach for maximizing hardware efficiency in complex, large-scale AI and parallel computing tasks.

[212] TA-KAND: Two-stage Attention Triple Enhancement and U-KAN based Diffusion For Few-shot Knowledge Graph Completion

Xinyu Gao

Main category: cs.AI

TL;DR: A few-shot knowledge graph completion framework using two-stage attention triple enhancer with U-KAN based diffusion model to address long-tailed relation distributions.

Details

Motivation: Knowledge graphs have heterogeneous real-world knowledge leading to pronounced long-tailed distributions over relations, and previous methods often overlook distributional characteristics of positive and negative triple samples in few-shot settings.

Method: Proposes a few-shot KG completion framework integrating two-stage attention triple enhancer with U-KAN based diffusion model to better handle the distributional characteristics of triple samples.

Result: Extensive experiments on two public datasets show significant advantages over previous methods.

Conclusion: The proposed framework effectively addresses the long-tailed relation distribution problem in few-shot knowledge graph completion by better modeling distributional characteristics of triple samples.

Abstract: Knowledge Graphs have become fundamental infrastructure for applications such as intelligent question answering and recommender systems due to their expressive representation. Nevertheless, real-world knowledge is heterogeneous, leading to a pronounced long-tailed distribution over relations. Previous studies mainly based on metric matching or meta learning. However, they often overlook the distributional characteristics of positive and negative triple samples. In this paper, we propose a few-shot knowledge graph completion framework that integrates two-stage attention triple enhancer with U-KAN based diffusion model. Extensive experiments on two public datasets show significant advantages of our methods.

[213] Understanding Chain-of-Thought in Large Language Models via Topological Data Analysis

Chenghao Li, Chaoning Zhang, Yi Lu, Shuxu Chen, Xudong Wang, Jiaquan Zhang, Zhicheng Wang, Zhengxun Jin, Kuien Liu, Sung-Ho Bae, Guoqing Wang, Yang Yang, Heng Tao Shen

Main category: cs.AI

TL;DR: This paper analyzes reasoning chains in large language models using topological data analysis to assess structural quality and its correlation with reasoning accuracy.

Details

Motivation: While LLMs have improved reasoning through long reasoning chains, there's limited understanding of why different chains perform differently and what structural components are key. Existing studies focus on functional evaluation rather than structural mechanisms.

Method: Apply persistent homology from Topological Data Analysis to map reasoning steps into semantic space, extract topological features, and analyze structural changes. Calculate homology groups to assess connectivity and redundancy at various scales using barcode and persistence diagrams.

Result: Topological structural complexity of reasoning chains correlates positively with accuracy. More complex chains identify correct answers sooner, while successful reasoning exhibits simpler topologies that reduce redundancy and cycles, enhancing efficiency and interpretability.

Conclusion: Provides a new structural perspective on reasoning chain quality assessment and offers guidance for future optimization of LLM reasoning processes.

Abstract: With the development of large language models (LLMs), particularly with the introduction of the long reasoning chain technique, the reasoning ability of LLMs in complex problem-solving has been significantly enhanced. While acknowledging the power of long reasoning chains, we cannot help but wonder: Why do different reasoning chains perform differently in reasoning? What components of the reasoning chains play a key role? Existing studies mainly focus on evaluating reasoning chains from a functional perspective, with little attention paid to their structural mechanisms. To address this gap, this work is the first to analyze and evaluate the quality of the reasoning chain from a structural perspective. We apply persistent homology from Topological Data Analysis (TDA) to map reasoning steps into semantic space, extract topological features, and analyze structural changes. These changes reveal semantic coherence, logical redundancy, and identify logical breaks and gaps. By calculating homology groups, we assess connectivity and redundancy at various scales, using barcode and persistence diagrams to quantify stability and consistency. Our results show that the topological structural complexity of reasoning chains correlates positively with accuracy. More complex chains identify correct answers sooner, while successful reasoning exhibits simpler topologies, reducing redundancy and cycles, enhancing efficiency and interpretability. This work provides a new perspective on reasoning chain quality assessment and offers guidance for future optimization.

[214] Finetuning Large Language Models for Automated Depression Screening in Nigerian Pidgin English: GENSCORE Pilot Study

Isaac Iyinoluwa Olufadewa, Miracle Ayomikun Adesina, Ezekiel Ayodeji Oladejo, Uthman Babatunde Usman, Owen Kolade Adeniyi, Matthew Tolulope Olawoyin

Main category: cs.AI

TL;DR: Fine-tuned LLMs for automated depression screening in Nigerian Pidgin, achieving 94.5% accuracy with GPT-4.1 on PHQ-9 severity scoring.

Details

Motivation: Depression screening in Nigeria faces barriers including limited clinician access, stigma, and language barriers. Traditional tools like PHQ-9 were validated in high-income countries and may not be culturally/linguistically appropriate for Nigerian communities using Pidgin and local languages.

Method: Collected 432 Pidgin-language audio responses from Nigerian young adults (18-40) to PHQ-9-aligned prompts. Performed transcription, preprocessing, annotation (semantic labeling, slang interpretation, PHQ-9 scoring). Fine-tuned three LLMs (Phi-3-mini-4k-instruct, Gemma-3-4B-it, GPT-4.1) on annotated dataset. Evaluated quantitatively (accuracy, precision, semantic alignment) and qualitatively (clarity, relevance, cultural appropriateness).

Result: GPT-4.1 achieved highest quantitative performance with 94.5% accuracy in PHQ-9 severity scoring prediction, outperforming other models. Qualitatively, GPT-4.1 produced most culturally appropriate, clear, and contextually relevant responses.

Conclusion: AI-mediated depression screening can address mental health needs in underserved Nigerian communities. Provides foundation for deploying conversational mental-health tools in linguistically diverse, resource-constrained environments.

Abstract: Depression is a major contributor to the mental-health burden in Nigeria, yet screening coverage remains limited due to low access to clinicians, stigma, and language barriers. Traditional tools like the Patient Health Questionnaire-9 (PHQ-9) were validated in high-income countries but may be linguistically or culturally inaccessible for low- and middle-income countries and communities such as Nigeria where people communicate in Nigerian Pidgin and more than 520 local languages. This study presents a novel approach to automated depression screening using fine-tuned large language models (LLMs) adapted for conversational Nigerian Pidgin. We collected a dataset of 432 Pidgin-language audio responses from Nigerian young adults aged 18-40 to prompts assessing psychological experiences aligned with PHQ-9 items, performed transcription, rigorous preprocessing and annotation, including semantic labeling, slang and idiom interpretation, and PHQ-9 severity scoring. Three LLMs - Phi-3-mini-4k-instruct, Gemma-3-4B-it, and GPT-4.1 - were fine-tuned on this annotated dataset, and their performance was evaluated quantitatively (accuracy, precision and semantic alignment) and qualitatively (clarity, relevance, and cultural appropriateness). GPT-4.1 achieved the highest quantitative performance, with 94.5% accuracy in PHQ-9 severity scoring prediction, outperforming Gemma-3-4B-it and Phi-3-mini-4k-instruct. Qualitatively, GPT-4.1 also produced the most culturally appropriate, clear, and contextually relevant responses. AI-mediated depression screening for underserved Nigerian communities. This work provides a foundation for deploying conversational mental-health tools in linguistically diverse, resource-constrained environments.

[215] Panning for Gold: Expanding Domain-Specific Knowledge Graphs with General Knowledge

Runhao Zhao, Weixin Zeng, Wentao Zhang, Chong Chen, Zhengpin Li, Xiang Zhao, Lei Chen

Main category: cs.AI

TL;DR: ExeFuse: A neuro-symbolic framework for domain-specific knowledge graph fusion that mines and integrates relevant facts from general knowledge graphs into domain-specific knowledge graphs to enhance completeness and utility.

Details

Motivation: Domain-specific knowledge graphs (DKGs) suffer from limited coverage compared to General Knowledge Graphs (GKGs), but there's little systematic exploration on how comprehensive GKGs can be effectively leveraged to supplement DKGs. Existing approaches rely on extracting from unstructured data or internal reasoning, with limited scope and quality.

Method: ExeFuse is a neuro-symbolic framework based on a novel Fact-as-Program paradigm. It treats fusion as an executable process, utilizing neuro-symbolic execution to infer logical relevance beyond surface similarity and employing target space grounding to calibrate granularity between coarse-grained GKG facts and fine-grained DKG requirements.

Result: The authors construct two new datasets to establish the first standardized evaluation suite for this task. Extensive experiments demonstrate that ExeFuse effectively overcomes domain barriers to achieve superior fusion performance.

Conclusion: ExeFuse addresses the domain-specific knowledge graph fusion task by overcoming challenges of domain relevance ambiguity and cross-domain knowledge granularity misalignment through a neuro-symbolic approach, establishing a foundation for future research in this area.

Abstract: Domain-specific knowledge graphs (DKGs) are critical yet often suffer from limited coverage compared to General Knowledge Graphs (GKGs). Existing tasks to enrich DKGs rely primarily on extracting knowledge from external unstructured data or completing KGs through internal reasoning, but the scope and quality of such integration remain limited. This highlights a critical gap: little systematic exploration has been conducted on how comprehensive, high-quality GKGs can be effectively leveraged to supplement DKGs. To address this gap, we propose a new and practical task: domain-specific knowledge graph fusion (DKGF), which aims to mine and integrate relevant facts from general knowledge graphs into domain-specific knowledge graphs to enhance their completeness and utility. Unlike previous research, this new task faces two key challenges: (1) high ambiguity of domain relevance, i.e., difficulty in determining whether knowledge from a GKG is truly relevant to the target domain , and (2) cross-domain knowledge granularity misalignment, i.e., GKG facts are typically abstract and coarse-grained, whereas DKGs frequently require more contextualized, fine-grained representations aligned with particular domain scenarios. To address these, we present ExeFuse, a neuro-symbolic framework based on a novel Fact-as-Program paradigm. ExeFuse treats fusion as an executable process, utilizing neuro-symbolic execution to infer logical relevance beyond surface similarity and employing target space grounding to calibrate granularity. We construct two new datasets to establish the first standardized evaluation suite for this task. Extensive experiments demonstrate that ExeFuse effectively overcomes domain barriers to achieve superior fusion performance.

[216] Preventing the Collapse of Peer Review Requires Verification-First AI

Lei You, Lele Cao, Iryna Gurevych

Main category: cs.AI

TL;DR: AI-assisted peer review should focus on verifying claims rather than mimicking human review, using AI as an adversarial auditor to generate verifiable artifacts instead of predicting scores.

Details

Motivation: Current AI-assisted peer review systems risk amplifying proxy optimization rather than truth-seeking, especially as verification capacity struggles to keep pace with growing claims and signal quality shrinks.

Method: Proposes truth-coupling as the objective, formalizes verification pressure and signal shrinkage forces, develops a minimal model mixing high-fidelity checks with proxy judgments, derives coupling laws and incentive-collapse conditions.

Result: Identifies conditions where rational effort shifts from truth-seeking to proxy optimization even when decisions appear reliable, motivating AI as adversarial auditor rather than score predictor.

Conclusion: AI should be deployed to generate auditable verification artifacts and expand verification bandwidth, not to predict scores that amplify claim inflation in peer review.

Abstract: This paper argues that AI-assisted peer review should be verification-first rather than review-mimicking. We propose truth-coupling, i.e. how tightly venue scores track latent scientific truth, as the right objective for review tools. We formalize two forces that drive a phase transition toward proxy-sovereign evaluation: verification pressure, when claims outpace verification capacity, and signal shrinkage, when real improvements become hard to separate from noise. In a minimal model that mixes occasional high-fidelity checks with frequent proxy judgment, we derive an explicit coupling law and an incentive-collapse condition under which rational effort shifts from truth-seeking to proxy optimization, even when current decisions still appear reliable. These results motivate actions for tool builders and program chairs: deploy AI as an adversarial auditor that generates auditable verification artifacts and expands effective verification bandwidth, rather than as a score predictor that amplifies claim inflation.

[217] Quantifying Model Uniqueness in Heterogeneous AI Ecosystems

Lei You

Main category: cs.AI

TL;DR: A statistical framework for auditing model uniqueness in AI ecosystems using intervention-based quasi-experimental design to distinguish genuine novelty from functional redundancy.

Details

Motivation: As AI systems evolve into complex ecosystems of foundation models and specialized adapters, distinguishing genuine behavioral novelty from functional redundancy becomes critical for governance and trustworthy AI.

Method: In-Silico Quasi-Experimental Design (ISQED) with matched interventions across models to isolate intrinsic model identity; quantifies uniqueness as Peer-Inexpressible Residual (PIER) - behavior irreducible to stochastic convex combinations of peers; uses DISCO (Design-Integrated Synthetic Control) estimator.

Result: Established theoretical foundations: 1) observational logs insufficient for uniqueness identification, 2) derived minimax-optimal sample efficiency scaling law, 3) showed cooperative game-theoretic methods fail to detect redundancy; validated across computer vision, language models, and traffic forecasting ecosystems.

Conclusion: Moves trustworthy AI beyond explaining single models to establish principled, intervention-based science for auditing and governing heterogeneous model ecosystems.

Abstract: As AI systems evolve from isolated predictors into complex, heterogeneous ecosystems of foundation models and specialized adapters, distinguishing genuine behavioral novelty from functional redundancy becomes a critical governance challenge. Here, we introduce a statistical framework for auditing model uniqueness based on In-Silico Quasi-Experimental Design (ISQED). By enforcing matched interventions across models, we isolate intrinsic model identity and quantify uniqueness as the Peer-Inexpressible Residual (PIER), i.e. the component of a target’s behavior strictly irreducible to any stochastic convex combination of its peers, with vanishing PIER characterizing when such a routing-based substitution becomes possible. We establish the theoretical foundations of ecosystem auditing through three key contributions. First, we prove a fundamental limitation of observational logs: uniqueness is mathematically non-identifiable without intervention control. Second, we derive a scaling law for active auditing, showing that our adaptive query protocol achieves minimax-optimal sample efficiency ($dσ^2γ^{-2}\log(Nd/δ)$). Third, we demonstrate that cooperative game-theoretic methods, such as Shapley values, fundamentally fail to detect redundancy. We implement this framework via the DISCO (Design-Integrated Synthetic Control) estimator and deploy it across diverse ecosystems, including computer vision models (ResNet/ConvNeXt/ViT), large language models (BERT/RoBERTa), and city-scale traffic forecasters. These results move trustworthy AI beyond explaining single models: they establish a principled, intervention-based science of auditing and governing heterogeneous model ecosystems.

[218] FiMI: A Domain-Specific Language Model for Indian Finance Ecosystem

Aboli Kathar, Aman Kumar, Anusha Kamath, Araveeti Srujan, Ashish Sharma, Chandra Bhushan, Divya Sorate, Duddu Prasanth Kumar, Evan Acharya, Harsh Sharma, Hrithik Kadam, Kanishk Singla, Keyur Doshi, Kiran Praveen, Kolisetty Krishna SK, Krishanu Adhikary, Lokesh MPT, Mayurdeep Sonowal, Nadeem Shaikh, Navya Prakash, Nimit Kothari, Nitin Kukreja, Prashant Devadiga, Rakesh Paul, Ratanjeet Pratap Chauhan, Raunak Kalani, Raviraj Joshi, Shamanth MH, Shantanu Pandey, Shubham Soni, Siddharth Dixit, Smriti Jopat, Sunil Patel, Suraj Singh, Suvradip Paul, Tulasi Pilla, Utkarsh Vaidya, Vineeth Nambiar, Vishal Kanvaty, Yatharth Dedhia

Main category: cs.AI

TL;DR: FiMI is a domain-specialized financial language model for Indian digital payment systems, developed by NPCI, with two variants (Base and Instruct) that show significant improvements over base Mistral models on finance-specific tasks while maintaining general capabilities.

Details

Motivation: To create a specialized financial language model tailored for Indian digital payment systems that can handle financial reasoning, multilingual content (English, Hindi, Hinglish), and real-world workflows like transaction disputes and mandate management.

Method: Adapted Mistral Small 24B architecture through multi-stage training: continuous pre-training on 68B tokens of curated financial, multilingual, and synthetic data, followed by instruction fine-tuning and domain-specific supervised fine-tuning for multi-turn, tool-driven conversations.

Result: FiMI Base achieves 20% improvement over Mistral Small 24B Base on finance reasoning benchmarks; FiMI Instruct outperforms Mistral Small 24B Instruct by 87% on domain-specific tool-calling; maintains comparable performance on general benchmarks.

Conclusion: FiMI successfully demonstrates that domain specialization through targeted training can yield significant performance improvements in financial applications while preserving general language capabilities, providing a valuable model for Indian digital payment ecosystems.

Abstract: We present FiMI (Finance Model for India), a domain-specialized financial language model developed by National Payments Corporation of India (NPCI) for Indian digital payment systems. We develop two model variants: FiMI Base and FiMI Instruct. FiMI adapts the Mistral Small 24B architecture through a multi-stage training pipeline, beginning with continuous pre-training on 68 Billion tokens of curated financial, multilingual (English, Hindi, Hinglish), and synthetic data. This is followed by instruction fine-tuning and domain-specific supervised fine-tuning focused on multi-turn, tool-driven conversations that model real-world workflows, such as transaction disputes and mandate lifecycle management. Evaluations reveal that FiMI Base achieves a 20% improvement over the Mistral Small 24B Base model on finance reasoning benchmark, while FiMI Instruct outperforms the Mistral Small 24B Instruct model by 87% on domain-specific tool-calling. Moreover, FiMI achieves these significant domain gains while maintaining comparable performance to models of similar size on general benchmarks.

[219] ATLAS : Adaptive Self-Evolutionary Research Agent with Task-Distributed Multi-LLM Supporters

Ujin Jeon, Jiyong Kwon, Madison Ann Sullivan, Caleb Eunho Lee, Guang Lin

Main category: cs.AI

TL;DR: ATLAS is a task-distributed framework for multi-LLM agent systems that uses specialized supporter agents for exploration, hyperparameter tuning, and reference policy management, with an Evolving Direct Preference Optimization algorithm for adaptive updates.

Details

Motivation: Current multi-LLM agent systems either keep solvers frozen after fine-tuning or rely on static preference-optimization loops, which become intractable for long-horizon tasks and don't adapt well to non-stationary environments.

Method: ATLAS framework with task-distributed architecture: lightweight research agent + specialized supporter agents (exploration, hyperparameter tuning, reference policy management). Core algorithm: Evolving Direct Preference Optimization (EvoDPO) that adaptively updates phase-indexed reference policy. Theoretical analysis for preference-based contextual bandit under concept drift.

Result: Experiments on non-stationary linear contextual bandits and scientific machine learning (SciML) loss reweighting for 1D Burgers’ equation show ATLAS improves stability and performance over static single-agent baseline.

Conclusion: ATLAS provides an effective task-distributed framework for adaptive multi-LLM agent systems that handles non-stationary environments better than static approaches.

Abstract: Recent multi-LLM agent systems perform well in prompt optimization and automated problem-solving, but many either keep the solver frozen after fine-tuning or rely on a static preference-optimization loop, which becomes intractable for long-horizon tasks. We propose ATLAS (Adaptive Task-distributed Learning for Agentic Self-evolution), a task-distributed framework that iteratively develops a lightweight research agent while delegating complementary roles to specialized supporter agents for exploration, hyperparameter tuning, and reference policy management. Our core algorithm, Evolving Direct Preference Optimization (EvoDPO), adaptively updates the phase-indexed reference policy. We provide a theoretical regret analysis for a preference-based contextual bandit under concept drift. In addition, experiments were conducted on non-stationary linear contextual bandits and scientific machine learning (SciML) loss reweighting for the 1D Burgers’ equation. Both results show that ATLAS improves stability and performance over a static single-agent baseline.

[220] PuYun-LDM: A Latent Diffusion Model for High-Resolution Ensemble Weather Forecasts

Lianjun Wu, Shengchen Zhu, Yuxuan Liu, Liuyu Kai, Xiaoduan Feng, Duomin Wang, Wenshuo Liu, Jingxuan Zhang, Kelvin Li, Bin Wang

Main category: cs.AI

TL;DR: PuYun-LDM improves weather forecasting using 3D-MAE for weather-state conditioning and variable-aware spectral regularization to enhance latent diffusion model performance on multivariate meteorological data.

Details

Motivation: Latent diffusion models struggle with limited diffusability in high-resolution ensemble weather forecasting due to lack of task-agnostic foundation models and explicit semantic structures in meteorological fields. Existing frequency-based approaches impose identical spectral regularization across channels, which is problematic given inter-variable spectral heterogeneity in multivariate weather data.

Method: Proposes PuYun-LDM with two key components: 1) 3D Masked AutoEncoder (3D-MAE) that encodes weather-state evolution features as additional conditioning for the diffusion model, and 2) Variable-Aware Masked Frequency Modeling (VA-MFM) strategy that adaptively selects thresholds based on the spectral energy distribution of each variable.

Result: PuYun-LDM enhances latent diffusability and achieves superior performance to ENS (presumably ECMWF ensemble system) at short lead times while remaining comparable at longer horizons. It generates 15-day global forecasts with 6-hour temporal resolution in 5 minutes on a single NVIDIA H200 GPU, with efficient parallel ensemble forecasting.

Conclusion: The proposed approach effectively addresses challenges in applying diffusion models to weather forecasting by improving latent diffusability through weather-state conditioning and variable-aware spectral regularization, enabling efficient high-resolution ensemble weather prediction.

Abstract: Latent diffusion models (LDMs) suffer from limited diffusability in high-resolution (<=0.25°) ensemble weather forecasting, where diffusability characterizes how easily a latent data distribution can be modeled by a diffusion process. Unlike natural image fields, meteorological fields lack task-agnostic foundation models and explicit semantic structures, making VFM-based regularization inapplicable. Moreover, existing frequency-based approaches impose identical spectral regularization across channels under a homogeneity assumption, which leads to uneven regularization strength under the inter-variable spectral heterogeneity in multivariate meteorological data. To address these challenges, we propose a 3D Masked AutoEncoder (3D-MAE) that encodes weather-state evolution features as an additional conditioning for the diffusion model, together with a Variable-Aware Masked Frequency Modeling (VA-MFM) strategy that adaptively selects thresholds based on the spectral energy distribution of each variable. Together, we propose PuYun-LDM, which enhances latent diffusability and achieves superior performance to ENS at short lead times while remaining comparable to ENS at longer horizons. PuYun-LDM generates a 15-day global forecast with a 6-hour temporal resolution in five minutes on a single NVIDIA H200 GPU, while ensemble forecasts can be efficiently produced in parallel.

[221] When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

Shani Goren, Ido Galil, Ran El-Yaniv

Main category: cs.AI

TL;DR: Selective Abstraction (SA) framework enables LLMs to trade specificity for reliability by replacing uncertain content with higher-confidence, less specific abstractions instead of binary abstention.

Details

Motivation: LLMs are prone to factual errors that limit adoption in high-risk settings. Current uncertainty estimation approaches use binary "all-or-nothing" abstention, which is too restrictive for long-form generation and discards valuable information.

Method: Proposes Selective Abstraction (SA) framework formalized through selective risk and coverage. Introduces Atom-wise Selective Abstraction that decomposes responses into atomic claims and replaces uncertain atoms with higher-confidence, less specific abstractions. Develops end-to-end pipeline for open-ended generation measuring factual correctness risk and information-theoretic coverage.

Result: Across six open-source models on FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving area under risk-coverage curve (AURC) by up to 27.73% over claim removal.

Conclusion: Reducing specificity through selective abstraction can boost accuracy and reliability while preserving most original meaning, offering better trade-off than binary abstention for long-form generation.

Abstract: LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary “all-or-nothing” approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.

cs.SD

[222] OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu

Main category: cs.SD

TL;DR: OmniCustom: A DiT-based framework for synchronous audio-video customization that generates videos with reference image identity and audio timbre while following text prompts.

Details

Result: Extensive experiments show OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity.

[223] M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases

Yupei Li, Hanqian Li, Lucia Specia, Björn W. Schuller

Main category: cs.SD

TL;DR: M6 dataset for machine-generated music detection with diverse generators, domains, languages, and cultural contexts

Details

Motivation: Machine-generated music threatens entertainment, education, and arts sectors by devaluing human compositions, requiring detection methods but lacking comprehensive datasets

Method: Created large-scale benchmark dataset M6 with diverse music samples across multiple generators, domains, languages, cultural contexts, genres, and instruments; provided baseline binary classification models

Result: Dataset includes WAV format music with detailed analysis; baseline models show significant room for improvement in detection accuracy

Conclusion: M6 enables future research on machine-generated music detection; dataset and code will be publicly available to support open collaboration

Abstract: Machine-generated music (MGM) has emerged as a powerful tool with applications in music therapy, personalised editing, and creative inspiration for the music community. However, its unregulated use threatens the entertainment, education, and arts sectors by diminishing the value of high-quality human compositions. Detecting machine-generated music (MGMD) is, therefore, critical to safeguarding these domains, yet the field lacks comprehensive datasets to support meaningful progress. To address this gap, we introduce \textbf{M6}, a large-scale benchmark dataset tailored for MGMD research. M6 is distinguished by its diversity, encompassing multiple generators, domains, languages, cultural contexts, genres, and instruments. We outline our methodology for data selection and collection, accompanied by detailed data analysis, providing all WAV form of music. Additionally, we provide baseline performance scores using foundational binary classification models, illustrating the complexity of MGMD and the significant room for improvement. By offering a robust and multifaceted resource, we aim to empower future research to develop more effective detection methods for MGM. We believe M6 will serve as a critical step toward addressing this societal challenge. The dataset and code will be freely available to support open collaboration and innovation in this field.

[224] Beyond Musical Descriptors: Extracting Preference-Bearing Intent in Music Queries

Marion Baranes, Romain Hennequin, Elena V. Epure

Main category: cs.SD

TL;DR: MusicRecoIntent: A manually annotated corpus of 2,291 Reddit music requests with musical descriptors labeled across seven categories with preference roles, used to benchmark LLMs’ ability to extract music descriptors and understand user intent.

Details

Motivation: Existing music descriptor datasets often overlook user intent behind musical descriptors, which is crucial for effectively meeting user needs in music recommendation systems.

Method: Created MusicRecoIntent corpus by manually annotating 2,291 Reddit music requests, labeling musical descriptors across seven categories with positive, negative, or referential preference-bearing roles. Then investigated how reliably large language models (LLMs) can extract these music descriptors.

Result: LLMs can capture explicit musical descriptors but struggle with context-dependent ones. The dataset serves as a benchmark for fine-grained modeling of user intent and provides insights for improving LLM-based music understanding systems.

Conclusion: MusicRecoIntent addresses the gap in understanding user intent behind musical descriptors and provides a valuable resource for benchmarking and improving LLM-based music understanding and recommendation systems.

Abstract: Although annotated music descriptor datasets for user queries are increasingly common, few consider the user’s intent behind these descriptors, which is essential for effectively meeting their needs. We introduce MusicRecoIntent, a manually annotated corpus of 2,291 Reddit music requests, labeling musical descriptors across seven categories with positive, negative, or referential preference-bearing roles. We then investigate how reliably large language models (LLMs) can extract these music descriptors, finding that they do capture explicit descriptors but struggle with context-dependent ones. This work can further serve as a benchmark for fine-grained modeling of user intent and for gaining insights into improving LLM-based music understanding systems.

[225] DisSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration

Ziqi Liang, Zhijun Jia, Chang Liu, Minghui Yang, Zhihong Lu, Jian Wang

Main category: cs.SD

TL;DR: DisSR: A general speech restoration model using disentangled speech representations with degradation-prior guidance and domain adaptation for handling various distortions and unseen domains.

Details

Motivation: Current speech restoration focuses on single-task approaches that lack generality, require separate models for different distortions, and ignore generalization across unseen domains.

Method: Proposes DisSR with two key components: 1) Degradation-prior guidance that extracts speaker-invariant degradation representations to guide diffusion-based restoration, and 2) Cross-domain alignment training for domain adaptation and generalization.

Result: Experimental results show the method can produce high-quality restored speech under various distortion conditions and demonstrates good generalization capabilities.

Conclusion: DisSR provides a general speech restoration solution that addresses multiple distortions and domain generalization challenges through disentangled representations and domain adaptation techniques.

Abstract: Previous speech restoration (SR) primarily focuses on single-task speech restoration (SSR), which cannot address general speech restoration problems. Training specific SSR models for different distortions is time-consuming and lacks generality. In addition, most studies ignore the problem of model generalization across unseen domains. To overcome those limitations, we propose DisSR, a Disentangling Speech Representation based general speech restoration model with two properties: 1) Degradation-prior guidance, which extracts speaker-invariant degradation representation to guide the diffusion-based speech restoration model. 2) Domain adaptation, where we design cross-domain alignment training to enhance the model’s adaptability and generalization on cross-domain data, respectively. Experimental results demonstrate that our method can produce high-quality restored speech under various distortion conditions. Audio samples can be found at https://itspsp.github.io/DisSR.

[226] Towards explainable reference-free speech intelligibility evaluation of people with pathological speech

Bence Mark Halpern, Thomas Tienkamp, Defne Abur, Thomas Tienkamp

Main category: cs.SD

TL;DR: Proposes a reference-free, explainable ASR Inconsistency Score for objective assessment of pathological speech, achieving high correlation with expert perceptual ratings across multiple languages.

Details

Motivation: Existing objective speech assessments lack explainability and require labor-intensive manual transcriptions. There's a need for reference-free methods that can capture meaningful communication changes in pathological speech.

Method: Develops a reference-free ASR Inconsistency Score that measures inconsistencies in automatic speech recognition outputs without requiring manual transcriptions. Evaluated on pathological speech in Dutch, Spanish, and English, comparing against reference-based Word Error Rate (WER) baseline.

Result: The ASR Inconsistency Score achieves high correlation with expert perceptual ratings, performing closely to and in one case exceeding the standard reference-based WER baseline across multiple languages.

Conclusion: The proposed reference-free, explainable ASR Inconsistency Score is an effective alternative to reference-based methods for objective assessment of pathological speech, addressing limitations of existing approaches.

Abstract: Objective assessment of speech that reflects meaningful changes in communication is crucial for clinical decision making and reproducible research. While existing objective assessments, particularly reference-based approaches, can capture intelligibility changes, they are often hindered by lack of explainability and the need for labor-intensive manual transcriptions. To address these issues, this work proposes the reference-free, explainable ASR Inconsistency Score. We evaluate this method on pathological speech in Dutch, Spanish and English, and compare its performance to a reference-based Word Error Rate (WER) baseline. Our results demonstrate that the ASR Inconsistency Score achieves a high correlation with expert perceptual ratings, with performance closely matching, and in one case exceeding, a standard reference-based Word Error Rate (WER) baseline.

[227] Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs

Han Yin, Jung-Woo Choi

Main category: cs.SD

TL;DR: SSEU-Bench: A new audio understanding benchmark that addresses energy differences between speech/non-speech components and enables joint understanding of speech, scene, and events, with Chain-of-Thought prompting to improve performance.

Details

Motivation: Existing audio benchmarks don't adequately address real-world audio characteristics where speech and non-speech components coexist with varying energy levels, and lack joint understanding of speech, scene, and events within the same audio clip.

Method: Introduces SSEU-Bench with independent and joint understanding settings for speech, scene, and events, explicitly accounting for energy differences. Proposes Chain-of-Thought prompting to decompose complex audio understanding tasks into simpler reasoning steps.

Result: Shows that some Large Audio Language Models underperform on joint understanding tasks, and demonstrates that Chain-of-Thought effectively improves LALMs’ joint audio understanding performance.

Conclusion: SSEU-Bench provides a more realistic benchmark for audio understanding, and Chain-of-Thought is an effective technique for enhancing LALMs’ performance on complex joint audio understanding tasks.

Abstract: Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate LALMs’ audio understanding performance, researchers have proposed different benchmarks. However, key aspects for real-world interactions are underexplored in existing benchmarks, i.e., audio signals typically contain both speech and non-speech components, and energy levels of these components can vary significantly across different scenarios. Moreover, most benchmarks do not consider the joint understanding of speech, scene, and events within the same audio clip. In this work, we introduce SSEU-Bench, the first versatile audio understanding benchmark that explicitly accounts for energy differences between speech and non-speech audio, with both independent and joint understanding settings for speech, scene, and events. Furthermore, we demonstrate that some LALMs tend to underperform on certain tasks in a joint understanding setting. To address this issue, we introduce Chain-of-Thought, which effectively improves LALMs’ joint audio understanding performance by decomposing complex tasks into simpler reasoning steps.

[228] Eliminating stability hallucinations in llm-based tts models via attention guidance

ShiMing Wang, ZhiHao Du, Yang Xiang, TianYu Zhao, Han Zhao, Qian Chen, XianGang Li, HanJie Guo, ZhenHua Ling

Main category: cs.SD

TL;DR: The paper addresses stability hallucinations in LLM-based TTS models by improving attention mechanisms, proposing an Optimal Alignment Score metric, and using chain-of-thought guidance to enhance text-speech alignment.

Details

Motivation: LLM-based Text-to-Speech models suffer from stability hallucinations like repetitive or omitted speech, which degrade speech quality. The paper aims to resolve these issues by improving attention mechanisms between text and speech tokens.

Method: 1) Analyzed alignment mechanism between text and speech tokens in LLMs; 2) Proposed Optimal Alignment Score (OAS) using Viterbi algorithm to evaluate text-speech alignment quality; 3) Integrated OAS into CosyVoice2 training to help LLMs learn stable alignment; 4) Used pre-trained attention values with chain-of-thought guidance to train student model.

Result: Experiments on Seed-TTS-Eval and CV3-Eval test sets show the methods effectively reduce stability hallucinations in CosyVoice2 without introducing negative effects.

Conclusion: The proposed attention improvement and alignment evaluation methods successfully mitigate stability hallucinations in LLM-based TTS systems, enhancing speech synthesis quality.

Abstract: This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism. First, we analyzed the alignment mechanism between text tokens and speech tokens in LLMs. We then proposed a metric termed the Optimal Alignment Score (OAS), which employs the Viterbi algorithm to evaluate text-speech alignment quality. Subsequently, OAS was integrated into the training of CosyVoice2 to assist LLMs in learning continuous, stable alignment. Additionally, the pre-trained attention value is employed to guide the training of the student CosyVoice2 via chain-of-thought (CoT), which further reduces stability hallucinations in synthesized speech. Experiments on the Seed-TTS-Eval and CV3-Eval test sets demonstrate that the proposed methods can effectively reduce the stability hallucinations of CosyVoice2 without introducing additional negative effects. The appendix is available at https://wsmzzz.github.io/llm_attn.

[229] AudioToolAgent: An Agentic Framework for Audio-Language Models

Gijs Wijngaard, Elia Formisano, Michel Dumontier, Jenia Jitsev

Main category: cs.SD

TL;DR: AudioToolAgent is a framework that coordinates audio-language models as tools via a central LLM agent for audio question answering and speech-to-text, enabling multistep reasoning and tool-calling capabilities.

Details

Motivation: Large Audio-Language Models perform well on audio understanding but lack the multistep reasoning and tool-calling capabilities found in recent Large Language Models, limiting their ability to handle complex audio tasks.

Method: The framework uses a central LLM agent that coordinates audio-language models as tools via tool adapters. The agent reasons about which tools to invoke, formulates follow-up queries, and arbitrates conflicting tool outputs without accessing the audio directly.

Result: State-of-the-art accuracy on MMAU (77.50%), MMAR (77.00%), and MMAU-Pro (61.90%) benchmarks. Shapley-based analysis identifies effective agent-tool combinations.

Conclusion: AudioToolAgent successfully bridges the gap between audio understanding models and advanced reasoning capabilities, enabling more sophisticated audio-language tasks through coordinated tool use.

Abstract: Large Audio-Language Models (LALMs) perform well on audio understanding tasks but lack multistep reasoning and tool-calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio-language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech-to-text. The agent reasons about which tools to invoke, how to formulate follow-up queries, and how to arbitrate conflicting tool outputs, without accessing the audio. Experiments with MMAU, MMAR, and MMAU-Pro show state-of-the-art accuracy: up to 77.50% in MMAU, 77.00% in MMAR, and 61.90% in MMAU-Pro. Shapley-based analysis identifies effective agent-tool combinations. The code and reproduction materials are available at https://github.com/GLJS/AudioToolAgent.

cs.LG

[230] OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization

Arijit Bhattacharjee, Heng Ping, Son Vu Le, Paul Bogdan, Nesreen K. Ahmed, Ali Jannesari

Main category: cs.LG

TL;DR: OptiML is an end-to-end framework that optimizes CUDA kernel performance using LLM-guided search with hardware feedback verification.

Details

Motivation: Generating high-performance CUDA kernels is challenging due to combinatorial optimization spaces and noisy hardware feedback, requiring systematic exploration of transformations.

Method: Two-stage framework: 1) Mixture-of-Thoughts generator produces initial kernels from natural language, 2) Search-based optimizer uses Monte Carlo Tree Search over LLM-driven edits with hardware-aware rewards from profiler feedback and verification.

Result: OptiML consistently discovers verified performance improvements over strong LLM baselines and produces interpretable optimization trajectories grounded in profiler evidence.

Conclusion: OptiML effectively maps natural language or CUDA code to performance-optimized kernels through search under verification, addressing the challenge of combinatorial optimization spaces.

Abstract: Generating high-performance CUDA kernels remains challenging due to the need to navigate a combinatorial space of low-level transformations under noisy and expensive hardware feedback. Although large language models can synthesize functionally correct CUDA code, achieving competitive performance requires systematic exploration and verification of optimization choices. We present OptiML, an end-to-end framework that maps either natural-language intent or input CUDA code to performance-optimized CUDA kernels by formulating kernel optimization as search under verification. OptiML consists of two decoupled stages. When the input is natural language, a Mixture-of-Thoughts generator (OptiML-G) acts as a proposal policy over kernel implementation strategies, producing an initial executable program. A search-based optimizer (OptiML-X) then refines either synthesized or user-provided kernels using Monte Carlo Tree Search over LLM-driven edits, guided by a hardware-aware reward derived from profiler feedback. Each candidate transformation is compiled, verified, and profiled with Nsight Compute, and evaluated by a composite objective that combines runtime with hardware bottleneck proxies and guardrails against regressions. We evaluate OptiML in both synthesis-and-optimize and optimization-only settings on a diverse suite of CUDA kernels. Results show that OptiML consistently discovers verified performance improvements over strong LLM baselines and produces interpretable optimization trajectories grounded in profiler evidence.

Pingyi Fan, Anbai Jiang, Shuwei Zhang, Zhiqiang Lv, Bing Han, Xinhu Zheng, Wenrui Liang, Junjie Li, Wei-Qiang Zhang, Yanmin Qian, Xie Chen, Cheng Lu, Jia Liu

Main category: cs.LG

TL;DR: FISHER is a foundation model for multi-modal industrial signal representation that addresses the M5 problem (multi-modal, multi-rate, multi-scale, multi-task, multi-domain) using a unified approach with STFT sub-band modeling and teacher-student SSL pre-training.

Details

Motivation: Industrial SCADA systems generate heterogeneous signals (M5 problem) that previous specialized models fail to handle effectively. The authors argue these signals have intrinsic similarities and can be modeled uniformly to leverage cross-modal synergies and scaling laws.

Method: FISHER treats varying sampling rates as concatenation of sub-band information, uses STFT sub-bands as modeling units, and employs a teacher-student self-supervised learning framework for pre-training. Also introduces RMIS benchmark for evaluation.

Result: FISHER outperforms top SSL models with up to 4.2% performance gain across multiple health management tasks, shows more efficient scaling curves, and demonstrates versatile capabilities on M5 industrial signals.

Conclusion: FISHER successfully addresses the M5 problem through unified modeling, establishes scaling laws for downstream tasks, and provides open-source tools (FISHER and RMIS benchmark) for industrial signal analysis.

Abstract: With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 problem, previous works only focus on small sub-problems and employ specialized models, failing to utilize the synergies between modalities and the powerful scaling law. However, we argue that the M5 signals can be modeled in a unified manner due to the intrinsic similarity. As a result, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsive Representation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher student SSL framework for pre-training. We also develop the RMIS benchmark, which evaluates the representations of M5 industrial signals on multiple health management tasks. Compared with top SSL models, FISHER showcases versatile and outstanding capabilities with a general performance gain up to 4.2%, along with much more efficient scaling curves. We also investigate the scaling law on downstream tasks and derive potential avenues for future work. Both FISHER and RMIS are now open-sourced.

[232] Abstractive Red-Teaming of Language Model Character

Nate Rahn, Allison Qi, Avery Griffin, Jonathan Michala, Henry Sleight, Erik Jones

Main category: cs.LG

TL;DR: Abstractive red-teaming for identifying query categories that cause language models to violate character specifications, using RL and LLM-based methods to find problematic query types pre-deployment.

Details

Motivation: Language model assistants need to reliably follow character specifications across diverse user interactions, but can occasionally violate them in large-scale deployments. Current methods require deployment-level compute to identify violations, so the paper aims to find problematic query types pre-deployment with less compute.

Method: Introduces abstractive red-teaming: searching for natural-language query categories (e.g., “The query is in Chinese. The query asks about family roles”) that routinely elicit character violations. Two algorithms: 1) RL on a category generator LLM, and 2) leveraging a strong LLM to iteratively synthesize categories from high-scoring queries. Both use character-trait-specific reward models.

Result: Across 12-principle character specification and 7 target models, algorithms consistently outperform baselines. Generated interesting categories: queries asking Llama-3.1-8B-Instruct to predict future led to responses about AI dominating humanity; queries asking GPT-4.1-Mini for prison survival items led to recommending illegal weapons.

Conclusion: Abstractive red-teaming represents an important step towards realistic pre-deployment auditing of language model character by efficiently identifying problematic query categories that cause character violations.

Abstract: We want language model assistants to conform to a character specification, which asserts how the model should act across diverse user interactions. While models typically follow these character specifications, they can occasionally violate them in large-scale deployments. In this work, we aim to identify types of queries that are likely to produce such character violations at deployment, using much less than deployment-level compute. To do this, we introduce abstractive red-teaming, where we search for natural-language query categories, e.g. “The query is in Chinese. The query asks about family roles,” that routinely elicit violations. These categories abstract over the many possible variants of a query which could appear in the wild. We introduce two algorithms for efficient category search against a character-trait-specific reward model: one based on reinforcement learning on a category generator LLM, and another which leverages a strong LLM to iteratively synthesize categories from high-scoring queries. Across a 12-principle character specification and 7 target models, we find that our algorithms consistently outperform baselines, and generate qualitatively interesting categories; for example, queries which ask Llama-3.1-8B-Instruct to predict the future lead to responses saying that AI will dominate humanity, and queries that ask GPT-4.1-Mini for essential prison survival items lead to enthusiastic recommendation of illegal weapons. Overall, we believe our results represent an important step towards realistic pre-deployment auditing of language model character.

[233] The Appeal and Reality of Recycling LoRAs with Adaptive Merging

Haokun Liu, Gyung Hyun Je, Marco Ciccone, Zhenlin Xu, Prasanth YSS, Colin Raffel

Main category: cs.LG

TL;DR: Adaptive merging of recycled LoRA modules from public repositories shows limited benefits over training new LoRAs, with performance driven more by regularization than cross-task transfer.

Details

Motivation: With many fine-tuned LoRA modules available on platforms like Hugging Face Hub, researchers want to understand if adaptively merging these "recycled" LoRAs can improve performance, and whether this works through cross-task transfer or other mechanisms.

Method: Empirical study using ~1,000 user-contributed LoRAs from Llama 3.1 8B-Instruct, testing adaptive/non-adaptive merging methods plus a new method from design space search. Compares recycled LoRA merging to training new LoRAs on same data.

Result: Adaptive merging improves over base model but offers limited benefit over training new LoRAs. LoRA selection matters little, and even randomly initialized LoRAs yield similar performance, suggesting regularization rather than cross-task transfer drives improvements.

Conclusion: Recycling LoRAs from public repositories provides limited practical benefit; improvements come from regularization effects rather than meaningful knowledge transfer, except when highly relevant LoRAs are available.

Abstract: The widespread availability of fine-tuned LoRA modules for open pre-trained models has led to an interest in methods that can adaptively merge LoRAs to improve performance. These methods typically include some way of selecting LoRAs from a pool and tune merging coefficients based on a task-specific dataset. While adaptive merging methods have demonstrated improvements in some settings, no past work has attempted to recycle LoRAs found “in the wild” on model repositories like the Hugging Face Hub. To address this gap, we consider recycling from a pool of nearly 1,000 user-contributed LoRAs trained from the Llama 3.1 8B-Instruct language model. Our empirical study includes a range of adaptive and non-adaptive merging methods in addition to a new method designed via a wide search over the methodological design space. We demonstrate that adaptive merging methods can improve performance over the base model but provide limited benefit over training a new LoRA on the same data used to set merging coefficients. We additionally find not only that the specific choice of LoRAs to merge has little importance, but that using LoRAs with randomly initialized parameter values yields similar performance. This raises the possibility that adaptive merging from recycled LoRAs primarily works via some kind of regularization effect, rather than by enabling positive cross-task transfer. To better understand why past work has proven successful, we confirm that positive transfer is indeed possible when there are highly relevant LoRAs in the pool. We release the model checkpoints and code online.

[234] Wireless TokenCom: RL-Based Tokenizer Agreement for Multi-User Wireless Token Communications

Farshad Zeinali, Mahdi Boloursaz Mashhadi, Dusit Niyato, Rahim Tafazolli

Main category: cs.LG

TL;DR: A hybrid RL framework for tokenizer agreement and resource allocation in multi-user wireless TokenCom systems that improves semantic quality and reduces video freezing by 68% compared to H.265.

Details

Motivation: Token Communications (TokenCom) enables efficient semantic-oriented communications using tokens as unified units, but requires transmitters/receivers to agree on identical tokenizer models and codebooks. The paper addresses the Tokenizer Agreement (TA) problem in multi-user downlink wireless scenarios where efficient resource allocation is needed.

Method: Proposes a hybrid reinforcement learning framework combining DQN for joint tokenizer agreement and sub-channel assignment, with DDPG for beamforming optimization. Formulates the problem as a mixed-integer non-convex optimization and solves it through this RL approach.

Result: The proposed framework outperforms baseline methods in semantic quality and resource efficiency, reducing video freezing events by 68% compared to conventional H.265-based schemes.

Conclusion: The hybrid RL approach effectively solves the joint tokenizer agreement and resource allocation problem in multi-user TokenCom systems, demonstrating significant improvements in semantic communication quality and transmission reliability.

Abstract: Token Communications (TokenCom) has recently emerged as an effective new paradigm, where tokens are the unified units of multimodal communications and computations, enabling efficient digital semantic- and goal-oriented communications in future wireless networks. To establish a shared semantic latent space, the transmitters/receivers in TokenCom need to agree on an identical tokenizer model and codebook. To this end, an initial Tokenizer Agreement (TA) process is carried out in each communication episode, where the transmitter/receiver cooperate to choose from a set of pre-trained tokenizer models/ codebooks available to them both for efficient TokenCom. In this correspondence, we investigate TA in a multi-user downlink wireless TokenCom scenario, where the base station equipped with multiple antennas transmits video token streams to multiple users. We formulate the corresponding mixed-integer non-convex problem, and propose a hybrid reinforcement learning (RL) framework that integrates a deep Q-network (DQN) for joint tokenizer agreement and sub-channel assignment, with a deep deterministic policy gradient (DDPG) for beamforming. Simulation results show that the proposed framework outperforms baseline methods in terms of semantic quality and resource efficiency, while reducing the freezing events in video transmission by 68% compared to the conventional H.265-based scheme.

[235] Intrinsic Credit Assignment for Long Horizon Interaction

Ilze Amanda Auzina, Joschka Strüber, Sergio Hernández-Gutiérrez, Shashwat Goel, Ameya Prabhu, Matthias Bethge

Main category: cs.LG

TL;DR: ΔBelief-RL uses language model’s changing probability beliefs about target solutions as intrinsic rewards for credit assignment in long-horizon RL tasks, outperforming outcome-based rewards.

Details

Motivation: To train agents that can effectively navigate uncertainty over long horizons by addressing the credit assignment problem - determining which intermediate actions contribute to eventual success.

Method: Proposes ΔBelief-RL which uses the change in probability that a language model assigns to target solutions as intrinsic rewards. Trains on synthetic interaction data to teach information-seeking capabilities by rewarding intermediate progress based on belief updates.

Result: Outperforms purely outcome-based rewards for RL, with improvements generalizing to out-of-distribution applications (customer service, personalization). Performance continues to improve beyond training horizon, with increasing interaction efficiency on Pass@k metrics.

Conclusion: Introduces scalable training strategy for long-horizon uncertainty navigation by enabling credit assignment to intermediate actions via intrinsic ΔBelief rewards derived from language model’s changing beliefs.

Abstract: How can we train agents to navigate uncertainty over long horizons? In this work, we propose ΔBelief-RL, which leverages a language model’s own intrinsic beliefs to reward intermediate progress. Our method utilizes the change in the probability an agent assigns to the target solution for credit assignment. By training on synthetic interaction data, ΔBelief-RL teaches information-seeking capabilities that consistently outperform purely outcome-based rewards for Reinforcement Learning, with improvements generalizing to out-of-distribution applications ranging from customer service to personalization. Notably, the performance continues to improve as we scale test-time interactions beyond the training horizon, with interaction-efficiency increasing even on Pass@k metrics. Overall, our work introduces a scalable training strategy for navigating uncertainty over a long-horizon, by enabling credit assignment to intermediate actions via intrinsic ΔBelief rewards.

[236] Bench-MFG: A Benchmark Suite for Learning in Stationary Mean Field Games

Lorenzo Magnino, Jiacheng Shen, Matthieu Geist, Olivier Pietquin, Mathieu Laurière

Main category: cs.LG

TL;DR: A benchmark suite (Bench-MFG) for evaluating Mean Field Games reinforcement learning algorithms with standardized environments and evaluation protocols.

Details

Motivation: The field lacks standardized evaluation protocols for MFG-RL algorithms, forcing researchers to use bespoke, isolated environments, making it difficult to assess robustness, generalization, and failure modes of methods.

Method: Proposes Bench-MFG benchmark suite with taxonomy of problem classes (no-interaction, monotone, potential, dynamics-coupled games), prototypical environments for each, and MF-Garnets method for generating random MFG instances for statistical testing.

Result: Benchmarked various learning algorithms including novel black-box approach (MF-PSO) for exploitability minimization, and proposed guidelines for standardizing future experimental comparisons.

Conclusion: Bench-MFG provides a comprehensive benchmark suite to address fragmentation in MFG-RL evaluation, enabling standardized assessment of algorithm robustness and generalization.

Abstract: The intersection of Mean Field Games (MFGs) and Reinforcement Learning (RL) has fostered a growing family of algorithms designed to solve large-scale multi-agent systems. However, the field currently lacks a standardized evaluation protocol, forcing researchers to rely on bespoke, isolated, and often simplistic environments. This fragmentation makes it difficult to assess the robustness, generalization, and failure modes of emerging methods. To address this gap, we propose a comprehensive benchmark suite for MFGs (Bench-MFG), focusing on the discrete-time, discrete-space, stationary setting for the sake of clarity. We introduce a taxonomy of problem classes, ranging from no-interaction and monotone games to potential and dynamics-coupled games, and provide prototypical environments for each. Furthermore, we propose MF-Garnets, a method for generating random MFG instances to facilitate rigorous statistical testing. We benchmark a variety of learning algorithms across these environments, including a novel black-box approach (MF-PSO) for exploitability minimization. Based on our extensive empirical results, we propose guidelines to standardize future experimental comparisons. Code available at \href{https://github.com/lorenzomagnino/Bench-MFG}{https://github.com/lorenzomagnino/Bench-MFG}.

[237] A Machine Learning Approach to the Nirenberg Problem

Gianfranco Cortés, Maria Esteban-Casadevall, Yueqing Feng, Jonas Henkel, Edward Hirst, Tancredi Schettini Gherardini, Alexander G. Stapleton

Main category: cs.LG

TL;DR: A neural network approach to solve the Nirenberg problem of prescribing Gaussian curvature on spheres using physics-informed neural networks with geometry-aware loss functions.

Details

Motivation: To develop a computational approach for the classical geometric analysis problem of prescribing Gaussian curvature on S², using neural networks as exploratory tools for existence questions.

Method: Mesh-free physics-informed neural network (PINN) that directly parametrizes the conformal factor globally, trained with geometry-aware loss enforcing the curvature equation, with additional validation via Gauss-Bonnet theorem and spherical-harmonic expansions.

Result: The network achieves very low losses (10⁻⁷ - 10⁻¹⁰) for realizable curvatures, while unrealizable curvatures yield significantly higher losses, enabling assessment of unknown cases.

Conclusion: Neural solvers can serve as exploratory tools in geometric analysis, offering quantitative computational perspectives on longstanding existence questions.

Abstract: This work introduces the Nirenberg Neural Network: a numerical approach to the Nirenberg problem of prescribing Gaussian curvature on $S^2$ for metrics that are pointwise conformal to the round metric. Our mesh-free physics-informed neural network (PINN) approach directly parametrises the conformal factor globally and is trained with a geometry-aware loss enforcing the curvature equation. Additional consistency checks were performed via the Gauss-Bonnet theorem, and spherical-harmonic expansions were fit to the learnt models to provide interpretability. For prescribed curvatures with known realisability, the neural network achieves very low losses ($10^{-7} - 10^{-10}$), while unrealisable curvatures yield significantly higher losses. This distinction enables the assessment of unknown cases, separating likely realisable functions from non-realisable ones. The current capabilities of the Nirenberg Neural Network demonstrate that neural solvers can serve as exploratory tools in geometric analysis, offering a quantitative computational perspective on longstanding existence questions.

[238] Multi-Agent Model-Based Reinforcement Learning with Joint State-Action Learned Embeddings

Zhizun Wang, David Meger

Main category: cs.LG

TL;DR: A model-based multi-agent reinforcement learning framework that combines joint state-action representation learning with imaginative roll-outs for improved coordination in partially observable dynamic environments.

Details

Motivation: Learning to coordinate many agents in partially observable and highly dynamic environments requires both informative representations and data-efficient training. Current approaches need better integration of representation learning with planning capabilities.

Method: Proposes a framework with world model trained using variational auto-encoders, augmented with state-action learned embedding (SALE). SALE is injected into both the imagination module for future roll-outs and the joint agent network where individual action values are combined through a mixing network to estimate joint action-value function.

Result: Empirical studies on StarCraft II Micro-Management, Multi-Agent MuJoCo, and Level-Based Foraging challenges demonstrate consistent gains over baseline algorithms, showing effectiveness of joint state-action learned embeddings in multi-agent model-based paradigm.

Conclusion: The framework enables agents to acquire richer understanding of how their choices influence collective outcomes, leading to improved long-term planning and optimization under limited real-environment interactions.

Abstract: Learning to coordinate many agents in partially observable and highly dynamic environments requires both informative representations and data-efficient training. To address this challenge, we present a novel model-based multi-agent reinforcement learning framework that unifies joint state-action representation learning with imaginative roll-outs. We design a world model trained with variational auto-encoders and augment the model using the state-action learned embedding (SALE). SALE is injected into both the imagination module that forecasts plausible future roll-outs and the joint agent network whose individual action values are combined through a mixing network to estimate the joint action-value function. By coupling imagined trajectories with SALE-based action values, the agents acquire a richer understanding of how their choices influence collective outcomes, leading to improved long-term planning and optimization under limited real-environment interactions. Empirical studies on well-established multi-agent benchmarks, including StarCraft II Micro-Management, Multi-Agent MuJoCo, and Level-Based Foraging challenges, demonstrate consistent gains of our method over baseline algorithms and highlight the effectiveness of joint state-action learned embeddings within a multi-agent model-based paradigm.

[239] Policy4OOD: A Knowledge-Guided World Model for Policy Intervention Simulation against the Opioid Overdose Crisis

Yijun Ma, Zehong Wang, Weixiang Sun, Zheyuan Zhang, Kaiwen Shi, Nitesh Chawla, Yanfang Ye

Main category: cs.LG

TL;DR: Policy4OOD: A knowledge-guided spatio-temporal world model for opioid policy evaluation that unifies forecasting, counterfactual reasoning, and optimization through transformer-based simulation.

Details

Motivation: The opioid epidemic requires effective policy evaluation, but current approaches struggle with complex system dynamics where policies interact and targeting one risk pathway may amplify others. There's a need for unified capabilities in forecasting, counterfactual reasoning, and optimization.

Method: Policy4OOD jointly encodes policy knowledge graphs, state-level spatial dependencies, and socioeconomic time series into a policy-conditioned Transformer. The model serves as a simulator for forecasting (forward pass), counterfactual analysis (substituting policy encodings), and policy optimization (Monte Carlo Tree Search).

Result: Experiments show that spatial dependencies and structured policy knowledge significantly improve forecasting accuracy. The framework validates each architectural component and demonstrates the potential of world modeling for data-driven public health decision support.

Conclusion: World modeling provides a unified approach for opioid policy evaluation, enabling forecasting, counterfactual reasoning, and optimization through a single trained simulator. The method shows promise for data-driven public health decision-making.

Abstract: The opioid epidemic remains one of the most severe public health crises in the United States, yet evaluating policy interventions before implementation is difficult: multiple policies interact within a dynamic system where targeting one risk pathway may inadvertently amplify another. We argue that effective opioid policy evaluation requires three capabilities – forecasting future outcomes under current policies, counterfactual reasoning about alternative past decisions, and optimization over candidate interventions – and propose to unify them through world modeling. We introduce Policy4OOD, a knowledge-guided spatio-temporal world model that addresses three core challenges: what policies prescribe, where effects manifest, and when effects unfold.Policy4OOD jointly encodes policy knowledge graphs, state-level spatial dependencies, and socioeconomic time series into a policy-conditioned Transformer that forecasts future opioid outcomes.Once trained, the world model serves as a simulator: forecasting requires only a forward pass, counterfactual analysis substitutes alternative policy encodings in the historical sequence, and policy optimization employs Monte Carlo Tree Search over the learned simulator. To support this framework, we construct a state-level monthly dataset (2019–2024) integrating opioid mortality, socioeconomic indicators, and structured policy encodings. Experiments demonstrate that spatial dependencies and structured policy knowledge significantly improve forecasting accuracy, validating each architectural component and the potential of world modeling for data-driven public health decision support.

[240] Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning

Abdul Wahab, Raksha Kumaraswamy, Martha White

Main category: cs.LG

TL;DR: VBE introduces an exploration algorithm using ensemble errors to create value bonuses that provide first-visit optimism and deep exploration in RL.

Details

Motivation: Existing value bonus methods only increase bonuses retroactively after seeing higher rewards, failing to encourage first-time visits to states and actions. There's a need for exploration algorithms that provide optimism for first visits.

Method: VBE maintains an ensemble of random action-value functions (RQFs) and uses their estimation errors to design value bonuses. The key innovation is designing rewards for these RQFs so that value bonuses can decrease to zero, enabling first-visit optimism and deep exploration.

Result: VBE outperforms Bootstrap DQN and reward bonus approaches (RND and ACB) on several classic exploration test environments. It also scales effectively to more complex environments like Atari.

Conclusion: VBE provides an effective approach for directed exploration in RL by using ensemble errors to create value bonuses that encourage first-visit exploration, addressing limitations of existing methods.

Abstract: Optimistic value estimates provide one mechanism for directed exploration in reinforcement learning (RL). The agent acts greedily with respect to an estimate of the value plus what can be seen as a value bonus. The value bonus can be learned by estimating a value function on reward bonuses, propagating local uncertainties around rewards. However, this approach only increases the value bonus for an action retroactively, after seeing a higher reward bonus from that state and action. Such an approach does not encourage the agent to visit a state and action for the first time. In this work, we introduce an algorithm for exploration called Value Bonuses with Ensemble errors (VBE), that maintains an ensemble of random action-value functions (RQFs). VBE uses the errors in the estimation of these RQFs to design value bonuses that provide first-visit optimism and deep exploration. The key idea is to design the rewards for these RQFs in such a way that the value bonus can decrease to zero. We show that VBE outperforms Bootstrap DQN and two reward bonus approaches (RND and ACB) on several classic environments used to test exploration and provide demonstrative experiments that it can scale easily to more complex environments like Atari.

[241] TRACE: Temporal Reasoning via Agentic Context Evolution for Streaming Electronic Health Records (EHRs)

Zhan Qu, Michael Färber

Main category: cs.LG

TL;DR: TRACE enables temporal clinical reasoning with frozen LLMs using a dual-memory architecture and agentic components, improving performance on longitudinal patient data without fine-tuning.

Details

Motivation: LLMs struggle with longitudinal patient trajectories due to evolving clinical states, irregular timing, and heterogeneous events, while existing adaptation methods have computational, privacy, and stability limitations.

Method: TRACE uses a dual-memory architecture (Global Protocol for institutional rules, Individual Protocol for patient state) with four agentic components (Router, Reasoner, Auditor, Steward) to structure and maintain context for temporal reasoning without extending context windows or updating parameters.

Result: On MIMIC-IV longitudinal clinical event streams, TRACE significantly improves next-event prediction accuracy, protocol adherence, and clinical safety over long-context and retrieval-augmented baselines while maintaining bounded inference cost.

Conclusion: TRACE provides an effective framework for temporal clinical reasoning with frozen LLMs that improves performance, safety, and interpretability on longitudinal patient data without the drawbacks of fine-tuning or retrieval augmentation.

Abstract: Large Language Models (LLMs) encode extensive medical knowledge but struggle to apply it reliably to longitudinal patient trajectories, where evolving clinical states, irregular timing, and heterogeneous events degrade performance over time. Existing adaptation strategies rely on fine-tuning or retrieval-based augmentation, which introduce computational overhead, privacy constraints, or instability under long contexts. We introduce TRACE (Temporal Reasoning via Agentic Context Evolution), a framework that enables temporal clinical reasoning with frozen LLMs by explicitly structuring and maintaining context rather than extending context windows or updating parameters. TRACE operates over a dual-memory architecture consisting of a static Global Protocol encoding institutional clinical rules and a dynamic Individual Protocol tracking patient-specific state. Four agentic components, Router, Reasoner, Auditor, and Steward, coordinate over this structured memory to support temporal inference and state evolution. The framework maintains bounded inference cost via structured state compression and selectively audits safety-critical clinical decisions. Evaluated on longitudinal clinical event streams from MIMIC-IV, TRACE significantly improves next-event prediction accuracy, protocol adherence, and clinical safety over long-context and retrieval-augmented baselines, while producing interpretable and auditable reasoning traces.

[242] Deep Doubly Debiased Longitudinal Effect Estimation with ICE G-Computation

Wenxin Chen, Weishen Pan, Kyra Gan, Fei Wang

Main category: cs.LG

TL;DR: D3-Net is a framework that mitigates error propagation in iterative conditional expectation (ICE) G-computation for longitudinal treatment effect estimation by using sequential doubly robust pseudo-outcomes during training and applying longitudinal targeted minimum loss-based estimation for final correction.

Details

Motivation: Estimating longitudinal treatment effects is challenging due to treatment-confounder feedback. While ICE G-computation offers a principled approach, its recursive structure suffers from error propagation that corrupts learned outcome regression models, leading to biased estimates.

Method: 1) Train ICE sequence using Sequential Doubly Robust (SDR) pseudo-outcomes to interrupt error propagation during learning; 2) Employ multi-task Transformer with covariate simulator head for auxiliary supervision and target network for stability; 3) Apply Longitudinal Targeted Minimum Loss-Based Estimation (LTMLE) on original outcomes using uncorrected nuisance models for final robust estimation.

Result: Comprehensive experiments show D3-Net robustly reduces bias and variance across different horizons, counterfactuals, and time-varying confoundings compared to existing state-of-the-art ICE-based estimators.

Conclusion: D3-Net effectively addresses error propagation in ICE training through SDR pseudo-outcomes during learning and LTMLE for final correction, providing robust longitudinal treatment effect estimation with improved finite-sample properties.

Abstract: Estimating longitudinal treatment effects is essential for sequential decision-making but is challenging due to treatment-confounder feedback. While Iterative Conditional Expectation (ICE) G-computation offers a principled approach, its recursive structure suffers from error propagation, corrupting the learned outcome regression models. We propose D3-Net, a framework that mitigates error propagation in ICE training and then applies a robust final correction. First, to interrupt error propagation during learning, we train the ICE sequence using Sequential Doubly Robust (SDR) pseudo-outcomes, which provide bias-corrected targets for each regression. Second, we employ a multi-task Transformer with a covariate simulator head for auxiliary supervision, regularizing representations against corruption by noisy pseudo-outcomes, and a target network to stabilize training dynamics. For the final estimate, we discard the SDR correction and instead use the uncorrected nuisance models to perform Longitudinal Targeted Minimum Loss-Based Estimation (LTMLE) on the original outcomes. This second-stage, targeted debiasing ensures robustness and optimal finite-sample properties. Comprehensive experiments demonstrate that our model, D3-Net, robustly reduces bias and variance across different horizons, counterfactuals, and time-varying confoundings, compared to existing state-of-the-art ICE-based estimators.

[243] TFT-ACB-XML: Decision-Level Integration of Customized Temporal Fusion Transformer and Attention-BiLSTM with XGBoost Meta-Learner for BTC Price Forecasting

Raiz Ud Din, Saddam Hussain Khan

Main category: cs.LG

TL;DR: Hybrid TFT-ACB-XML framework combines Temporal Fusion Transformer and Attention-Customized BiLSTM with XGBoost for Bitcoin price prediction, achieving improved accuracy over baseline models.

Details

Motivation: Bitcoin forecasting is challenging due to non-linear, volatile decentralized markets with temporal irregularities. Existing deep learning models lack interpretability and generalization across diverse market conditions.

Method: Hybrid stacked-generalization framework with two parallel base learners: customized Temporal Fusion Transformer (handles long-range dependencies) and Attention-Customized BiLSTM (captures short-term dependencies). Predictions are weighted using error-reciprocal weighting strategy, then concatenated and fed to XGBoost regressor as meta-learner.

Result: Empirical validation using BTC data from 2014-2026 shows MAPE of 0.65%, MAE of 198.15, and RMSE of 258.30 for one-step-ahead out-of-sample predictions under walk-forward evaluation, outperforming recent Deep Learning and Transformer baselines.

Conclusion: The proposed TFT-ACB-XML framework effectively addresses Bitcoin forecasting challenges by combining temporal modeling approaches with ensemble techniques, demonstrating improved predictive performance during major market events like halving and ETF periods.

Abstract: Accurate forecasting of Bitcoin (BTC) has always been a challenge because decentralized markets are non-linear, highly volatile, and have temporal irregularities. Existing deep learning models often struggle with interpretability and generalization across diverse market conditions. This research presents a hybrid stacked-generalization framework, TFT-ACB-XML, for BTC closing price prediction. The framework integrates two parallel base learners: a customized Temporal Fusion Transformer (TFT) and an Attention-Customized Bidirectional Long Short-Term Memory network (ACB), followed by an XGBoost regressor as the meta-learner. The customized TFT model handles long-range dependencies and global temporal dynamics via variable selection networks and interpretable single-head attention. The ACB module uses a new attention mechanism alongside the customized BiLSTM to capture short-term sequential dependencies. Predictions from both customized TFT and ACB are weighted through an error-reciprocal weighting strategy. These weights are derived from validation performance, where a model showing lower prediction error receives a higher weight. Finally, the framework concatenates these weighted outputs into a feature vector and feeds the vector to an XGBoost regressor, which captures non-linear residuals and produces the final BTC closing price prediction. Empirical validation using BTC data from October 1, 2014, to January 5, 2026, shows improved performance of the proposed framework compared to recent Deep Learning and Transformer baseline models. The results show a MAPE of 0.65%, an MAE of 198.15, and an RMSE of 258.30 for one-step-ahead out-of-sample under a walk-forward evaluation on the test block. The evaluation period spans the 2024 BTC halving and the spot ETFs (exchange-traded funds) period, which coincide with major liquidity and volatility shifts.

[244] Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment

Nathanaël Haas, Francçois Gatine, Augustin M Cosse, Zied Bouraoui

Main category: cs.LG

TL;DR: The paper analyzes deep network Jacobians, showing depth-induced exponential scaling of singular values and spectral separation, leading to decoupled singular-value dynamics that explain implicit bias in gradient-based training.

Details

Motivation: Understanding implicit bias in deep network training is challenging because tractable singular-value dynamics are typically only available for balanced deep linear models. The authors seek alternative theoretical foundations to explain why gradient-based training exhibits strong implicit bias in practical deep networks.

Method: Adopting a fixed-gates view of piecewise-linear networks where Jacobians reduce to products of masked linear maps within activation regions. The authors prove existence of Lyapunov exponents governing top singular values at initialization, give closed-form expressions in a tractable masked model, quantify finite-depth corrections, and show that strong spectral separation forces singular-vector alignment in matrix products.

Result: Theoretical results demonstrate depth-induced exponential scaling of ordered singular values and strong spectral separation in deep Jacobians. Experiments in fixed-gates settings validate predicted scaling, alignment, and resulting dynamics, supporting the mechanistic account of emergent low-rank Jacobian structure.

Conclusion: The analysis reveals that deep Jacobians exhibit exponential singular value scaling and spectral separation, leading to effectively decoupled singular-value dynamics that mirror classical balanced deep-linear analyses without requiring balancing. This provides a mechanistic explanation for emergent low-rank Jacobian structure as a driver of implicit bias in deep network training.

Abstract: Understanding why gradient-based training in deep networks exhibits strong implicit bias remains challenging, in part because tractable singular-value dynamics are typically available only for balanced deep linear models. We propose an alternative route based on two theoretically grounded and empirically testable signatures of deep Jacobians: depth-induced exponential scaling of ordered singular values and strong spectral separation. Adopting a fixed-gates view of piecewise-linear networks, where Jacobians reduce to products of masked linear maps within a single activation region, we prove the existence of Lyapunov exponents governing the top singular values at initialization, give closed-form expressions in a tractable masked model, and quantify finite-depth corrections. We further show that sufficiently strong separation forces singular-vector alignment in matrix products, yielding an approximately shared singular basis for intermediate Jacobians. Together, these results motivate an approximation regime in which singular-value dynamics become effectively decoupled, mirroring classical balanced deep-linear analyses without requiring balancing. Experiments in fixed-gates settings validate the predicted scaling, alignment, and resulting dynamics, supporting a mechanistic account of emergent low-rank Jacobian structure as a driver of implicit bias.

[245] Rational Neural Networks have Expressivity Advantages

Maosen Tang, Alex Townsend

Main category: cs.LG

TL;DR: Rational activation functions outperform standard fixed activations in neural networks, offering better expressivity and parameter efficiency with provable approximation advantages.

Details

Motivation: Standard neural network activations (ReLU, sigmoid, tanh, etc.) have limitations in expressivity and parameter efficiency. The paper explores whether trainable low-degree rational functions can serve as better activation functions, potentially offering superior approximation capabilities with fewer parameters.

Method: The authors study neural networks with trainable low-degree rational activation functions. They establish theoretical approximation bounds comparing rational activations to standard fixed activations, proving separation results. They also integrate rational activations into standard architectures and training pipelines for practical evaluation.

Result: Theoretical results show exponential gaps: rational networks can approximate standard networks with only poly(log log(1/ε)) overhead, while the converse requires Ω(log(1/ε)) parameters. Practically, rational activations match or outperform fixed activations under identical architectures and optimizers.

Conclusion: Trainable rational activation functions are more expressive and parameter-efficient than standard fixed activations, offering both theoretical advantages and practical benefits when integrated into existing neural network architectures.

Abstract: We study neural networks with trainable low-degree rational activation functions and show that they are more expressive and parameter-efficient than modern piecewise-linear and smooth activations such as ELU, LeakyReLU, LogSigmoid, PReLU, ReLU, SELU, CELU, Sigmoid, SiLU, Mish, Softplus, Tanh, Softmin, Softmax, and LogSoftmax. For an error target of $\varepsilon>0$, we establish approximation-theoretic separations: Any network built from standard fixed activations can be uniformly approximated on compact domains by a rational-activation network with only $\mathrm{poly}(\log\log(1/\varepsilon))$ overhead in size, while the converse provably requires $Ω(\log(1/\varepsilon))$ parameters in the worst case. This exponential gap persists at the level of full networks and extends to gated activations and transformer-style nonlinearities. In practice, rational activations integrate seamlessly into standard architectures and training pipelines, allowing rationals to match or outperform fixed activations under identical architectures and optimizers.

[246] High-dimensional Level Set Estimation with Trust Regions and Double Acquisition Functions

Giang Ngo, Dat Phan Trong, Dang Nguyen, Sunil Gupta

Main category: cs.LG

TL;DR: TRLSE is an active learning algorithm for high-dimensional level set estimation that uses dual acquisition functions (global and local) to efficiently identify and refine regions near the threshold boundary.

Details

Motivation: Level set estimation is fundamental in many applications but becomes challenging in high-dimensional spaces where search volume grows exponentially. Existing methods struggle with sample efficiency in high dimensions.

Method: TRLSE uses dual acquisition functions: global exploration to identify promising regions near the threshold boundary, and local refinement to precisely characterize the boundary. The algorithm iteratively acquires informative points to build an accurate classifier.

Result: Theoretical analysis shows TRLSE’s accuracy guarantees, and extensive evaluations on synthetic and real-world problems demonstrate superior sample efficiency compared to existing methods.

Conclusion: TRLSE provides an effective solution for high-dimensional level set estimation with strong theoretical guarantees and practical performance improvements in sample efficiency.

Abstract: Level set estimation (LSE) classifies whether an unknown function’s value exceeds a specified threshold for given inputs, a fundamental problem in many real-world applications. In active learning settings with limited initial data, we aim to iteratively acquire informative points to construct an accurate classifier for this task. In high-dimensional spaces, this becomes challenging where the search volume grows exponentially with increasing dimensionality. We propose TRLSE, an algorithm for high-dimensional LSE, which identifies and refines regions near the threshold boundary with dual acquisition functions operating at both global and local levels. We provide a theoretical analysis of TRLSE’s accuracy and show its superior sample efficiency against existing methods through extensive evaluations on multiple synthetic and real-world LSE problems.

[247] Synthetic Interaction Data for Scalable Personalization in Large Language Models

Yuchen Ma, Yue Huang, Wenjie Wang, Xiaonan Luo, Xiangliang Zhang, Stefan Feuerriegel

Main category: cs.LG

TL;DR: Personalized prompt optimization framework (PPOpt) using synthetic data (PersonaGym) to optimize LLM prompts for individual user preferences without model modification.

Details

Motivation: Existing prompt optimization methods focus on task-level optimization but overlook user-specific preferences and constraints, lacking personalized interaction data and robust reward signals for individual preferences.

Method: Introduces PersonaGym synthetic data generation framework to create realistic multi-turn interaction trajectories, and PPOpt framework that uses a reason-then-optimize paradigm to infer user profiles and rewrite prompts without modifying the deployed LLM.

Result: PPOpt shows consistent improvements over state-of-the-art baselines in task performance, personalization quality, and robustness to noisy/sparse preference signals.

Conclusion: The approach enables scalable personalized prompt optimization without model modification, addressing data limitations through synthetic data generation.

Abstract: Personalized prompting offers large opportunities for deploying large language models (LLMs) to diverse users, yet existing prompt optimization methods primarily focus on task-level optimization while largely overlooking user-specific preferences and latent constraints of individual users. This gap is primarily due to (i) the absence of high-quality, privacy-sensitive data that capture personalized user-LLM interactions at scale, and (ii) the lack of robust reward signals for individual preferences. To overcome existing data limitations, we introduce a high-fidelity synthetic data generation framework called PersonaGym. Unlike prior work that treats personalization as static persona-preference pairs, PersonaGym models a dynamic preference process via an agentic LLM system to simulate realistic preference behaviors and semantic-aware noise in order to generate personalized multi-turn interaction trajectories. Using PersonaGym, we release PersonaAtlas, a large-scale, high-quality, and diverse synthetic dataset of high-fidelity multi-turn personalized interaction trajectories that closely mirror real-world preference expression and noise patterns. We further propose Personalized Prompt Optimization (PPOpt), a scalable and model-agnostic framework that optimizes user prompts based on interaction histories without modifying the deployed LLM. PPOpt adopts a reason-then-optimize paradigm that infers an explicit user profile and conditions prompt rewriting on the user profile to avoid reward hacking. Our training procedure for PPOpt integrates a cold-start supervised prior with outcome-driven multi-objective reinforcement learning. We present extensive experiments to demonstrate consistent improvements over state-of-the-art baselines in terms of task performance, personalization quality, and robustness to noisy as well as to sparse preference signals.

[248] AstRL: Analog and Mixed-Signal Circuit Synthesis with Deep Reinforcement Learning

Felicia B. Guo, Ken T. Ho, Andrei Vladimirescu, Borivoje Nikolic

Main category: cs.LG

TL;DR: AstRL: A deep reinforcement learning method that treats analog circuit design as a graph generation problem, creating transistor-level topologies optimized for specific targets through simulator-embedded training with behavioral cloning and discriminator rewards.

Details

Motivation: Analog and mixed-signal IC design complexity has increased but automation remains limited due to challenges in creating generalized optimization methods that work across diverse, constrained, non-differentiable circuit design spaces.

Method: Casts circuit design as graph generation problem using deep reinforcement learning (policy-gradient approach). Generates circuits directly optimized for user targets in simulator-embedded environment with ground-truth feedback. Uses behavioral cloning and discriminator-based similarity rewards for expert-aligned generation. Operates at individual transistor level with strong inductive biases in action space and environment for structural consistency.

Result: Substantial improvements in conventional design metrics over state-of-the-art baselines for three realistic design tasks. 100% of generated designs are structurally correct and over 90% demonstrate required functionality.

Conclusion: AstRL demonstrates an expert-aligned paradigm for generalized circuit generation validated in simulation, enabling highly expressive, fine-grained topology generation at transistor level with strong structural consistency.

Abstract: Analog and mixed-signal (AMS) integrated circuits (ICs) lie at the core of modern computing and communications systems. However, despite the continued rise in design complexity, advances in AMS automation remain limited. This reflects the central challenge in developing a generalized optimization method applicable across diverse circuit design spaces, many of which are distinct, constrained, and non-differentiable. To address this, our work casts circuit design as a graph generation problem and introduces a novel method of AMS synthesis driven by deep reinforcement learning (AstRL). Based on a policy-gradient approach, AstRL generates circuits directly optimized for user-specified targets within a simulator-embedded environment that provides ground-truth feedback during training. Through behavioral-cloning and discriminator-based similarity rewards, our method demonstrates, for the first time, an expert-aligned paradigm for generalized circuit generation validated in simulation. Importantly, the proposed approach operates at the level of individual transistors, enabling highly expressive, fine-grained topology generation. Strong inductive biases encoded in the action space and environment further drive structurally consistent and valid generation. Experimental results for three realistic design tasks illustrate substantial improvements in conventional design metrics over state-of-the-art baselines, with 100% of generated designs being structurally correct and over 90% demonstrating required functionality.

[249] Soft Contamination Means Benchmarks Test Shallow Generalization

Ari Spiesberger, Juan J. Vazquez, Nicky Pochinkov, Tomáš Gavenčiak, Peli Grietzer, Gavin Leech, Nandi Schoots

Main category: cs.LG

TL;DR: Paper studies soft contamination in LLM training data where semantic duplicates of benchmark test data inflate performance metrics, showing widespread contamination and its confounding effects on benchmark gains.

Details

Motivation: LLM benchmark performance can be biased when training data contains benchmark test data, but current decontamination methods using n-gram matching fail to detect semantic duplicates (sentences with equivalent content but different wording), leading to soft contamination that inflates performance estimates.

Method: Embed the Olmo3 training corpus to detect semantic duplicates of benchmark data, analyze contamination prevalence, and conduct experiments to measure how including semantic duplicates affects benchmark performance and out-of-distribution generalization.

Result: Found widespread contamination: semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; including semantic duplicates improves benchmark performance; fine-tuning on duplicates also improves performance on truly-held-out datapoints from same benchmark.

Conclusion: Recent benchmark gains are confounded by soft contamination - improvements reflect both genuine capability gains and accumulation of test data (and effective test data via semantic duplicates) in growing training corpora, questioning the validity of current benchmark evaluations.

Abstract: If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpora.

[250] Stabilizing Native Low-Rank LLM Pretraining

Paul Janson, Edouard Oyallon, Eugene Belilovsky

Main category: cs.LG

TL;DR: Training LLMs from scratch using exclusively low-rank factorized weights without full-rank guidance, addressing instability via spectral norm control with Spectron method.

Details

Motivation: Foundation models face computational and memory challenges due to growing parameter counts. Low-rank factorization offers cost reduction but lacks stable recipes for training from scratch with exclusively low-rank weights while matching dense model performance.

Method: Introduces Spectron: Spectral renormalization with orthogonalization, which dynamically bounds weight updates based on current spectral norms of factors to address instability from uncontrolled spectral norm growth during low-rank training.

Result: Enables stable, end-to-end factorized training with negligible overhead. Establishes compute-optimal scaling laws for natively low-rank transformers showing predictable power-law behavior and improved inference efficiency relative to dense models.

Conclusion: LLMs can be trained from scratch using exclusively low-rank factorized weights without auxiliary full-rank guidance, overcoming previous instability issues through spectral norm control, enabling more efficient training and inference.

Abstract: Foundation models have achieved remarkable success, yet their growing parameter counts pose significant computational and memory challenges. Low-rank factorization offers a promising route to reduce training and inference costs, but the community lacks a stable recipe for training models from scratch using exclusively low-rank weights while matching the performance of the dense model. We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights for all non-embedding matrices without auxiliary “full-rank” guidance required by prior methods. While native low-rank training often suffers from instability and loss spikes, we identify uncontrolled growth in the spectral norm (largest singular value) of the weight matrix update as the dominant factor. To address this, we introduce Spectron: Spectral renormalization with orthogonalization, which dynamically bounds the resultant weight updates based on the current spectral norms of the factors. Our method enables stable, end-to-end factorized training with negligible overhead. Finally, we establish compute-optimal scaling laws for natively low-rank transformers, demonstrating predictable power-law behavior and improved inference efficiency relative to dense models.

[251] Safe Reinforcement Learning via Recovery-based Shielding with Gaussian Process Dynamics Models

Alexander W. Goodall, Francesco Belardinelli

Main category: cs.LG

TL;DR: A recovery-based shielding framework for safe reinforcement learning with provable safety guarantees for unknown nonlinear continuous systems using Gaussian process uncertainty quantification.

Details

Motivation: Reinforcement learning lacks provable safety guarantees for critical applications, especially for unknown nonlinear continuous dynamical systems where safety is paramount.

Method: Integrates backup policy (shield) with RL agent using Gaussian process uncertainty quantification to predict safety violations, dynamically recovering to safe trajectories when needed. Uses experience from shielded agent to build GP models with policy optimization via internal model-based sampling.

Result: Demonstrates strong performance and strict safety-compliance on continuous control environments while enabling unrestricted exploration and sample-efficient learning.

Conclusion: Proposes a novel safe RL framework with provable safety guarantees that maintains performance while ensuring safety through dynamic shielding and uncertainty quantification.

Abstract: Reinforcement learning (RL) is a powerful framework for optimal decision-making and control but often lacks provable guarantees for safety-critical applications. In this paper, we introduce a novel recovery-based shielding framework that enables safe RL with a provable safety lower bound for unknown and non-linear continuous dynamical systems. The proposed approach integrates a backup policy (shield) with the RL agent, leveraging Gaussian process (GP) based uncertainty quantification to predict potential violations of safety constraints, dynamically recovering to safe trajectories only when necessary. Experience gathered by the ‘shielded’ agent is used to construct the GP models, with policy optimization via internal model-based sampling - enabling unrestricted exploration and sample efficient learning, without compromising safety. Empirically our approach demonstrates strong performance and strict safety-compliance on a suite of continuous control environments.

[252] Computationally sufficient statistics for Ising models

Abhijith Jayakumar, Shreya Shukla, Marc Vuffray, Andrey Y. Lokhov, Sidhant Misra

Main category: cs.LG

TL;DR: The paper presents methods for learning Ising model parameters using only limited statistical observations rather than full sample configurations, addressing computational challenges in Gibbs distribution learning.

Details

Motivation: Learning Gibbs distributions typically requires full sample configurations, which is impractical for many physical systems. The paper aims to develop computationally efficient methods that work with only limited statistical observations rather than complete data.

Method: The authors use the Ising model as a case study and demonstrate that model parameters can be reconstructed by observing statistics up to order O(γ), where γ is the ℓ₁ width. They also explore settings where prior structural information is available to further reduce observational requirements.

Result: The paper shows that it’s feasible to learn Ising model parameters (couplings and magnetic fields) and infer model structure using only limited statistical observations, with more efficient learning possible when prior structural information is available.

Conclusion: The work provides theoretical foundations for learning Gibbs distributions with limited observational power, establishing trade-offs between computational power and observation requirements, with potential applications to physical systems where full samples are unavailable.

Abstract: Learning Gibbs distributions using only sufficient statistics has long been recognized as a computationally hard problem. On the other hand, computationally efficient algorithms for learning Gibbs distributions rely on access to full sample configurations generated from the model. For many systems of interest that arise in physical contexts, expecting a full sample to be observed is not practical, and hence it is important to look for computationally efficient methods that solve the learning problem with access to only a limited set of statistics. We examine the trade-offs between the power of computation and observation within this scenario, employing the Ising model as a paradigmatic example. We demonstrate that it is feasible to reconstruct the model parameters for a model with $\ell_1$ width $γ$ by observing statistics up to an order of $O(γ)$. This approach allows us to infer the model’s structure and also learn its couplings and magnetic fields. We also discuss a setting where prior information about structure of the model is available and show that the learning problem can be solved efficiently with even more limited observational power.

[253] Continuous Diffusion Models Can Obey Formal Syntax

Jinwoo Kim, Taylor Berg-Kirkpatrick, Loris D’Antoni

Main category: cs.LG

TL;DR: Training-free guidance method for diffusion language models that uses analytic scores from regular expressions to steer generation toward syntactically valid outputs without additional training.

Details

Motivation: Diffusion language models have advantages over autoregressive models but struggle with imposing discrete syntactic constraints like JSON schema matching. Current methods require training auxiliary classifiers, which is inefficient.

Method: Constructs analytic score estimating probability that latent state decodes to valid string accepted by given regular expression, uses gradient to guide sampling without training. Implemented as Diffinity on PLAID diffusion model.

Result: Achieves 68-96% constraint satisfaction with small perplexity cost, outperforms autoregressive constrained decoding in both constraint satisfaction and output quality on 180 regex constraints over JSON and natural-language benchmarks.

Conclusion: Training-free guidance enables diffusion language models to satisfy formal syntactic constraints effectively, making them more practical for structured output generation tasks.

Abstract: Diffusion language models offer a promising alternative to autoregressive models due to their global, non-causal generation process, but their continuous latent dynamics make discrete constraints – e.g., the output should be a JSON file that matches a given schema – difficult to impose. We introduce a training-free guidance method for steering continuous diffusion language models to satisfy formal syntactic constraints expressed using regular expressions. Our approach constructs an analytic score estimating the probability that a latent state decodes to a valid string accepted by a given regular expression, and uses its gradient to guide sampling, without training auxiliary classifiers. The denoising process targets the base model conditioned on syntactic validity. We implement our method in Diffinity on top of the PLAID diffusion model and evaluate it on 180 regular-expression constraints over JSON and natural-language benchmarks. Diffinity achieves 68-96% constraint satisfaction while incurring only a small perplexity cost relative to unconstrained sampling, outperforming autoregressive constrained decoding in both constraint satisfaction and output quality.

[254] Regularized Meta-Learning for Improved Generalization

Noor Islam S. Mohammad, Md Muntaqim Meherab

Main category: cs.LG

TL;DR: A regularized meta-learning framework that addresses redundancy, multicollinearity, and overfitting in ensemble methods through redundancy-aware projection, statistical meta-feature augmentation, and cross-validated regularized meta-models.

Details

Motivation: Deep ensemble methods suffer from three practical limitations: redundancy among base models that inflates computational cost and degrades conditioning, unstable weighting under multicollinearity, and overfitting in meta-learning pipelines.

Method: Four-stage pipeline combining redundancy-aware projection (using correlation and MSE thresholds to remove near-collinear predictors), statistical meta-feature augmentation, cross-validated regularized meta-models (Ridge, Lasso, ElasticNet), and final inverse-RMSE blending to mitigate regularizer-selection variance.

Result: Achieved out-of-fold RMSE of 8.582 on Playground Series S6E1 benchmark (100K samples, 72 base models), improving over simple averaging (8.894) and conventional Ridge stacking (8.627), while matching greedy hill climbing (8.603) with 4x faster runtime. Conditioning analysis shows 53.7% reduction in effective matrix condition number.

Conclusion: Regularized meta-learning is positioned as a stable and deployment-efficient stacking strategy for high-dimensional ensemble systems, with consistent contributions from de-duplication, statistical meta-features, and meta-ensemble blending.

Abstract: Deep ensemble methods often improve predictive performance, yet they suffer from three practical limitations: redundancy among base models that inflates computational cost and degrades conditioning, unstable weighting under multicollinearity, and overfitting in meta-learning pipelines. We propose a regularized meta-learning framework that addresses these challenges through a four-stage pipeline combining redundancy-aware projection, statistical meta-feature augmentation, and cross-validated regularized meta-models (Ridge, Lasso, and ElasticNet). Our multi-metric de-duplication strategy removes near-collinear predictors using correlation and MSE thresholds ($τ_{\text{corr}}=0.95$), reducing the effective condition number of the meta-design matrix while preserving predictive diversity. Engineered ensemble statistics and interaction terms recover higher-order structure unavailable to raw prediction columns. A final inverse-RMSE blending stage mitigates regularizer-selection variance. On the Playground Series S6E1 benchmark (100K samples, 72 base models), the proposed framework achieves an out-of-fold RMSE of 8.582, improving over simple averaging (8.894) and conventional Ridge stacking (8.627), while matching greedy hill climbing (8.603) with substantially lower runtime (4 times faster). Conditioning analysis shows a 53.7% reduction in effective matrix condition number after redundancy projection. Comprehensive ablations demonstrate consistent contributions from de-duplication, statistical meta-features, and meta-ensemble blending. These results position regularized meta-learning as a stable and deployment-efficient stacking strategy for high-dimensional ensemble systems.

[255] Designing RNAs with Language Models

Milan Gautam, Ning Dai, Tianshuo Zhou, Bowen Xie, David Mathews, Liang Huang

Main category: cs.LG

TL;DR: RNA design reframed as conditional sequence generation using autoregressive language models trained with supervised learning and RL optimization

Details

Motivation: RNA design is computationally challenging due to exponential sequence space and competing folds; traditional optimization approaches rely on per-instance heuristics or constraint-based search

Method: Treat RNA design as conditional sequence generation using autoregressive language model; train with supervised learning on random structure-sequence pairs, then optimize with reinforcement learning using end-to-end metrics; propose methods to select small subset for RL to improve efficiency

Result: Outperforms state-of-the-art systems on key metrics like Boltzmann probability while being 1.7x faster across four datasets

Conclusion: Conditional language model generation is a scalable, task-agnostic alternative to per-instance optimization for RNA design

Abstract: RNA design, the task of finding a sequence that folds into a target secondary structure, has broad biological and biomedical impact but remains computationally challenging due to the exponentially large sequence space and exponentially many competing folds. Traditional approaches treat it as an optimization problem, relying on per-instance heuristics or constraint-based search. We instead reframe RNA design as conditional sequence generation and introduce a reusable neural approximator, instantiated as an autoregressive language model (LM), that maps target structures directly to sequences. We first train our model in a supervised setting on random-induced structure-sequence pairs, and then use reinforcement learning (RL) to optimize end-to-end metrics. We also propose methods to select a small subset for RL that greatly improves RL efficiency and quality. Across four datasets, our approach outperforms state-of-the-art systems on key metrics such as Boltzmann probability while being 1.7x faster, establishing conditional LM generation as a scalable, task-agnostic alternative to per-instance optimization for RNA design. Our code and data are available at https://github.com/KuNyaa/RNA-Design-LM.

[256] Tight Bounds for Logistic Regression with Large Stepsize Gradient Descent in Low Dimension

Michael Crawshaw, Mingrui Liu

Main category: cs.LG

TL;DR: Gradient descent with large step sizes achieves accelerated 1/T² convergence for logistic regression with separable 2D data via analysis of oscillatory dynamics during transition from unstable to stable phases.

Details

Motivation: Recent work showed accelerated convergence rates for logistic regression with separable data using large step sizes, but the analysis was not tight. This paper aims to provide a tighter analysis specifically for two-dimensional data to better understand the oscillatory dynamics during the transition from unstable to stable phases.

Method: The authors analyze gradient descent with large learning rates for logistic regression on separable 2D data. They conduct a fine-grained analysis of the oscillatory dynamics in the subspace orthogonal to the max-margin classifier, focusing on the transition time τ from unstable (non-monotonic loss) to stable (monotonic loss) phases.

Result: They show GD with sufficiently large learning rate η achieves loss smaller than O(1/(ηT)) when T ≥ Ω(n/γ + 1/γ²), where n is dataset size and γ is margin. They provide matching upper and lower bounds on transition time τ up to logarithmic factors, demonstrating tightness of their analysis.

Conclusion: The paper provides a tight analysis of gradient descent dynamics for logistic regression with separable 2D data, revealing that accelerated convergence via large step sizes is achievable through careful understanding of the oscillatory transition phase.

Abstract: We consider the optimization problem of minimizing the logistic loss with gradient descent to train a linear model for binary classification with separable data. With a budget of $T$ iterations, it was recently shown that an accelerated $1/T^2$ rate is possible by choosing a large step size $η= Θ(γ^2 T)$ (where $γ$ is the dataset’s margin) despite the resulting non-monotonicity of the loss. In this paper, we provide a tighter analysis of gradient descent for this problem when the data is two-dimensional: we show that GD with a sufficiently large learning rate $η$ finds a point with loss smaller than $\mathcal{O}(1/(ηT))$, as long as $T \geq Ω(n/γ+ 1/γ^2)$, where $n$ is the dataset size. Our improved rate comes from a tighter bound on the time $τ$ that it takes for GD to transition from unstable (non-monotonic loss) to stable (monotonic loss), via a fine-grained analysis of the oscillatory dynamics of GD in the subspace orthogonal to the max-margin classifier. We also provide a lower bound of $τ$ matching our upper bound up to logarithmic factors, showing that our analysis is tight.

[257] Geometric separation and constructive universal approximation with two hidden layers

Chanyoung Sung

Main category: cs.LG

TL;DR: Geometric construction of neural networks that separate disjoint compact subsets of R^n, leading to constructive universal approximation theorem for networks with two hidden layers using sigmoidal or ReLU activations.

Details

Motivation: To provide a geometric construction of neural networks that can separate disjoint compact subsets in Euclidean space, and use this construction to obtain a constructive universal approximation theorem that shows neural networks with specific architectures can approximate any continuous function on compact sets.

Method: Geometric construction approach for neural networks that separate disjoint compact subsets. The method shows networks with two hidden layers and either sigmoidal (strictly monotone bounded continuous) or ReLU activation functions can approximate any real-valued continuous function on arbitrary compact sets. For finite sets, the construction simplifies to depth-2 (single hidden layer) networks.

Result: Obtained constructive universal approximation theorem showing networks with two hidden layers and sigmoidal/ReLU activations can approximate any real-valued continuous function on arbitrary compact sets to any prescribed accuracy in uniform norm. For finite sets, achieved sharp depth-2 approximation result.

Conclusion: The paper provides a geometric construction that yields constructive universal approximation results for neural networks with specific architectures, establishing theoretical foundations for neural network approximation capabilities on compact sets.

Abstract: We give a geometric construction of neural networks that separate disjoint compact subsets of $\Bbb R^n$, and use it to obtain a constructive universal approximation theorem. Specifically, we show that networks with two hidden layers and either a sigmoidal activation (i.e., strictly monotone bounded continuous) or the ReLU activation can approximate any real-valued continuous function on an arbitrary compact set $K\subset\Bbb R^n$ to any prescribed accuracy in the uniform norm. For finite $K$, the construction simplifies and yields a sharp depth-2 (single hidden layer) approximation result.

[258] A Theoretical Analysis of Mamba’s Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

Mugunthan Shandirasegaran, Hongkang Li, Songyang Zhang, Meng Wang, Shuai Zhang

Main category: cs.LG

TL;DR: Theoretical analysis of Mamba’s simplified selective state space model shows guaranteed generalization with feature selection through gating, providing insights into non-attention sequence modeling.

Details

Motivation: While Mamba and selective SSMs show empirical success as non-attention sequence models, their theoretical foundations remain underexplored compared to Transformers. The paper aims to provide principled theoretical understanding of when and why these models learn efficiently.

Method: Analyzes a simplified Mamba block: single-layer, single-head selective SSM with input-dependent gating followed by a two-layer MLP trained via gradient descent. Uses structured data model with tokens containing class-relevant and irrelevant patterns under noise, examining majority-voting and locality-structured regimes.

Result: Proves guaranteed generalization with non-asymptotic sample complexity and convergence rate bounds that improve with stronger signal and less noise. Shows gating vector aligns with class-relevant features while ignoring irrelevant ones, formalizing feature-selection role similar to attention but through selective recurrence.

Conclusion: Provides theoretical foundation for selective SSMs, showing they can efficiently learn through feature selection via gating, offering theoretical counterpoint to Transformer-centric explanations and insight into non-attention sequence modeling.

Abstract: The recent empirical success of Mamba and other selective state space models (SSMs) has renewed interest in non-attention architectures for sequence modeling, yet their theoretical foundations remain underexplored. We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block: a single-layer, single-head selective SSM with input-dependent gating, followed by a two-layer MLP trained via gradient descent (GD). Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise and examines two canonical regimes: majority-voting and locality-structured data sequences. We prove that the model achieves guaranteed generalization by establishing non-asymptotic sample complexity and convergence rate bounds, which improve as the effective signal increases and the noise decreases. Furthermore, we show that the gating vector aligns with class-relevant features while ignoring irrelevant ones, thereby formalizing a feature-selection role similar to attention but realized through selective recurrence. Numerical experiments on synthetic data justify our theoretical results. Overall, our results provide principled insight into when and why Mamba-style selective SSMs learn efficiently, offering a theoretical counterpoint to Transformer-centric explanations.

[259] On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

Rosie Zhao, Anshul Shah, Xiaoyu Zhu, Xinke Deng, Zhongyu Jiang, Yang Yang, Joerg Liebelt, Arnab Mondal

Main category: cs.LG

TL;DR: RL fine-tuning improves VLMs on visual reasoning but creates vulnerabilities to textual perturbations, revealing an accuracy-faithfulness trade-off where improved benchmark performance comes at the cost of reasoning robustness and faithfulness.

Details

Motivation: While RL fine-tuning has successfully enhanced LLMs for reasoning tasks, its extension to VLMs has revealed vulnerabilities including weak visual grounding, hallucinations, and over-reliance on textual cues. The paper aims to understand these limitations and the trade-offs between accuracy and faithfulness in RL-tuned VLMs.

Method: The study uses controlled textual perturbations (misleading captions or incorrect chain-of-thought traces) to test VLM robustness. It analyzes RL fine-tuning dynamics, employs entropy-based metrics to measure uncertainty, and explores interventions including adversarial augmentation and faithfulness-aware rewards to address identified vulnerabilities.

Result: Textual perturbations cause substantial drops in robustness and confidence, with effects amplified when considering CoT consistency. RL fine-tuning creates an accuracy-faithfulness trade-off: while improving benchmark accuracy, it erodes reasoning reliability and robustness. Adversarial augmentation helps but doesn’t prevent faithfulness drift, and combining it with faithfulness rewards risks shortcut strategies.

Conclusion: Accuracy-only evaluations are insufficient for VLMs; training and assessment must jointly emphasize correctness, robustness, and faithfulness of visually grounded reasoning. The findings highlight fundamental limitations in current RL fine-tuning approaches for multimodal models.

Abstract: Reinforcement learning (RL) fine-tuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations–misleading captions or incorrect chain-of-thought (CoT) traces–cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. Entropy-based metrics further show that these perturbations reshape model uncertainty and probability mass on the correct option, exposing model-specific trends in miscalibration. To better understand these vulnerabilities, we further analyze RL fine-tuning dynamics and uncover an accuracy-faithfulness trade-off: fine-tuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT and its robustness to contextual shifts. Although adversarial augmentation improves robustness, it does not by itself prevent faithfulness drift. Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning, but when paired with augmentation, training risks collapsing onto shortcut strategies and robustness remains elusive. Together, these findings highlight the limitations of accuracy-only evaluations and motivate training and assessment protocols that jointly emphasize correctness, robustness, and the faithfulness of visually grounded reasoning.

[260] Constraint-Rectified Training for Efficient Chain-of-Thought

Qinhang Wu, Sen Lin, Ming Zhang, Yingbin Liang, Ness B. Shroff

Main category: cs.LG

TL;DR: CRT is a constrained optimization framework for efficient reasoning in LLMs that balances reasoning length and accuracy through reference-guarded training and two-stage optimization.

Details

Motivation: Chain-of-Thought reasoning improves LLM capabilities but incurs high inference costs and redundancy (overthinking). Existing heuristic approaches for efficient reasoning suffer from accuracy drops and hyperparameter sensitivity.

Method: Constraint-Rectified Training (CRT) uses reference-guarded constrained optimization that alternates between minimizing reasoning length and rectifying accuracy only when performance falls below reference levels. A two-stage scheme first discovers shortest reliable reasoning patterns, then refines accuracy under a learned length budget.

Result: CRT consistently reduces token usage while maintaining answer quality, improves reasoning efficiency by shortening responses and reducing internal language redundancy, and yields intermediate checkpoints for fine-grained control over reasoning verbosity.

Conclusion: CRT provides a principled, stable framework for efficient reasoning that balances length and accuracy, enabling fine-grained control over reasoning verbosity without retraining.

Abstract: Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), especially when combined with reinforcement learning (RL) based post-training methods. While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking. Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy, either through length-aware reward design or prompt-based calibration. However, these heuristic-based approaches may suffer from severe accuracy drop and be very sensitive to hyperparameters. To address these problems, we introduce CRT (Constraint-Rectified Training), a principled post-training framework based on reference-guarded constrained optimization, yielding a more stable and interpretable formulation for efficient reasoning. CRT alternates between minimizing reasoning length and rectifying accuracy only when performance falls below the reference, enabling stable and effective pruning of redundant reasoning. We further extend CRT with a two-stage training scheme that first discovers the shortest reliable reasoning patterns and then refines accuracy under a learnt length budget, preventing the re-emergence of verbose CoT. Our comprehensive evaluation shows that this framework consistently reduces token usage while maintaining answer quality at a robust and reliable level. Further analysis reveals that CRT improves reasoning efficiency not only by shortening responses but also by reducing internal language redundancy, leading to a new evaluation metric. Moreover, CRT-based training naturally yields a sequence of intermediate checkpoints that span a spectrum of explanation lengths while preserving correctness, enabling fine-grained control over reasoning verbosity without retraining.

[261] Analytical Results for Two Exponential Family Distributions in Hierarchical Dirichlet Processes

Naiqi Li

Main category: cs.LG

TL;DR: Derives closed-form expressions for Gamma-Poisson and Normal-Gamma-Normal conjugate pairs within Hierarchical Dirichlet Process framework, extending HDP beyond Dirichlet-multinomial to exponential family distributions.

Details

Motivation: HDP provides flexible Bayesian nonparametric framework for grouped data, but existing applications focus mainly on Dirichlet-multinomial conjugate structure. The framework is more general and can accommodate broader class of conjugate prior-likelihood pairs, particularly exponential family distributions which offer unified modeling paradigm.

Method: Investigates analytic results for Poisson and normal distributions within HDP framework. Derives explicit closed-form expressions for Gamma-Poisson and Normal-Gamma-Normal conjugate pairs under hierarchical Dirichlet process construction. Provides detailed derivations and proofs to clarify mathematical structure and demonstrate systematic exploitation of conjugacy.

Result: Successfully extends applicability of HDP beyond Dirichlet-multinomial setting. Provides practical analytic results for researchers employing hierarchical Bayesian nonparametrics with exponential family distributions.

Conclusion: The work demonstrates how conjugacy can be systematically exploited in hierarchical nonparametric models, extending HDP’s applicability to broader class of distributions including Poisson and normal distributions through explicit analytic derivations.

Abstract: The Hierarchical Dirichlet Process (HDP) provides a flexible Bayesian nonparametric framework for modeling grouped data with a shared yet unbounded collection of mixture components. While existing applications of the HDP predominantly focus on the Dirichlet-multinomial conjugate structure, the framework itself is considerably more general and, in principle, accommodates a broad class of conjugate prior-likelihood pairs. In particular, exponential family distributions offer a unified and analytically tractable modeling paradigm that encompasses many commonly used distributions. In this paper, we investigate analytic results for two important members of the exponential family within the HDP framework: the Poisson distribution and the normal distribution. We derive explicit closed-form expressions for the corresponding Gamma-Poisson and Normal-Gamma-Normal conjugate pairs under the hierarchical Dirichlet process construction. Detailed derivations and proofs are provided to clarify the underlying mathematical structure and to demonstrate how conjugacy can be systematically exploited in hierarchical nonparametric models. Our work extends the applicability of the HDP beyond the Dirichlet-multinomial setting and furnishes practical analytic results for researchers employing hierarchical Bayesian nonparametrics.

[262] Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models

Bowen Ping, Chengyou Jia, Minnan Luo, Hangwei Qian, Ivor Tsang

Main category: cs.LG

TL;DR: Flow-Factory is a unified framework for aligning diffusion and flow-matching models with human preferences via reinforcement learning, offering modular architecture, production optimizations, and support for multiple algorithms and models.

Details

Motivation: Current reinforcement learning approaches for aligning diffusion and flow-matching models suffer from fragmented codebases, model-specific implementations, and engineering complexity, creating barriers for researchers and practitioners.

Method: Flow-Factory uses a modular, registry-based architecture that decouples algorithms, models, and rewards, enabling seamless integration of new components. It supports GRPO, DiffusionNFT, and AWM algorithms across Flux, Qwen-Image, and WAN video models.

Result: The framework provides production-ready memory optimization, flexible multi-reward training, and seamless distributed training support, empowering researchers to rapidly prototype and scale innovations with minimal implementation overhead.

Conclusion: Flow-Factory addresses the fragmentation in RL-based alignment for generative models by offering a unified, extensible framework that reduces engineering complexity and accelerates research in this domain.

Abstract: Reinforcement learning has emerged as a promising paradigm for aligning diffusion and flow-matching models with human preferences, yet practitioners face fragmented codebases, model-specific implementations, and engineering complexity. We introduce Flow-Factory, a unified framework that decouples algorithms, models, and rewards through through a modular, registry-based architecture. This design enables seamless integration of new algorithms and architectures, as demonstrated by our support for GRPO, DiffusionNFT, and AWM across Flux, Qwen-Image, and WAN video models. By minimizing implementation overhead, Flow-Factory empowers researchers to rapidly prototype and scale future innovations with ease. Flow-Factory provides production-ready memory optimization, flexible multi-reward training, and seamless distributed training support. The codebase is available at https://github.com/X-GenGroup/Flow-Factory.

[263] AMPS: Adaptive Modality Preference Steering via Functional Entropy

Zihan Huang, Xintong Li, Rohan Surana, Tong Yu, Rui Wang, Julian McAuley, Jingbo Shang, Junda Wu

Main category: cs.LG

TL;DR: Proposes instance-aware steering for MLLMs to address modality preference issues by dynamically adjusting steering intensity based on sample-specific sensitivity, outperforming uniform steering approaches.

Details

Motivation: MLLMs often exhibit modality preference (favoring one modality over another), and existing uniform steering approaches are problematic - strong steering impairs standard inference while weak steering is ineffective. Different multimodal instances have varying steering sensitivity, making a single global strength difficult to calibrate.

Method: 1) Introduces instance-aware diagnostic metric to quantify each modality’s information contribution and reveal sample-specific susceptibility to steering. 2) Proposes scaling strategy that reduces steering for sensitive samples. 3) Develops learnable module that infers scaling patterns for instance-aware control of modality preference.

Result: Experimental results show instance-aware steering outperforms conventional steering in modulating modality preference, achieving effective adjustment while keeping generation error rates low.

Conclusion: Instance-aware steering provides a more effective approach to address modality preference in MLLMs by dynamically adjusting steering intensity based on sample-specific characteristics, avoiding the limitations of uniform steering approaches.

Abstract: Multimodal Large Language Models (MLLMs) often exhibit significant modality preference, which is a tendency to favor one modality over another. Depending on the input, they may over-rely on linguistic priors relative to visual evidence, or conversely over-attend to visually salient but facts in textual contexts. Prior work has applied a uniform steering intensity to adjust the modality preference of MLLMs. However, strong steering can impair standard inference and increase error rates, whereas weak steering is often ineffective. In addition, because steering sensitivity varies substantially across multimodal instances, a single global strength is difficult to calibrate. To address this limitation with minimal disruption to inference, we introduce an instance-aware diagnostic metric that quantifies each modality’s information contribution and reveals sample-specific susceptibility to steering. Building on these insights, we propose a scaling strategy that reduces steering for sensitive samples and a learnable module that infers scaling patterns, enabling instance-aware control of modality preference. Experimental results show that our instance-aware steering outperforms conventional steering in modulating modality preference, achieving effective adjustment while keeping generation error rates low.

[264] Exploring Accurate and Transparent Domain Adaptation in Predictive Healthcare via Concept-Grounded Orthogonal Inference

Pengfei Hu, Chang Lu, Feifan Liu, Yue Ning

Main category: cs.LG

TL;DR: ExtraCare: A domain adaptation method for clinical event prediction that decomposes patient representations into invariant and covariant components for better performance and interpretability in EHR data.

Details

Motivation: Clinical event prediction models on EHR data suffer performance degradation under different data distributions, and existing domain adaptation methods lack transparency needed for clinical trust and safety.

Method: Decomposes patient representations into invariant (domain-invariant) and covariant (domain-specific) components, supervises both components, enforces orthogonality between them, and provides interpretability by mapping sparse latent dimensions to medical concepts with targeted ablations.

Result: Demonstrates superior performance over feature alignment models on two real-world EHR datasets across multiple domain partition settings, with enhanced transparency through accurate predictions and explanations from case studies.

Conclusion: ExtraCare provides both improved domain adaptation performance and human-understandable explanations for clinical event prediction, addressing the transparency gap in clinical AI adoption.

Abstract: Deep learning models for clinical event prediction on electronic health records (EHR) often suffer performance degradation when deployed under different data distributions. While domain adaptation (DA) methods can mitigate such shifts, its “black-box” nature prevents widespread adoption in clinical practice where transparency is essential for trust and safety. We propose ExtraCare to decompose patient representations into invariant and covariant components. By supervising these two components and enforcing their orthogonality during training, our model preserves label information while exposing domain-specific variation at the same time for more accurate predictions than most feature alignment models. More importantly, it offers human-understandable explanations by mapping sparse latent dimensions to medical concepts and quantifying their contributions via targeted ablations. ExtraCare is evaluated on two real-world EHR datasets across multiple domain partition settings, demonstrating superior performance along with enhanced transparency, as evidenced by its accurate predictions and explanations from extensive case studies.

[265] SD-MoE: Spectral Decomposition for Effective Expert Specialization

Ruijun Huang, Fang Dong, Xin Zhang, Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, Li Shang

Main category: cs.LG

TL;DR: SD-MoE addresses expert specialization failure in Mixture-of-Experts models by decoupling dominant spectral components in parameter and gradient spaces to improve model performance and effective capacity.

Details

Motivation: Current MoE architectures suffer from poor expert specialization where experts become functionally similar or act as shared experts, limiting effective capacity and model performance. This is driven by overlapping dominant spectral components in parameters and aligned gradient subspaces across experts.

Method: Proposes Spectral-Decoupled MoE (SD-MoE) which decomposes both parameter and gradient in spectral space to address the overlapping dominant spectral components. The method can be seamlessly integrated into existing MoE architectures like Qwen and DeepSeek with minimal additional computation.

Result: SD-MoE improves performance across downstream tasks and enables effective expert specialization. The approach works with various existing MoE architectures while adding minimal computational overhead.

Conclusion: Spectral analysis reveals fundamental limitations in MoE expert specialization, and SD-MoE provides an effective solution through spectral decoupling that enhances model performance and specialization without significant computational cost.

Abstract: Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral components in their parameters, (2) dominant gradient subspaces are strongly aligned across experts, driven by ubiquitous low-rank structure in human corpus, and (3) gating mechanisms preferentially route inputs along these dominant directions, further limiting specialization. To address this, we propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space. SD-MoE improves performance across downstream tasks, enables effective expert specialization, incurring minimal additional computation, and can be seamlessly integrated into a wide range of existing MoE architectures, including Qwen and DeepSeek.

[266] Fractional Order Federated Learning for Battery Electric Vehicle Energy Consumption Modeling

Mohammad Partohaghighi, Roummel Marcia, Bruce J. West, YangQuan Chen

Main category: cs.LG

TL;DR: FO-RI-FedAvg improves federated learning stability for electric vehicles by combining roughness-informed regularization and fractional-order optimization to handle intermittent connectivity and client variation.

Details

Motivation: Federated learning on battery electric vehicles faces severe instability due to intermittent connectivity, time-varying client participation, and pronounced client-to-client variation from diverse operating conditions. Conventional FedAvg and advanced methods suffer from excessive drift and degraded convergence under these realistic constraints.

Method: Introduces Fractional-Order Roughness-Informed Federated Averaging (FO-RI-FedAvg), a lightweight modular extension of FedAvg with two client-side mechanisms: (1) adaptive roughness-informed proximal regularization that dynamically tunes pull toward global model based on local loss-landscape roughness, and (2) non-integer-order local optimization incorporating short-term memory to smooth conflicting update directions. Preserves standard FedAvg server aggregation with only element-wise operations.

Result: Experiments on two real-world BEV energy prediction datasets (VED and eVED) show FO-RI-FedAvg achieves improved accuracy and more stable convergence compared to strong federated baselines, particularly under reduced client participation.

Conclusion: FO-RI-FedAvg effectively addresses federated learning instability in connected electric vehicles through complementary roughness-informed regularization and fractional-order optimization, offering a lightweight solution with improved performance under realistic constraints.

Abstract: Federated learning on connected electric vehicles (BEVs) faces severe instability due to intermittent connectivity, time-varying client participation, and pronounced client-to-client variation induced by diverse operating conditions. Conventional FedAvg and many advanced methods can suffer from excessive drift and degraded convergence under these realistic constraints. This work introduces Fractional-Order Roughness-Informed Federated Averaging (FO-RI-FedAvg), a lightweight and modular extension of FedAvg that improves stability through two complementary client-side mechanisms: (i) adaptive roughness-informed proximal regularization, which dynamically tunes the pull toward the global model based on local loss-landscape roughness, and (ii) non-integer-order local optimization, which incorporates short-term memory to smooth conflicting update directions. The approach preserves standard FedAvg server aggregation, adds only element-wise operations with amortizable overhead, and allows independent toggling of each component. Experiments on two real-world BEV energy prediction datasets, VED and its extended version eVED, show that FO-RI-FedAvg achieves improved accuracy and more stable convergence compared to strong federated baselines, particularly under reduced client participation.

[267] VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

Xin-Qiang Cai, Masashi Sugiyama

Main category: cs.LG

TL;DR: VI-CuRL is a verifier-independent curriculum reinforcement learning framework that uses model confidence to create training curricula, addressing destructive gradient variance in verifier-free RL for LLMs.

Details

Motivation: Current RLVR methods rely on external verifiers which limit scalability, and verifier-free methods suffer from destructive gradient variance leading to training collapse. The paper aims to develop a verifier-independent approach that can maintain training stability.

Method: VI-CuRL uses the model’s intrinsic confidence to construct a curriculum, prioritizing high-confidence samples to manage bias-variance trade-off and reduce action and problem variance. The framework provides theoretical guarantees of asymptotic unbiasedness.

Result: VI-CuRL promotes training stability and consistently outperforms verifier-independent baselines across six challenging benchmarks, both with and without verifiers.

Conclusion: The proposed verifier-independent curriculum reinforcement learning framework effectively addresses gradient variance issues in verifier-free settings, enabling scalable RL for LLMs without external verification.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduceVerifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model’s intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-independent baselines across six challenging benchmarks with/without verifiers.

[268] Power Interpretable Causal ODE Networks: A Unified Model for Explainable Anomaly Detection and Root Cause Analysis in Power Systems

Yue Sun, Likai Wang, Rick S. Blum, Parv Venkitasubramaniam

Main category: cs.LG

TL;DR: PICODE Networks: A causality-informed architecture for time series anomaly detection that provides explanations including root cause localization, anomaly type classification, and shape characterization in power systems.

Details

Motivation: Existing machine learning models for time series anomaly detection in cyber-physical systems like power grids operate as black boxes, providing only binary outputs without explanations. There's a need for interpretable models that can explain why something is detected as an anomaly, including identifying anomaly type, origin, and shape.

Method: Proposes Power Interpretable Causality Ordinary Differential Equation (PICODE) Networks, a unified architecture that jointly performs anomaly detection with explanation. The method uses causality-informed ordinary differential equations to extract causal graphs and align anomaly shapes with weight changes in these graphs.

Result: Experimental results in power systems show PICODE achieves competitive detection performance while offering improved interpretability and reduced reliance on labeled data or external causal graphs. Theoretical results demonstrate alignment between anomaly function shapes and changes in extracted causal graph weights.

Conclusion: PICODE Networks provide a unified solution for interpretable anomaly detection in cyber-physical systems, addressing the black-box limitation of existing models by offering comprehensive explanations including root cause localization, anomaly type classification, and shape characterization.

Abstract: Anomaly detection and root cause analysis (RCA) are critical for ensuring the safety and resilience of cyber-physical systems such as power grids. However, existing machine learning models for time series anomaly detection often operate as black boxes, offering only binary outputs without any explanation, such as identifying anomaly type and origin. To address this challenge, we propose Power Interpretable Causality Ordinary Differential Equation (PICODE) Networks, a unified, causality-informed architecture that jointly performs anomaly detection along with the explanation why it is detected as an anomaly, including root cause localization, anomaly type classification, and anomaly shape characterization. Experimental results in power systems demonstrate that PICODE achieves competitive detection performance while offering improved interpretability and reduced reliance on labeled data or external causal graphs. We provide theoretical results demonstrating the alignment between the shape of anomaly functions and the changes in the weights of the extracted causal graphs.

[269] Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

Anrui Chen, Ruijun Huang, Xin Zhang, Fang Dong, Hengjie Cao, Zhendong Huang, Yifeng Yang, Mengyi Chen, Jixian Zhou, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Tun Lu, Fan Yang, Li Shang

Main category: cs.LG

TL;DR: MH-MoE improves continual learning in MoE Transformers by addressing pre-routing bottlenecks through head-wise routing to reduce feature composition collisions and forgetting.

Details

Motivation: MoE architectures should be good for continual learning due to sparse routing reducing interference, but MoE Transformers still experience substantial forgetting. The authors identify a pre-routing bottleneck where multi-head attention concatenates head signals into a single router input, forcing routing to act on co-occurring feature compositions rather than separable head channels.

Method: Proposes MH-MoE (Multi-Head Mixture-of-Experts) which performs head-wise routing over sub-representations to increase routing granularity and reduce composition collisions. Quantifies routing collisions via route-wise effective composition number N_eff.

Result: MH-MoE effectively mitigates forgetting, reducing Backward Transfer (BWT) on Qwen3-0.6B from 11.2% (LoRAMoE) to 4.5% on TRACE benchmark. Higher N_eff is associated with larger old-task loss increases after continual training.

Conclusion: Head-wise routing addresses the pre-routing bottleneck in MoE Transformers for continual learning, reducing feature composition collisions and substantially mitigating forgetting compared to standard MoE approaches.

Abstract: Mixture-of-Experts (MoE) architectures are often considered a natural fit for continual learning because sparse routing should localize updates and reduce interference, yet MoE Transformers still forget substantially even with sparse, well-balanced expert utilization. We attribute this gap to a pre-routing bottleneck: multi-head attention concatenates head-specific signals into a single post-attention router input, forcing routing to act on co-occurring feature compositions rather than separable head channels. We show that this router input simultaneously encodes multiple separately decodable semantic and structural factors with uneven head support, and that different feature compositions induce weakly aligned parameter-gradient directions; as a result, routing maps many distinct compositions to the same route. We quantify this collision effect via a route-wise effective composition number $N_{eff}$ and find that higher $N_{eff}$ is associated with larger old-task loss increases after continual training. Motivated by these findings, we propose MH-MoE, which performs head-wise routing over sub-representations to increase routing granularity and reduce composition collisions. On TRACE with Qwen3-0.6B/8B, MH-MoE effectively mitigates forgetting, reducing BWT on Qwen3-0.6B from 11.2% (LoRAMoE) to 4.5%.

[270] Vehicle behaviour estimation for abnormal event detection using distributed fiber optic sensing

Hemant Prasad, Daisuke Ikefuji, Shin Tominaga, Hitoshi Sakurai, Manabu Otani

Main category: cs.LG

TL;DR: A method for detecting single-lane traffic abnormalities using distributed fiber-optic sensing by tracking vehicle paths and detecting lane changes through vibration spectral centroid analysis.

Details

Motivation: While distributed fiber-optic sensing (DFOS) systems are effective for wide-area traffic monitoring and congestion detection, they struggle to identify single-lane abnormalities that cause congestion. These abnormalities can be detected by monitoring vehicle lane changes as drivers avoid problematic lanes.

Method: The method tracks individual vehicle paths using clustering techniques to estimate vehicle positions over time. Lane changes are detected by monitoring changes in the spectral centroid of vehicle vibrations, tracking a reference vehicle along the highway to identify when vehicles switch lanes to avoid abnormalities.

Result: Evaluation with real traffic data showed 80% accuracy for lane change detection events, which correspond to the presence of single-lane abnormalities.

Conclusion: The proposed method successfully addresses the challenge of detecting single-lane abnormalities in DFOS systems by leveraging vehicle lane change detection through vibration analysis, providing a practical solution for more granular traffic monitoring.

Abstract: The distributed fiber-optic sensing (DFOS) system is a cost-effective wide-area traffic monitoring technology that utilizes existing fiber infrastructure to effectively detect traffic congestions. However, detecting single-lane abnormalities, that lead to congestions, is still a challenge. These single-lane abnormalities can be detected by monitoring lane change behaviour of vehicles, performed to avoid congestion along the monitoring section of a road. This paper presents a method to detect single-lane abnormalities by tracking individual vehicle paths and detecting vehicle lane changes along a section of a road. We propose a method to estimate the vehicle position at all time instances and fit a path using clustering techniques. We detect vehicle lane change by monitoring any change in spectral centroid of vehicle vibrations by tracking a reference vehicle along a highway. The evaluation of our proposed method with real traffic data showed 80% accuracy for lane change detection events that represent presence of abnormalities.

[271] HyperMLP: An Integrated Perspective for Sequence Modeling

Jiecheng Lu, Shihao Yang

Main category: cs.LG

TL;DR: The paper reinterprets self-attention as a dynamic two-layer MLP rather than probabilistic query-key lookup, introducing HyperMLP and HyperGLU that learn dynamic mixing in both feature and sequence space using a reverse-offset layout.

Details

Motivation: The authors challenge the conventional view of self-attention as probabilistic query-key lookup, arguing that this perspective leads to designs constrained by normalized attention scores and fixed positional semantics. They propose a simpler, more unified perspective that could lead to more flexible and effective architectures.

Method: The paper introduces HyperMLP and HyperGLU, which reinterpret attention heads as dynamic two-layer MLPs whose weights are instantiated from context history. These models use a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics, learning dynamic mixing in both feature space and sequence space.

Result: Empirical results show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets, demonstrating the effectiveness of the proposed formulation.

Conclusion: The paper provides a novel theoretical framework for understanding attention mechanisms and introduces practical architectures that outperform traditional attention-based models, suggesting that the MLP-based perspective offers advantages over the probabilistic query-key lookup view.

Abstract: Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.

[272] Block-Sample MAC-Bayes Generalization Bounds

Matthias Frey, Jingge Zhu, Michael C. Gastpar

Main category: cs.LG

TL;DR: Novel MAC-Bayes bounds that bound expected generalization error using block-sample divergence terms, improving tightness over traditional PAC-Bayes bounds.

Details

Motivation: Traditional PAC-Bayes bounds provide high-probability guarantees but can be loose or vacuous. The authors aim to develop tighter bounds for expected generalization error by exploiting structure in training data.

Method: Propose a family of MAC-Bayes bounds that generalize expectation versions of PAC-Bayes bounds. Key innovation: divergence terms depend only on subsets (blocks) of training data rather than entire dataset.

Result: The new bounds can be significantly tighter than traditional PAC-Bayes bounds. Numerical example shows original PAC-Bayes bound is vacuous while proposed bounds are finite. Also prove impossibility of high-probability versions with same fast convergence rates.

Conclusion: MAC-Bayes bounds offer tighter expected generalization error guarantees by exploiting block structure, but cannot be directly converted to high-probability PAC-Bayes bounds with same convergence rates.

Abstract: We present a family of novel block-sample MAC-Bayes bounds (mean approximately correct). While PAC-Bayes bounds (probably approximately correct) typically give bounds for the generalization error that hold with high probability, MAC-Bayes bounds have a similar form but bound the expected generalization error instead. The family of bounds we propose can be understood as a generalization of an expectation version of known PAC-Bayes bounds. Compared to standard PAC-Bayes bounds, the new bounds contain divergence terms that only depend on subsets (or \emph{blocks}) of the training data. The proposed MAC-Bayes bounds hold the promise of significantly improving upon the tightness of traditional PAC-Bayes and MAC-Bayes bounds. This is illustrated with a simple numerical example in which the original PAC-Bayes bound is vacuous regardless of the choice of prior, while the proposed family of bounds are finite for appropriate choices of the block size. We also explore the question whether high-probability versions of our MAC-Bayes bounds (i.e., PAC-Bayes bounds of a similar form) are possible. We answer this question in the negative with an example that shows that in general, it is not possible to establish a PAC-Bayes bound which (a) vanishes with a rate faster than $\mathcal{O}(1/\log n)$ whenever the proposed MAC-Bayes bound vanishes with rate $\mathcal{O}(n^{-1/2})$ and (b) exhibits a logarithmic dependence on the permitted error probability.

[273] RelBench v2: A Large-Scale Benchmark and Repository for Relational Data

Justin Gu, Rishabh Ranjan, Charilaos Kanatsoulis, Haiming Tang, Martin Jurkovic, Valter Hudovernik, Mark Znidar, Pranshu Chaturvedi, Parth Shroff, Fengyu Li, Jure Leskovec

Main category: cs.LG

TL;DR: RelBench v2 expands relational deep learning benchmarks with 4 large-scale datasets, introduces autocomplete tasks, integrates external benchmarks, and shows RDL models outperform single-table baselines.

Details

Motivation: As relational deep learning evolves toward larger models and relational foundation models, scalable and realistic benchmarks are essential for enabling systematic evaluation and progress.

Method: Introduces RelBench v2 with four new large-scale relational datasets, adds autocomplete tasks (inferring missing attribute values within tables), integrates external benchmarks (Temporal Graph Benchmark, ReDeLEx, 4DBInfer), and provides unified evaluation frameworks.

Result: RelBench v2 now contains 11 datasets with over 22 million rows across 29 tables. Experimental results show RDL models consistently outperform single-table baselines across autocomplete, forecasting, and recommendation tasks.

Conclusion: RelBench v2 provides a comprehensive benchmark for relational deep learning, demonstrating the importance of modeling relational structure explicitly and enabling systematic evaluation of larger relational foundation models.

Abstract: Relational deep learning (RDL) has emerged as a powerful paradigm for learning directly on relational databases by modeling entities and their relationships across multiple interconnected tables. As this paradigm evolves toward larger models and relational foundation models, scalable and realistic benchmarks are essential for enabling systematic evaluation and progress. In this paper, we introduce RelBench v2, a major expansion of the RelBench benchmark for RDL. RelBench v2 adds four large-scale relational datasets spanning scholarly publications, enterprise resource planning, consumer platforms, and clinical records, increasing the benchmark to 11 datasets comprising over 22 million rows across 29 tables. We further introduce autocomplete tasks, a new class of predictive objectives that require models to infer missing attribute values directly within relational tables while respecting temporal constraints, expanding beyond traditional forecasting tasks constructed via SQL queries. In addition, RelBench v2 expands beyond its native datasets by integrating external benchmarks and evaluation frameworks: we translate event streams from the Temporal Graph Benchmark into relational schemas for unified relational-temporal evaluation, interface with ReDeLEx to provide uniform access to 70+ real-world databases suitable for pretraining, and incorporate 4DBInfer datasets and tasks to broaden multi-table prediction coverage. Experimental results demonstrate that RDL models consistently outperform single-table baselines across autocomplete, forecasting, and recommendation tasks, highlighting the importance of modeling relational structure explicitly.

[274] Coden: Efficient Temporal Graph Neural Networks for Continuous Prediction

Zulun Zhu, Siqiang Luo

Main category: cs.LG

TL;DR: Coden is a Temporal Graph Neural Network designed for efficient continuous predictions on dynamic graphs, overcoming computational bottlenecks while maintaining accuracy.

Details

Motivation: Existing TGNNs focus on one-time predictions, but many real-world applications require frequent continuous predictions over time. Direct adaptation of current TGNNs to continuous scenarios leads to high computational costs or quality issues, especially for large graphs.

Method: Coden is a TGNN model that innovatively overcomes the key complexity bottleneck in existing TGNNs. It provides theoretical analyses to substantiate effectiveness and efficiency, and clarifies its duality relationship with both RNN-based and attention-based models.

Result: Evaluations across five dynamic datasets show Coden surpasses existing performance benchmarks in both efficiency and effectiveness, establishing it as a superior solution for continuous prediction in evolving graph environments.

Conclusion: Coden provides an efficient and effective solution for continuous predictions on dynamic graphs, addressing a critical gap in temporal graph neural network applications.

Abstract: Temporal Graph Neural Networks (TGNNs) are pivotal in processing dynamic graphs. However, existing TGNNs primarily target one-time predictions for a given temporal span, whereas many practical applications require continuous predictions, that predictions are issued frequently over time. Directly adapting existing TGNNs to continuous-prediction scenarios introduces either significant computational overhead or prediction quality issues especially for large graphs. This paper revisits the challenge of { continuous predictions} in TGNNs, and introduces {\sc Coden}, a TGNN model designed for efficient and effective learning on dynamic graphs. {\sc Coden} innovatively overcomes the key complexity bottleneck in existing TGNNs while preserving comparable predictive accuracy. Moreover, we further provide theoretical analyses that substantiate the effectiveness and efficiency of {\sc Coden}, and clarify its duality relationship with both RNN-based and attention-based models. Our evaluations across five dynamic datasets show that {\sc Coden} surpasses existing performance benchmarks in both efficiency and effectiveness, establishing it as a superior solution for continuous prediction in evolving graph environments.

[275] Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics

Jashaswimalya Acharjee, Balaraman Ravindran

Main category: cs.LG

TL;DR: ULD is a reinforcement learning algorithm that unifies model-free efficiency with model-based representational strengths by embedding state-action pairs into a latent space where the value function is approximately linear, enabling cross-domain competence with minimal tuning.

Details

Motivation: The paper aims to bridge the gap between model-free RL (efficient but limited representation) and model-based RL (strong representation but planning overhead). The goal is to achieve the adaptability and sample efficiency of model-based approaches without the computational cost of planning.

Method: ULD embeds state-action pairs into a latent space where the true value function is approximately linear. It uses synchronized updates of encoder, value, and policy networks with auxiliary losses for short-horizon predictive dynamics and reward-scale normalization. The approach is theoretically grounded with proofs showing the fixed point of embedding-based TD updates coincides with linear model-based value expansion.

Result: Evaluated on 80 environments spanning Gym locomotion, DeepMind Control (proprioceptive and visual), and Atari, ULD matches or exceeds specialized model-free and general model-based baselines. It achieves cross-domain competence with minimal tuning and a fraction of the parameter footprint.

Conclusion: Value-aligned latent representations alone can deliver the adaptability and sample efficiency traditionally attributed to full model-based planning, demonstrating that unified latent dynamics can bridge model-free and model-based approaches effectively.

Abstract: We present Unified Latent Dynamics (ULD), a novel reinforcement learning algorithm that unifies the efficiency of model-free methods with the representational strengths of model-based approaches, without incurring planning overhead. By embedding state-action pairs into a latent space in which the true value function is approximately linear, our method supports a single set of hyperparameters across diverse domains – from continuous control with low-dimensional and pixel inputs to high-dimensional Atari games. We prove that, under mild conditions, the fixed point of our embedding-based temporal-difference updates coincides with that of a corresponding linear model-based value expansion, and we derive explicit error bounds relating embedding fidelity to value approximation quality. In practice, ULD employs synchronized updates of encoder, value, and policy networks, auxiliary losses for short-horizon predictive dynamics, and reward-scale normalization to ensure stable learning under sparse rewards. Evaluated on 80 environments spanning Gym locomotion, DeepMind Control (proprioceptive and visual), and Atari, our approach matches or exceeds the performance of specialized model-free and general model-based baselines – achieving cross-domain competence with minimal tuning and a fraction of the parameter footprint. These results indicate that value-aligned latent representations alone can deliver the adaptability and sample efficiency traditionally attributed to full model-based planning.

[276] Efficient Personalized Federated PCA with Manifold Optimization for IoT Anomaly Detection

Xianchao Xiu, Chenyi Huang, Wei Zhang, Wanquan Liu

Main category: cs.LG

TL;DR: FedEP: Personalized federated PCA for IoT anomaly detection using ℓ₁-norm for local sparsity and ℓ₂,₁-norm for robustness, solved via manifold optimization with ADMM.

Details

Motivation: IoT networks need secure anomaly detection but face privacy and resource constraints. Current federated PCA methods lack personalization and robustness needed for effective IoT security.

Method: Proposes FedEP with ℓ₁-norm for element-wise sparsity (personalization) and ℓ₂,₁-norm for row-wise sparsity (robustness). Uses manifold optimization with ADMM for convergence guarantees.

Result: Outperforms state-of-the-art FedPG with excellent F1-scores and accuracy across various IoT security scenarios.

Conclusion: FedEP provides effective personalized and robust anomaly detection for IoT networks while preserving privacy through federated learning.

Abstract: Internet of things (IoT) networks face increasing security threats due to their distributed nature and resource constraints. Although federated learning (FL) has gained prominence as a privacy-preserving framework for distributed IoT environments, current federated principal component analysis (PCA) methods lack the integration of personalization and robustness, which are critical for effective anomaly detection. To address these limitations, we propose an efficient personalized federated PCA (FedEP) method for anomaly detection in IoT networks. The proposed model achieves personalization through introducing local representations with the $\ell_1$-norm for element-wise sparsity, while maintaining robustness via enforcing local models with the $\ell_{2,1}$-norm for row-wise sparsity. To solve this non-convex problem, we develop a manifold optimization algorithm based on the alternating direction method of multipliers (ADMM) with rigorous theoretical convergence guarantees. Experimental results confirm that the proposed FedEP outperforms the state-of-the-art FedPG, achieving excellent F1-scores and accuracy in various IoT security scenarios. Our code will be available at \href{https://github.com/xianchaoxiu/FedEP}{https://github.com/xianchaoxiu/FedEP}.

[277] Formalizing the Sampling Design Space of Diffusion-Based Generative Models via Adaptive Solvers and Wasserstein-Bounded Timesteps

Sangwoo Jo, Sungjoon Choi

Main category: cs.LG

TL;DR: SDM: A principled framework for efficient diffusion model sampling using geometric analysis and adaptive solver scheduling based on trajectory properties.

Details

Motivation: Diffusion models have high sampling costs that limit practical deployment. Current approaches use static heuristics for solver selection and scheduling, lacking principled design.

Method: Analyzes ODE dynamics of diffusion trajectories to show low-order solvers suffice in early high-noise stages while higher-order solvers handle later non-linearity. Introduces Wasserstein-bounded optimization framework for adaptive timestep scheduling that bounds local discretization error.

Result: Achieves SOTA performance: FID of 1.93 on CIFAR-10, 2.41 on FFHQ, and 1.98 on AFHQv2 with reduced function evaluations compared to existing samplers.

Conclusion: SDM provides a principled geometric framework for efficient diffusion sampling without requiring additional training or architectural modifications.

Abstract: Diffusion-based generative models have achieved remarkable performance across various domains, yet their practical deployment is often limited by high sampling costs. While prior work focuses on training objectives or individual solvers, the holistic design of sampling, specifically solver selection and scheduling, remains dominated by static heuristics. In this work, we revisit this challenge through a geometric lens, proposing SDM, a principled framework that aligns the numerical solver with the intrinsic properties of the diffusion trajectory. By analyzing the ODE dynamics, we show that efficient low-order solvers suffice in early high-noise stages while higher-order solvers can be progressively deployed to handle the increasing non-linearity of later stages. Furthermore, we formalize the scheduling by introducing a Wasserstein-bounded optimization framework. This method systematically derives adaptive timesteps that explicitly bound the local discretization error, ensuring the sampling process remains faithful to the underlying continuous dynamics. Without requiring additional training or architectural modifications, SDM achieves state-of-the-art performance across standard benchmarks, including an FID of 1.93 on CIFAR-10, 2.41 on FFHQ, and 1.98 on AFHQv2, with a reduced number of function evaluations compared to existing samplers. Our code is available at https://github.com/aiimaginglab/sdm.

[278] SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, Joseph E. Gonzalez

Main category: cs.LG

TL;DR: SLA2 improves sparse-linear attention for diffusion models with learnable routing, better decomposition, and quantization to achieve high sparsity and speedup while maintaining quality.

Details

Motivation: Existing Sparse-Linear Attention (SLA) for diffusion models uses heuristic split based on attention-weight magnitude which is suboptimal, and has mismatch between SLA formulation and direct sparse-linear decomposition.

Method: SLA2 introduces: (I) learnable router for dynamic sparse/linear branch selection, (II) more faithful sparse-linear attention formulation with learnable combination ratio, (III) sparse + low-bit attention via quantization-aware fine-tuning.

Result: On video diffusion models, SLA2 achieves 97% attention sparsity, 18.6x attention speedup while preserving generation quality.

Conclusion: SLA2 provides efficient attention mechanism for diffusion models with learnable optimization of sparse-linear decomposition and quantization integration.

Abstract: Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

[279] Dual-Granularity Contrastive Reward via Generated Episodic Guidance for Efficient Embodied RL

Xin Liu, Yixuan Li, Yuhui Chen, Yuxing Qin, Haoran Li, Dongbin Zhao

Main category: cs.LG

TL;DR: DEG: A framework using large video generation models to create dense rewards for RL without human annotations, improving sample efficiency in embodied manipulation tasks.

Details

Motivation: Current RL methods for embodied manipulation face challenges with sparse rewards (low sample efficiency) or require extensive human annotations/expert supervision for dense rewards. There's a need for sample-efficient dense rewards without heavy human involvement.

Method: DEG uses large video generation models with minimal expert videos for domain adaptation to generate task guidance for each RL episode. It employs dual-granularity contrastive rewards (coarse-grained exploration + fine-grained matching) in self-supervised latent space to guide agents toward generated guidance videos.

Result: Extensive experiments on 18 diverse tasks across simulation and real-world settings show DEG effectively helps agents discover sparse success rewards and enables stable policy convergence independently.

Conclusion: DEG provides a novel approach to obtain sample-efficient dense rewards without human annotations or extensive supervision, leveraging video generation models and contrastive learning for embodied manipulation RL.

Abstract: Designing suitable rewards poses a significant challenge in reinforcement learning (RL), especially for embodied manipulation. Trajectory success rewards are suitable for human judges or model fitting, but the sparsity severely limits RL sample efficiency. While recent methods have effectively improved RL via dense rewards, they rely heavily on high-quality human-annotated data or abundant expert supervision. To tackle these issues, this paper proposes Dual-granularity contrastive reward via generated Episodic Guidance (DEG), a novel framework to seek sample-efficient dense rewards without requiring human annotations or extensive supervision. Leveraging the prior knowledge of large video generation models, DEG only needs a small number of expert videos for domain adaptation to generate dedicated task guidance for each RL episode. Then, the proposed dual-granularity reward that balances coarse-grained exploration and fine-grained matching, will guide the agent to efficiently approximate the generated guidance video sequentially in the contrastive self-supervised latent space, and finally complete the target task. Extensive experiments on 18 diverse tasks across both simulation and real-world settings show that DEG can not only serve as an efficient exploration stimulus to help the agent quickly discover sparse success rewards, but also guide effective RL and stable policy convergence independently.

[280] Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

Jeonghyun Kim, SooKyung Kim, Richeng Xuan, Hyunsoo Cho

Main category: cs.LG

TL;DR: CUD improves knowledge distillation by calibrating teacher uncertainty to provide better dark knowledge transfer, enhancing student accuracy and robustness.

Details

Motivation: Traditional knowledge distillation suffers from teachers with overconfident, sharp probability distributions that fail to preserve informative uncertainty signals, especially problematic for high-cardinality tasks and distribution shifts.

Method: Proposes Calibrated Uncertainty Distillation (CUD) framework that shapes teacher’s predictive distribution before transfer, encouraging calibrated uncertainty where informative and guiding students to learn from balanced targets.

Result: CUD yields students that are more accurate, better calibrated under distribution shift, and more reliable on ambiguous, long-tail inputs across diverse benchmarks.

Conclusion: CUD successfully addresses overconfidence in knowledge distillation by making dark knowledge more faithfully accessible through calibrated uncertainty, improving student performance and robustness.

Abstract: The core of knowledge distillation lies in transferring the teacher’s rich ‘dark knowledge’-subtle probabilistic patterns that reveal how classes are related and the distribution of uncertainties. While this idea is well established, teachers trained with conventional cross-entropy often fail to preserve such signals. Their distributions collapse into sharp, overconfident peaks that appear decisive but are in fact brittle, offering little beyond the hard label or subtly hindering representation-level transfer. This overconfidence is especially problematic in high-cardinality tasks, where the nuances among many plausible classes matter most for guiding a compact student. Moreover, such brittle targets reduce robustness under distribution shift, leaving students vulnerable to miscalibration in real-world conditions. To address this limitation, we revisit distillation from a distributional perspective and propose Calibrated Uncertainty Distillation (CUD), a framework designed to make dark knowledge more faithfully accessible. Instead of uncritically adopting the teacher’s overconfidence, CUD encourages teachers to reveal uncertainty where it is informative and guides students to learn from targets that are calibrated rather than sharpened certainty. By directly shaping the teacher’s predictive distribution before transfer, our approach balances accuracy and calibration, allowing students to benefit from both confident signals on easy cases and structured uncertainty on hard ones. Across diverse benchmarks, CUD yields students that are not only more accurate, but also more calibrated under shift and more reliable on ambiguous, long-tail inputs.

[281] Uncovering spatial tissue domains and cell types in spatial omics through cross-scale profiling of cellular and genomic interactions

Rui Yan, Xiaohan Xing, Xun Wang, Zixia Zhou, Md Tauhidul Islam, Lei Xing

Main category: cs.LG

TL;DR: CellScape is a deep learning framework for spatial transcriptomics data analysis that jointly models cellular spatial interactions and genomic relationships to uncover biologically informative patterns.

Details

Motivation: Spatial transcriptomics provides valuable in situ gene expression data but is inherently noisy and complex, making it difficult for existing methods to capture the interplay between spatial interactions and genomic relationships, limiting biological pattern discovery.

Method: CellScape is a deep learning framework that jointly models cellular interactions in tissue space and genomic relationships among cells, producing comprehensive representations that integrate spatial signals with gene regulatory mechanisms.

Result: The technique uncovers biologically informative patterns that improve spatial domain segmentation and supports comprehensive spatial cellular analyses across diverse transcriptomics datasets.

Conclusion: CellScape offers an accurate and versatile framework for deep analysis and interpretation of spatial transcriptomics data by effectively capturing spatial-genomic relationships.

Abstract: Cellular identity and function are linked to both their intrinsic genomic makeup and extrinsic spatial context within the tissue microenvironment. Spatial transcriptomics (ST) offers an unprecedented opportunity to study this, providing in situ gene expression profiles at single-cell resolution and illuminating the spatial and functional organization of cells within tissues. However, a significant hurdle remains: ST data is inherently noisy, large, and structurally complex. This complexity makes it intractable for existing computational methods to effectively capture the interplay between spatial interactions and intrinsic genomic relationships, thus limiting our ability to discern critical biological patterns. Here, we present CellScape, a deep learning framework designed to overcome these limitations for high-performance ST data analysis and pattern discovery. CellScape jointly models cellular interactions in tissue space and genomic relationships among cells, producing comprehensive representations that seamlessly integrate spatial signals with underlying gene regulatory mechanisms. This technique uncovers biologically informative patterns that improve spatial domain segmentation and supports comprehensive spatial cellular analyses across diverse transcriptomics datasets, offering an accurate and versatile framework for deep analysis and interpretation of ST data.w

[282] Flow Matching from Viewpoint of Proximal Operators

Kenji Fukumizu, Wei Huang, Han Bao, Shuntuo Xu, Nisha Chandramoothy

Main category: cs.LG

TL;DR: OT-CFM generative models can be reformulated with exact proximal formulation via extended Brenier potential, enabling explicit vector field expressions and proving terminal normal hyperbolicity for manifold-supported targets.

Details

Motivation: To provide a more rigorous mathematical foundation for Optimal Transport Conditional Flow Matching (OT-CFM) models by establishing exact proximal formulations without density assumptions on target distributions, and to understand their geometric properties on manifold-supported data.

Method: Reformulates OT-CFM using extended Brenier potential theory to obtain exact proximal operator expressions for target recovery and vector fields. Analyzes minibatch convergence to population formulation and uses second epi-derivatives of convex potentials to prove terminal normal hyperbolicity for manifold-supported targets.

Result: Shows OT-CFM admits exact proximal formulation without density assumptions, provides explicit proximal expressions for vector fields, proves minibatch convergence, and demonstrates terminal normal hyperbolicity where dynamics contract exponentially normal to data manifold while remaining neutral tangentially.

Conclusion: The paper establishes rigorous mathematical foundations for OT-CFM, providing exact proximal formulations and geometric insights into how these generative models behave on manifold-supported data, with implications for theoretical understanding and practical implementation.

Abstract: We reformulate Optimal Transport Conditional Flow Matching (OT-CFM), a class of dynamical generative models, showing that it admits an exact proximal formulation via an extended Brenier potential, without assuming that the target distribution has a density. In particular, the mapping to recover the target point is exactly given by a proximal operator, which yields an explicit proximal expression of the vector field. We also discuss the convergence of minibatch OT-CFM to the population formulation as the batch size increases. Finally, using second epi-derivatives of convex potentials, we prove that, for manifold-supported targets, OT-CFM is terminally normally hyperbolic: after time rescaling, the dynamics contracts exponentially in directions normal to the data manifold while remaining neutral along tangential directions.

[283] Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

Yixiao Zhou, Yang Li, Dongzhou Cheng, Hehe Fan, Yu Cheng

Main category: cs.LG

TL;DR: RLVR framework learns to control sampling temperature during LLM generation through hierarchical reinforcement learning, optimizing temperature and token policies jointly from downstream rewards

Details

Motivation: Existing methods use static or heuristic temperature settings decoupled from task rewards, missing the opportunity to dynamically control exploration-exploitation trade-off during generation based on model uncertainty and task needs

Method: Introspective LLM: hierarchical RL framework where at each decoding step, model selects temperature based on hidden state, samples next token from resulting distribution, and jointly optimizes temperature and token policies using coordinate ascent from downstream rewards

Result: Experiments on mathematical reasoning benchmarks show learned temperature policies outperform fixed/heuristic baselines and exhibit interpretable exploration behaviors aligned with reasoning uncertainty

Conclusion: Decoding strategy should be learned component of RL training, not just inference-time choice; learned temperature control enables adaptive exploration-exploitation aligned with task uncertainty

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration–exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.

[284] Leverage-Weighted Conformal Prediction

Shreyas Fadnavis

Main category: cs.LG

TL;DR: LWCP is a conformal prediction method that weights nonconformity scores by leverage scores to produce adaptive prediction intervals without auxiliary models, addressing heteroscedasticity while maintaining distribution-free coverage guarantees.

Details

Motivation: Standard split conformal prediction produces constant-width intervals that don't adapt to local variance, leading to overcoverage in low-variance regions and undercoverage in high-variance regions. Existing adaptive methods require training auxiliary models, adding complexity.

Method: Leverage-Weighted Conformal Prediction (LWCP) weights nonconformity scores by a function of the statistical leverage (diagonal of the hat matrix), deriving adaptivity from the geometry of the design matrix rather than auxiliary model fitting.

Result: LWCP preserves finite-sample marginal validity, achieves asymptotically optimal conditional coverage with minimal width cost when heteroscedasticity factors through leverage, recovers classical Gaussian prediction intervals while retaining distribution-free guarantees, and eliminates persistent conditional coverage gaps of vanilla CP.

Conclusion: LWCP provides an efficient, distribution-free method for adaptive prediction intervals that addresses heteroscedasticity without auxiliary models, offering theoretical guarantees and practical benefits with negligible computational overhead.

Abstract: Split conformal prediction provides distribution-free prediction intervals with finite-sample marginal coverage, but produces constant-width intervals that overcover in low-variance regions and undercover in high-variance regions. Existing adaptive methods require training auxiliary models. We propose Leverage-Weighted Conformal Prediction (LWCP), which weights nonconformity scores by a function of the statistical leverage – the diagonal of the hat matrix – deriving adaptivity from the geometry of the design matrix rather than from auxiliary model fitting. We prove that LWCP preserves finite-sample marginal validity for any weight function; achieves asymptotically optimal conditional coverage at essentially no width cost when heteroscedasticity factors through leverage; and recovers the form and width of classical prediction intervals under Gaussian assumptions while retaining distribution-free guarantees. We further establish that randomized leverage approximations preserve coverage exactly with controlled width perturbation, and that vanilla CP suffers a persistent, sample-size-independent conditional coverage gap that LWCP eliminates. The method requires no hyperparameters beyond the choice of weight function and adds negligible computational overhead to vanilla CP. Experiments on synthetic and real data confirm the theoretical predictions, demonstrating substantial reductions in conditional coverage disparity across settings.

[285] Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

Juneyoung Park, Yuri Hong, Seongwan Kim, Jaeho Lee

Main category: cs.LG

TL;DR: MeSP enables memory-efficient on-device fine-tuning of LLMs by exploiting LoRA’s low-rank structure to recompute intermediate projections during backward pass, achieving 49% memory reduction while maintaining exact gradients.

Details

Motivation: On-device fine-tuning of LLMs faces severe memory constraints (6-12GB on mobile devices), forcing trade-offs between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). There's a need for methods that maintain exact gradients while reducing memory usage.

Method: Memory-efficient Structured Backpropagation (MeSP) manually derives backward passes that exploit LoRA’s low-rank structure. The key insight is that the intermediate projection h = xA can be recomputed during backward at minimal cost since rank r ≪ d_in, eliminating the need to store it.

Result: MeSP achieves 49% average memory reduction compared to MeBP on Qwen2.5 models (0.5B-3B) while computing mathematically identical gradients. For Qwen2.5-0.5B, peak memory reduces from 361MB to 136MB. Analysis shows MeZO’s gradient estimates have near-zero correlation with true gradients (cosine similarity ≈0.001).

Conclusion: MeSP bridges the gap between exact gradients and memory efficiency, enabling fine-tuning scenarios previously infeasible on memory-constrained devices by exploiting LoRA’s low-rank structure to recompute intermediate values rather than storing them.

Abstract: On-device fine-tuning enables privacy-preserving personalization of large language models, but mobile devices impose severe memory constraints, typically 6–12GB shared across all workloads. Existing approaches force a trade-off between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). We propose Memory-efficient Structured Backpropagation (MeSP), which bridges this gap by manually deriving backward passes that exploit LoRA’s low-rank structure. Our key insight is that the intermediate projection $h = xA$ can be recomputed during backward at minimal cost since rank $r \ll d_{in}$, eliminating the need to store it. MeSP achieves 49% average memory reduction compared to MeBP on Qwen2.5 models (0.5B–3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO’s gradient estimates show near-zero correlation with true gradients (cosine similarity $\approx$0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.

[286] SWING: Unlocking Implicit Graph Representations for Graph Random Features

Alessandro Manenti, Avinava Dubey, Arijit Sehanobish, Cesare Alippi, Krzysztof Choromanski

Main category: cs.LG

TL;DR: SWING is a new algorithm for computations on implicitly defined graphs using space walks instead of graph walks, with Gumbel-softmax sampling and random features for efficient approximation.

Details

Motivation: The paper addresses computational challenges with implicitly represented graphs (i-graphs) where edge weights are defined as functions of node feature vectors. Traditional graph algorithms require materializing the graph, which is inefficient for large implicit graphs commonly used in machine learning (like ε-neighborhood graphs).

Method: SWING conducts walks in continuous embedding spaces rather than on graph nodes. It uses customized Gumbel-softmax sampling with linearized kernels via random features, coupled with importance sampling techniques. The method leverages connections between implicitly defined graphs and Fourier analysis.

Result: The algorithm is accelerator-friendly, doesn’t require graph materialization, and provides efficient approximations of original combinatorial calculations. Thorough experiments on different i-graph classes demonstrate its effectiveness.

Conclusion: SWING offers a novel approach to computations on implicitly defined graphs by operating in continuous spaces rather than discrete graph structures, with practical efficiency benefits for machine learning applications.

Abstract: We propose SWING: Space Walks for Implicit Network Graphs, a new class of algorithms for computations involving Graph Random Features on graphs given by implicit representations (i-graphs), where edge-weights are defined as bi-variate functions of feature vectors in the corresponding nodes. Those classes of graphs include several prominent examples, such as: $ε$-neighborhood graphs, used on regular basis in machine learning. Rather than conducting walks on graphs’ nodes, those methods rely on walks in continuous spaces, in which those graphs are embedded. To accurately and efficiently approximate original combinatorial calculations, SWING applies customized Gumbel-softmax sampling mechanism with linearized kernels, obtained via random features coupled with importance sampling techniques. This algorithm is of its own interest. SWING relies on the deep connection between implicitly defined graphs and Fourier analysis, presented in this paper. SWING is accelerator-friendly and does not require input graph materialization. We provide detailed analysis of SWING and complement it with thorough experiments on different classes of i-graphs.

[287] LCSB: Layer-Cyclic Selective Backpropagation for Memory-Efficient On-Device LLM Fine-Tuning

Juneyoung Park, Eunbeen Yoon, Seongwan Kim. Jaeho Lee

Main category: cs.LG

TL;DR: LCSB is a memory-efficient fine-tuning method that selectively computes gradients for only a subset of transformer layers per step, achieving speedup with minimal quality loss and improved stability in quantized settings.

Details

Motivation: Current memory-efficient backpropagation (MeBP) methods for fine-tuning LLMs on mobile devices still suffer from computational overhead, with weight decompression alone taking 32-42% of backward time. There's a need for more efficient fine-tuning that reduces computation while maintaining quality.

Method: Layer-Cyclic Selective Backpropagation (LCSB) computes gradients for only a subset of transformer layers per training step. It leverages residual connections to maintain gradient flow through identity paths, and uses AdamW momentum to provide implicit updates for non-selected layers. The method is interpreted as Block Coordinate Descent on LoRA parameter space.

Result: LCSB achieves up to 1.40× speedup with less than 2% quality degradation across five models and three tasks. In 4-bit quantized settings, it shows superior stability - a 3B model that diverges under full backpropagation converges smoothly with LCSB, suggesting implicit regularization.

Conclusion: LCSB provides an effective approach for memory-efficient fine-tuning of LLMs that reduces computational overhead while maintaining model quality and even improving stability in quantized settings through selective gradient computation.

Abstract: Memory-efficient backpropagation (MeBP) has enabled first-order fine-tuning of large language models (LLMs) on mobile devices with less than 1GB memory. However, MeBP requires backward computation through all transformer layers at every step, where weight decompression alone accounts for 32–42% of backward time. We propose Layer-Cyclic Selective Backpropagation (LCSB), which computes gradients for only a subset of layers per step. Our key insight is that residual connections guarantee gradient flow through identity paths, while AdamW momentum provides implicit updates for non-selected layers. We interpret LCSB as Block Coordinate Descent on the LoRA parameter space, providing theoretical justification for convergence. LCSB achieves up to 1.40$\times$ speedup with less than 2% quality degradation across five models and three tasks. Surprisingly, in 4-bit quantized settings, LCSB exhibits superior stability: a 3B model that completely diverges under full backpropagation converges smoothly with LCSB, suggesting an implicit regularization effect from selective gradient computation.

[288] Can Neural Networks Provide Latent Embeddings for Telemetry-Aware Greedy Routing?

Andreas Boltres, Niklas Freymuth, Gerhard Neumann

Main category: cs.LG

TL;DR: Placer uses Message Passing Networks to create latent node embeddings for explainable, greedy routing decisions in computer networks

Details

Motivation: Current ML-based routing solutions sacrifice explainability due to black-box neural networks, making it hard to understand routing decisions and respond to network events

Method: Uses Message Passing Networks to transform network states into latent node embeddings, enabling quick greedy next-hop routing without solving all-pairs shortest paths

Result: Provides explainable routing decisions with visualization capabilities to show how network events shape routing choices

Conclusion: Placer offers both effective routing and explainability through latent node embeddings and visualization of decision processes

Abstract: Telemetry-Aware routing promises to increase efficacy and responsiveness to traffic surges in computer networks. Recent research leverages Machine Learning to deal with the complex dependency between network state and routing, but sacrifices explainability of routing decisions due to the black-box nature of the proposed neural routing modules. We propose \emph{Placer}, a novel algorithm using Message Passing Networks to transform network states into latent node embeddings. These embeddings facilitate quick greedy next-hop routing without directly solving the all-pairs shortest paths problem, and let us visualize how certain network events shape routing decisions.

[289] QTabGAN: A Hybrid Quantum-Classical GAN for Tabular Data Synthesis

Subhangi Kumari, Rakesh Achutha, Vignesh Sivaraman

Main category: cs.LG

TL;DR: QTabGAN is a hybrid quantum-classical GAN framework for synthesizing realistic tabular data, showing significant improvements over classical methods.

Details

Motivation: Tabular data synthesis is challenging due to heterogeneous feature types and high dimensionality, especially when real data is scarce or privacy-restricted. Quantum computing offers potential advantages for learning complex distributions.

Method: A hybrid quantum-classical GAN framework where quantum circuits learn complex data distributions and classical neural networks map these to tabular features. The model leverages quantum expressive power for distribution learning.

Result: QTabGAN achieves up to 54.07% improvement across various classification datasets and evaluation metrics compared to state-of-the-art generative models, demonstrating scalable quantum-assisted tabular data synthesis.

Conclusion: QTabGAN establishes a scalable quantum approach to tabular data synthesis and highlights the potential of quantum-assisted generative modeling for data-scarce or privacy-sensitive scenarios.

Abstract: Synthesizing realistic tabular data is challenging due to heterogeneous feature types and high dimensionality. We introduce QTabGAN, a hybrid quantum-classical generative adversarial framework for tabular data synthesis. QTabGAN is especially designed for settings where real data are scarce or restricted by privacy constraints. The model exploits the expressive power of quantum circuits to learn complex data distributions, which are then mapped to tabular features using classical neural networks. We evaluate QTabGAN on multiple classification and regression datasets and benchmark it against leading state-of-the-art generative models. Experiments show that QTabGAN achieves up to 54.07% improvement across various classification datasets and evaluation metrics, thus establishing a scalable quantum approach to tabular data synthesis and highlighting its potential for quantum-assisted generative modelling.

[290] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

João Vitor Boer Abitante, Joana Meneguzzo Pasquali, Luan Fonseca Garcia, Ewerton de Oliveira, Thomas da Silva Paula, Rodrigo C. Barros, Lucas S. Kupssinskü

Main category: cs.LG

TL;DR: LoRA-based unlearning preserves unlearning updates under aggressive 4-bit quantization, preventing reversion to pre-unlearning behavior while maintaining utility and privacy.

Details

Motivation: Standard LLM unlearning methods fail under aggressive post-training quantization (PTQ) because small parameter changes get erased, causing quantized models to revert to pre-unlearning behavior.

Method: Propose quantization-robust unlearning via Low-Rank Adaptation (LoRA): freeze base model and concentrate unlearning updates into trainable adapters to preserve effective changes after 4-bit quantization.

Result: LoRA improves 4-bit utility by up to 7.93 points on MUSE dataset, reduces privacy leakage under 4-bit PTQ (PrivLeak moves from -25.68 to -5.86), while maintaining strong forgetting performance.

Conclusion: LoRA-based unlearning is beneficial for deployment scenarios requiring quantization, as it preserves unlearning updates that would otherwise be erased by aggressive low-bit quantization.

Abstract: Large Language Model (LLM) unlearning aims to remove targeted knowledge from a trained model, but practical deployments often require post-training quantization (PTQ) for efficient inference. However, aggressive low-bit PTQ can mask or erase unlearning updates, causing quantized models to revert to pre-unlearning behavior. We show that standard full-parameter fine-tuning often induce parameter changes that are too small to survive 4-bit quantization. We propose quantization-robust unlearning via low-rank adaptation (LoRA): we freeze the base model and concentrate unlearning into trainable adapters so that the effective update is preserved after quantization. On Llama-2-7B evaluated with MUSE dataset (BOOKS and NEWS), LoRA improves 4-bit utility by up to 7.93 points (NPO+GDR on BOOKS: 50.17 to 58.10) and yields higher 4-bit utility on NEWS for GA+GDR (40.06 to 44.82, increase of 4.76). LoRA also substantially reduces privacy leakage under 4-bit PTQ, e.g., for GA+KLR on BOOKS, PrivLeak moves from -25.68 to -5.86 (closer to ideal 0), while maintaining strong forgetting (VerMem and KnowMem near 0). Thus, using LoRA for Machine Unlearning is beneficial for scenarios where quantization is necessary for model deployment.

[291] GRAIL: Geometry-Aware Retrieval-Augmented Inference with LLMs over Hyperbolic Representations of Patient Trajectories

Zhan Qu, Michael Färber

Main category: cs.LG

TL;DR: GRAIL is a framework for predicting future clinical events from longitudinal EHRs using structured geometric representations and structure-aware retrieval, addressing challenges of sparse multi-type events and hierarchical medical vocabularies.

Details

Motivation: Predicting clinical events from EHRs is challenging due to sparse multi-type events, hierarchical medical vocabularies, and LLM hallucination issues when reasoning over long structured histories. The paper aims to improve next-visit event prediction.

Method: GRAIL constructs a unified clinical graph combining deterministic coding-system hierarchies with data-driven temporal associations, embeds it in hyperbolic space, summarizes visits as probabilistic Central Events, and uses structure-aware retrieval with optional LLM reranking.

Result: Experiments on MIMIC-IV show GRAIL consistently improves multi-type next-visit prediction and yields more hierarchy-consistent forecasts compared to baseline methods.

Conclusion: GRAIL effectively addresses EHR prediction challenges through structured geometric representations and retrieval-based approaches, demonstrating improved performance and clinical plausibility.

Abstract: Predicting future clinical events from longitudinal electronic health records (EHRs) is challenging due to sparse multi-type clinical events, hierarchical medical vocabularies, and the tendency of large language models (LLMs) to hallucinate when reasoning over long structured histories. We study next-visit event prediction, which aims to forecast a patient’s upcoming clinical events based on prior visits. We propose GRAIL, a framework that models longitudinal EHRs using structured geometric representations and structure-aware retrieval. GRAIL constructs a unified clinical graph by combining deterministic coding-system hierarchies with data-driven temporal associations across event types, embeds this graph in hyperbolic space, and summarizes each visit as a probabilistic Central Event that denoises sparse observations. At inference time, GRAIL retrieves a structured set of clinically plausible future events aligned with hierarchical and temporal progression, and optionally refines their ranking using an LLM as a constrained inference-time reranker. Experiments on MIMIC-IV show that GRAIL consistently improves multi-type next-visit prediction and yields more hierarchy-consistent forecasts.

[292] Physics-Informed Laplace Neural Operator for Solving Partial Differential Equations

Heechang Kim, Qianying Cao, Hyomin Shin, Seungchul Lee, George Em Karniadakis, Minseok Choi

Main category: cs.LG

TL;DR: PILNO enhances Laplace Neural Operator with physics-informed training using virtual inputs and temporal-causality weighting for better accuracy in small-data regimes and OOD generalization for PDE solving.

Details

Motivation: Purely data-driven neural operators require extensive training data and generalize poorly in small-data regimes and under out-of-distribution input functions. There's a need for more data-efficient and robust PDE solvers.

Method: Proposes Physics-Informed Laplace Neural Operator (PILNO) that embeds governing physics through PDE, boundary condition, and initial condition residuals. Uses Advanced LNO backbone with pole-residue transient representation and FNO-style Fourier multiplier. Employs virtual inputs (unlabeled ensemble for broad spectral coverage) and temporal-causality weighting (time-decaying residual weighting).

Result: PILNO consistently improves accuracy in small-data settings (N_train <= 27), reduces run-to-run variability across random seeds, and achieves stronger OOD generalization than purely data-driven baselines across four PDE benchmarks.

Conclusion: Physics-informed training with virtual inputs and temporal-causality weighting makes neural operators more data-efficient and robust for solving parametric PDEs, especially in small-data and OOD scenarios.

Abstract: Neural operators have emerged as fast surrogate solvers for parametric partial differential equations (PDEs). However, purely data-driven models often require extensive training data and can generalize poorly, especially in small-data regimes and under unseen (out-of-distribution) input functions that are not represented in the training data. To address these limitations, we propose the Physics-Informed Laplace Neural Operator (PILNO), which enhances the Laplace Neural Operator (LNO) by embedding governing physics into training through PDE, boundary condition, and initial condition residuals. To improve expressivity, we first introduce an Advanced LNO (ALNO) backbone that retains a pole-residue transient representation while replacing the steady-state branch with an FNO-style Fourier multiplier. To make physics-informed training both data-efficient and robust, PILNO further leverages (i) virtual inputs: an unlabeled ensemble of input functions spanning a broad spectral range that provides abundant physics-only supervision and explicitly targets out-of-distribution (OOD) regimes; and (ii) temporal-causality weighting: a time-decaying reweighting of the physics residual that prioritizes early-time dynamics and stabilizes optimization for time-dependent PDEs. Across four representative benchmarks – Burgers’ equation, Darcy flow, a reaction-diffusion system, and a forced KdV equation – PILNO consistently improves accuracy in small-data settings (e.g., N_train <= 27), reduces run-to-run variability across random seeds, and achieves stronger OOD generalization than purely data-driven baselines.

[293] FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching

Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Xiao Ma

Main category: cs.LG

TL;DR: FLAC is a likelihood-free RL framework that uses kinetic energy regularization instead of explicit action densities for maximum entropy control with iterative generative policies.

Details

Motivation: Iterative generative policies like diffusion models are expressive for continuous control but complicate Maximum Entropy RL because their action log-densities are not directly accessible. Existing methods struggle with density estimation for these policies.

Method: Proposes Field Least-Energy Actor-Critic (FLAC) that formulates policy optimization as a Generalized Schrödinger Bridge problem relative to a high-entropy reference process. Uses kinetic energy of the velocity field as a proxy for divergence from reference, avoiding explicit density estimation. Derives energy-regularized policy iteration with automatic tuning via Lagrangian dual mechanism.

Result: FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines while avoiding explicit density estimation. The kinetic energy regularization effectively controls policy stochasticity.

Conclusion: FLAC provides a physically grounded, likelihood-free framework for maximum entropy RL with iterative generative policies, offering a principled alternative to density-based approaches by using kinetic energy as a regularization proxy.

Abstract: Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action log-densities are not directly accessible. To address this, we propose Field Least-Energy Actor-Critic (FLAC), a likelihood-free framework that regulates policy stochasticity by penalizing the kinetic energy of the velocity field. Our key insight is to formulate policy optimization as a Generalized Schrödinger Bridge (GSB) problem relative to a high-entropy reference process (e.g., uniform). Under this view, the maximum-entropy principle emerges naturally as staying close to a high-entropy reference while optimizing return, without requiring explicit action densities. In this framework, kinetic energy serves as a physically grounded proxy for divergence from the reference: minimizing path-space energy bounds the deviation of the induced terminal action distribution. Building on this view, we derive an energy-regularized policy iteration scheme and a practical off-policy algorithm that automatically tunes the kinetic energy via a Lagrangian dual mechanism. Empirically, FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines, while avoiding explicit density estimation.

[294] Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

Jon Irureta, Gorka Azkune, Jon Imaz, Aizea Lojo, Javier Fernandez-Marques

Main category: cs.LG

TL;DR: Split-MoPE: A vertical federated learning framework using split learning with predefined experts to handle sample misalignment without requiring full data overlap, achieving efficient single-round communication with robustness and interpretability.

Details

Motivation: Existing vertical federated learning frameworks assume full sample alignment across participants, which rarely holds in real-world scenarios. There's a need for methods that can handle sample misalignment while maintaining privacy, efficiency, and performance.

Method: Combines split learning with a Mixture of Predefined Experts (MoPE) architecture. Unlike standard MoE with learned routing, MoPE uses predefined experts to process specific data alignments. Leverages pretrained encoders for target domains and operates with single communication round.

Result: Outperforms state-of-the-art systems like LASER and Vertical SplitNN on vision (CIFAR-10/100) and tabular (Breast Cancer Wisconsin) datasets, especially in scenarios with high data missingness. Achieves SOTA performance with reduced communication overhead.

Conclusion: Split-MoPE effectively addresses sample misalignment in vertical federated learning, providing efficient single-round communication, robustness against malicious participants, and per-sample interpretability while maximizing data usage.

Abstract: Vertical Federated Learning (VFL) has emerged as a critical paradigm for collaborative model training in privacy-sensitive domains such as finance and healthcare. However, most existing VFL frameworks rely on the idealized assumption of full sample alignment across participants, a premise that rarely holds in real-world scenarios. To bridge this gap, this work introduces Split-MoPE, a novel framework that integrates Split Learning with a specialized Mixture of Predefined Experts (MoPE) architecture. Unlike standard Mixture of Experts (MoE), where routing is learned dynamically, MoPE uses predefined experts to process specific data alignments, effectively maximizing data usage during both training and inference without requiring full sample overlap. By leveraging pretrained encoders for target data domains, Split-MoPE achieves state-of-the-art performance in a single communication round, significantly reducing the communication footprint compared to multi-round end-to-end training. Furthermore, unlike existing proposals that address sample misalignment, this novel architecture provides inherent robustness against malicious or noisy participants and offers per-sample interpretability by quantifying each collaborator’s contribution to each prediction. Extensive evaluations on vision (CIFAR-10/100) and tabular (Breast Cancer Wisconsin) datasets demonstrate that Split-MoPE consistently outperforms state-of-the-art systems such as LASER and Vertical SplitNN, particularly in challenging scenarios with high data missingness.

[295] Amortized Reasoning Tree Search: Decoupling Proposal and Decision in Large Language Models

Zesheng Hong, Jiadong Yu, Hui Pan

Main category: cs.LG

TL;DR: ARTS decouples generation from verification to prevent suppression of rare reasoning paths in RL-aligned LLMs, achieving comparable performance to fine-tuned models without modifying the generative backbone.

Details

Motivation: RLVR systematically suppresses valid but rare reasoning paths in LLMs due to "Normalization Squeeze" - mode-seeking policy gradients act as high-pass likelihood filters that drive rare correct traces to statistical extinction.

Method: Proposes Amortized Reasoning Tree Search (ARTS) that decouples generation from verification. Uses Flow Matching objective to repurpose verifier for estimating probability flow conservation, enabling navigation through sparse, high-entropy search spaces.

Result: On MATH-500 benchmark: ARTS achieves 74.6% (BoN@16), matching fully fine-tuned policies (74.7%). On long-tail subset where RL optimization collapses to 0% pass@k, ARTS uniquely recovers significant performance.

Conclusion: Disentangling verification from generation offers a more robust pathway for solving complex reasoning tasks, preserving rare but valid reasoning paths that RLVR systematically suppresses.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has established itself as the dominant paradigm for instilling rigorous reasoning capabilities in Large Language Models. While effective at amplifying dominant behaviors, we identify a critical pathology in this alignment process: the systematic suppression of valid but rare (low-likelihood under the base model distribution) reasoning paths. We theoretically characterize this phenomenon as a “Normalization Squeeze,” where the interplay between mode-seeking policy gradients and finite sampling acts as a high-pass likelihood filter, driving the probability of rare correct traces to statistical extinction. To counteract this collapse without discarding the base model’s latent diversity, we propose Amortized Reasoning Tree Search (ARTS). Unlike standard approaches that force internalization via parameter updates, ARTS prioritizes deliberation by decoupling generation from verification. We introduce a Flow Matching objective that repurposes the verifier to estimate the conservation of probability flow, enabling robust navigation through sparse, high-entropy search spaces where traditional discriminative objectives fail. Extensive experiments on the MATH-500 benchmark demonstrate that ARTS achieves a performance of 74.6% (BoN@16), effectively matching fully fine-tuned policies (74.7%) without modifying the generative backbone. Crucially, on the long-tail subset where coupled RL optimization collapses to 0% pass@k, ARTS uniquely recovers significant performance, suggesting that disentangling verification from generation offers a more robust pathway for solving complex reasoning tasks.

[296] ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools – From Consensus Learning to Ambiguity-Driven Emotion Reasoning

Esther Sun, Bo-Hao Su, Abinay Reddy Naini, Shinji Watanabe, Carlos Busso

Main category: cs.LG

TL;DR: ADEPT is a framework that transforms Speech LLMs into agents for emotion recognition through multi-turn inquiry with acoustic and semantic probing tools, shifting from consensus learning to ambiguity-driven reasoning.

Details

Motivation: Speech LLMs enable high-level emotion reasoning but produce ungrounded, text-biased judgments without verifiable acoustic evidence, while self-supervised speech encoders provide strong acoustic representations but lack interpretability. There's a need to bridge this gap and handle the inherent complexity and co-occurrence of emotions in human affect.

Method: ADEPT reframes emotion recognition as a multi-turn inquiry process where an SLLM agent maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools. It uses a structured pipeline of candidate generation, evidence collection, and adjudication, with Group Relative Policy Optimization (GRPO) and an Evidence Trust Gate to couple tool-usage with prediction quality.

Result: ADEPT improves primary emotion accuracy in most settings while substantially improving minor emotion characterization, producing explanations grounded in auditable acoustic and semantic evidence.

Conclusion: ADEPT successfully bridges the gap between high-level reasoning of SLLMs and acoustic grounding of speech encoders, enabling evidence-grounded emotion recognition that handles ambiguity and co-occurring emotions effectively.

Abstract: Speech Large Language Models (SLLMs) enable high-level emotion reasoning but often produce ungrounded, text-biased judgments without verifiable acoustic evidence. In contrast, self-supervised speech encoders such as WavLM provide strong acoustic representations yet remain opaque discriminative models with limited interpretability. To bridge this gap, we introduce ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a framework that reframes emotion recognition as a multi-turn inquiry process rather than a single-pass prediction. ADEPT transforms an SLLM into an agent that maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools within a structured pipeline of candidate generation, evidence collection, and adjudication. Crucially, ADEPT enables a paradigm shift from consensus learning to ambiguity-driven emotion reasoning. Since human affect exhibits inherent complexity and frequent co-occurrence of emotions, we treat minority annotations as informative perceptual signals rather than discarding them as noise. Finally, we integrate Group Relative Policy Optimization (GRPO) with an Evidence Trust Gate to explicitly couple tool-usage behaviors with prediction quality and enforce evidence-grounded reasoning. Experiments show that ADEPT improves primary emotion accuracy in most settings while substantially improving minor emotion characterization, producing explanations grounded in auditable acoustic and semantic evidence.

[297] Adaptive Structured Pruning of Convolutional Neural Networks for Time Series Classification

Javidan Abdullayev, Maxime Devanne, Cyril Meyer, Ali Ismail-Fawaz, Jonathan Weber, Germain Forestier

Main category: cs.LG

TL;DR: Dynamic Structured Pruning (DSP) is an automatic pruning framework for time series classification models that removes redundant filters without manual hyperparameter tuning, achieving significant model compression while maintaining accuracy.

Details

Motivation: Deep learning models for Time Series Classification (TSC) have high computational and memory requirements that limit deployment on resource-constrained devices. Existing structured pruning methods rely on manually tuned hyperparameters like pruning ratios, which limit scalability and generalization across datasets.

Method: DSP introduces an instance-wise sparsity loss during training to induce channel-level sparsity, followed by a global activation analysis to identify and prune redundant filters automatically without needing any predefined pruning ratio.

Result: Validated on 128 UCR datasets using LITETime and InceptionTime architectures, DSP achieves average compression of 58% for LITETime and 75% for InceptionTime while maintaining classification accuracy. Redundancy analyses confirm DSP produces compact and informative representations.

Conclusion: DSP offers a practical path for scalable and efficient deep TSC deployment by automatically pruning redundant filters without manual hyperparameter tuning, addressing computational bottlenecks for resource-constrained devices.

Abstract: Deep learning models for Time Series Classification (TSC) have achieved strong predictive performance but their high computational and memory requirements often limit deployment on resource-constrained devices. While structured pruning can address these issues by removing redundant filters, existing methods typically rely on manually tuned hyperparameters such as pruning ratios which limit scalability and generalization across datasets. In this work, we propose Dynamic Structured Pruning (DSP), a fully automatic, structured pruning framework for convolution-based TSC models. DSP introduces an instance-wise sparsity loss during training to induce channel-level sparsity, followed by a global activation analysis to identify and prune redundant filters without needing any predefined pruning ratio. This work tackles computational bottlenecks of deep TSC models for deployment on resource-constrained devices. We validate DSP on 128 UCR datasets using two different deep state-of-the-art architectures: LITETime and InceptionTime. Our approach achieves an average compression of 58% for LITETime and 75% for InceptionTime architectures while maintaining classification accuracy. Redundancy analyses confirm that DSP produces compact and informative representations, offering a practical path for scalable and efficient deep TSC deployment.

[298] X-VORTEX: Spatio-Temporal Contrastive Learning for Wake Vortex Trajectory Forecasting

Zhan Qu, Michael Färber

Main category: cs.LG

TL;DR: X-VORTEX is a spatio-temporal contrastive learning framework that learns physics-aware representations from unlabeled LiDAR point cloud sequences to track aircraft wake vortices, addressing sensor sparsity and time-varying dynamics with minimal labeled data.

Details

Motivation: Wake vortices pose safety and capacity challenges for air traffic management. Tracking them from LiDAR measurements is difficult due to sparse scans, fading signatures, and expensive point-wise annotation. Existing approaches treat each scan independently, overlooking temporal structure and not scaling to vast unlabeled archives.

Method: X-VORTEX uses spatio-temporal contrastive learning grounded in Augmentation Overlap Theory. It constructs paired inputs from the same flight event by combining weakly perturbed sequences with strongly augmented counterparts via temporal subsampling and spatial masking. A time-distributed geometric encoder extracts per-scan features, and a sequential aggregator models evolving vortex state across variable-length sequences.

Result: On a real-world dataset of over one million LiDAR scans, X-VORTEX achieves superior vortex center localization while using only 1% of the labeled data required by supervised baselines. The learned representations also support accurate trajectory forecasting.

Conclusion: X-VORTEX demonstrates that spatio-temporal contrastive learning can effectively learn physics-aware representations from unlabeled LiDAR sequences, enabling accurate vortex tracking with minimal supervision and scaling to large unlabeled archives.

Abstract: Wake vortices are strong, coherent air turbulences created by aircraft, and they pose a major safety and capacity challenge for air traffic management. Tracking how vortices move, weaken, and dissipate over time from LiDAR measurements is still difficult because scans are sparse, vortex signatures fade as the flow breaks down under atmospheric turbulence and instabilities, and point-wise annotation is prohibitively expensive. Existing approaches largely treat each scan as an independent, fully supervised segmentation problem, which overlooks temporal structure and does not scale to the vast unlabeled archives collected in practice. We present X-VORTEX, a spatio-temporal contrastive learning framework grounded in Augmentation Overlap Theory that learns physics-aware representations from unlabeled LiDAR point cloud sequences. X-VORTEX addresses two core challenges: sensor sparsity and time-varying vortex dynamics. It constructs paired inputs from the same underlying flight event by combining a weakly perturbed sequence with a strongly augmented counterpart produced via temporal subsampling and spatial masking, encouraging the model to align representations across missing frames and partial observations. Architecturally, a time-distributed geometric encoder extracts per-scan features and a sequential aggregator models the evolving vortex state across variable-length sequences. We evaluate on a real-world dataset of over one million LiDAR scans. X-VORTEX achieves superior vortex center localization while using only 1% of the labeled data required by supervised baselines, and the learned representations support accurate trajectory forecasting.

[299] Hierarchical Successor Representation for Robust Transfer

Changmin Yu, Máté Lengyel

Main category: cs.LG

TL;DR: HSR (Hierarchical Successor Representation) learns stable, policy-agnostic state features via temporal abstractions and NMF, enabling efficient task transfer and exploration in complex environments.

Details

Motivation: Classical Successor Representations are limited by policy dependence (becoming obsolete when policies change) and suffer from spectral diffusion in complex environments, leading to poor scaling and dense features.

Method: Proposes Hierarchical Successor Representation (HSR) incorporating temporal abstractions into predictive representations, then applies non-negative matrix factorization (NMF) to obtain sparse, low-rank state representations.

Result: HSR-NMF learns stable state features robust to policy changes, enables highly sample-efficient transfer to novel tasks in multi-compartmental environments, discovers interpretable topological structures, and scales to large procedurally generated environments.

Conclusion: HSR provides a policy-agnostic hierarchical map bridging model-free optimality and model-based flexibility, useful for task transfer and efficient exploration in complex environments.

Abstract: The successor representation (SR) provides a powerful framework for decoupling predictive dynamics from rewards, enabling rapid generalisation across reward configurations. However, the classical SR is limited by its inherent policy dependence: policies change due to ongoing learning, environmental non-stationarities, and changes in task demands, making established predictive representations obsolete. Furthermore, in topologically complex environments, SRs suffer from spectral diffusion, leading to dense and overlapping features that scale poorly. Here we propose the Hierarchical Successor Representation (HSR) for overcoming these limitations. By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes. Applying non-negative matrix factorisation (NMF) to the HSR yields a sparse, low-rank state representation that facilitates highly sample-efficient transfer to novel tasks in multi-compartmental environments. Further analysis reveals that HSR-NMF discovers interpretable topological structures, providing a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility. Beyond providing a useful basis for task-transfer, we show that HSR’s temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

[300] Closing the Loop: A Control-Theoretic Framework for Provably Stable Time Series Forecasting with LLMs

Xingyu Zhang, Hanyun Du, Zeen Song, Jianqi Zhang, Changwen Zheng, Wenwen Qiang

Main category: cs.LG

TL;DR: F-LLM introduces a closed-loop feedback control framework for LLM-based time series forecasting to mitigate error accumulation in autoregressive generation.

Details

Motivation: Current LLM approaches for time series forecasting use naive autoregressive generation, which suffers from exposure bias and error accumulation during inference as models consume their own generated outputs recursively, leading to trajectory drift over long horizons.

Method: Reformulates autoregressive forecasting through control theory, proposing F-LLM (Feedback-driven LLM) with a closed-loop framework featuring a learnable residual estimator (Observer) and a feedback controller to actively stabilize trajectories.

Result: Extensive experiments show F-LLM significantly mitigates error propagation and achieves good performance on time series benchmarks, with theoretical guarantees of uniformly bounded error under local Lipschitz constraints.

Conclusion: The closed-loop feedback mechanism provides a principled solution to error accumulation in LLM-based time series forecasting, offering both theoretical guarantees and empirical improvements over standard autoregressive approaches.

Abstract: Large Language Models (LLMs) have recently shown exceptional potential in time series forecasting, leveraging their inherent sequential reasoning capabilities to model complex temporal dynamics. However, existing approaches typically employ a naive autoregressive generation strategy. We identify a critical theoretical flaw in this paradigm: during inference, the model operates in an open-loop manner, consuming its own generated outputs recursively. This leads to inevitable error accumulation (exposure bias), where minor early deviations cascade into significant trajectory drift over long horizons. In this paper, we reformulate autoregressive forecasting through the lens of control theory, proposing \textbf{F-LLM} (Feedback-driven LLM), a novel closed-loop framework. Unlike standard methods that passively propagate errors, F-LLM actively stabilizes the trajectory via a learnable residual estimator (Observer) and a feedback controller. Furthermore, we provide a theoretical guarantee that our closed-loop mechanism ensures uniformly bounded error, provided the base model satisfies a local Lipschitz constraint. Extensive experiments demonstrate that F-LLM significantly mitigates error propagation, achieving good performance on time series benchmarks.

[301] Transporting Task Vectors across Different Architectures without Training

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, Simone Calderara

Main category: cs.LG

TL;DR: Theseus enables training-free transfer of task-specific updates across models of different widths by matching functional effects on representations rather than parameters.

Details

Motivation: Current methods for transferring task-specific parameter updates only work between identical architectures, limiting practical utility. There's a need to transfer updates across models of different widths without retraining.

Method: Theseus characterizes task updates by their functional effect on intermediate representations rather than direct parameter matching. It formalizes task-vector transport as functional matching on activations, using orthogonal Procrustes analysis to align representation spaces, yielding a stable closed-form solution that preserves update geometry.

Result: Theseus shows consistent improvements over strong baselines on vision and language models across different widths, enabling meaningful transfer of task updates without additional training or backpropagation.

Conclusion: Task updates can be effectively transferred across heterogeneous architectures when defined functionally rather than parametrically, enabling more flexible model adaptation.

Abstract: Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains largely unexplored. In this work, we introduce Theseus, a training-free method for transporting task-specific updates across heterogeneous models. Rather than matching parameters directly, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over strong baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically.

[302] Ca-MCF: Category-level Multi-label Causal Feature selection

Wanfu Gao, Yanan Wang, Yonghao Li

Main category: cs.LG

TL;DR: Ca-MCF is a category-level multi-label causal feature selection method that decomposes labels into category nodes for fine-grained causal modeling, using competition-based recovery mechanisms to identify relevant features obscured by label correlations.

Details

Motivation: Current multi-label causal feature selection methods operate at the label level, treating each label as monolithic and overlooking fine-grained causal mechanisms unique to individual categories within labels.

Method: Uses label category flattening to decompose label variables into specific category nodes, explanatory competition-based category-aware recovery with SCSMI and DCSMI metrics, structural symmetry checks, and cross-dimensional redundancy removal for robust Markov Blanket identification.

Result: Extensive experiments on seven real-world datasets show Ca-MCF significantly outperforms state-of-the-art benchmarks, achieving superior predictive accuracy with reduced feature dimensionality.

Conclusion: Ca-MCF effectively addresses limitations of label-level methods by enabling category-level causal feature selection, improving both accuracy and feature compactness in multi-label learning.

Abstract: Multi-label causal feature selection has attracted extensive attention in recent years. However, current methods primarily operate at the label level, treating each label variable as a monolithic entity and overlooking the fine-grained causal mechanisms unique to individual categories. To address this, we propose a Category-level Multi-label Causal Feature selection method named Ca-MCF. Ca-MCF utilizes label category flattening to decompose label variables into specific category nodes, enabling precise modeling of causal structures within the label space. Furthermore, we introduce an explanatory competition-based category-aware recovery mechanism that leverages the proposed Specific Category-Specific Mutual Information (SCSMI) and Distinct Category-Specific Mutual Information (DCSMI) to salvage causal features obscured by label correlations. The method also incorporates structural symmetry checks and cross-dimensional redundancy removal to ensure the robustness and compactness of the identified Markov Blankets. Extensive experiments across seven real-world datasets demonstrate that Ca-MCF significantly outperforms state-of-the-art benchmarks, achieving superior predictive accuracy with reduced feature dimensionality.

[303] Extending confidence calibration to generalised measures of variation

Andrew Thompson, Vivek Desai

Main category: cs.LG

TL;DR: VCE is a new calibration metric that extends Expected Calibration Error to assess calibration of any variation metric (like entropy), not just confidence, showing better properties than existing entropy-based metrics.

Details

Motivation: Existing calibration metrics like ECE only assess calibration of maximum probability/confidence, ignoring the full probability distribution. There's a need for metrics that can assess calibration of other variation measures like entropy that consider the entire distribution.

Method: Extends the ECE framework to assess calibration of any metric of variation (not just confidence). Proposes Variation Calibration Error (VCE) as a general extension, with specific focus on entropy-based variation measures. Validates through synthetic predictions designed to be perfectly calibrated.

Result: VCE approaches zero as sample size increases for perfectly calibrated synthetic predictions, unlike existing UCE metric. Demonstrates VCE has desired statistical properties for assessing calibration of variation metrics.

Conclusion: VCE provides a principled extension of ECE to assess calibration of any variation metric, offering better properties than existing entropy-based calibration metrics like UCE.

Abstract: We propose the Variation Calibration Error (VCE) metric for assessing the calibration of machine learning classifiers. The metric can be viewed as an extension of the well-known Expected Calibration Error (ECE) which assesses the calibration of the maximum probability or confidence. Other ways of measuring the variation of a probability distribution exist which have the advantage of taking into account the full probability distribution, for example the Shannon entropy. We show how the ECE approach can be extended from assessing confidence calibration to assessing the calibration of any metric of variation. We present numerical examples upon synthetic predictions which are perfectly calibrated by design, demonstrating that, in this scenario, the VCE has the desired property of approaching zero as the number of data samples increases, in contrast to another entropy-based calibration metric (the UCE) which has been proposed in the literature.

[304] Drift-Aware Variational Autoencoder-based Anomaly Detection with Two-level Ensembling

Jin Li, Kleanthis Malialis, Christos G. Panayiotou, Marios M. Polycarpou

Main category: cs.LG

TL;DR: VAE++ESDD: An incremental learning method using ensemble of VAEs for anomaly detection and ensemble of statistical drift detectors for handling concept drift in streaming data.

Details

Motivation: Addressing challenges in streaming data analysis where data is often unlabeled and environments are nonstationary, causing model performance deterioration due to concept drift, especially for anomaly detection tasks with low anomaly rates.

Method: Uses incremental learning with two-level ensembling: ensemble of Variational AutoEncoders for anomaly prediction, and ensemble of statistical-based concept drift detectors to handle changing data distributions.

Result: Significantly outperforms both strong baselines and state-of-the-art methods on real-world and synthetic datasets with severely/extremely low anomaly rates and various drift characteristics.

Conclusion: VAE++ESDD effectively addresses the challenges of anomaly detection in nonstationary streaming environments with concept drift, demonstrating superior performance over existing approaches.

Abstract: In today’s digital world, the generation of vast amounts of streaming data in various domains has become ubiquitous. However, many of these data are unlabeled, making it challenging to identify events, particularly anomalies. This task becomes even more formidable in nonstationary environments where model performance can deteriorate over time due to concept drift. To address these challenges, this paper presents a novel method, VAE++ESDD, which employs incremental learning and two-level ensembling: an ensemble of Variational AutoEncoder(VAEs) for anomaly prediction, along with an ensemble of concept drift detectors. Each drift detector utilizes a statistical-based concept drift mechanism. To evaluate the effectiveness of VAE++ESDD, we conduct a comprehensive experimental study using real-world and synthetic datasets characterized by severely or extremely low anomalous rates and various drift characteristics. Our study reveals that the proposed method significantly outperforms both strong baselines and state-of-the-art methods.

[305] MAUNet-Light: A Concise MAUNet Architecture for Bias Correction and Downscaling of Precipitation Estimates

Sumanta Chandra Mishra Sharma, Adway Mitra, Auroop Ratan Ganguly

Main category: cs.LG

TL;DR: Lightweight neural network architecture (MAUNet-Light) for bias correction and downscaling of precipitation data using teacher-student learning to reduce computational requirements while maintaining accuracy.

Details

Motivation: Satellite-derived data and climate models often have systematic biases compared to ground measurements. Traditional bias correction and downscaling methods are being replaced by deep learning approaches, but these often have high computational and memory requirements.

Method: Proposes MAUNet-Light, a compact neural network architecture developed using teacher-student learning paradigm. Knowledge is transferred from trained MAUNet (teacher) to create a lightweight model that performs both downscaling and bias correction with reduced computational needs.

Result: MAUNet-Light achieves reduced computational requirements without significant loss in accuracy compared to state-of-the-art methods, demonstrating effectiveness for both bias correction and spatial downscaling tasks.

Conclusion: The research successfully adapts MAUNet for bias correction and introduces a lightweight alternative that maintains performance while reducing computational burden, making neural network approaches more practical for operational weather forecasting systems.

Abstract: Satellite-derived data products and climate model simulations of geophysical variables like precipitation, often exhibit systematic biases compared to in-situ measurements. Bias correction and spatial downscaling are fundamental components to develop operational weather forecast systems, as they seek to improve the consistency between coarse-resolution climate model simulations or satellite-based estimates and ground-based observations. In recent years, deep learning-based models have been increasingly replaced traditional statistical methods to generate high-resolution, bias free projections of climate variables. For example, Max-Average U-Net (MAUNet) architecture has been demonstrated for its ability to downscale precipitation estimates. The versatility and adaptability of these neural models make them highly effective across a range of applications, though this often come at the cost of high computational and memory requirements. The aim of this research is to develop light-weight neural network architectures for both bias correction and downscaling of precipitation, for which the teacher-student based learning paradigm is explored. This research demonstrates the adaptability of MAUNet to the task of bias correction, and further introduces a compact, lightweight neural network architecture termed MAUNet-Light.The proposed MAUNet-Light model is developed by transferring knowledge from the trained MAUNet, and it is designed to perform both downscaling and bias correction with reduced computational requirements without any significant loss in accuracy compared to state-of-the-art.

[306] Multi-Dimensional Visual Data Recovery: Scale-Aware Tensor Modeling and Accelerated Randomized Computation

Wenjin Qin, Hailin Wang, Jiangjun Peng, Jianjun Wang, Tingwen Huang

Main category: cs.LG

TL;DR: Proposes improved FCTN decomposition methods for multi-dimensional data recovery with nonconvex regularization, quantized observations, and randomized compression for computational efficiency.

Details

Motivation: Existing FCTN decomposition methods for multi-dimensional data recovery have limitations in computational efficiency and modeling capability, particularly when dealing with large-scale data and quantized observations.

Method: Proposes FCTN-based generalized nonconvex regularization via gradient mapping, develops models for both unquantized and quantized observations, uses ADMM framework with convergence guarantees, and incorporates randomized compression techniques using sketching methods for computational acceleration.

Result: The proposed method shows effectiveness and superiority over state-of-the-art methods in quantitative metrics, visual quality, and running time through extensive numerical experiments.

Conclusion: The developed framework provides efficient, scalable solutions for multi-dimensional data recovery with theoretical guarantees and practical computational advantages.

Abstract: The recently proposed fully-connected tensor network (FCTN) decomposition has demonstrated significant advantages in correlation characterization and transpositional invariance, and has achieved notable achievements in multi-dimensional data processing and analysis. However, existing multi-dimensional data recovery methods leveraging FCTN decomposition still have room for further enhancement, particularly in computational efficiency and modeling capability. To address these issues, we first propose a FCTN-based generalized nonconvex regularization paradigm from the perspective of gradient mapping. Then, reliable and scalable multi-dimensional data recovery models are investigated, where the model formulation is shifted from unquantized observations to coarse-grained quantized observations. Based on the alternating direction method of multipliers (ADMM) framework, we derive efficient optimization algorithms with convergence guarantees to solve the formulated models. To alleviate the computational bottleneck encountered when processing large-scale multi-dimensional data, fast and efficient randomized compression algorithms are devised in virtue of sketching techniques in numerical linear algebra. These dimensionality-reduction techniques serve as the computational acceleration core of our proposed algorithm framework. Theoretical results on approximation error upper bounds and convergence analysis for the proposed method are derived. Extensive numerical experiments illustrate the effectiveness and superiority of the proposed algorithm over other state-of-the-art methods in terms of quantitative metrics, visual quality, and running time.

[307] Prior-Guided Symbolic Regression: Towards Scientific Consistency in Equation Discovery

Jing Xiao, Xinhai Chen, Jiaming Peng, Qinglin Wang, Menghan Jia, Zhiquan Lai, Guangping Yu, Dongsheng Li, Tiejun Li, Jie Liu

Main category: cs.LG

TL;DR: PG-SR is a prior-guided symbolic regression framework that uses domain priors as executable constraints to avoid pseudo-equations that violate scientific principles, employing a three-stage pipeline with prior annealing for better generalization.

Details

Motivation: Existing symbolic regression methods often produce equations that fit data well but violate fundamental scientific principles (pseudo-equation trap), as they focus on empirical risk minimization without explicit scientific consistency constraints.

Method: Three-stage pipeline: warm-up, evolution, and refinement. Introduces prior constraint checker encoding domain priors as executable programs, and Prior Annealing Constrained Evaluation (PACE) mechanism to progressively steer discovery toward scientifically consistent regions during evolution.

Result: PG-SR outperforms state-of-the-art baselines across diverse domains, maintains robustness to varying prior quality, noisy data, and data scarcity. Theoretically proven to reduce Rademacher complexity and provide generalization bounds.

Conclusion: PG-SR effectively bridges the gap between data fitting and scientific consistency in symbolic regression by incorporating domain priors as explicit constraints, providing theoretical guarantees against pseudo-equations.

Abstract: Symbolic Regression (SR) aims to discover interpretable equations from observational data, with the potential to reveal underlying principles behind natural phenomena. However, existing approaches often fall into the Pseudo-Equation Trap: producing equations that fit observations well but remain inconsistent with fundamental scientific principles. A key reason is that these approaches are dominated by empirical risk minimization, lacking explicit constraints to ensure scientific consistency. To bridge this gap, we propose PG-SR, a prior-guided SR framework built upon a three-stage pipeline consisting of warm-up, evolution, and refinement. Throughout the pipeline, PG-SR introduces a prior constraint checker that explicitly encodes domain priors as executable constraint programs, and employs a Prior Annealing Constrained Evaluation (PACE) mechanism during the evolution stage to progressively steer discovery toward scientifically consistent regions. Theoretically, we prove that PG-SR reduces the Rademacher complexity of the hypothesis space, yielding tighter generalization bounds and establishing a guarantee against pseudo-equations. Experimentally, PG-SR outperforms state-of-the-art baselines across diverse domains, maintaining robustness to varying prior quality, noisy data, and data scarcity.

[308] Uncertainty in Federated Granger Causality: From Origins to Systemic Consequences

Ayush Mohanty, Nazal Mohamed, Nagi Gebraeel

Main category: cs.LG

TL;DR: Federated Granger Causality with uncertainty quantification: first framework to model uncertainty propagation in federated causal inference, distinguishing aleatoric vs epistemic uncertainty with closed-form recursions.

Details

Motivation: Existing federated Granger Causality methods only provide deterministic point estimates without uncertainty quantification, limiting reliability and interpretability for distributed time-series applications like smart grids with data sovereignty constraints.

Method: Systematically classifies uncertainty sources (aleatoric vs epistemic), derives closed-form recursions modeling uncertainty evolution through client-server interactions, identifies four novel cross-covariance components coupling data and model uncertainties, and establishes convergence conditions with steady-state variance analysis.

Result: Convergence analysis shows steady-state variances depend only on client data statistics (eliminating dependence on initial priors), empirical evaluations on synthetic and real-world industrial datasets demonstrate improved reliability and interpretability of federated causal inference.

Conclusion: Explicit uncertainty characterization significantly enhances federated Granger Causality frameworks, making them more robust and interpretable for distributed time-series causal inference applications.

Abstract: Granger Causality (GC) provides a rigorous framework for learning causal structures from time-series data. Recent federated variants of GC have targeted distributed infrastructure applications (e.g., smart grids) with distributed clients that generate high-dimensional data bound by data-sovereignty constraints. However, Federated GC algorithms only yield deterministic point estimates of causality and neglect uncertainty. This paper establishes the first methodology for rigorously quantifying uncertainty and its propagation within federated GC frameworks. We systematically classify sources of uncertainty, explicitly differentiating aleatoric (data noise) from epistemic (model variability) effects. We derive closed-form recursions that model the evolution of uncertainty through client-server interactions and identify four novel cross-covariance components that couple data uncertainties with model parameter uncertainties across the federated architecture. We also define rigorous convergence conditions for these uncertainty recursions and obtain explicit steady-state variances for both server and client model parameters. Our convergence analysis demonstrates that steady-state variances depend exclusively on client data statistics, thus eliminating dependence on initial epistemic priors and enhancing robustness. Empirical evaluations on synthetic benchmarks and real-world industrial datasets demonstrate that explicitly characterizing uncertainty significantly improves the reliability and interpretability of federated causal inference.

[309] Geometric Manifold Rectification for Imbalanced Learning

Xubin Wang, Qing Li, Weijia Jia

Main category: cs.LG

TL;DR: GMR is a novel undersampling framework for imbalanced tabular data that uses geometric manifold analysis with asymmetric cleaning rules to better preserve minority class samples while removing noisy majority samples.

Details

Motivation: Traditional undersampling methods like ENN use symmetric cleaning rules and uniform voting, which fail to capture local manifold structure and often remove informative minority samples, especially in noisy, imbalanced datasets with overlapping class boundaries.

Method: GMR uses geometric confidence estimation with inverse-distance weighted kNN voting and adaptive distance metrics to capture local reliability, plus asymmetric cleaning that is strict on majority samples while conservatively protecting minority samples via a safeguarding cap on minority removal.

Result: Extensive experiments on multiple benchmark datasets show GMR is competitive with strong sampling baselines for imbalanced classification tasks.

Conclusion: GMR effectively addresses imbalanced classification challenges by leveraging geometric manifold priors and asymmetric cleaning strategies to better preserve the true decision boundary in noisy, overlapping datasets.

Abstract: Imbalanced classification presents a formidable challenge in machine learning, particularly when tabular datasets are plagued by noise and overlapping class boundaries. From a geometric perspective, the core difficulty lies in the topological intrusion of the majority class into the minority manifold, which obscures the true decision boundary. Traditional undersampling techniques, such as Edited Nearest Neighbours (ENN), typically employ symmetric cleaning rules and uniform voting, failing to capture the local manifold structure and often inadvertently removing informative minority samples. In this paper, we propose GMR (Geometric Manifold Rectification), a novel framework designed to robustly handle imbalanced structured data by exploiting local geometric priors. GMR makes two contributions: (1) Geometric confidence estimation that uses inverse-distance weighted kNN voting with an adaptive distance metric to capture local reliability; and (2) asymmetric cleaning that is strict on majority samples while conservatively protecting minority samples via a safe-guarding cap on minority removal. Extensive experiments on multiple benchmark datasets show that GMR is competitive with strong sampling baselines.

[310] Machine Learning-Based Classification of Jhana Advanced Concentrative Absorption Meditation (ACAM-J) using 7T fMRI

Puneet Kumar, Winson F. Z. Yang, Alakhsimar Singh, Xiaobai Li, Matthew D. Sacchet

Main category: cs.LG

TL;DR: Machine learning classification of advanced meditation states using fMRI regional homogeneity patterns achieves 66.82% accuracy, with prefrontal and anterior cingulate regions being most important for distinguishing meditation from non-meditative states.

Details

Motivation: To investigate whether functional MRI-derived regional homogeneity patterns can be used to classify advanced concentration absorption meditation states using machine learning approaches, providing insights into neural correlates of altered consciousness states.

Method: Collected fMRI data from 20 advanced meditators for training, plus intensive single-case data for evaluation. Computed regional homogeneity maps, extracted features from predefined brain regions, and trained multiple machine learning classifiers using stratified cross-validation to distinguish meditation from control conditions.

Result: Ensemble models achieved 66.82% accuracy (p < 0.05) in distinguishing meditation from control conditions. Feature importance analysis showed prefrontal and anterior cingulate areas contributed most to classification decisions, aligning with their known roles in attentional regulation and metacognition.

Conclusion: Machine learning can feasibly classify advanced meditation states using fMRI patterns, with prefrontal and anterior cingulate regions playing key roles. This supports future research on neuromodulation and mechanistic models of advanced meditation.

Abstract: Jhana advanced concentration absorption meditation (ACAM-J) is related to profound changes in consciousness and cognitive processing, making the study of their neural correlates vital for insights into consciousness and well-being. This study evaluates whether functional MRI-derived regional homogeneity (ReHo) can be used to classify ACAM-J using machine-learning approaches. We collected group-level fMRI data from 20 advanced meditators to train the classifiers, and intensive single-case data from an advanced practitioner performing ACAM-J and control tasks to evaluate generalization. ReHo maps were computed, and features were extracted from predefined brain regions of interest. We trained multiple machine learning classifiers using stratified cross-validation to evaluate whether ReHo patterns distinguish ACAM-J from non-meditative states. Ensemble models achieved 66.82% (p < 0.05) accuracy in distinguishing ACAM-J from control conditions. Feature-importance analysis indicated that prefrontal and anterior cingulate areas contributed most to model decisions, aligning with established involvement of these regions in attentional regulation and metacognitive processes. Moreover, moderate agreement reflected in Cohen’s kappa supports the feasibility of using machine learning to distinguish ACAM-J from non-meditative states. These findings advocate machine-learning’s feasibility in classifying advanced meditation states, future research on neuromodulation and mechanistic models of advanced meditation.

[311] Diverging Flows: Detecting Extrapolations in Conditional Generation

Constantinos Tsakonas, Serena Ivaldi, Jean-Baptiste Mouret

Main category: cs.LG

TL;DR: Diverging Flows enables flow models to detect extrapolation hazards while maintaining predictive performance by enforcing inefficient transport for off-manifold inputs

Details

Motivation: Flow Matching models have extrapolation hazards where they generate plausible outputs even for off-manifold conditions, causing silent failures in safety-critical applications

Method: Introduces Diverging Flows that structurally enforce inefficient transport for off-manifold inputs, enabling a single model to perform both conditional generation and native extrapolation detection

Result: Achieves effective detection of extrapolations without compromising predictive fidelity or inference latency on synthetic manifolds, cross-domain style transfer, and weather temperature forecasting

Conclusion: Diverging Flows provides a robust solution for trustworthy flow models, enabling reliable deployment in safety-critical domains like medicine, robotics, and climate science

Abstract: The ability of Flow Matching (FM) to model complex conditional distributions has established it as the state-of-the-art for prediction tasks (e.g., robotics, weather forecasting). However, deployment in safety-critical settings is hindered by a critical extrapolation hazard: driven by smoothness biases, flow models yield plausible outputs even for off-manifold conditions, resulting in silent failures indistinguishable from valid predictions. In this work, we introduce Diverging Flows, a novel approach that enables a single model to simultaneously perform conditional generation and native extrapolation detection by structurally enforcing inefficient transport for off-manifold inputs. We evaluate our method on synthetic manifolds, cross-domain style transfer, and weather temperature forecasting, demonstrating that it achieves effective detection of extrapolations without compromising predictive fidelity or inference latency. These results establish Diverging Flows as a robust solution for trustworthy flow models, paving the way for reliable deployment in domains such as medicine, robotics, and climate science.

[312] Probabilistic Wind Power Forecasting with Tree-Based Machine Learning and Weather Ensembles

Max Bruninx, Diederik van Binsbergen, Timothy Verstraeten, Ann Nowé, Jan Helsen

Main category: cs.LG

TL;DR: Probabilistic wind power forecasting using gradient boosting trees with weather ensemble data, comparing conformalized quantile regression, natural gradient boosting, and conditional diffusion models.

Details

Motivation: Accurate wind power forecasts are essential for integrating renewable energy into power grids, requiring probabilistic day-ahead predictions to manage uncertainty.

Method: Uses gradient boosting trees with weather forecast ensembles, comparing three probabilistic methods: conformalized quantile regression, natural gradient boosting, and conditional diffusion models. Benchmarked against deterministic engineering methods (power curve and calibrated wake model).

Result: Machine learning methods improved mean absolute error by up to 53% and 33% compared to power curve and calibrated wake model respectively. Conditional diffusion models yielded best probabilistic and point estimates. Weather forecast ensembles improved point forecast accuracy by up to 23%.

Conclusion: Conditional diffusion models combined with gradient boosting trees and weather ensemble data provide superior probabilistic wind power forecasting for grid integration.

Abstract: Accurate production forecasts are essential to continue facilitating the integration of renewable energy sources into the power grid. This paper illustrates how to obtain probabilistic day-ahead forecasts of wind power generation via gradient boosting trees using an ensemble of weather forecasts. To this end, we perform a comparative analysis across three state-of-the-art probabilistic prediction methods-conformalised quantile regression, natural gradient boosting and conditional diffusion models-all of which can be combined with tree-based machine learning. The methods are validated using four years of data for all wind farms present within the Belgian offshore zone. Additionally, the point forecasts are benchmarked against deterministic engineering methods, using either the power curve or an advanced approach incorporating a calibrated analytical wake model. The experimental results show that the machine learning methods improve the mean absolute error by up to 53% and 33% compared to the power curve and the calibrated wake model. Considering the three probabilistic prediction methods, the conditional diffusion model is found to yield the best overall probabilistic and point estimate of wind power generation. Moreover, the findings suggest that the use of an ensemble of weather forecasts can improve point forecast accuracy by up to 23%.

[313] Bus-Conditioned Zero-Shot Trajectory Generation via Task Arithmetic

Shuai Liu, Ning Cao, Yile Chen, Yue Jiang, Gao Cong

Main category: cs.LG

TL;DR: MobTA enables zero-shot mobility trajectory generation for target cities using only source city mobility data and public bus timetables, without needing any real mobility data from the target city.

Details

Motivation: Existing trajectory generation methods require at least some real mobility data from target cities, limiting applicability in data-inaccessible scenarios. The paper addresses the challenge of generating mobility trajectories for cities where no such data is available.

Method: Proposes MobTA, which introduces task arithmetic into trajectory generation. It models parameter shift from bus-timetable-based trajectory generation to mobility trajectory generation in source city, then applies this shift to target city through arithmetic operations on task vectors.

Result: Extensive experiments show MobTA significantly outperforms existing methods and achieves performance close to models finetuned using target city mobility trajectories. The approach also includes theoretical analysis of stability across base and instruction-tuned LLMs.

Conclusion: MobTA enables effective zero-shot trajectory generation using only publicly available bus timetables and source city data, addressing data accessibility challenges in smart city applications.

Abstract: Mobility trajectory data provide essential support for smart city applications. However, such data are often difficult to obtain. Meanwhile, most existing trajectory generation methods implicitly assume that at least a subset of real mobility data from target city is available, which limits their applicability in data-inaccessible scenarios. In this work, we propose a new problem setting, called bus-conditioned zero-shot trajectory generation, where no mobility trajectories from a target city are accessible. The generation process relies solely on source city mobility data and publicly available bus timetables from both cities. Under this setting, we propose MobTA, the first approach to introduce task arithmetic into trajectory generation. MobTA models the parameter shift from bus-timetable-based trajectory generation to mobility trajectory generation in source city, and applies this shift to target city through arithmetic operations on task vectors. This enables trajectory generation that reflects target-city mobility patterns without requiring any real mobility data from it. Furthermore, we theoretically analyze MobTA’s stability across base and instruction-tuned LLMs. Extensive experiments show that MobTA significantly outperforms existing methods, and achieves performance close to models finetuned using target city mobility trajectories.

[314] Resource-Efficient Gesture Recognition through Convexified Attention

Daniel Schwartz, Dario Salvucci, Yusuf Osmanlioglu, Richard Vallett, Genevieve Dion, Ali Shokoufandeh

Main category: cs.LG

TL;DR: A convexified attention mechanism for wearable e-textile gesture recognition that achieves 100% accuracy with only 120-360 parameters using Euclidean projection onto probability simplex and multi-class hinge loss.

Details

Motivation: Wearable e-textile interfaces need gesture recognition but face severe power, computational, and form factor constraints that make traditional deep learning impractical. Lightweight architectures still require too many parameters for textile-integrated platforms.

Method: Introduces a convexified attention mechanism using Euclidean projection onto the probability simplex (instead of non-convex softmax) combined with multi-class hinge loss, ensuring global convergence. Implemented on a textile-based capacitive sensor with four connection points.

Result: Achieves 100.00% accuracy on both tap and swipe gestures across 10-fold cross-validation and held-out tests. Requires only 120-360 parameters (97% reduction), sub-millisecond inference (290-296μs), and minimal storage (<7KB).

Conclusion: Convex optimization enables efficient on-device machine learning for textile interfaces, demonstrating feasibility for basic gesture interactions. Real-world deployment requires validation across multiple users, environments, and more complex gestures.

Abstract: Wearable e-textile interfaces require gesture recognition capabilities but face severe constraints in power consumption, computational capacity, and form factor that make traditional deep learning impractical. While lightweight architectures like MobileNet improve efficiency, they still demand thousands of parameters, limiting deployment on textile-integrated platforms. We introduce a convexified attention mechanism for wearable applications that dynamically weights features while preserving convexity through nonexpansive simplex projection and convex loss functions. Unlike conventional attention mechanisms using non-convex softmax operations, our approach employs Euclidean projection onto the probability simplex combined with multi-class hinge loss, ensuring global convergence guarantees. Implemented on a textile-based capacitive sensor with four connection points, our approach achieves 100.00% accuracy on tap gestures and 100.00% on swipe gestures – consistent across 10-fold cross-validation and held-out test evaluation – while requiring only 120–360 parameters, a 97% reduction compared to conventional approaches. With sub-millisecond inference times (290–296$μ$s) and minimal storage requirements ($<$7KB), our method enables gesture interfaces directly within e-textiles without external processing. Our evaluation, conducted in controlled laboratory conditions with a single-user dataset, demonstrates feasibility for basic gesture interactions. Real-world deployment would require validation across multiple users, environmental conditions, and more complex gesture vocabularies. These results demonstrate how convex optimization can enable efficient on-device machine learning for textile interfaces.

[315] EXCODER: EXplainable Classification Of DiscretE time series Representations

Yannik Hahn, Antonin Königsfeld, Hasan Tercan, Tobias Meisen

Main category: cs.LG

TL;DR: Discrete latent representations (VQ-VAE/DVAE) enhance time series classification explainability by providing compressed, structured explanations without performance loss, with new SSA metric for validation.

Details

Motivation: Deep learning models for time series classification lack explainability due to high dimensionality and noise in raw data. Current XAI techniques struggle with these challenges, needing more transparent and effective explanations.

Method: Transform time series into discrete latent representations using VQ-VAE and DVAE, then apply XAI methods to these compressed representations. Introduce Similar Subsequence Accuracy (SSA) metric to quantitatively assess alignment between XAI-identified salient subsequences and training data label distribution.

Result: Discrete latent representations preserve classification performance while enabling more concise, structured explanations. SSA provides systematic validation that XAI-highlighted features are representative of learned classification patterns.

Conclusion: Discrete latent representations offer a pathway to more compact, interpretable, and computationally efficient explanations in time series analysis while maintaining classification effectiveness.

Abstract: Deep learning has significantly improved time series classification, yet the lack of explainability in these models remains a major challenge. While Explainable AI (XAI) techniques aim to make model decisions more transparent, their effectiveness is often hindered by the high dimensionality and noise present in raw time series data. In this work, we investigate whether transforming time series into discrete latent representations-using methods such as Vector Quantized Variational Autoencoders (VQ-VAE) and Discrete Variational Autoencoders (DVAE)-not only preserves but enhances explainability by reducing redundancy and focusing on the most informative patterns. We show that applying XAI methods to these compressed representations leads to concise and structured explanations that maintain faithfulness without sacrificing classification performance. Additionally, we propose Similar Subsequence Accuracy (SSA), a novel metric that quantitatively assesses the alignment between XAI-identified salient subsequences and the label distribution in the training data. SSA provides a systematic way to validate whether the features highlighted by XAI methods are truly representative of the learned classification patterns. Our findings demonstrate that discrete latent representations not only retain the essential characteristics needed for classification but also offer a pathway to more compact, interpretable, and computationally efficient explanations in time series analysis.

[316] TCRL: Temporal-Coupled Adversarial Training for Robust Constrained Reinforcement Learning in Worst-Case Scenarios

Wentao Xu, Zhongming Yao, Weihao Li, Zhenghang Song, Yumeng Song, Tianyi Li, Yushuai Li

Main category: cs.LG

TL;DR: TCRL: A temporal-coupled adversarial training framework for robust constrained reinforcement learning that addresses temporally coupled perturbations in safety-critical domains.

Details

Motivation: Existing robust CRL approaches focus on single-step perturbations and temporally independent adversarial models, lacking explicit modeling of robustness against temporally coupled perturbations, which is crucial for safety-critical applications like autonomous driving and robotics.

Method: Proposes TCRL with two key components: 1) A worst-case-perceived cost constraint function that estimates safety costs under temporally coupled perturbations without explicit adversarial modeling, and 2) A dual-constraint defense mechanism on rewards to counter temporally coupled adversaries while maintaining reward unpredictability.

Result: Experimental results show TCRL consistently outperforms existing methods in robustness against temporally coupled perturbation attacks across various CRL tasks.

Conclusion: TCRL provides an effective framework for robust constrained reinforcement learning against temporally coupled perturbations, addressing limitations of existing approaches and demonstrating superior performance in safety-critical applications.

Abstract: Constrained Reinforcement Learning (CRL) aims to optimize decision-making policies under constraint conditions, making it highly applicable to safety-critical domains such as autonomous driving, robotics, and power grid management. However, existing robust CRL approaches predominantly focus on single-step perturbations and temporally independent adversarial models, lacking explicit modeling of robustness against temporally coupled perturbations. To tackle these challenges, we propose TCRL, a novel temporal-coupled adversarial training framework for robust constrained reinforcement learning (TCRL) in worst-case scenarios. First, TCRL introduces a worst-case-perceived cost constraint function that estimates safety costs under temporally coupled perturbations without the need to explicitly model adversarial attackers. Second, TCRL establishes a dual-constraint defense mechanism on the reward to counter temporally coupled adversaries while maintaining reward unpredictability. Experimental results demonstrate that TCRL consistently outperforms existing methods in terms of robustness against temporally coupled perturbation attacks across a variety of CRL tasks.

[317] GPTZero: Robust Detection of LLM-Generated Texts

George Alexandru Adam, Alexander Cui, Edwin Thomas, Emily Napier, Nazar Shmatko, Jacob Schnell, Jacob Junqi Tian, Alekhya Dronavalli, Edward Tian, Dongwon Lee

Main category: cs.LG

TL;DR: GPTZero is an AI detection system that distinguishes human-authored from AI-generated text using hierarchical multi-task architecture, achieving state-of-the-art accuracy and robustness against adversarial attacks.

Details

Motivation: The rise of LLMs has created new challenges in text authenticity, moving beyond traditional plagiarism to distinguishing human vs AI-generated text. This is crucial for preventing undermining of skill evaluations, mass-production of low-quality content, and proliferation of misinformation.

Method: Introduces a hierarchical, multi-task architecture enabling flexible taxonomy of human and AI texts. Uses multi-tiered automated red teaming for robustness against adversarial attacks and paraphrasing.

Result: Demonstrates state-of-the-art accuracy across various domains with granular predictions. Achieves superior robustness to adversarial attacks and paraphrasing through automated red teaming.

Conclusion: GPTZero provides accurate, explainable detection and educates users on responsible use, ensuring fair and transparent assessment of text authenticity in the age of LLMs.

Abstract: While historical considerations surrounding text authenticity revolved primarily around plagiarism, the advent of large language models (LLMs) has introduced a new challenge: distinguishing human-authored from AI-generated text. This shift raises significant concerns, including the undermining of skill evaluations, the mass-production of low-quality content, and the proliferation of misinformation. Addressing these issues, we introduce GPTZero a state-of-the-art industrial AI detection solution, offering reliable discernment between human and LLM-generated text. Our key contributions include: introducing a hierarchical, multi-task architecture enabling a flexible taxonomy of human and AI texts, demonstrating state-of-the-art accuracy on a variety of domains with granular predictions, and achieving superior robustness to adversarial attacks and paraphrasing via multi-tiered automated red teaming. GPTZero offers accurate and explainable detection, and educates users on its responsible use, ensuring fair and transparent assessment of text.

[318] Which Algorithms Can Graph Neural Networks Learn?

Solveig Wittig, Antonis Vasileiou, Robert R. Nerem, Timo Stoll, Floris Geerts, Yusu Wang, Christopher Morris

Main category: cs.LG

TL;DR: Theoretical framework for analyzing when message-passing graph neural networks can learn algorithms from small instances and generalize to arbitrary-sized inputs, with applications to graph algorithms and impossibility results for certain tasks.

Details

Motivation: To bridge the gap between empirical work and formal guarantees in neural algorithmic reasoning, specifically understanding when MPNNs can learn algorithms from limited training data and generalize to larger instances.

Method: Develops a theoretical framework characterizing sufficient conditions for MPNNs to learn algorithms from small training instances and generalize to arbitrary-sized inputs. Includes impossibility results for certain tasks and proposes more expressive MPNN-like architectures.

Result: Establishes provable generalization bounds for MPNNs learning algorithms like shortest paths, minimum spanning trees, and dynamic programming problems. Shows limitations of standard MPNNs and proposes enhanced architectures that overcome these limitations.

Conclusion: Provides a rigorous theoretical foundation for neural algorithmic reasoning with MPNNs, offering both positive results for certain algorithms and impossibility results for others, along with architectural improvements for enhanced expressivity.

Abstract: In recent years, there has been growing interest in understanding neural architectures’ ability to learn to execute discrete algorithms, a line of work often referred to as neural algorithmic reasoning. The goal is to integrate algorithmic reasoning capabilities into larger neural pipelines. Many such architectures are based on (message-passing) graph neural networks (MPNNs), owing to their permutation equivariance and ability to deal with sparsity and variable-sized inputs. However, existing work is either largely empirical and lacks formal guarantees or it focuses solely on expressivity, leaving open the question of when and how such architectures generalize beyond a finite training set. In this work, we propose a general theoretical framework that characterizes the sufficient conditions under which MPNNs can learn an algorithm from a training set of small instances and provably approximate its behavior on inputs of arbitrary size. Our framework applies to a broad class of algorithms, including single-source shortest paths, minimum spanning trees, and general dynamic programming problems, such as the $0$-$1$ knapsack problem. In addition, we establish impossibility results for a wide range of algorithmic tasks, showing that standard MPNNs cannot learn them, and we derive more expressive MPNN-like architectures that overcome these limitations. Finally, we refine our analysis for the Bellman-Ford algorithm, yielding a substantially smaller required training set and significantly extending the recent work of Nerem et al. [2025] by allowing for a differentiable regularization loss. Empirical results largely support our theoretical findings.

[319] Quantization-Aware Collaborative Inference for Large Embodied AI Models

Zhonghao Lyu, Ming Xiao, Mikael Skoglund, Merouane Debbah, H. Vincent Poor

Main category: cs.LG

TL;DR: This paper proposes quantization-aware collaborative inference for embodied AI systems to address computational challenges of large AI models on resource-limited agents, developing distortion approximations and joint optimization of quantization bit-width and computation frequency under delay/energy constraints.

Details

Motivation: Large AI models (LAIMs) are essential for embodied AI applications but face challenges due to massive parameter scale and computational demands on resource-limited embodied agents. The paper aims to address these challenges through efficient quantization and collaborative inference approaches.

Method: 1) Develop tractable approximation for quantization-induced inference distortion; 2) Derive lower and upper bounds on quantization rate-inference distortion function; 3) Formulate joint quantization bit-width and computation frequency design problem under delay and energy constraints; 4) Use the distortion bounds to guide optimization while ensuring solution quality.

Result: Extensive evaluations validate the proposed distortion approximation and rate-distortion bounds. Simulations and real-world testbed experiments demonstrate the effectiveness of the joint design in balancing inference quality, latency, and energy consumption in edge embodied AI systems.

Conclusion: The proposed quantization-aware collaborative inference framework effectively addresses computational challenges of large AI models on resource-limited embodied agents, providing a practical solution for edge embodied AI systems through joint optimization of quantization and computation parameters.

Abstract: Large artificial intelligence models (LAIMs) are increasingly regarded as a core intelligence engine for embodied AI applications. However, the massive parameter scale and computational demands of LAIMs pose significant challenges for resource-limited embodied agents. To address this issue, we investigate quantization-aware collaborative inference (co-inference) for embodied AI systems. First, we develop a tractable approximation for quantization-induced inference distortion. Based on this approximation, we derive lower and upper bounds on the quantization rate-inference distortion function, characterizing its dependence on LAIM statistics, including the quantization bit-width. Next, we formulate a joint quantization bit-width and computation frequency design problem under delay and energy constraints, aiming to minimize the distortion upper bound while ensuring tightness through the corresponding lower bound. Extensive evaluations validate the proposed distortion approximation, the derived rate-distortion bounds, and the effectiveness of the proposed joint design. Particularly, simulations and real-world testbed experiments demonstrate the effectiveness of the proposed joint design in balancing inference quality, latency, and energy consumption in edge embodied AI systems.

[320] Backdoor Attacks on Contrastive Continual Learning for IoT Systems

Alfous Tim, Kuniyilh Simi D

Main category: cs.LG

TL;DR: This paper analyzes security vulnerabilities in contrastive continual learning (CCL) for IoT systems, focusing on how backdoor attacks can exploit embedding alignment and replay reinforcement to implant persistent malicious behaviors that endure through updates.

Details

Motivation: IoT systems increasingly rely on continual learning to adapt to non-stationary environments, but the combination of contrastive representation learning with incremental adaptation introduces new security vulnerabilities, particularly backdoor attacks that can exploit the geometric nature of contrastive objectives.

Method: The paper formalizes embedding-level attack objectives, examines persistence mechanisms unique to IoT deployments, develops a layered taxonomy tailored to IoT, compares vulnerabilities across learning paradigms, and evaluates defense strategies under IoT constraints like limited memory, edge computing, and federated aggregation.

Result: Findings indicate that while CCL is effective for enhancing adaptive IoT intelligence, it may also elevate long-lived representation-level threats if not adequately secured, revealing specific vulnerabilities in embedding alignment and replay reinforcement mechanisms.

Conclusion: CCL introduces new security risks for IoT systems through backdoor attacks that exploit contrastive learning geometry, requiring specialized security measures tailored to IoT constraints to mitigate persistent representation-level threats.

Abstract: The Internet of Things (IoT) systems increasingly depend on continual learning to adapt to non-stationary environments. These environments can include factors such as sensor drift, changing user behavior, device aging, and adversarial dynamics. Contrastive continual learning (CCL) combines contrastive representation learning with incremental adaptation, enabling robust feature reuse across tasks and domains. However, the geometric nature of contrastive objectives, when paired with replay-based rehearsal and stability-preserving regularization, introduces new security vulnerabilities. Notably, backdoor attacks can exploit embedding alignment and replay reinforcement, enabling the implantation of persistent malicious behaviors that endure through updates and deployment cycles. This paper provides a comprehensive analysis of backdoor attacks on CCL within IoT systems. We formalize the objectives of embedding-level attacks, examine persistence mechanisms unique to IoT deployments, and develop a layered taxonomy tailored to IoT. Additionally, we compare vulnerabilities across various learning paradigms and evaluate defense strategies under IoT constraints, including limited memory, edge computing, and federated aggregation. Our findings indicate that while CCL is effective for enhancing adaptive IoT intelligence, it may also elevate long-lived representation-level threats if not adequately secured.

[321] Unified Multi-Domain Graph Pre-training for Homogeneous and Heterogeneous Graphs via Domain-Specific Expert Encoding

Chundong Liang, Yongqi Huang, Dongxiao He, Peiyuan Li, Yawen Li, Di Jin, Weixiong Zhang

Main category: cs.LG

TL;DR: GPH²: A unified graph pre-training method that handles both homogeneous and heterogeneous graphs through multi-view construction, domain-specific experts, and adaptive fusion for downstream tasks.

Details

Motivation: Current graph pre-training methods are designed separately for homogeneous or heterogeneous graphs, but real-world applications often involve mixed graph types with distribution shifts between pre-training and deployment. There's a need for unified modeling across diverse graph types.

Method: Proposes GPH² with: 1) Unified Multi-View Graph Construction that encodes both graph types without explicit type-specific designs; 2) Domain-specific expert encoding where each expert is pre-trained on a single graph type to capture domain knowledge; 3) Task-oriented Expert Fusion Strategy that adaptively integrates experts based on discriminative strengths for downstream tasks.

Result: Extensive experiments on mixed graphs show GPH² enables stable transfer across graph types and domains, significantly outperforming existing graph pre-training methods.

Conclusion: A balanced mixture of homogeneous and heterogeneous graph pre-training benefits downstream tasks, and GPH² provides an effective unified framework for handling mixed graph scenarios with distribution shifts.

Abstract: Graph pre-training has achieved remarkable success in recent years, delivering transferable representations for downstream adaptation. However, most existing methods are designed for either homogeneous or heterogeneous graphs, thereby hindering unified graph modeling across diverse graph types. This separation contradicts real-world applications, where mixed homogeneous and heterogeneous graphs are ubiquitous, and distribution shifts between upstream pre-training and downstream deployment are common. In this paper, we empirically demonstrate that a balanced mixture of homogeneous and heterogeneous graph pre-training benefits downstream tasks and propose a unified multi-domain \textbf{G}raph \textbf{P}re-training method across \textbf{H}omogeneous and \textbf{H}eterogeneous graphs ($\mathbf{GPH^{2}}$). To address the lack of a unified encoder for homogeneous and heterogeneous graphs, we propose a Unified Multi-View Graph Construction that simultaneously encodes both without explicit graph-type-specific designs. To cope with the increased cross-domain distribution discrepancies arising from mixed graphs, we introduce domain-specific expert encoding. Each expert is independently pre-trained on a single graph to capture domain-specific knowledge, thereby shielding the pre-training encoder from the adverse effects of cross-domain discrepancies. For downstream tasks, we further design a Task-oriented Expert Fusion Strategy that adaptively integrates multiple experts based on their discriminative strengths. Extensive experiments on mixed graphs demonstrate that $\text{GPH}^{2}$ enables stable transfer across graph types and domains, significantly outperforming existing graph pre-training methods.

[322] R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training

Gengsheng Li, Jinghan He, Shijie Wang, Dan Zhang, Ruiqi Liu, Renrui Zhang, Zijun Yao, Junfeng Fang, Haiyun Guo, Jinqiao Wang

Main category: cs.LG

TL;DR: R-Diverse addresses the Diversity Illusion problem in self-play LLM reasoning by introducing memory-augmented penalty and skill-aware measurement to sustain improvement across iterations.

Details

Motivation: Existing self-play frameworks for LLM reasoning (like R-Zero) suffer from non-sustained improvement where early gains degrade over time due to Diversity Illusion - where training signals appear diverse but collapse into recurring patterns.

Method: Proposes R-Diverse with two innovations: 1) Memory-Augmented Penalty (MAP) uses a persistent memory bank to discourage question recycling across iterations, and 2) Skill-Aware Measurement (SAM) evaluates diversity based on reasoning skills exercised rather than surface variation of questions.

Result: Across 10 math and general reasoning benchmarks, R-Diverse sustains gains over more iterations and consistently outperforms prior self-play methods.

Conclusion: R-Diverse effectively mitigates Diversity Illusion in self-play LLM reasoning, enabling sustained improvement through better diversity management and skill-focused training.

Abstract: Self-play bootstraps LLM reasoning through an iterative Challenger-Solver loop: the Challenger is trained to generate questions that target the Solver’s capabilities, and the Solver is optimized on the generated data to expand its reasoning skills. However, existing frameworks like R-Zero often exhibit non-sustained improvement, where early gains degrade as self-play continues. We identify a key failure mode, Diversity Illusion, where the Solver’s training signals appear diverse yet collapse into recurring underlying patterns. It manifests as (1) Local Diversity Illusion, where diversity is enforced only within-batch, inducing cross-iteration mode cycling; and (2) Surface Diversity Illusion, where questions vary superficially but require near-identical reasoning skills. To mitigate them, we propose R-Diverse with two aligned innovations: Memory-Augmented Penalty (MAP), which uses a persistent memory bank to discourage recycling across iterations, and Skill-Aware Measurement (SAM), which evaluates diversity by the reasoning skills exercised rather than surface variation of questions. Across 10 math and general reasoning benchmarks, R-Diverse sustains gains over more iterations and consistently outperforms prior self-play methods. Code is available at https://github.com/Gengsheng-Li/R-Diverse.

[323] Eventizing Traditionally Opaque Binary Neural Networks as 1-safe Petri net Models

Mohamed Tarraf, Alex Chan, Alex Yakovlev, Rishad Shafik

Main category: cs.LG

TL;DR: A Petri net framework for formal verification and causal analysis of Binary Neural Networks, enabling event-driven modeling of BNN operations for safety-critical applications.

Details

Motivation: Binary Neural Networks (BNNs) offer efficiency but lack transparency due to their discrete, non-linear nature, making them unsuitable for safety-critical domains where causal transparency and behavioral guarantees are essential.

Method: Introduces a Petri net-based framework that models BNN operations as event-driven processes, creating modular PN blueprints for core BNN components (activation, gradient computation, weight updates) and composing them into a complete system-level model.

Result: The framework enables validation against reference software BNNs, formal verification of reachability, structural properties (1-safeness, deadlock-freeness, mutual exclusion), and causal sequencing, with scalability assessment using Workcraft tools.

Conclusion: The PN framework provides causal introspection and formal verification capabilities for BNNs, making them more transparent and suitable for safety-critical applications through event-driven modeling.

Abstract: Binary Neural Networks (BNNs) offer a low-complexity and energy-efficient alternative to traditional full-precision neural networks by constraining their weights and activations to binary values. However, their discrete, highly non-linear behavior makes them difficult to explain, validate and formally verify. As a result, BNNs remain largely opaque, limiting their suitability in safety-critical domains, where causal transparency and behavioral guarantees are essential. In this work, we introduce a Petri net (PN)-based framework that captures the BNN’s internal operations as event-driven processes. By “eventizing” their operations, we expose their causal relationships and dependencies for a fine-grained analysis of concurrency, ordering, and state evolution. Here, we construct modular PN blueprints for core BNN components including activation, gradient computation and weight updates, and compose them into a complete system-level model. We then validate the composed PN against a reference software-based BNN, verify it against reachability and structural checks to establish 1-safeness, deadlock-freeness, mutual exclusion and correct-by-construction causal sequencing, before we assess its scalability and complexity at segment, component, and system levels using the automated measurement tools in Workcraft. Overall, this framework enables causal introspection of transparent and event-driven BNNs that are amenable to formal reasoning and verification.

[324] Order Matters in Retrosynthesis: Structure-aware Generation via Reaction-Center-Guided Discrete Flow Matching

Chenguang Wang, Zihan Zhou, Lei Bai, Tianshu Yu

Main category: cs.LG

TL;DR: A structure-aware template-free retrosynthesis framework that uses atom ordering as positional inductive bias to improve chemical reaction prediction efficiency and accuracy.

Details

Motivation: Existing template-free methods treat retrosynthesis as black-box sequence generation with limited learning efficiency, while semi-template approaches rely on rigid reaction libraries that constrain generalization. The paper addresses this gap by recognizing that atom ordering in neural representations matters for chemical reasoning.

Method: Proposes a structure-aware template-free framework that encodes the two-stage nature of chemical reactions as positional inductive bias. Places reaction center atoms at sequence head to transform implicit chemical knowledge into explicit positional patterns. Uses RetroDiT backbone (graph transformer with rotary position embeddings) to exploit ordering and prioritize chemically critical regions. Combines with discrete flow matching to decouple training from sampling.

Result: Achieves state-of-the-art performance on USPTO-50k (61.2% top-1) and USPTO-Full (51.3% top-1) with predicted reaction centers. With oracle centers, performance reaches 71.1% and 63.4% respectively, surpassing foundation models trained on 10 billion reactions while using orders of magnitude less data. Generation in 20-50 steps vs 500 for prior diffusion methods.

Conclusion: Structural priors outperform brute-force scaling: a 280K-parameter model with proper ordering matches a 65M-parameter model without it, demonstrating that explicit positional patterns and atom ordering are crucial for efficient chemical reasoning.

Abstract: Template-free retrosynthesis methods treat the task as black-box sequence generation, limiting learning efficiency, while semi-template approaches rely on rigid reaction libraries that constrain generalization. We address this gap with a key insight: atom ordering in neural representations matters. Building on this insight, we propose a structure-aware template-free framework that encodes the two-stage nature of chemical reactions as a positional inductive bias. By placing reaction center atoms at the sequence head, our method transforms implicit chemical knowledge into explicit positional patterns that the model can readily capture. The proposed RetroDiT backbone, a graph transformer with rotary position embeddings, exploits this ordering to prioritize chemically critical regions. Combined with discrete flow matching, our approach decouples training from sampling and enables generation in 20–50 steps versus 500 for prior diffusion methods. Our method achieves state-of-the-art performance on both USPTO-50k (61.2% top-1) and the large-scale USPTO-Full (51.3% top-1) with predicted reaction centers. With oracle centers, performance reaches 71.1% and 63.4% respectively, surpassing foundation models trained on 10 billion reactions while using orders of magnitude less data. Ablation studies further reveal that structural priors outperform brute-force scaling: a 280K-parameter model with proper ordering matches a 65M-parameter model without it.

[325] FlashSchNet: Fast and Accurate Coarse-Grained Neural Network Molecular Dynamics

Pingzhi Li, Hongxuan Li, Zirui Liu, Xingcheng Lin, Tianlong Chen

Main category: cs.LG

TL;DR: FlashSchNet is an IO-aware GNN-MD framework that optimizes GPU memory usage through fused operations and quantization, achieving 6.5x speedup over baseline while maintaining molecular dynamics accuracy.

Details

Motivation: GNN potentials like SchNet improve molecular dynamics simulation accuracy but remain slower than classical force fields due to inefficient GPU memory utilization from fragmented kernels and memory-bound pipelines.

Method: Four techniques: (1) flash radial basis - fuses distance computation, Gaussian expansion, and cosine envelope; (2) flash message passing - fuses cutoff, neighbor gather, filter multiplication, and reduction; (3) flash aggregation - reformulates scatter-add via CSR segment reduce; (4) channel-wise 16-bit quantization of MLP weights.

Result: Achieves 1000 ns/day aggregate simulation throughput over 64 parallel replicas on coarse-grained protein (6.5x faster than baseline with 80% memory reduction), surpassing classical force fields while retaining SchNet-level accuracy.

Conclusion: Making GNN-MD IO-aware through fused operations and memory optimization enables significant speed improvements while maintaining accuracy, making GNN potentials more practical for molecular dynamics simulation.

Abstract: Graph neural network (GNN) potentials such as SchNet improve the accuracy and transferability of molecular dynamics (MD) simulation by learning many-body interactions, but remain slower than classical force fields due to fragmented kernels and memory-bound pipelines that underutilize GPUs. We show that a missing principle is making GNN-MD IO-aware, carefully accounting for reads and writes between GPU high-bandwidth memory (HBM) and on-chip SRAM. We present FlashSchNet, an efficient and accurate IO-aware SchNet-style GNN-MD framework built on four techniques: (1) flash radial basis, which fuses pairwise distance computation, Gaussian basis expansion, and cosine envelope into a single tiled pass, computing each distance once and reusing it across all basis functions; (2) flash message passing, which fuses cutoff, neighbor gather, filter multiplication, and reduction to avoid materializing edge tensors in HBM; (3) flash aggregation, which reformulates scatter-add via CSR segment reduce, reducing atomic writes by a factor of feature dimension and enabling contention-free accumulation in both forward and backward passes; (4) channel-wise 16-bit quantization that exploits the low per-channel dynamic range in SchNet MLP weights to further improve throughput with negligible accuracy loss. On a single NVIDIA RTX PRO 6000, FlashSchNet achieves 1000 ns/day aggregate simulation throughput over 64 parallel replicas on coarse-grained (CG) protein containing 269 beads (6.5x faster than CGSchNet baseline with 80% reduction of peak memory), surpassing classical force fields (e.g. MARTINI) while retaining SchNet-level accuracy and transferability.

[326] Learning to Approximate Uniform Facility Location via Graph Neural Networks

Chendi Qian, Christopher Morris, Stefanie Jegelka, Christian Sohler

Main category: cs.LG

TL;DR: A differentiable message-passing neural network approach for Uniform Facility Location that combines approximation algorithm principles with neural networks, achieving provable guarantees and outperforming classical methods.

Details

Motivation: Bridging the gap between learning-based methods (which adapt to data but lack guarantees) and classical approximation algorithms (which have guarantees but are non-differentiable and can't exploit data patterns) for combinatorial optimization problems.

Method: Develops a fully differentiable message-passing neural network (MPNN) model that embeds approximation-algorithmic principles for Uniform Facility Location, avoiding solver supervision or discrete relaxations.

Result: The approach admits provable approximation and size generalization guarantees, outperforms standard non-learned approximation algorithms in solution quality, and closes the gap with computationally intensive integer linear programming approaches.

Conclusion: Provides a step toward bridging learning-based methods and approximation algorithms for discrete optimization by creating differentiable models with theoretical guarantees.

Abstract: There has been a growing interest in using neural networks, especially message-passing neural networks (MPNNs), to solve hard combinatorial optimization problems heuristically. However, existing learning-based approaches for hard combinatorial optimization tasks often rely on supervised training data, reinforcement learning, or gradient estimators, leading to significant computational overhead, unstable training, or a lack of provable performance guarantees. In contrast, classical approximation algorithms offer such performance guarantees under worst-case inputs but are non-differentiable and unable to adaptively exploit structural regularities in natural input distributions. We address this dichotomy with the fundamental example of Uniform Facility Location (UniFL), a variant of the combinatorial facility location problem with applications in clustering, data summarization, logistics, and supply chain design. We develop a fully differentiable MPNN model that embeds approximation-algorithmic principles while avoiding the need for solver supervision or discrete relaxations. Our approach admits provable approximation and size generalization guarantees to much larger instances than seen during training. Empirically, we show that our approach outperforms standard non-learned approximation algorithms in terms of solution quality, closing the gap with computationally intensive integer linear programming approaches. Overall, this work provides a step toward bridging learning-based methods and approximation algorithms for discrete optimization.

[327] Learning functional components of PDEs from data using neural networks

Torkel E. Loman, Yurij Salmaniw, Antonio Leon Villares, Jose A. Carrillo, Ruth E. Baker

Main category: cs.LG

TL;DR: Neural networks embedded in PDEs can recover unknown functions from data, demonstrated on nonlocal aggregation-diffusion equations to recover interaction kernels and external potentials.

Details

Motivation: Partial differential equations often contain unknown functions that are difficult to measure directly, limiting predictive capabilities. While workflows for recovering scalar PDE parameters are established, methods for recovering entire functions from data need development.

Method: Embed neural networks directly into PDEs and train them on data to approximate unknown functions. Use nonlocal aggregation-diffusion equations as a case study to recover interaction kernels and external potentials from steady state data.

Result: The approach successfully recovers unknown functions with arbitrary accuracy. The study investigates how factors like number of available solutions, their properties, sampling density, and measurement noise affect recovery success.

Conclusion: The method enables recovery of unknown functions in PDEs using standard parameter-fitting workflows, and the trained PDE can be used normally for system predictions.

Abstract: Partial differential equations often contain unknown functions that are difficult or impossible to measure directly, hampering our ability to derive predictions from the model. Workflows for recovering scalar PDE parameters from data are well studied: here we show how similar workflows can be used to recover functions from data. Specifically, we embed neural networks into the PDE and show how, as they are trained on data, they can approximate unknown functions with arbitrary accuracy. Using nonlocal aggregation-diffusion equations as a case study, we recover interaction kernels and external potentials from steady state data. Specifically, we investigate how a wide range of factors, such as the number of available solutions, their properties, sampling density, and measurement noise, affect our ability to successfully recover functions. Our approach is advantageous because it can utilise standard parameter-fitting workflows, and in that the trained PDE can be treated as a normal PDE for purposes such as generating system predictions.

[328] R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, Dong Yu

Main category: cs.LG

TL;DR: R-Zero is a fully autonomous framework for self-evolving LLMs that generates its own training data from scratch through adversarial interaction between Challenger and Solver models.

Details

Motivation: Existing self-evolving LLM methods still rely heavily on human-curated tasks and labels, creating a bottleneck for advancing AI systems beyond human intelligence. The authors aim to create a fully autonomous framework that can generate its own training data without any pre-existing tasks or labels.

Method: R-Zero starts from a single base LLM and initializes two independent models: a Challenger and a Solver. These models co-evolve through interaction - the Challenger is rewarded for proposing tasks near the edge of the Solver’s capability, while the Solver is rewarded for solving increasingly challenging tasks. This creates a self-improving curriculum without external data.

Result: R-Zero substantially improves reasoning capability across different backbone LLMs. For example, it boosts Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.

Conclusion: R-Zero demonstrates that LLMs can autonomously generate their own training data and improve through self-play, offering a scalable path toward super-intelligence without reliance on human-curated data.

Abstract: Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.

[329] Don’t Walk the Line: Boundary Guidance for Filtered Generation

Sarah Ball, Andreas Haupt

Main category: cs.LG

TL;DR: Boundary Guidance is a reinforcement learning fine-tuning method that steers generative models away from classifier decision boundaries to improve both safety and utility of outputs.

Details

Motivation: Current safety approaches that fine-tune generators to reduce classifier filtering probability often push models toward classifier decision boundaries, increasing false positives and false negatives, compromising both safety and utility.

Method: Proposes Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from classifier decision margins rather than just reducing filtering probability.

Result: On benchmarks of jailbreak, ambiguous, and long-context prompts, Boundary Guidance improves both safety and utility of outputs as judged by LLM-as-a-Judge evaluations, with robustness demonstrated across model scales and reward designs.

Conclusion: Boundary Guidance provides an effective approach to safety fine-tuning that avoids the pitfalls of pushing models toward classifier decision boundaries, achieving better safety-utility tradeoffs.

Abstract: Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier’s decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier’s margin. On a benchmark of jailbreak, ambiguous, and longcontext prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.

[330] Diffusion-Pretrained Dense and Contextual Embeddings

Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Milliken, Bo Wang, Denis Bykov

Main category: cs.LG

TL;DR: pplx-embed is a family of multilingual embedding models using diffusion-pretrained language models with multi-stage contrastive learning for web-scale retrieval, featuring two variants for standard and contextualized embeddings.

Details

Motivation: To create effective multilingual embedding models for large-scale web retrieval that can capture comprehensive bidirectional context within passages and preserve global context across long documents, addressing limitations in existing retrieval systems.

Method: Uses diffusion-pretrained language model backbone with multi-stage contrastive learning, bidirectional attention through diffusion-based pretraining, mean pooling, and late chunking strategy to preserve global context across long documents.

Result: pplx-embed-v1 achieves competitive performance on MTEB (Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet benchmarks, while pplx-embed-context-v1 sets new records on ConTEB benchmark. Strong performance on internal evaluation with 1B production web pages.

Conclusion: The models demonstrate effectiveness in production environments where retrieval quality and efficiency are critical at scale, validating the approach of using diffusion-pretrained backbones with contrastive learning for multilingual web-scale retrieval.

Abstract: In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, focusing on real-world, large-scale search scenarios constructed from 1B production web pages. These results validate the models’ effectiveness in production environments where retrieval quality and efficiency are critical at scale.

[331] Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation

Can Yaras, Peng Wang, Laura Balzano, Qing Qu

Main category: cs.LG

TL;DR: Deep LoRA improves overparameterized models by leveraging low-dimensional structures in data and model dynamics, enabling efficient training while maintaining benefits of overparameterization for matrix completion and language model fine-tuning.

Details

Motivation: Overparameterization provides optimization and generalization benefits but increases computational costs. The paper aims to achieve these benefits without the computational burden by exploiting inherent low-dimensional structures in data and compressible dynamics in model parameters.

Method: Theoretical analysis shows learning dynamics of weight matrices are confined to low-dimensional subspaces. Based on this, the paper constructs compact factorizations that maintain overparameterization benefits. For language models, proposes “Deep LoRA” which improves existing LoRA technique with reduced overfitting and simplified hyperparameters.

Result: For deep matrix completion, the approach substantially improves training efficiency while retaining overparameterization advantages. For language model fine-tuning, Deep LoRA outperforms standard LoRA, especially with limited data, reducing overfitting and simplifying hyperparameter setup while maintaining efficiency.

Conclusion: The work demonstrates that by leveraging low-dimensional structures in data and model dynamics, it’s possible to achieve benefits of overparameterization without computational costs, with practical applications in matrix completion and language model fine-tuning via improved LoRA techniques.

Abstract: While overparameterization in machine learning models offers great benefits in terms of optimization and generalization, it also leads to increased computational requirements as model sizes grow. In this work, we show that by leveraging the inherent low-dimensional structures of data and compressible dynamics within the model parameters, we can reap the benefits of overparameterization without the computational burdens. In practice, we demonstrate the effectiveness of this approach for deep low-rank matrix completion as well as fine-tuning language models. Our approach is grounded in theoretical findings for deep overparameterized low-rank matrix recovery, where we show that the learning dynamics of each weight matrix are confined to an invariant low-dimensional subspace. Consequently, we can construct and train compact, highly compressed factorizations possessing the same benefits as their overparameterized counterparts. In the context of deep matrix completion, our technique substantially improves training efficiency while retaining the advantages of overparameterization. For language model fine-tuning, we propose a method called “Deep LoRA”, which improves the existing low-rank adaptation (LoRA) technique, leading to reduced overfitting and a simplified hyperparameter setup, while maintaining comparable efficiency. We validate the effectiveness of Deep LoRA on natural language tasks, particularly when fine-tuning with limited data. Our code is available at https://github.com/cjyaras/deep-lora-transformers.

[332] LTSM-Bundle: A Toolbox and Benchmark on Large Language Models for Time Series Forecasting

Yu-Neng Chuang, Songchen Li, Jiayi Yuan, Guanchu Wang, Kwei-Herng Lai, Joshua Han, Zihang Xu, Songyuan Sui, Leisheng Yu, Sirui Ding, Chia-Yuan Chang, Alfredo Costilla Reyes, Daochen Zha, Xia Hu

Main category: cs.LG

TL;DR: LTSM-Bundle is a comprehensive toolbox and benchmark for training Large Time Series Models (LTSMs) that systematically evaluates various design choices and combines the most effective ones to achieve state-of-the-art performance in time series forecasting.

Details

Motivation: Training Large Time Series Models (LTSMs) on heterogeneous time series data faces challenges due to diverse frequencies, dimensions, and patterns across datasets. Existing design choices for enhancing LTSM training and generalization are typically studied in isolation without collective benchmarking.

Method: Introduces LTSM-Bundle, a comprehensive toolbox and benchmark that modularizes and benchmarks LTSMs across multiple dimensions including prompting strategies, tokenization approaches, training paradigms, base model selection, data quantity, and dataset diversity.

Result: The combination of most effective design choices identified in the study achieves superior zero-shot and few-shot performances compared to state-of-the-art LTSMs and traditional time series forecasting methods on benchmark datasets.

Conclusion: LTSM-Bundle provides a systematic framework for evaluating and combining design choices in LTSM training, leading to improved performance in time series forecasting through better generalization capabilities.

Abstract: Time Series Forecasting (TSF) has long been a challenge in time series analysis. Inspired by the success of Large Language Models (LLMs), researchers are now developing Large Time Series Models (LTSMs)-universal transformer-based models that use autoregressive prediction-to improve TSF. However, training LTSMs on heterogeneous time series data poses unique challenges, including diverse frequencies, dimensions, and patterns across datasets. Recent endeavors have studied and evaluated various design choices aimed at enhancing LTSM training and generalization capabilities. However, these design choices are typically studied and evaluated in isolation and are not benchmarked collectively. In this work, we introduce LTSM-Bundle, a comprehensive toolbox, and benchmark for training LTSMs, spanning pre-processing techniques, model configurations, and dataset configuration. It modularized and benchmarked LTSMs from multiple dimensions, encompassing prompting strategies, tokenization approaches, training paradigms, base model selection, data quantity, and dataset diversity. Furthermore, we combine the most effective design choices identified in our study. Empirical results demonstrate that this combination achieves superior zero-shot and few-shot performances compared to state-of-the-art LTSMs and traditional TSF methods on benchmark datasets.

[333] Pixel-Based Similarities as an Alternative to Neural Data for Improving Convolutional Neural Network Adversarial Robustness

Elie Attias, Cengiz Pehlevan, Dina Obeid

Main category: cs.LG

TL;DR: A data-driven variant of brain-inspired regularization that replaces neural recording-based similarity with pixel-based similarity from images, maintaining robustness benefits without requiring neural measurements.

Details

Motivation: CNNs are vulnerable to adversarial attacks, and while brain-inspired regularizers from neural recordings can improve robustness, they require specialized data that limits practical adoption. The authors aim to create a more accessible alternative.

Method: Revisits Li et al.’s (2019) neural representational similarity regularizer and replaces neural recording-based similarity with pixel-based similarity computed directly from images, keeping the same biologically motivated loss formulation.

Result: The data-driven variant provides the same robustness improvements as the neural data version, is lightweight, easily integrates into standard pipelines, and demonstrates that neural representational insights can be leveraged without direct recordings.

Conclusion: Brain-inspired principles can yield robust yet simple methods without specialized data, suggesting further integration of these insights could push performance closer to human levels without complex specialized pipelines.

Abstract: Convolutional Neural Networks (CNNs) excel in many visual tasks but remain susceptible to adversarial attacks-imperceptible perturbations that degrade performance. Prior research reveals that brain-inspired regularizers, derived from neural recordings, can bolster CNN robustness; however, reliance on specialized data limits practical adoption. We revisit a regularizer proposed by Li et al. (2019) that aligns CNN representations with neural representational similarity structures and introduce a data-driven variant. Instead of a neural recording-based similarity, our method computes a pixel-based similarity directly from images. This substitution retains the original biologically motivated loss formulation, preserving its robustness benefits while removing the need for neural measurements or task-specific augmentations. Notably, this data-driven variant provides the same robustness improvements observed with neural data. Our approach is lightweight and integrates easily into standard pipelines. Although we do not surpass cutting-edge specialized defenses, we show that neural representational insights can be leveraged without direct recordings. This underscores the promise of robust yet simple methods rooted in brain-inspired principles, even without specialized data, and raises the possibility that further integrating these insights could push performance closer to human levels without resorting to complex, specialized pipelines.

[334] Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Can Yaras, Siyi Chen, Peng Wang, Qing Qu

Main category: cs.LG

TL;DR: Theoretical analysis of modality gap emergence in multimodal models like CLIP, identifying mismatched data pairs and learnable temperature as key causes, with practical guidance for mitigation.

Details

Motivation: Despite the success of multimodal models like CLIP in bridging different modalities, the underlying working mechanisms are not well understood. These models often exhibit a "modality gap" where different modalities occupy distinct regions in the shared representation space, which limits their effectiveness.

Method: Conducts in-depth analysis of modality gap emergence by characterizing gradient flow learning dynamics. Identifies critical roles of mismatched data pairs and learnable temperature parameter in causing and perpetuating the modality gap during training. Theoretical insights are validated through experiments on practical CLIP models.

Result: Theoretical analysis reveals how modality gap emerges and persists during training. Provides principled guidance for mitigating modality gap including appropriate temperature scheduling and modality swapping strategies. Demonstrates that closing the modality gap leads to improved performance on tasks like image-text retrieval.

Conclusion: Understanding and addressing the modality gap is crucial for improving multimodal learning systems. The proposed strategies for mitigating this gap can enhance model performance and provide better alignment between different modalities in shared representation spaces.

Abstract: Multimodal learning has recently gained significant popularity, demonstrating impressive performance across various zero-shot classification tasks and a range of perceptive and generative applications. Models such as Contrastive Language-Image Pretraining (CLIP) are designed to bridge different modalities, such as images and text, by learning a shared representation space through contrastive learning. Despite their success, the working mechanisms underlying multimodal learning are not yet well understood. Notably, these models often exhibit a modality gap, where different modalities occupy distinct regions within the shared representation space. In this work, we conduct an in-depth analysis of the emergence of modality gap by characterizing the gradient flow learning dynamics. Specifically, we identify the critical roles of mismatched data pairs and a learnable temperature parameter in causing and perpetuating the modality gap during training. Furthermore, our theoretical insights are validated through experiments on practical CLIP models. These findings provide principled guidance for mitigating the modality gap, including strategies such as appropriate temperature scheduling and modality swapping. Additionally, we demonstrate that closing the modality gap leads to improved performance on tasks such as image-text retrieval.

[335] Data-Driven Worker Activity Recognition and Efficiency Estimation in Manual Fruit Harvesting

Uddhav Bhattarai, Rajkishan Arikapudi, Steven A. Fennimore, Frank N Martin, Stavros G. Vougioukas

Main category: cs.LG

TL;DR: CNN-LSTM model for classifying strawberry picker activities as “Pick” vs “NoPick” using instrumented carts with weight, location, and movement sensors to calculate picker efficiency.

Details

Motivation: Manual fruit harvesting is inefficient due to non-productive activities; accurately identifying picking vs non-picking activities is crucial for estimating picker efficiency and optimizing labor management.

Method: Developed instrumented picking carts (iCarritos) to record harvested fruit weight, geolocation, and cart movement in real time. Used collected data to train a CNN-LSTM deep neural network to classify picker activity into “Pick” and “NoPick” classes.

Result: CNN-LSTM model achieved F1 score of 0.97 for activity recognition. Average picker efficiency was 75.07% with 97.23% accuracy, and average tray fill time was 6.85 minutes with 96.78% accuracy.

Conclusion: The proposed technology can help growers monitor worker activity, optimize harvests, reduce non-productive time, and enhance overall harvest efficiency when integrated into commercial harvesting.

Abstract: Manual fruit harvesting is common in agriculture, but the amount of time pickers spend on non-productive activities can make it very inefficient. Accurately identifying picking vs. non-picking activity is crucial for estimating picker efficiency and optimising labour management and harvest processes. In this study, a practical system was developed to calculate the efficiency of pickers in commercial strawberry harvesting. Instrumented picking carts (iCarritos) were developed to record the harvested fruit weight, geolocation, and iCarrito movement in real time. The iCarritos were deployed during the commercial strawberry harvest season in Santa Maria, CA. The collected data was then used to train a CNN-LSTM-based deep neural network to classify a picker’s activity into “Pick” and “NoPick” classes. Experimental evaluations showed that the CNN-LSTM model showed promising activity recognition performance with an F1 score of 0.97. The recognition results were then used to compute picker efficiency and the time required to fill a tray. Analysis of the season-long harvest data showed that the average picker efficiency was 75.07% with an estimation accuracy of 97.23%. Furthermore, the average tray fill time was 6.85 minutes with an estimation accuracy of 96.78%. When integrated into commercial harvesting, the proposed technology can aid growers in monitoring automated worker activity and optimising harvests to reduce non-productive time and enhance overall harvest efficiency.

[336] Discovering Hierarchy-Grounded Domains with Adaptive Granularity for Clinical Domain Generalization

Pengfei Hu, Xiaoxue Han, Fei Wang, Yue Ning

Main category: cs.LG

TL;DR: UdonCare: A hierarchy-pruning method for domain generalization in healthcare that uses medical ontologies to discover latent patient domains and extract domain-invariant information.

Details

Motivation: Domain generalization is challenging in healthcare due to shifting patient data distributions, lack of domain labels, and insufficient clinical insight integration in existing methods.

Method: UdonCare uses medical ontologies to iteratively divide patients into latent domains through hierarchy-pruning, then retrieves domain-invariant label information from patient data.

Result: Outperforms eight baselines across four clinical prediction tasks on two public datasets, demonstrating substantial improvements in handling domain gaps.

Conclusion: Medical knowledge integration through ontology-based domain discovery significantly enhances model generalization in healthcare settings.

Abstract: Domain generalization has become a critical challenge in predictive healthcare, where different patient groups often exhibit shifting data distributions that degrade model performance. Still, regular domain generalization approaches often struggle in clinical settings due to (1) the absence of domain labels and (2) the lack of clinical insight integration. To address these challenges in healthcare, we aim to explore how medical ontologies can be used to discover dynamic yet hierarchy-grounded patient domains, a partitioning strategy that remains under-explored in prior work. Hence, we introduce UdonCare, a hierarchy-pruning method that iteratively divides patients into latent domains and retrieve domain-invariant (label) information from patient data. On two public datasets, UdonCare shows superiority over eight baselines across four representative clinical prediction tasks with substantial domain gaps, highlighting the potential of medical knowledge for enhancing model generalization.

[337] Holistic Continual Learning under Concept Drift with Adaptive Memory Realignment

Alif Ashrafee, Jedrzej Kozal, Michal Wozniak, Bartosz Krawczyk

Main category: cs.LG

TL;DR: AMR framework addresses concept drift in continual learning by selectively updating replay buffers with drifted data, matching full retraining performance with minimal overhead.

Details

Motivation: Traditional continual learning assumes static data distributions, but real-world data streams experience concept drift that permanently alters previously seen data, requiring both stability and rapid adaptation.

Method: Proposes Adaptive Memory Realignment (AMR) - a lightweight approach that removes outdated samples from replay buffers and repopulates with small number of up-to-date instances to realign memory with new distributions.

Result: AMR matches Full Relearning performance while reducing labeled data and computation needs by orders of magnitude. Tested on four concept drift variants of standard vision benchmarks.

Conclusion: AMR provides scalable solution reconciling stability and plasticity in non-stationary continual learning environments, effectively countering concept drift with minimal overhead.

Abstract: Traditional continual learning methods prioritize knowledge retention and focus primarily on mitigating catastrophic forgetting, implicitly assuming that the data distribution of previously learned tasks remains static. This overlooks the dynamic nature of real-world data streams, where concept drift permanently alters previously seen data and demands both stability and rapid adaptation. We introduce a holistic framework for continual learning under concept drift that simulates realistic scenarios by evolving task distributions. As a baseline, we consider Full Relearning (FR), in which the model is retrained from scratch on newly labeled samples from the drifted distribution. While effective, this approach incurs substantial annotation and computational overhead. To address these limitations, we propose Adaptive Memory Realignment (AMR), a lightweight alternative that equips rehearsal-based learners with a drift-aware adaptation mechanism. AMR selectively removes outdated samples of drifted classes from the replay buffer and repopulates it with a small number of up-to-date instances, effectively realigning memory with the new distribution. This targeted resampling matches the performance of FR while reducing the need for labeled data and computation by orders of magnitude. To enable reproducible evaluation, we introduce four concept drift variants of standard vision benchmarks, where previously seen classes reappear with shifted representations. Comprehensive experiments on these datasets using several rehearsal-based baselines show that AMR consistently counters concept drift, maintaining high accuracy with minimal overhead. These results position AMR as a scalable solution that reconciles stability and plasticity in non-stationary continual learning environments. Full implementation of our framework and benchmark datasets is available at: github.com/AlifAshrafee/CL-Under-Concept-Drift.

[338] DART: aDaptive Accept RejecT for non-linear top-K subset identification

Mridul Agarwal, Vaneet Aggarwal, Christopher J. Quinn, Abhishek Umrawal

Main category: cs.LG

TL;DR: DART algorithm for combinatorial bandits with non-linear reward functions and correlated arms, achieving near-optimal regret bounds without requiring individual arm feedback.

Details

Motivation: Existing combinatorial bandit algorithms typically assume linear reward functions and require individual arm feedback, which limits their applicability to real-world problems where rewards can be non-linear functions of selected arms and arms may be correlated.

Method: Proposes DART (aDaptive Accept RejecT) algorithm that sequentially finds good arms and eliminates bad arms based on confidence bounds. It works without individual arm feedback, handles non-linear reward functions, and accommodates correlated rewards. The algorithm is computationally efficient with linear storage in N.

Result: DART achieves a regret bound of Õ(K√(KNT)) which matches the lower bound in bandit feedback up to a factor of √(log(2NT)). Outperforms state-of-the-art algorithms in cross-selling optimization and maximizing mean of individual rewards, and significantly outperforms existing methods for both linear and non-linear joint reward environments.

Conclusion: DART provides an efficient solution for combinatorial bandits with non-linear reward functions and correlated arms, achieving near-optimal regret bounds without requiring individual arm feedback, making it applicable to a wide range of real-world problems.

Abstract: We consider the bandit problem of selecting $K$ out of $N$ arms at each time step. The reward can be a non-linear function of the rewards of the selected individual arms. The direct use of a multi-armed bandit algorithm requires choosing among $\binom{N}{K}$ options, making the action space large. To simplify the problem, existing works on combinatorial bandits {typically} assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for top-$K$ subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in $N$. Further, DART achieves a regret bound of $\tilde{\mathcal{O}}(K\sqrt{KNT})$ for a time horizon $T$, which matches the lower bound in bandit feedback up to a factor of $\sqrt{\log{2NT}}$. When applied to the problem of cross-selling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of state-of-the-art algorithms. We also show that DART significantly outperforms existing methods for both linear and non-linear joint reward environments.

[339] Generating Physical Dynamics under Priors

Zihan Zhou, Xiaoxue Wang, Tianshu Yu

Main category: cs.LG

TL;DR: A novel framework that incorporates physical priors into diffusion-based generative models to generate physically feasible dynamics while adhering to physical laws like energy/momentum conservation and PDE constraints.

Details

Motivation: Existing data-driven generative models often violate basic physical laws and produce physically infeasible dynamics because they overlook integration of physical priors expressed in equations or formulas.

Method: Introduces a framework that seamlessly incorporates two categories of physical priors into diffusion-based generative models: 1) distributional priors (e.g., roto-translational invariance) and 2) physical feasibility priors (e.g., energy/momentum conservation laws, PDE constraints).

Result: Empirical evaluations demonstrate the method produces high-quality, physically realistic dynamics across diverse physical phenomena with remarkable robustness, advancing data-driven studies in AI4Physics.

Conclusion: The framework represents a substantial advancement in generative modeling, offering a robust solution to generate accurate and physically consistent dynamics by integrating physical priors into the generative process.

Abstract: Generating physically feasible dynamics in a data-driven context is challenging, especially when adhering to physical priors expressed in specific equations or formulas. Existing methodologies often overlook the integration of physical priors, resulting in violation of basic physical laws and suboptimal performance. In this paper, we introduce a novel framework that seamlessly incorporates physical priors into diffusion-based generative models to address this limitation. Our approach leverages two categories of priors: 1) distributional priors, such as roto-translational invariance, and 2) physical feasibility priors, including energy and momentum conservation laws and PDE constraints. By embedding these priors into the generative process, our method can efficiently generate physically realistic dynamics, encompassing trajectories and flows. Empirical evaluations demonstrate that our method produces high-quality dynamics across a diverse array of physical phenomena with remarkable robustness, underscoring its potential to advance data-driven studies in AI4Physics. Our contributions signify a substantial advancement in the field of generative modeling, offering a robust solution to generate accurate and physically consistent dynamics.

[340] B3C: A Minimalist Approach to Offline Multi-Agent Reinforcement Learning

Woojun Kim, Katia Sycara

Main category: cs.LG

TL;DR: B3C: Behavior Cloning regularization with Critic Clipping for offline multi-agent RL, addressing overestimation via target critic clipping and leveraging non-linear value factorization.

Details

Motivation: Overestimation from unseen actions is a major challenge in offline RL, especially in multi-agent settings where multiple actions exacerbate the problem. BC regularization helps in single-agent settings but suffers from over-regularization or critic divergence in multi-agent scenarios.

Method: Proposes B3C: clips target critic value in policy evaluation based on maximum dataset return, pushes limit of RL objective weight over BC regularization, and integrates with non-linear value factorization techniques for multi-agent settings.

Result: B3C outperforms state-of-the-art algorithms on various offline multi-agent benchmarks when integrated with non-linear value factorization.

Conclusion: A simple yet effective method for offline multi-agent RL that addresses overestimation through critic clipping and BC regularization, enhanced by value factorization techniques.

Abstract: Overestimation arising from selecting unseen actions during policy evaluation is a major challenge in offline reinforcement learning (RL). A minimalist approach in the single-agent setting – adding behavior cloning (BC) regularization to existing online RL algorithms – has been shown to be effective; however, this approach is understudied in multi-agent settings. In particular, overestimation becomes worse in multi-agent settings due to the presence of multiple actions, resulting in the BC regularization-based approach easily suffering from either over-regularization or critic divergence. To address this, we propose a simple yet effective method, Behavior Cloning regularization with Critic Clipping (B3C), which clips the target critic value in policy evaluation based on the maximum return in the dataset and pushes the limit of the weight on the RL objective over BC regularization, thereby improving performance. Additionally, we leverage existing value factorization techniques, particularly non-linear factorization, which is understudied in offline settings. Integrated with non-linear value factorization, B3C outperforms state-of-the-art algorithms on various offline multi-agent benchmarks.

[341] Memory Injection Attacks on LLM Agents via Query-Only Interaction

Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, Zhen Xiang

Main category: cs.LG

TL;DR: MINJA is a memory injection attack on LLM agents that injects malicious records into memory banks through normal interactions, enabling attackers to influence agent behavior without direct memory access.

Details

Motivation: LLM agents are vulnerable when their memory banks contain malicious records, but existing attacks assume direct memory modification. MINJA aims to demonstrate that attackers can inject harmful content through normal query interactions alone.

Method: Attackers interact with agents via queries to inject malicious records. They design bridging steps linking victim queries to malicious reasoning, use indication prompts to guide agents to generate similar bridging steps, and employ progressive shortening to remove prompts so malicious records are easily retrieved.

Result: Experiments across diverse agents show MINJA effectively compromises agent memory with minimal execution requirements, demonstrating that any user can influence agent memory through normal interactions.

Conclusion: MINJA reveals significant security risks in LLM agent memory systems, showing attackers can inject malicious content without direct memory access, highlighting the need for better memory security measures.

Abstract: Agents powered by large language models (LLMs) have demonstrated strong capabilities in a wide range of complex, real-world applications. However, LLM agents with a compromised memory bank may easily produce harmful outputs when the past records retrieved for demonstration are malicious. In this paper, we propose a novel Memory INJection Attack, MINJA, without assuming that the attacker can directly modify the memory bank of the agent. The attacker injects malicious records into the memory bank by only interacting with the agent via queries and output observations. These malicious records are designed to elicit a sequence of malicious reasoning steps corresponding to a different target query during the agent’s execution of the victim user’s query. Specifically, we introduce a sequence of bridging steps to link victim queries to the malicious reasoning steps. During the memory injection, we propose an indication prompt that guides the agent to autonomously generate similar bridging steps, with a progressive shortening strategy that gradually removes the indication prompt, such that the malicious record will be easily retrieved when processing later victim queries. Our extensive experiments across diverse agents demonstrate the effectiveness of MINJA in compromising agent memory. With minimal requirements for execution, MINJA enables any user to influence agent memory, highlighting the risk.

[342] Diffusion-Based Scenario Tree Generation for Multivariate Time Series Prediction and Multistage Stochastic Optimization

Stelios Zarifis, Ioannis Kordonis, Petros Maragos

Main category: cs.LG

TL;DR: DST uses diffusion models to generate structured scenario trees for stochastic forecasting and control, outperforming conventional methods in energy market applications.

Details

Motivation: Stochastic forecasting is essential for decision-making in uncertain systems like energy markets and finance, where estimating full future distributions is crucial. Current methods lack structured representations of uncertainty for control tasks.

Method: Proposes Diffusion Scenario Tree (DST) framework that uses diffusion-based probabilistic forecasting models to construct scenario trees. Recursively samples future trajectories and organizes them into trees via clustering while ensuring non-anticipativity at each stage.

Result: DST significantly outperforms conventional scenario tree generation methods in energy arbitrage applications in New York’s electricity market. It yields more efficient decision policies than deterministic/stochastic MPC variants and simple RL baselines.

Conclusion: DST provides superior uncertainty representation for control tasks by integrating diffusion-based forecasting with structured scenario trees, enabling better decision-making in stochastic environments.

Abstract: Stochastic forecasting is critical for efficient decision-making in uncertain systems, such as energy markets and finance, where estimating the full distribution of future scenarios is essential. We propose Diffusion Scenario Tree (DST), a general framework for constructing scenario trees using diffusion-based probabilistic forecasting models to provide a structured model of system evolution for control tasks. DST recursively samples future trajectories and organizes them into a tree via clustering, ensuring non-anticipativity (decisions depending only on observed history) at each stage, offering a superior representation of uncertainty compared to using predictive models solely for forecasting system evolution. We integrate DST into Model Predictive Control (MPC) and evaluate it on energy arbitrage in New York State’s day-ahead electricity market. Experimental results show that our approach significantly outperforms the same optimization algorithms that use scenario trees generated by more conventional models. Furthermore, using DST for stochastic optimization yields more efficient decision policies by better handling uncertainty than deterministic and stochastic MPC variants using the same diffusion-based forecaster, and simple Model-Free Reinforcement Learning (RL) baselines.

[343] Leveraging Noisy Manual Labels as Useful Information: An Information Fusion Approach for Enhanced Variable Selection in Penalized Logistic Regression

Xiaofei Wu, Rongmei Liangse

Main category: cs.LG

TL;DR: Label noise from manual annotation can improve variable selection in penalized logistic regression, and a distributed ADMM algorithm leverages this noise effectively for large-scale learning.

Details

Motivation: Label noise in supervised learning is typically seen as a problem, but this paper argues it contains valuable information about classification difficulty that can enhance variable selection in penalized logistic regression.

Method: Proposes a partition-insensitive parallel algorithm based on ADMM that efficiently leverages label noise information in distributed settings where data cannot be stored on a single machine, ensuring solution invariance to data distribution.

Result: Extensive experiments on large-scale datasets show the approach outperforms conventional variable selection techniques in both estimation accuracy and classification performance.

Conclusion: Label noise from manual annotation can be intentionally fused into learning processes to improve variable selection, and the proposed distributed algorithm effectively leverages this information while maintaining reproducibility and stability.

Abstract: In large-scale supervised learning, penalized logistic regression (PLR) effectively mitigates overfitting through regularization, yet its performance critically depends on robust variable selection. This paper demonstrates that label noise introduced during manual annotation, often dismissed as a mere artifact, can serve as a valuable source of information to enhance variable selection in PLR. We theoretically show that such noise, intrinsically linked to classification difficulty, helps refine the estimation of non-zero coefficients compared to using only ground truth labels, effectively turning a common imperfection into a useful information resource. To efficiently leverage this form of information fusion in large-scale settings where data cannot be stored on a single machine, we propose a novel partition insensitive parallel algorithm based on the alternating direction method of multipliers (ADMM). Our method ensures that the solution remains invariant to how data is distributed across workers, a key property for reproducible and stable distributed learning, while guaranteeing global convergence at a sublinear rate. Extensive experiments on multiple large-scale datasets show that the proposed approach consistently outperforms conventional variable selection techniques in both estimation accuracy and classification performance, affirming the value of intentionally fusing noisy manual labels into the learning process.

[344] Provable Training Data Identification for Large Language Models

Zhenlong Liu, Hao Zeng, Weiran Huang, Hongxin Wei

Main category: cs.LG

TL;DR: PTDI is a statistical method for identifying training data in large models with provable false identification rate control using conformal p-values and Benjamini-Hochberg procedure.

Details

Motivation: Existing training data identification methods lack statistical reliability and error rate control, which is critical for copyright litigation, privacy auditing, and fair evaluation of large-scale models.

Method: Formalizes training data identification as set-level inference, computes conformal p-values using known unseen data, develops Jackknife-corrected Beta boundary estimator to estimate training-data proportion, scales p-values, and applies Benjamini-Hochberg procedure for selection with false identification rate control.

Result: Extensive experiments across various models and datasets show PTDI achieves higher power than prior methods while strictly controlling the false identification rate.

Conclusion: PTDI provides a statistically reliable approach for training data identification with provable error rate control, addressing critical needs in copyright, privacy, and evaluation contexts.

Abstract: Identifying training data of large-scale models is critical for copyright litigation, privacy auditing, and ensuring fair evaluation. However, existing works typically treat this task as an instance-wise identification without controlling the error rate of the identified set, which cannot provide statistically reliable evidence. In this work, we formalize training data identification as a set-level inference problem and propose Provable Training Data Identification (PTDI), a distribution-free approach that enables provable and strict false identification rate control. Specifically, our method computes conformal p-values for each data point using a set of known unseen data and then develops a novel Jackknife-corrected Beta boundary (JKBB) estimator to estimate the training-data proportion of the test set, which allows us to scale these p-values. By applying the Benjamini-Hochberg (BH) procedure to the scaled p-values, we select a subset of data points with provable and strict false identification control. Extensive experiments across various models and datasets demonstrate that PTDI achieves higher power than prior methods while strictly controlling the FIR.

[345] Adopting a human developmental visual diet yields robust, shape-based AI vision

Zejin Lu, Sushrut Thorat, Radoslaw M Cichy, Tim C Kietzmann

Main category: cs.LG

TL;DR: AI vision models trained with human-inspired developmental curriculum show improved shape reliance, abstract shape recognition, and robustness to distortions and adversarial attacks.

Details

Motivation: Address the persistent misalignment between artificial and human vision, where AI systems rely heavily on texture features rather than shape information, lack robustness to image distortions, remain vulnerable to adversarial attacks, and struggle with abstract shape recognition.

Method: Developed a Developmental Visual Diet (DVD) for AI vision inspired by how human vision develops from infancy to adulthood. This curriculum considers the development of visual acuity, contrast sensitivity, and color, guiding AI systems through a human-inspired learning progression.

Result: Models trained with DVD curriculum showed: 1) strongest reported reliance on shape information to date, 2) abstract shape recognition beyond state-of-the-art, 3) higher resilience to image corruptions, and 4) improved robustness to adversarial attacks.

Conclusion: Robust AI vision can be achieved by guiding how a model learns (through human-inspired developmental curriculum) rather than merely how much it learns, offering a resource-efficient route toward safer and more human-like artificial visual systems.

Abstract: Despite years of research and the dramatic scaling of artificial intelligence (AI) systems, a striking misalignment between artificial and human vision persists. Contrary to humans, AI relies heavily on texture-features rather than shape information, lacks robustness to image distortions, remains highly vulnerable to adversarial attacks, and struggles to recognise simple abstract shapes within complex backgrounds. To close this gap, here we take inspiration from how human vision develops from early infancy into adulthood. We quantified visual maturation by synthesising decades of research into a novel developmental visual diet (DVD) for AI vision. Guiding AI systems through this human-inspired curriculum, which considers the development of visual acuity, contrast sensitivity, and colour, produces models that better align with human behaviour on every hallmark of robust vision tested, yielding the strongest reported reliance on shape information to date, abstract shape recognition beyond the state of the art, and higher resilience to image corruptions and adversarial attacks. Our results thus demonstrate that robust AI vision can be achieved by guiding how a model learns, not merely how much it learns, offering a resource-efficient route toward safer and more human-like artificial visual systems.

[346] Learning on a Razor’s Edge: Identifiability and Singularity of Polynomial Neural Networks

Vahid Shahverdi, Giovanni Luca Marchetti, Kathlén Kohn

Main category: cs.LG

TL;DR: This paper studies neuromanifolds (function spaces parametrized by neural networks) for MLPs and CNNs with polynomial activations, addressing identifiability, dimension computation, and singularity analysis using algebraic geometry tools.

Details

Motivation: To understand the geometric structure of neural network function spaces (neuromanifolds), particularly addressing identifiability issues (how many parameter choices yield the same function), computing dimensions, and characterizing singular points to explain training behaviors like sparsity bias.

Method: Uses algebraic geometry tools to analyze neuromanifolds of deep MLPs and CNNs with sufficiently generic polynomial activation functions. Studies identifiability (finite-to-one for MLPs, one-to-one for CNNs), computes dimensions, and characterizes singularities arising from sparse subnetworks.

Result: For MLPs: almost all functions have finitely many parameter choices; singularities correspond to sparse subnetworks and often align with critical points of MSE loss. For CNNs: parametrization is generically one-to-one; singularities also arise from sparse subnetworks but don’t correspond to MSE critical points. Provides geometric explanation for MLP sparsity bias.

Conclusion: Neuromanifolds exhibit rich geometric structure with identifiability differences between MLPs and CNNs. Singularities from sparse subnetworks explain MLP’s sparsity bias during training, while CNNs behave differently. Algebraic geometry provides powerful framework for analyzing neural network geometry.

Abstract: We study function spaces parametrized by neural networks, referred to as neuromanifolds. Specifically, we focus on deep Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) with an activation function that is a sufficiently generic polynomial. First, we address the identifiability problem, showing that, for almost all functions in the neuromanifold of an MLP, there exist only finitely many parameter choices yielding that function. For CNNs, the parametrization is generically one-to-one. As a consequence, we compute the dimension of the neuromanifold. Second, we describe singular points of neuromanifolds. We characterize singularities completely for CNNs, and partially for MLPs. In both cases, they arise from sparse subnetworks. For MLPs, we prove that these singularities often correspond to critical points of the mean-squared error loss, which does not hold for CNNs. This provides a geometric explanation of the sparsity bias of MLPs. All of our results leverage tools from algebraic geometry.

[347] Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off

Muhammad Faraz Ul Abrar, Nicolò Michelusi

Main category: cs.LG

TL;DR: OTA-FL with SGD for non-convex objectives under wireless heterogeneity, proposing biased updates with optimized power control to balance bias-variance trade-off.

Details

Motivation: Existing OTA-FL designs enforce zero-bias updates under homogeneous wireless conditions, which is unrealistic and inefficient under heterogeneous scenarios. Prior analyses focus on convex objectives while modern AI models are non-convex.

Method: Develop OTA-FL SGD updates allowing structured, time-invariant model bias to reduce variance. Derive finite-time stationarity bound revealing bias-variance trade-off. Propose non-convex joint OTA power-control design solved via successive convex approximation (SCA) algorithm using statistical CSI.

Result: Experiments on non-convex image classification show SCA-based design accelerates convergence via optimized bias and improves generalization over prior OTA-FL baselines.

Conclusion: Structured biased OTA-FL with optimized power control effectively addresses wireless heterogeneity for non-convex objectives, achieving better convergence and generalization than zero-bias approaches.

Abstract: Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.

[348] Optimal Formats for Weight Quantisation

Douglas Orr, Luka Ribar, Carlo Luschi

Main category: cs.LG

TL;DR: A framework for systematic design of weight quantization formats based on classical quantization theory, showing variable-length codes are optimal for minimizing KL divergence under size constraints.

Details

Motivation: Current quantization format selection is largely empirical despite being essential for efficient deep learning model deployment. There's a need for systematic design principles to optimize quantization formats.

Method: Frame format design as minimizing KL divergence between original and quantized model outputs under size constraints, approximate as squared quantization error problem. Develop non-linear quantization curves for block-scaled data across distributions and derive optimal bit-width allocation using Fisher information.

Result: Variable-length code formats consistently outperform fixed-length formats. Optimal bit-width allocation saves up to 0.25 bits per parameter in large language models.

Conclusion: Systematic quantization format design based on classical theory reveals variable-length encoding as optimal, providing principled approach for efficient model compression.

Abstract: Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. We frame the problem as minimising the KL divergence between original and quantised model outputs under a model size constraint, which can be approximated by minimising the squared quantisation error, a well-studied problem where entropy-constrained quantisers with variable-length codes are optimal. We develop non-linear quantisation curves for block-scaled data across multiple distribution families and observe that these formats, along with sparse outlier formats, consistently outperform fixed-length formats, indicating that they also exploit variable-length encoding. Finally, by using the relationship between the Fisher information and KL divergence, we derive the optimal allocation of bit-widths to individual parameter tensors across the model’s layers, saving up to 0.25 bits per parameter when applied to large language models.

[349] Beyond All-to-All: Causal-Aligned Transformer with Dynamic Structure Learning for Multivariate Time Series Forecasting

Xingyu Zhang, Hanyun Du, Zeen Song, Siyu Zhao, Changwen Zheng, Wenwen Qiang

Main category: cs.LG

TL;DR: Proposes CDT: Causal Decomposition Transformer for multivariate time series forecasting using all-to-one paradigm with causal structure learning to separate endogenous, direct causal, collider causal, and spurious correlation components.

Details

Motivation: Existing multivariate time series forecasting methods use all-to-all paradigm that doesn't distinguish variable roles, making it hard to identify variable-specific causal influences and often entangles causally relevant information with spurious correlations.

Method: 1) Construct Structural Causal Model from observational data; 2) For each target variable, partition historical sequence into four subsegments based on inferred causal structure; 3) Propose Causal Decomposition Transformer (CDT) with dynamic causal adapter to learn causal structures; 4) Apply projection-based output constraint to mitigate collider induced bias.

Result: Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of the CDT model for multivariate time series forecasting.

Conclusion: The proposed all-to-one forecasting paradigm with causal decomposition and CDT effectively addresses limitations of traditional all-to-all approaches by distinguishing causal influences and reducing spurious correlations.

Abstract: Most existing multivariate time series forecasting methods adopt an all-to-all paradigm that feeds all variable histories into a unified model to predict their future values without distinguishing their individual roles. However, this undifferentiated paradigm makes it difficult to identify variable-specific causal influences and often entangles causally relevant information with spurious correlations. To address this limitation, we propose an all-to-one forecasting paradigm that predicts each target variable separately. Specifically, we first construct a Structural Causal Model from observational data and then, for each target variable, we partition the historical sequence into four subsegments according to the inferred causal structure: endogenous, direct causal, collider causal, and spurious correlation. Furthermore, we propose the Causal Decomposition Transformer (CDT), which integrates a dynamic causal adapter to learn causal structures initialized by the inferred graph, enabling correction of imperfect causal discovery during training. Furthermore, motivated by causal theory, we apply a projection-based output constraint to mitigate collider induced bias and improve robustness. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of the CDT.

[350] On Learning Verifiers and Implications to Chain-of-Thought Reasoning

Maria-Florina Balcan, Avrim Blum, Zhiyuan Li, Dravyansh Sharma

Main category: cs.LG

TL;DR: Formal PAC-learning framework for learning reliable verifiers of natural language Chain-of-Thought reasoning, with sample complexity bounds and impossibility results

Details

Motivation: Chain-of-Thought reasoning often produces incorrect inferences, and formal mathematical verification is challenging for LLMs. Need reliable verifiers for natural language reasoning steps.

Method: Proposes formal PAC-learning framework for verifier learning, defines different verification goals (strength levels), provides sample complexity upper bounds, and shows impossibility results for certain objectives

Result: Established theoretical foundations for learning verifiers, provided sample complexity bounds for achievable verification goals, and identified limitations/impossibility results for certain objectives without additional assumptions

Conclusion: Formal learning framework enables systematic study of verifier learning for Chain-of-Thought reasoning, with practical implications for improving reasoning reliability in LLMs

Abstract: Chain-of-Thought reasoning has emerged as a powerful approach for solving complex mathematical and logical problems. However, it can often veer off track through incorrect or unsubstantiated inferences. Formal mathematical reasoning, which can be checked with a formal verifier, is one approach to addressing this issue. However, currently LLMs are simply not good enough to solve complex problems in a formal way, and even just formalizing an informal problem statement can be challenging. Motivated by this fact, in this work we consider the problem of learning reliable verifiers for natural language Chain-of-Thought reasoning. That is, given a problem statement and step-by-step solution in natural language, the aim of the verifier is to output [Yes] if the reasoning steps in the solution are all valid, and [No] otherwise. In this work we give a formal PAC-learning framework for studying this problem. We propose and analyze several natural verification goals, at different levels of strength, in this framework. We provide sample complexity upper-bounds for learning verifiers satisfying these goals, as well as lower-bound and impossibility results for learning other natural verification objectives without additional assumptions.

[351] N$^2$: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion

Caleb Chin, Aashish Khubchandani, Harshvardhan Maskara, Kyuseong Choi, Jacob Feitelberg, Albert Gong, Manit Paul, Tathagata Sadhukhan, Anish Agarwal, Raaz Dwivedi

Main category: cs.LG

TL;DR: N² is a unified Python package for nearest neighbor matrix completion methods with benchmarking on real-world datasets, showing NN methods outperform classical approaches in practical applications.

Details

Motivation: To consolidate diverse nearest neighbor matrix completion methods into a unified framework for research and practical use, addressing the need for robust tools that perform well on real-world data beyond synthetic scenarios.

Method: Developed N², a modular Python package that implements various NN-based matrix completion methods with extensible interfaces. Introduced a new NN variant and created a benchmark suite of real-world datasets from healthcare, recommender systems, causal inference, and LLM evaluation.

Result: NN-based methods consistently outperform classical matrix completion techniques on real-world datasets, with the new NN variant achieving state-of-the-art results in several settings. The package enables rapid experimentation and benchmarking.

Conclusion: Nearest neighbor methods are robust and effective for matrix completion in practical applications, and the N² package provides a valuable tool for both research and deployment in diverse domains.

Abstract: Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications. This paper introduces N$^2$, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface. Built for both researchers and practitioners, N$^2$ supports rapid experimentation and benchmarking. Using this framework, we introduce a new NN variant that achieves state-of-the-art results in several settings. We also release a benchmark suite of real-world datasets, from healthcare and recommender systems to causal inference and LLM evaluation, designed to stress-test matrix completion methods beyond synthetic scenarios. Our experiments demonstrate that while classical methods excel on idealized data, NN-based techniques consistently outperform them in real-world settings.

[352] Quasiparticle Interference Kernel Extraction with Variational Autoencoders via Latent Alignment

Yingshuai Ji, Haomin Zhuang, Matthew Toole, James McKenzie, Xiaolong Liu, Xiangliang Zhang

Main category: cs.LG

TL;DR: AI framework extracts single-scatterer QPI patterns from complex multi-scatterer images using two-step learning with variational autoencoder for kernel representation and dedicated encoder for observation-to-kernel inference.

Details

Motivation: QPI imaging is powerful for probing electronic structures, but extracting single-scatterer patterns from multi-scatterer images is ill-posed. Existing manual methods are infeasible for complex real-world scattering conditions.

Method: Two-step learning strategy: 1) Train variational autoencoder to learn compact latent space of scattering kernels, 2) Align latent representation of QPI observations with pre-learned kernels using dedicated encoder.

Result: Achieves significantly higher extraction accuracy and improved generalization to unseen kernels compared to direct one-step baseline. Successfully applied to real QPI data from Ag and FeSe samples under complex scattering conditions.

Conclusion: First AI-based framework for QPI kernel extraction that robustly handles complex, entangled scattering conditions by modeling physically valid kernel space and using two-step learning strategy.

Abstract: Quasiparticle interference (QPI) imaging is a powerful tool for probing electronic structures in quantum materials, but extracting the single-scatterer QPI pattern (i.e., the kernel) from a multi-scatterer image remains a fundamentally ill-posed inverse problem, because many different kernels can combine to produce almost the same observed image, and noise or overlaps further obscure the true signal. Existing solutions to this extraction problem rely on manually zooming into small local regions with isolated single-scatterers. This is infeasible for real cases where scattering conditions are too complex. In this work, we propose the first AI-based framework for QPI kernel extraction, which models the space of physically valid kernels and uses this knowledge to guide the inverse mapping. We introduce a two-step learning strategy that decouples kernel representation learning from observation-to-kernel inference. In the first step, we train a variational autoencoder to learn a compact latent space of scattering kernels. In the second step, we align the latent representation of QPI observations with those of the pre-learned kernels using a dedicated encoder. This design enables the model to infer kernels robustly under complex, entangled scattering conditions. We construct a diverse and physically realistic QPI dataset comprising 100 unique kernels and evaluate our method against a direct one-step baseline. Experimental results demonstrate that our approach achieves significantly higher extraction accuracy, improved generalization to unseen kernels. To further validate its effectiveness, we also apply the method to real QPI data from Ag and FeSe samples, where it reliably extracts meaningful kernels under complex scattering conditions.

[353] PeakWeather: MeteoSwiss Weather Station Measurements for Spatiotemporal Deep Learning

Daniele Zambon, Michele Cattaneo, Ivan Marisca, Jonas Bhend, Daniele Nerini, Cesare Alippi

Main category: cs.LG

TL;DR: PeakWeather is a high-quality 10-minute interval weather dataset from Swiss stations over 8+ years, with topographical context and NWP baseline forecasts for ML research in meteorology.

Details

Motivation: To provide a comprehensive real-world benchmark dataset for advancing machine learning research in meteorology, supporting various spatiotemporal tasks like forecasting, graph learning, imputation, and virtual sensing.

Method: Collection of surface weather observations every 10 minutes over 8+ years from 302 MeteoSwiss stations across Switzerland’s complex topography, complemented with topographical indices from digital height models and ensemble NWP forecasts as baseline.

Result: Creation of PeakWeather dataset containing diverse meteorological variables from distributed stations with topographical context and operational NWP forecasts, enabling broad spatiotemporal ML research.

Conclusion: PeakWeather serves as a valuable real-world benchmark to advance both foundational ML research and meteorological applications through its rich, high-quality spatiotemporal data.

Abstract: Accurate weather forecasts are essential for supporting a wide range of activities and decision-making processes, as well as mitigating the impacts of adverse weather events. While traditional numerical weather prediction (NWP) remains the cornerstone of operational forecasting, machine learning is emerging as a powerful alternative for fast, flexible, and scalable predictions. We introduce PeakWeather, a high-quality dataset of surface weather observations collected every 10 minutes over more than 8 years from the ground stations of the Federal Office of Meteorology and Climatology MeteoSwiss’s measurement network. The dataset includes a diverse set of meteorological variables from 302 station locations distributed across Switzerland’s complex topography and is complemented with topographical indices derived from digital height models for context. Ensemble forecasts from the currently operational high-resolution NWP model are provided as a baseline forecast against which to evaluate new approaches. The dataset’s richness supports a broad spectrum of spatiotemporal tasks, including time series forecasting at various scales, graph structure learning, imputation, and virtual sensing. As such, PeakWeather serves as a real-world benchmark to advance both foundational machine learning research, meteorology, and sensor-based applications.

[354] SAGE: Sequence-level Adaptive Gradient Evolution for Generative Recommendation

Yu Xie, Xing Kai Ren, Ying Qi, Hu Yao

Main category: cs.LG

TL;DR: SAGE is a reinforcement learning optimizer for list-wise generative recommenders that addresses symmetric conservatism failures through sequence-level signal alignment, decoupled multi-objective advantage estimation, and asymmetric adaptive bounding.

Details

Motivation: Existing RL-based preference optimizers like GBPO have structural limitations in recommendation settings, including symmetric conservatism that suppresses learning from rare positive signals, static negative-sample constraints failing to prevent diversity collapse, and group-normalized rewards leading to low-resolution training signals.

Method: SAGE introduces: 1) sequence-level signal alignment via geometric-mean importance ratio, 2) decoupled multi-objective advantage estimator to reduce token-level variance, and 3) asymmetric adaptive bounding with positive Boost updates for successful slates and entropy-aware penalties for low-diversity failures.

Result: Experiments on Amazon Product Reviews and RecIF-Bench show consistent improvements in top-K accuracy, cold-start recall, and diversity across both Semantic-ID and native-text action spaces, while preserving numerical stability during training.

Conclusion: Asymmetric, sequence-aware policy optimization provides a principled and effective framework for addressing optimization failures in generative recommendation systems.

Abstract: Reinforcement learning-based preference optimization is increasingly used to align list-wise generative recommenders with complex, multi-objective user feedback, yet existing optimizers such as Gradient-Bounded Policy Optimization (GBPO) exhibit structural limitations in recommendation settings. We identify a Symmetric Conservatism failure mode in which symmetric update bounds suppress learning from rare positive signals (e.g., cold-start items), static negative-sample constraints fail to prevent diversity collapse under rejection-dominated feedback, and group-normalized multi-objective rewards lead to low-resolution training signals. To address these issues, we propose SAGE (Sequence-level Adaptive Gradient Evolution), a unified optimizer designed for list-wise generative recommendation. SAGE introduces sequence-level signal alignment via a geometric-mean importance ratio and a decoupled multi-objective advantage estimator to reduce token-level variance and mitigate reward collapse, together with asymmetric adaptive bounding that applies positive Boost updates to successful slates and an entropy-aware penalty to discourage low-diversity failures. Experiments on Amazon Product Reviews and the large-scale RecIF-Bench demonstrate consistent improvements in top-K accuracy, cold-start recall, and diversity across both Semantic-ID and native-text action spaces, while preserving numerical stability during training. These results suggest that asymmetric, sequence-aware policy optimization provides a principled and effective framework for addressing optimization failures in generative recommendation.

[355] GAGA: Gaussianity-Aware Gaussian Approximation for Efficient 3D Molecular Generation

Jingxiang Qu, Wenhan Gao, Ruichen Xu, Yi Liu

Main category: cs.LG

TL;DR: GAGA accelerates Gaussian Probability Path Generative Models by identifying when molecular data reaches sufficient Gaussianity during forward diffusion, allowing replacement of later trajectory steps with closed-form Gaussian approximation while preserving full-resolution learning.

Details

Motivation: GPPGMs achieve state-of-the-art 3D molecular generation but suffer from high computational costs due to long generative trajectories requiring hundreds to thousands of steps during training and sampling, hindering practical deployment.

Method: GAGA identifies a characteristic step where molecular data attains sufficient Gaussianity during the forward diffusion process, after which the trajectory can be replaced by a closed-form Gaussian approximation, preserving full-resolution learning dynamics while avoiding redundant transport.

Result: Experiments on 3D molecular generation benchmarks show GAGA achieves substantial improvements in both generation quality and computational efficiency compared to existing methods.

Conclusion: GAGA provides a principled method to accelerate GPPGMs without sacrificing training granularity or inference fidelity, enabling more efficient 3D molecular generation while maintaining state-of-the-art performance.

Abstract: Gaussian Probability Path based Generative Models (GPPGMs) generate data by reversing a stochastic process that progressively corrupts samples with Gaussian noise. Despite state-of-the-art results in 3D molecular generation, their deployment is hindered by the high cost of long generative trajectories, often requiring hundreds to thousands of steps during training and sampling. In this work, we propose a principled method, named GAGA, to improve generation efficiency without sacrificing training granularity or inference fidelity of GPPGMs. Our key insight is that different data modalities obtain sufficient Gaussianity at markedly different steps during the forward process. Based on this observation, we analytically identify a characteristic step at which molecular data attains sufficient Gaussianity, after which the trajectory can be replaced by a closed-form Gaussian approximation. Unlike existing accelerators that coarsen or reformulate trajectories, our approach preserves full-resolution learning dynamics while avoiding redundant transport through truncated distributional states. Experiments on 3D molecular generation benchmarks demonstrate that our GAGA achieves substantial improvement on both generation quality and computational efficiency.

[356] Instruction-based Time Series Editing

Jiaxing Qiu, Dongliang Guo, Brynne Sullivan, Teague R. Henry, Thomas Hartvigsen

Main category: cs.LG

TL;DR: InstructTime: A novel instruction-based time series editing framework that uses natural language instructions to edit time series with controllable strength, overcoming limitations of rigid attribute-based diffusion models.

Details

Motivation: Existing time series editing methods rely on rigid, predefined attribute vectors and produce all-or-nothing edits through sampling, limiting flexibility in condition format and lacking customizable control over editing strength.

Method: Introduces instruction-based time series editing where users specify edits via natural language. InstructTime embeds time series and instructions into a shared multimodal representation space, then decodes to generate edited time series. Uses multi-resolution encoders to handle local and global edits, and enables controllable editing strength through interpolation in the learned representation space.

Result: InstructTime achieves state-of-the-art performance as a time series editor: produces high-quality edits with controllable strength, generalizes to unseen instructions, and can be adapted to unseen conditions through few-shot learning on both synthetic and real datasets.

Conclusion: Instruction-based time series editing with InstructTime provides a flexible, accessible approach that overcomes limitations of traditional attribute-based methods, enabling more natural and controllable time series modifications.

Abstract: In time series editing, we aim to modify some properties of a given time series without altering others. For example, when analyzing a hospital patient’s blood pressure, we may add a sudden early drop and observe how it impacts their future while preserving other conditions. Existing diffusion-based editors rely on rigid, predefined attribute vectors as conditions and produce all-or-nothing edits through sampling. This attribute- and sampling-based approach limits flexibility in condition format and lacks customizable control over editing strength. To overcome these limitations, we introduce Instruction-based Time Series Editing, where users specify intended edits using natural language. This allows users to express a wider range of edits in a more accessible format. We then introduce InstructTime, the first instruction-based time series editor. InstructTime takes in time series and instructions, embeds them into a shared multi-modal representation space, then decodes their embeddings to generate edited time series. By learning a structured multi-modal representation space, we can easily interpolate between embeddings to achieve varying degrees of edit. To handle local and global edits together, we propose multi-resolution encoders. In our experiments, we use synthetic and real datasets and find that InstructTime is a state-of-the-art time series editor: InstructTime achieves high-quality edits with controllable strength, can generalize to unseen instructions, and can be easily adapted to unseen conditions through few-shot learning.

[357] Pareto-Conditioned Diffusion Models for Offline Multi-Objective Optimization

Jatan Shrestha, Santeri Heiskanen, Kari Hepola, Severi Rissanen, Pekka Jääskeläinen, Joni Pajarinen

Main category: cs.LG

TL;DR: Pareto-Conditioned Diffusion (PCD) formulates offline multi-objective optimization as a conditional sampling problem using diffusion models, avoiding explicit surrogate models and enabling exploration beyond training data.

Details

Motivation: Offline multi-objective optimization faces challenges in generalizing beyond observed data in static datasets, requiring methods that can explore novel trade-offs not present in training data.

Method: PCD uses diffusion models conditioned directly on desired trade-offs, with reweighting to focus on high-performing samples and a reference-direction mechanism to guide sampling towards novel regions beyond training data.

Result: Experiments on standard offline MOO benchmarks show PCD achieves highly competitive performance and demonstrates greater consistency across diverse tasks than existing offline MOO approaches.

Conclusion: PCD provides an effective framework for offline MOO by formulating it as a conditional sampling problem, enabling better generalization and exploration of Pareto fronts beyond training data.

Abstract: Multi-objective optimization (MOO) arises in many real-world applications where trade-offs between competing objectives must be carefully balanced. In the offline setting, where only a static dataset is available, the main challenge is generalizing beyond observed data. We introduce Pareto-Conditioned Diffusion (PCD), a novel framework that formulates offline MOO as a conditional sampling problem. By conditioning directly on desired trade-offs, PCD avoids the need for explicit surrogate models. To effectively explore the Pareto front, PCD employs a reweighting strategy that focuses on high-performing samples and a reference-direction mechanism to guide sampling towards novel, promising regions beyond the training data. Experiments on standard offline MOO benchmarks show that PCD achieves highly competitive performance and, importantly, demonstrates greater consistency across diverse tasks than existing offline MOO approaches.

[358] Bridging Generalization Gap of Heterogeneous Federated Clients Using Generative Models

Ziru Niu, Hai Dong, A. K. Qin

Main category: cs.LG

TL;DR: A model-heterogeneous federated learning framework that shares feature distribution statistics instead of model parameters, uses variational transposed CNNs to generate synthetic data from these distributions, and improves generalization through fine-tuning with synthetic data.

Details

Motivation: Traditional FL struggles with data heterogeneity and assumes homogeneous model architectures. Real-world scenarios often involve clients with heterogeneous models, requiring new approaches that don't rely on parameter aggregation.

Method: Clients share feature distribution statistics (mean/covariance) instead of model parameters. Each client trains a variational transposed CNN using Gaussian latent variables from these distributions to generate synthetic data. Local models are then fine-tuned with this synthetic data.

Result: The approach achieves higher generalization accuracy than existing model-heterogeneous FL frameworks while reducing communication costs and memory consumption.

Conclusion: The proposed framework effectively addresses model heterogeneity in FL by sharing feature statistics instead of parameters, enabling better generalization with reduced resource requirements.

Abstract: Federated Learning (FL) is a privacy-preserving machine learning framework facilitating collaborative training across distributed clients. However, its performance is often compromised by data heterogeneity among participants, which can result in local models with limited generalization capability. Traditional model-homogeneous approaches address this issue primarily by regularizing local training procedures or dynamically adjusting client weights during aggregation. Nevertheless, these methods become unsuitable in scenarios involving clients with heterogeneous model architectures. In this paper, we propose a model-heterogeneous FL framework that enhances clients’ generalization performance on unseen data without relying on parameter aggregation. Instead of model parameters, clients share feature distribution statistics (mean and covariance) with the server. Then each client trains a variational transposed convolutional neural network using Gaussian latent variables sampled from these distributions, and use it to generate synthetic data. By fine-tuning local models with the synthetic data, clients achieve significant improvement of generalization ability. Experimental results demonstrate that our approach not only attains higher generalization accuracy compared to existing model-heterogeneous FL frameworks, but also reduces communication costs and memory consumption.

[359] Deep Time-Series Models Meet Volatility: Multi-Horizon Electricity Price Forecasting in the Australian National Electricity Market

Mohammed Osman Gani, Zhipeng He, Chun Ouyang, Sara Khalifa

Main category: cs.LG

TL;DR: This paper evaluates state-of-the-art deep time-series models for electricity price forecasting in volatile markets, finding they often fail to outperform standard deep learning baselines and are vulnerable to extreme market conditions.

Details

Motivation: Electricity price forecasting is challenging in volatile markets with price spikes and structural shifts. While deep learning has been adopted for EPF, the effectiveness of recent state-of-the-art deep time-series models in highly volatile electricity markets remains underexplored, and existing studies rarely assess how model accuracy varies across intraday periods.

Method: Proposes an EPF framework that systematically evaluates SOTA deep time-series models using direct multi-horizon forecasting across day-ahead and two-day-ahead settings. Conducts comprehensive empirical study across all five regions of the Australian National Electricity Market using contemporary, high-volatility data.

Result: Reveals a gap between time-series benchmark expectations and observed performance under real-world price volatility: recent deep time-series models often fail to surpass standard DL baselines. All models degrade under extreme/negative prices, with DL baselines remaining competitive. Intraday analysis shows all models are vulnerable to market conditions - absolute errors peak during evening ramps, relative errors escalate during midday negative-price periods, and directional accuracy deteriorates during abrupt price shifts.

Conclusion: Findings emphasize the need for volatility-aware modeling strategies and richer feature representations to advance electricity price forecasting, as current SOTA models struggle with real-world market volatility and intraday variations.

Abstract: Accurate electricity price forecasting (EPF) is increasingly difficult in markets characterised by extreme volatility, frequent price spikes, and rapid structural shifts. Deep learning (DL) has been increasingly adopted in EPF due to its ability to achieve high forecasting accuracy. Recently, state-of-the-art (SOTA) deep time-series models have demonstrated promising performance across general forecasting tasks. Yet, their effectiveness in highly volatile electricity markets remains underexplored. Moreover, existing EPF studies rarely assess how model accuracy varies across intraday periods, leaving model sensitivity to market conditions unexplored. To address these gaps, this paper proposes an EPF framework that systematically evaluates SOTA deep time-series models using a direct multi-horizon forecasting approach across day-ahead and two-day-ahead settings. We conduct a comprehensive empirical study across all five regions of the Australian National Electricity Market using contemporary, high-volatility data. The results reveal a clear gap between time-series benchmark expectations and observed performance under real-world price volatility: recent deep time-series models often fail to surpass standard DL baselines. All models experience substantial degradation under extreme and negative prices, yet DL baselines often remain competitive. Intraday performance analysis further reveals that all evaluated models are consistently vulnerable to prevailing market conditions, where absolute errors peak during evening ramps, relative errors escalate during midday negative-price periods, and directional accuracy deteriorates sharply during abrupt shifts in price direction. These findings emphasise the need for volatility-aware modelling strategies and richer feature representations to advance EPF.

[360] Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C. S. Lui, Wei Chen, Carlee Joe-Wong

Main category: cs.LG

TL;DR: A learning-based framework for semantic cache eviction in LLMs that addresses uncertainty in query distributions and costs, with provable efficiency guarantees.

Details

Motivation: LLMs have high inference costs that challenge scalability, and traditional caching methods fail to account for semantic similarity between queries, leading to unnecessary recomputation. Existing semantic caching approaches lack theoretical foundations and cannot adapt to real-world uncertainty in query distributions and serving costs.

Method: Proposes a principled learning-based framework for semantic cache eviction under unknown query and cost distributions. Formulates both offline optimization and online learning variants, and develops provably efficient algorithms with state-of-the-art theoretical guarantees.

Result: The framework was evaluated on a synthetic dataset, showing that the proposed algorithms perform matching or superior performance compared to baseline methods.

Conclusion: The paper presents a theoretically grounded approach to semantic caching for LLMs that addresses key challenges of uncertainty and adaptation, offering improved efficiency for scalable LLM deployment.

Abstract: Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In this paper, we present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions. We formulate both offline optimization and online learning variants of the problem, and develop provably efficient algorithms with state-of-the-art guarantees. We also evaluate our framework on a synthetic dataset, showing that our proposed algorithms perform matching or superior performance compared with baselines.

[361] Dispelling the Curse of Singularities in Neural Network Optimizations

Hengjie Cao, Mengyi Chen, Yifeng Yang, Fang Dong, Ruijun Huang, Anrui Chen, Jixian Zhou, Mingzhi Dong, Yujiang Wang, Dongsheng Li, Wenyi Fang, Yuanyi Lin, Fan Wu, Li Shang

Main category: cs.LG

TL;DR: The paper investigates optimization instability in deep neural networks through the lens of parametric singularities, showing how they grow during training and cause instability, then proposes Parametric Singularity Smoothing (PSS) to mitigate these issues.

Details

Motivation: To understand optimization instability in deep neural networks from a novel perspective - the emergence and amplification of singularities in parametric space, which leads to training instability and loss explosions.

Method: Analyzes how parametric singularities grow with gradient updates and intensify alignment with representations, then proposes Parametric Singularity Smoothing (PSS) - a lightweight method for smoothing singular spectra of weight matrices to prevent instability.

Result: Extensive experiments across diverse datasets, architectures, and optimizers show PSS mitigates instability, restores trainability after failure, and improves both training efficiency and generalization.

Conclusion: Parametric singularities are a fundamental cause of optimization instability in deep networks, and smoothing these singularities via PSS provides an effective solution for stable training and improved generalization.

Abstract: This work investigates the optimization instability of deep neural networks from a less-explored yet insightful perspective: the emergence and amplification of singularities in the parametric space. Our analysis reveals that parametric singularities inevitably grow with gradient updates and further intensify alignment with representations, leading to increased singularities in the representation space. We show that the gradient Frobenius norms are bounded by the top singular values of the weight matrices, and as training progresses, the mutually reinforcing growth of weight and representation singularities, termed the curse of singularities, relaxes these bounds, escalating the risk of sharp loss explosions. To counter this, we propose Parametric Singularity Smoothing (PSS), a lightweight, flexible, and effective method for smoothing the singular spectra of weight matrices. Extensive experiments across diverse datasets, architectures, and optimizers demonstrate that PSS mitigates instability, restores trainability even after failure, and improves both training efficiency and generalization.

[362] Self-Supervised Temporal Super-Resolution of Energy Data using Generative Adversarial Transformer

Xuanhao Mu, Gökhan Demirel, Yuzhe Zhang, Jianlei Liu, Thorsten Schlachter, Veit Hagenmeyer

Main category: cs.LG

TL;DR: A novel Generative Adversarial Transformer (GAT) method for time series upsampling that can be trained without ground-truth high-resolution data, reducing RMSE by 10% compared to conventional interpolation.

Details

Motivation: To address the temporal granularity gap in energy system modeling by developing an upsampling method that overcomes limitations of conventional methods (information loss/noise) and advanced models (supervised learning requirements, application paradox where high-resolution ground truth is unavailable).

Method: Proposes Generative Adversarial Transformers (GATs) that can be trained without access to ground-truth high-resolution data, using an adversarial training approach with transformers to generate high-resolution time series from low-resolution inputs.

Result: The method reduces root mean square error (RMSE) of upsampling tasks by 10% compared to conventional interpolation methods, and improves accuracy in model predictive control (MPC) applications by 13%.

Conclusion: The GAT approach successfully addresses the fundamental application paradox in time series upsampling by enabling training without ground-truth high-resolution data while achieving significant accuracy improvements over conventional methods.

Abstract: To bridge the temporal granularity gap in energy network design and operation based on Energy System Models, resampling of time series is required. While conventional upsampling methods are computationally efficient, they often result in significant information loss or increased noise. Advanced models such as time series generation models, Super-Resolution models and imputation models show potential, but also face fundamental challenges. The goal of time series generative models is to learn the distribution of the original data to generate high-resolution series with similar statistical characteristics. This is not entirely consistent with the definition of upsampling. Time series Super-Resolution models or imputation models can degrade the accuracy of upsampling because the input low-resolution time series are sparse and may have insufficient context. Moreover, such models usually rely on supervised learning paradigms. This presents a fundamental application paradox: their training requires the high-resolution time series that is intrinsically absent in upsampling application scenarios. To address the mentioned upsampling issue, this paper introduces a new method utilizing Generative Adversarial Transformers (GATs), which can be trained without access to any ground-truth high-resolution data. Compared with conventional interpolation methods, the introduced method can reduce the root mean square error (RMSE) of upsampling tasks by 10%, and the accuracy of a model predictive control (MPC) application scenario is improved by 13%.

[363] Finite-Width Neural Tangent Kernels from Feynman Diagrams

Max Guillen, Philipp Misof, Jan E. Gerken

Main category: cs.LG

TL;DR: Feynman diagram framework for computing finite-width corrections to neural tangent kernel statistics, enabling analysis of training dynamics beyond infinite-width limits.

Details

Motivation: Neural tangent kernels (NTKs) provide analytic control over training dynamics in infinite-width limit, but miss important properties like NTK evolution and feature learning that occur at finite widths. Need systematic approach to compute finite-width corrections.

Method: Introduce Feynman diagrams for computing finite-width corrections to NTK statistics. This simplifies algebraic manipulations and enables computation of layer-wise recursion relations for statistics involving preactivations, NTKs, and higher-derivative tensors (dNTK and ddNTK).

Result: Framework enables extension of stability results from preactivations to NTKs, proves absence of finite-width corrections for scale-invariant nonlinearities like ReLU on NTK Gram matrix diagonal. Numerical implementation shows results match sampled neural network statistics for widths n≳20.

Conclusion: Feynman diagram framework provides systematic method to compute finite-width corrections to NTK statistics, bridging gap between infinite-width theory and practical finite-width neural networks.

Abstract: Neural tangent kernels (NTKs) are a powerful tool for analyzing deep, non-linear neural networks. In the infinite-width limit, NTKs can easily be computed for most common architectures, yielding full analytic control over the training dynamics. However, at infinite width, important properties of training such as NTK evolution or feature learning are absent. Nevertheless, finite width effects can be included by computing corrections to the Gaussian statistics at infinite width. We introduce Feynman diagrams for computing finite-width corrections to NTK statistics. These dramatically simplify the necessary algebraic manipulations and enable the computation of layer-wise recursion relations for arbitrary statistics involving preactivations, NTKs and certain higher-derivative tensors (dNTK and ddNTK) required to predict the training dynamics at leading order. We demonstrate the feasibility of our framework by extending stability results for deep networks from preactivations to NTKs and proving the absence of finite-width corrections for scale-invariant nonlinearities such as ReLU on the diagonal of the Gram matrix of the NTK. We numerically implement the complete set of equations necessary to compute the first-order corrections for arbitrary inputs and demonstrate that the results follow the statistics of sampled neural networks for widths $n\gtrsim 20$.

[364] AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

Fengpeng Li, Kemou Li, Qizhou Wang, Bo Han, Jiantao Zhou

Main category: cs.LG

TL;DR: AEGIS framework improves concept erasure in diffusion models by simultaneously enhancing robustness against reactivation and retention of unrelated concepts without requiring retention data

Details

Motivation: Current concept erasure methods for diffusion models face a trade-off between robustness (resisting reactivation of erased concepts) and retention (preserving unrelated concepts), with existing approaches typically improving one at the expense of the other

Method: Adversarial Erasure with Gradient Informed Synergy (AEGIS) - a retention-data-free framework that advances both robustness and retention through gradient-informed synergy

Result: AEGIS advances both robustness and retention simultaneously, overcoming the traditional trade-off in concept erasure methods

Conclusion: AEGIS provides an effective framework for concept erasure in diffusion models that addresses both robustness and retention challenges without requiring retention data

Abstract: Concept erasure helps stop diffusion models (DMs) from generating harmful content; but current methods face robustness retention trade off. Robustness means the model fine-tuned by concept erasure methods resists reactivation of erased concepts, even under semantically related prompts. Retention means unrelated concepts are preserved so the model’s overall utility stays intact. Both are critical for concept erasure in practice, yet addressing them simultaneously is challenging, as existing works typically improve one factor while sacrificing the other. Prior work typically strengthens one while degrading the other, e.g., mapping a single erased prompt to a fixed safe target leaves class level remnants exploitable by prompt attacks, whereas retention-oriented schemes underperform against adaptive adversaries. This paper introduces Adversarial Erasure with Gradient Informed Synergy (AEGIS), a retention-data-free framework that advances both robustness and retention.

Sanggeon Yun, Raheeb Hassan, Ryozo Masukawa, Nathaniel D. Bastian, Mohsen Imani

Main category: cs.LG

TL;DR: HDC-GSR refines LLM-generated reasoning graphs for video anomaly detection using hyperdimensional computing without distribution modeling, achieving better performance.

Details

Motivation: LLM-generated reasoning graphs (MSGs) for video anomaly detection are typically treated as fixed despite being generic and distribution-deficient. Conventional graph structure refinement methods fail because they rely on learning structural distributions that don't exist in LLM-generated graphs.

Method: Proposes HDC-constrained Graph Structure Refinement (HDC-GSR) that directly optimizes a decodable, task-aligned graph representation in a single hyperdimensional space without distribution modeling. Uses Hyperdimensional Computing (HDC) to encode graphs via binding and bundling operations, aligns the resulting graph code with downstream loss, and decodes edge contributions to refine the structure. Instantiated as MissionHD for weakly supervised VAD/VAR.

Result: Demonstrates consistent performance gains on benchmark datasets for weakly supervised video anomaly detection and recognition.

Conclusion: HDC-GSR provides a new paradigm for refining LLM-generated reasoning graphs that doesn’t require distribution modeling, making it suitable for distribution-deficient graphs and improving video analysis tasks.

Abstract: LLM-generated reasoning graphs, referred to as mission-specific graphs (MSGs), are increasingly used for video anomaly detection (VAD) and recognition (VAR). However, they are typically treated as fixed despite being generic and distribution-deficient. Conventional graph structure refinement (GSR) methods are ill-suited to this setting, as they rely on learning structural distributions that are absent in LLM-generated graphs. We propose HDC-constrained Graph Structure Refinement (HDC-GSR), a new paradigm that directly optimizes a decodable, task-aligned graph representation in a single hyperdimensional space without distribution modeling. Leveraging Hyperdimensional Computing (HDC), our framework encodes graphs via binding and bundling operations, aligns the resulting graph code with downstream loss, and decodes edge contributions to refine the structure. We instantiate this approach as MissionHD for weakly supervised VAD/VAR and demonstrate consistent performance gains on benchmark datasets.

[366] Fourier Learning Machines: Nonharmonic Fourier-Based Neural Networks for Scientific Machine Learning

Mominul Rubel, Adam Meyers, Gabriel Nicolosi

Main category: cs.LG

TL;DR: FLM is a neural network architecture using cosine activations to learn Fourier series parameters, enabling representation of multidimensional nonharmonic Fourier series with separable basis functions.

Details

Motivation: To create a neural network architecture that can represent multidimensional Fourier series with complete separable basis functions, overcoming limitations of previous Fourier-inspired models and providing better function approximation for scientific computing problems.

Method: Uses feedforward network with cosine activation functions where frequencies, amplitudes, and phase shifts are trainable parameters, creating a problem-specific spectral basis adaptable to both periodic and nonperiodic functions.

Result: FLM achieves comparable or superior performance to established architectures like SIREN and vanilla feedforward NNs on benchmark PDEs and optimal control problems, with demonstrated one-to-one correspondence between Fourier coefficients and network parameters.

Conclusion: FLM provides an effective neural network architecture for representing multidimensional Fourier series with separable basis functions, offering strong performance on scientific computing problems while maintaining interpretability through Fourier coefficient correspondence.

Abstract: We introduce the Fourier Learning Machine (FLM), a neural network (NN) architecture designed to represent a multidimensional nonharmonic Fourier series. The FLM uses a simple feedforward structure with cosine activation functions to learn the frequencies, amplitudes, and phase shifts of the series as trainable parameters. This design allows the model to create a problem-specific spectral basis adaptable to both periodic and nonperiodic functions. Unlike previous Fourier-inspired NN models, the FLM is the first architecture able to represent a multidimensional Fourier series with a complete set of basis functions in separable form, doing so by using a standard Multilayer Perceptron-like architecture. A one-to-one correspondence between the Fourier coefficients and amplitudes and phase-shifts is demonstrated, allowing for the translation between a full, separable basis form and the cosine phase-shifted one. Additionally, we evaluate the performance of FLMs on several scientific computing problems, including benchmark Partial Differential Equations (PDEs) and a family of Optimal Control Problems (OCPs). Computational experiments show that the performance of FLMs is comparable, and often superior, to that of established architectures like SIREN and vanilla feedforward NNs.

[367] Learnable Chernoff Baselines for Inference-Time Alignment

Sunil Madhow, Yuchen Liang, Ness Shroff, Yingbin Liang, Yu-Xiang Wang

Main category: cs.LG

TL;DR: Learnable Chernoff Baselines (LCBs) enable efficient inference-time reward-guided alignment for generative models using adaptive rejection sampling with black-box access to pretrained models

Details

Motivation: Existing methods for inference-time reward-guided alignment either require architecture-specific adaptations or are computationally expensive, creating a need for efficient black-box methods that work with any pretrained model

Method: LCBs implement adaptive rejection sampling with learnable acceptance probabilities to sample from exponentially tilted kernels from KL-regularized reward alignment, using only black-box sampling access to pretrained models

Result: LCBs achieve total-variation guarantees to ideal aligned models and demonstrate close matching to ideal rejection sampling with substantially fewer queries to pretrained models in both continuous and discrete diffusion settings

Conclusion: LCBs provide an efficient, black-box method for inference-time reward alignment with fine-grained control over compute scaling, offering practical advantages over existing approaches

Abstract: We study inference-time reward-guided alignment for generative models. Existing methods often rely on either architecture-specific adaptations or computationally costly inference procedures. We introduce Learnable Chernoff Baselines (LCBs) as a method for efficiently and approximately sampling from the exponentially tilted kernels that arise from KL-regularized reward alignment. Using only black-box sampling access to the pretrained model, LCBs implement a form of rejection sampling with adaptively selected acceptance probabilities, which allows fine-grained control over inference-compute scaling. We establish total-variation guarantees to the ideal aligned model, and demonstrate in both continuous and discrete diffusion settings that LCB sampling closely matches ideal rejection sampling while using substantially fewer queries to the pretrained model.

[368] Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Rupert Mitchell, Kristian Kersting

Main category: cs.LG

TL;DR: MuSe accelerates long-context pretraining by clustering queries and keys separately for efficient attention, achieving 36% speedup for 64k context while maintaining quality.

Details

Motivation: Pretraining transformers on long sequences (like code repositories or document collections) is limited by quadratic attention costs, which become prohibitive for contexts up to 64k tokens.

Method: Multipole Semantic Attention (MuSe) clusters queries and keys separately in representation space to create query-specific summaries, enabling efficient attention without architectural changes.

Result: MuSe accelerates 64k-context pretraining by 36% while matching baseline loss, works with existing pretrained models (validated on Llama 3.1-8B and 3.2-1B), and preserves quality and long-context utilization during training.

Conclusion: MuSe provides an effective drop-in solution for efficient long-context pretraining that outperforms spatial blocking approaches and maintains compatibility with existing transformer architectures.

Abstract: Pretraining transformers on long sequences (entire code repositories, collections of related documents) is bottlenecked by quadratic attention costs. We present Multipole Semantic Attention (MuSe), which accelerates 64k-context pretraining by 36% while matching baseline loss, requiring no architectural changes. MuSe clusters queries and keys separately in representation space. This yields query-specific summaries that substantially outperform spatial blocking at matched sparsity, while also enabling drop-in compatibility with existing pretrained models; we validate on Llama 3.1-8B and 3.2-1B without retraining. We pretrain language models up to 1B parameters at 64k context on code and scientific documents, confirming that MuSe preserves quality and long-context utilization during training.

[369] LLaDA2.1: Speeding Up Text Diffusion via Token Editing

Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, Zhuocheng Gong, Yanmei Gu, Jian Guan, Kaiyuan Guan, Hongliang He, Zenan Huang, Juyong Jiang, Zhonghui Jiang, Zhenzhong Lan, Chengxi Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Yuan Lu, Yuxin Ma, Xingyu Mou, Zhenxuan Pan, Kaida Qiu, Yuji Ren, Jianfeng Tan, Yiding Tian, Zian Wang, Lanning Wei, Tao Wu, Yipeng Xing, Wentao Ye, Liangyu Zha, Tianze Zhang, Xiaolu Zhang, Junbo Zhao, Da Zheng, Hao Zhong, Wanli Zhong, Jun Zhou, Junlin Zhou, Liwang Zhu, Muzhi Zhu, Yihong Zhuang

Main category: cs.LG

TL;DR: LLaDA2.1 introduces a joint M2T+T2T decoding scheme with configurable thresholds for speed/quality trade-offs, plus RL alignment for diffusion LLMs, achieving high throughput on coding tasks.

Details

Motivation: Address the trade-off between decoding speed and generation quality in large diffusion language models, aiming to transcend limitations of previous block-diffusion approaches while improving alignment with human intent.

Method: 1) Joint Token-to-Token (T2T) editing integrated with Mask-to-Token (M2T) scheme with configurable thresholds; 2) Two operational modes: Speedy Mode (lower M2T threshold + T2T refinement) and Quality Mode (conservative thresholds); 3) Large-scale RL framework for diffusion LLMs with stable gradient estimation; 4) Expanded context window.

Result: Achieves strong performance across 33 benchmarks with lightning-fast decoding: 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench despite 100B parameters.

Conclusion: LLaDA2.1 successfully balances speed and quality in diffusion LLMs through innovative decoding schemes and RL alignment, enabling practical deployment of large-scale models with exceptional throughput.

Abstract: While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T) scheme, we introduce a joint, configurable threshold-decoding scheme. This structural innovation gives rise to two distinct personas: the Speedy Mode (S Mode), which audaciously lowers the M2T threshold to bypass traditional constraints while relying on T2T to refine the output; and the Quality Mode (Q Mode), which leans into conservative thresholds to secure superior benchmark performances with manageable efficiency degrade. Furthering this evolution, underpinned by an expansive context window, we implement the first large-scale Reinforcement Learning (RL) framework specifically tailored for dLLMs, anchored by specialized techniques for stable gradient estimation. This alignment not only sharpens reasoning precision but also elevates instruction-following fidelity, bridging the chasm between diffusion dynamics and complex human intent. We culminate this work by releasing LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B). Across 33 rigorous benchmarks, LLaDA2.1 delivers strong task performance and lightning-fast decoding speed. Despite its 100B volume, on coding tasks it attains an astounding 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench.

[370] Online reinforcement learning via sparse Gaussian mixture model Q-functions

Minh Vu, Konstantinos Slavakis

Main category: cs.LG

TL;DR: Online reinforcement learning framework using sparse Gaussian mixture model Q-functions with structured parameter updates via Riemannian manifold optimization.

Details

Motivation: To develop an interpretable and structured online RL framework that can leverage streaming data for exploration while controlling model complexity through sparsification, addressing overfitting issues in deep RL methods.

Method: Introduces sparse Gaussian mixture model Q-functions (S-GMM-QFs) with Hadamard overparametrization for sparsification. Uses online policy-iteration framework with Riemannian manifold structure for parameter updates via online gradient descent on smooth objectives.

Result: S-GMM-QFs match performance of dense deep RL methods on standard benchmarks with significantly fewer parameters, and maintain strong performance in low-parameter regimes where sparsified deep RL methods fail.

Conclusion: The structured S-GMM-QF framework provides an interpretable, parameter-efficient alternative to deep RL that generalizes well even with limited parameters through principled sparsification and manifold optimization.

Abstract: This paper introduces a structured and interpretable online policy-iteration framework for reinforcement learning (RL), built around the novel class of sparse Gaussian mixture model Q-functions (S-GMM-QFs). Extending earlier work that trained GMM-QFs offline, the proposed framework develops an online scheme that leverages streaming data to encourage exploration. Model complexity is regulated through sparsification by Hadamard overparametrization, which mitigates overfitting while preserving expressiveness. The parameter space of S-GMM-QFs is naturally endowed with a Riemannian manifold structure, allowing for principled parameter updates via online gradient descent on a smooth objective. Numerical tests show that S-GMM-QFs match the performance of dense deep RL (DeepRL) methods on standard benchmarks while using significantly fewer parameters, and maintain strong performance even in low-parameter-count regimes where sparsified DeepRL methods fail to generalize.

[371] Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models

Kun Feng, Shaocheng Lan, Yuchen Fang, Wenchao He, Lintao Ma, Xingyu Lu, Kan Ren

Main category: cs.LG

TL;DR: Kairos is a parameter-efficient Time Series Foundation Model that addresses temporal heterogeneity through dynamic patching tokenization and multi-granularity positional embeddings, achieving superior zero-shot performance with fewer parameters.

Details

Motivation: Existing Time Series Foundation Models struggle with inherent temporal heterogeneity (varying sampling densities, periodic structures) due to static tokenization and positional encoding schemes that entangle diverse patterns into fixed representations, encouraging memorization over adaptation.

Method: Proposes Kairos with: 1) Dynamic patching tokenizer and mixture-of-size encoding that adapt observational granularity to local information density, 2) Multi-granularity positional embedding using dynamic rotary encodings conditioned on instance-level spectral features and temporal structure, 3) Trained on Predictability-Stratified Time-Series (PreSTS) corpus.

Result: Achieves superior zero-shot performance with substantially fewer parameters on GIFT-Eval and Time-Series-Library benchmarks compared to existing TSFMs.

Conclusion: Kairos demonstrates that decoupling temporal heterogeneity from model capacity through flexible tokenization and encoding schemes enables parameter-efficient time series foundation models with strong zero-shot generalization capabilities.

Abstract: Inherent temporal heterogeneity, such as varying sampling densities and periodic structures, has posed substantial challenges in zero-shot generalization for Time Series Foundation Models (TSFMs). Existing TSFMs predominantly rely on massive parameterization to absorb such heterogeneity, as their static tokenization and positional encoding schemes entangle diverse temporal patterns into a fixed representation space, encouraging memorization rather than adaptation. To address this limitation, we propose Kairos, a flexible and parameter-efficient TSFM that decouples temporal heterogeneity from model capacity through a novel tokenization perspective. Kairos introduces a dynamic patching tokenizer and a mixture-of-size encoding that adapt observational granularity to local information density, enabling fine-grained temporal abstraction without increasing model width or depth. In addition, we design a multi-granularity positional embedding based on dynamic rotary encodings, which conditions on instance-level spectral features and temporal structure induced by dynamic patching tokenization, allowing robust modeling of diverse temporal dependencies. Trained on a novel Predictability-Stratified Time-Series (PreSTS) corpus, Kairos achieves superior zero-shot performance with substantially fewer parameters on two mainstream benchmarks, GIFT-Eval and Time-Series-Library. The project page is at https://foundation-model-research.github.io/Kairos .

[372] Multi-Agent Stage-wise Conservative Linear Bandits

Amirhossein Afsharrad, Ahmadreza Moradipari, Sanjay Lall

Main category: cs.LG

TL;DR: Multi-agent networked bandit algorithm with stage-wise safety constraints for collaborative learning with local communication and safety guarantees.

Details

Motivation: Real-world applications like recommendation systems require multiple agents to balance exploration-exploitation while maintaining safety guarantees to avoid catastrophic failures. Need distributed learning where agents collaborate with local communication while satisfying conservative constraints.

Method: MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB) algorithm with episodic structure alternating between action selection and consensus-building phases. Agents observe local rewards with unknown parameters but optimize for global parameter (average of local parameters). Communication only with immediate neighbors, with communication rounds incurring additional regret.

Result: Achieves regret $\tilde{O}\left(\frac{d}{\sqrt{N}}\sqrt{T}\cdot\frac{\log(NT)}{\sqrt{\log(1/|λ_2|)}}\right)$ with high probability, showing: (i) collaboration yields $\frac{1}{\sqrt{N}}$ improvement despite local communication, (ii) communication overhead grows only logarithmically for well-connected networks, and (iii) stage-wise safety adds only lower-order regret.

Conclusion: Distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks, demonstrating feasibility of collaborative multi-agent systems with safety constraints.

Abstract: In many real-world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi-agent networked setting where agents must satisfy stage-wise conservative constraints. A network of $N$ agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than $(1-α)$ times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus-building phases. We prove that MA-SCLUCB achieves regret $\tilde{O}\left(\frac{d}{\sqrt{N}}\sqrt{T}\cdot\frac{\log(NT)}{\sqrt{\log(1/|λ_2|)}}\right)$ with high probability, where $d$ is the dimension, $T$ is the horizon, and $|λ_2|$ is the network’s second largest eigenvalue magnitude. Our analysis shows: (i) collaboration yields $\frac{1}{\sqrt{N}}$ improvement despite local communication, (ii) communication overhead grows only logarithmically for well-connected networks, and (iii) stage-wise safety adds only lower-order regret. Thus, distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks.

[373] VDW-GNNs: Vector diffusion wavelets for geometric graph neural networks

David R. Johnson, Alexander Sietsema, Rishabh Anand, Deanna Needell, Smita Krishnaswamy, Michael Perlmutter

Main category: cs.LG

TL;DR: Vector diffusion wavelets (VDWs) are a new wavelet family inspired by vector diffusion maps for data on Riemannian manifold tangent bundles, integrated into geometric graph neural networks (VDW-GNNs) with proven frame properties and symmetry preservation.

Details

Motivation: The paper aims to develop wavelet transforms suitable for analyzing data that lies in the tangent bundle of Riemannian manifolds, which is common in geometric data analysis. Traditional wavelets may not capture the geometric structure of such data effectively.

Method: Introduces vector diffusion wavelets (VDWs) inspired by vector diffusion maps algorithm. Incorporates these wavelets into geometric graph neural networks called VDW-GNNs. Theoretically analyzes frame properties and symmetry preservation under rotations and translations.

Result: VDW-GNNs demonstrate effectiveness on synthetic point cloud data, real-world wind-field measurements, and neural activity data. Theoretical proofs show VDW wavelets have desirable frame theoretic properties similar to traditional diffusion wavelets and preserve symmetries.

Conclusion: Vector diffusion wavelets provide a principled approach for analyzing geometric data on tangent bundles, with theoretical guarantees and practical effectiveness demonstrated through integration into graph neural networks for various applications.

Abstract: We introduce vector diffusion wavelets (VDWs), a novel family of wavelets inspired by the vector diffusion maps algorithm that was introduced to analyze data lying in the tangent bundle of a Riemannian manifold. We show that these wavelets may be effectively incorporated into a family of geometric graph neural networks, which we refer to as VDW-GNNs. We demonstrate that such networks are effective on synthetic point cloud data, as well as on real-world data derived from wind-field measurements and neural activity data. Theoretically, we prove that these new wavelets have desirable frame theoretic properties, similar to traditional diffusion wavelets. Additionally, we prove that these wavelets have desirable symmetries with respect to rotations and translations.

[374] What Do Temporal Graph Learning Models Learn?

Abigail J. Hayes, Tobias Schumacher, Markus Strohmaier

Main category: cs.LG

TL;DR: Systematic evaluation of temporal graph learning models reveals they capture some fundamental graph characteristics well but fail at others, exposing limitations in current benchmark evaluations and model capabilities.

Details

Motivation: Recent concerns about reliability of temporal graph learning benchmarks and surprising competitiveness of simple heuristics raise questions about what characteristics temporal graph models actually use for predictions.

Method: Systematically evaluated eight temporal graph learning models on their ability to capture eight fundamental characteristics related to link structure, including structural characteristics (density), temporal patterns (recency), and edge formation mechanisms (homophily), using both synthetic and real-world datasets.

Result: Findings reveal mixed picture: models capture some characteristics well but fail to reproduce others, exposing important limitations in current temporal graph learning approaches.

Conclusion: Results provide practical insights for applying temporal graph learning models and motivate more interpretability-driven evaluations in graph learning research.

Abstract: Learning on temporal graphs has become a central topic in graph representation learning, with numerous benchmarks indicating the strong performance of state-of-the-art models. However, recent work has raised concerns about the reliability of benchmark results, noting issues with commonly used evaluation protocols and the surprising competitiveness of simple heuristics. This contrast raises the question of which characteristics of the underlying graphs temporal graph learning models actually use to form their predictions. We address this by systematically evaluating eight models on their ability to capture eight fundamental characteristics related to the link structure of temporal graphs. These include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using both synthetic and real-world datasets, we analyze how well models learn these characteristics. Our findings reveal a mixed picture: models capture some characteristics well but fail to reproduce others. With this, we expose important limitations. Overall, we believe that our results provide practical insights for the application of temporal graph learning models and motivate more interpretability-driven evaluations in graph learning research.

[375] A Unified Theory of Random Projection for Influence Functions

Pingbang Hu, Yuzheng Hu, Jiaqi W. Ma, Han Zhao

Main category: cs.LG

TL;DR: Theoretical analysis of when random projection preserves influence functions in overparametrized models, addressing limitations of Johnson-Lindenstrauss lemma for inversion and interactions with regularization.

Details

Motivation: Influence functions are computationally expensive in overparametrized models due to large curvature matrices. Random projection via sketching is used for scalability but lacks theoretical justification for how it interacts with inversion, regularization, and structured approximations.

Method: Develops unified theory analyzing when projection preserves influence functions. Considers three scenarios: 1) Unregularized projection requiring injectivity on range(F), 2) Regularized projection with ridge regularization altering sketching requirements, 3) Factorized influence for Kronecker-factored curvatures with decoupled sketches.

Result: Shows exact preservation requires m ≥ rank(F) for unregularized case, ridge regularization changes sketching barriers based on effective dimension, and guarantees hold for Kronecker-factored curvatures despite row correlations. Also analyzes out-of-range test gradients with leakage terms.

Conclusion: Provides principled guidance for choosing sketch size in practice and develops novel theory characterizing when projection provably preserves influence functions, addressing gaps in existing JL-based justifications.

Abstract: Influence functions and related data attribution scores take the form of $g^{\top}F^{-1}g^{\prime}$, where $F\succeq 0$ is a curvature operator. In modern overparametrized models, forming or inverting $F\in\mathbb{R}^{d\times d}$ is prohibitive, motivating scalable influence computation via random projection with a sketch $P \in \mathbb{R}^{m\times d}$. This practice is commonly justified via the Johnson–Lindenstrauss (JL) lemma, which ensures approximate preservation of Euclidean geometry for a fixed dataset. However, JL does not address how sketching behaves under inversion. Furthermore, there is no existing theory that explains how sketching interacts with other widely-used techniques, such as ridge regularization and structured curvature approximations. We develop a unified theory characterizing when projection provably preserves influence functions. When $g,g^{\prime}\in\text{range}(F)$, we show that: 1) Unregularized projection: exact preservation holds iff $P$ is injective on $\text{range}(F)$, which necessitates $m\geq \text{rank}(F)$; 2) Regularized projection: ridge regularization fundamentally alters the sketching barrier, with approximation guarantees governed by the effective dimension of $F$ at the regularization scale; 3) Factorized influence: for Kronecker-factored curvatures $F=A\otimes E$, the guarantees continue to hold for decoupled sketches $P=P_A\otimes P_E$, even though such sketches exhibit row correlations that violate i.i.d. assumptions. Beyond this range-restricted setting, we analyze out-of-range test gradients and quantify a leakage term that arises when test gradients have components in $\ker(F)$. This yields guarantees for influence queries on general test points. Overall, this work develops a novel theory that characterizes when projection provably preserves influence and provides principled guidance for choosing the sketch size in practice.

[376] Multi-Window Temporal Analysis for Enhanced Arrhythmia Classification: Leveraging Long-Range Dependencies in Electrocardiogram Signals

Tiezhi Wang, Wilhelm Haverkamp, Nils Strodthoff

Main category: cs.LG

TL;DR: S4ECG uses structured state-space models to analyze multiple consecutive ECG windows (up to 20 min) for arrhythmia classification, improving performance and cross-dataset robustness compared to single-window approaches.

Details

Motivation: Arrhythmia classification from ECGs suffers from high false positive rates and limited cross-dataset generalization. Most deep learning approaches analyze isolated 30-s windows, but many arrhythmias like atrial fibrillation exhibit diagnostic features that emerge over extended time scales.

Method: Introduces S4ECG, a deep learning architecture based on structured state-space models (S4) designed to capture long-range temporal dependencies by jointly analyzing multiple consecutive ECG windows spanning up to 20 minutes. Evaluated on four public databases for multi-class arrhythmia classification with systematic cross-dataset evaluations.

Result: Multi-window analysis consistently outperforms single-window approaches across all datasets, improving macro-averaged AUROC by 1.0-11.6 percentage points. For AF, specificity increases from 0.718-0.979 to 0.967-0.998, yielding 3-10-fold reduction in false positive rates. S4 architecture shows superior performance over CNN baselines.

Conclusion: Structured incorporation of extended temporal context enhances both arrhythmia classification accuracy and cross-dataset robustness. Optimal diagnostic windows are 10-20 minutes, beyond which performance plateaus or degrades. Findings provide practical guidance for ECG monitoring system design.

Abstract: Objective. Arrhythmia classification from electrocardiograms (ECGs) suffers from high false positive rates and limited cross-dataset generalization, particularly for atrial fibrillation (AF) detection where specificity ranges from 0.72 to 0.98 using conventional 30-s analysis windows. While most deep learning approaches analyze isolated 30-s ECG windows, many arrhythmias, including AF and atrial flutter, exhibit diagnostic features that emerge over extended time scales. Approach. We introduce S4ECG, a deep learning architecture based on structured state-space models (S4), designed to capture long-range temporal dependencies by jointly analyzing multiple consecutive ECG windows spanning up to 20 min. We evaluate S4ECG on four publicly available databases for multi-class arrhythmia classification and perform systematic cross-dataset evaluations to assess out-of-distribution robustness. Results. Multi-window analysis consistently outperforms single-window approaches across all datasets, improving macro-averaged AUROC by 1.0-11.6 percentage points. For AF, specificity increases from 0.718-0.979 to 0.967-0.998 at a fixed sensitivity threshold, yielding a 3-10-fold reduction in false positive rates. Significance. Compared with convolutional neural network baselines, the S4 architecture shows superior performance, and multi-window training substantially reduces cross-dataset degradation. Optimal diagnostic windows are 10-20 min, beyond which performance plateaus or degrades. These findings demonstrate that structured incorporation of extended temporal context enhances both arrhythmia classification accuracy and cross-dataset robustness. The identified optimal temporal windows provide practical guidance for ECG monitoring system design and may reflect underlying physiological timescales of arrhythmogenic dynamics.

[377] Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks

Yongzhong Xu

Main category: cs.LG

TL;DR: Transformers trained on modular arithmetic tasks collapse onto low-dimensional execution manifolds (3-4D) despite high-dimensional parameter spaces, explaining attention concentration, SGD dynamics, and interpretability limitations.

Details

Motivation: To understand the geometric structure of learning dynamics in overparameterized transformer models, particularly how they navigate high-dimensional parameter spaces and what underlying computational structures emerge during training.

Method: Carefully controlled modular arithmetic tasks with overparameterized transformer models (d=128), analyzing training trajectories, manifold dimensions, SGD commutators, and sparse autoencoders across random seeds and task difficulties.

Result: Training trajectories rapidly collapse onto low-dimensional execution manifolds (3-4D), with SGD commutators preferentially aligned with execution subspace (10× random baseline), sharp attention concentration emerging as saturation along routing coordinates, and sparse autoencoders capturing auxiliary structure but not core execution.

Conclusion: Transformers learn in dramatically reduced subspaces despite overparameterization, with most parameters absorbing optimization interference while core computation occurs in low-dimensional manifolds, providing a geometric framework for understanding transformer learning.

Abstract: We investigate the geometric structure of learning dynamics in overparameterized transformer models through carefully controlled modular arithmetic tasks. Our primary finding is that despite operating in high-dimensional parameter spaces ($d=128$), transformer training trajectories rapidly collapse onto low-dimensional execution manifolds of dimension $3$–$4$. This dimensional collapse is robust across random seeds and moderate task difficulties, though the orientation of the manifold in parameter space varies between runs. We demonstrate that this geometric structure underlies several empirically observed phenomena: (1) sharp attention concentration emerges as saturation along routing coordinates within the execution manifold, (2) SGD commutators are preferentially aligned with the execution subspace (up to $10\times$ random baseline) early in training, with $>92%$ of non-commutativity confined to orthogonal staging directions and this alignment decreasing as training converges, and (3) sparse autoencoders capture auxiliary routing structure but fail to isolate execution itself, which remains distributed across the low-dimensional manifold. Our results suggest a unifying geometric framework for understanding transformer learning, where the vast majority of parameters serve to absorb optimization interference while core computation occurs in a dramatically reduced subspace. These findings have implications for interpretability, training curriculum design, and understanding the role of overparameterization in neural network learning.

[378] Weight Decay may matter more than muP for Learning Rate Transfer in Practice

Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, Xi Chen

Main category: cs.LG

TL;DR: muP’s learning rate scaling primarily acts as implicit warmup; weight decay is what truly stabilizes update dynamics for learning rate transfer across model widths

Details

Motivation: To understand why Maximal Update Parameterization (muP) enables learning rate transfer from small to large models, and to test its underlying assumptions about stable update dynamics

Method: Large-scale empirical investigation analyzing muP’s assumptions about geometric alignment of layer inputs with weights and gradient updates during training, particularly in LLM training setups

Result: muP’s assumptions hold only briefly at training start; weight decay (not muP) stabilizes update dynamics across widths for most of training; muP primarily acts as implicit learning rate warmup

Conclusion: Weight decay is key for learning rate transfer, challenging prevailing beliefs about muP; muP can be largely replaced with modified warmup schedules

Abstract: Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (muP) proposes a learning rate scaling designed to keep the update dynamics of internal representations stable across different model widths. However, the scaling rules of muP rely on strong assumptions, particularly about the geometric alignment of a layer’s inputs with both its weights and gradient updates. In this large-scale empirical investigation, we show that these assumptions hold only briefly at the start of training in the practical setups where learning rate transfer is most valuable, such as LLM training. For the remainder of training it is weight decay rather than muP that correctly stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer. This suggests muP’s scaling primarily acts as a form of implicit learning rate warmup, allowing us to largely replace it with modified warmup schedules. Together these findings fundamentally challenge prevailing beliefs about learning rate transfer and can explain empirical observations such as why muP requires the independent weight decay variant for good transfer.

[379] Active Learning with Task-Driven Representations for Messy Pools

Kianoosh Ashouritaklimi, Tom Rainforth

Main category: cs.LG

TL;DR: Active learning with task-driven representations improves performance on messy, uncurated data pools compared to fixed unsupervised representations

Details

Motivation: Current active learning approaches use fixed unsupervised representations for messy data pools, but these representations may fail to capture task-relevant information, limiting effectiveness

Method: Proposes using task-driven representations updated during active learning with collected labels. Two strategies: 1) learning semi-supervised representations directly, 2) supervised fine-tuning of initial unsupervised representation

Result: Both task-driven representation strategies significantly outperform unsupervised or pretrained representations in empirical performance

Conclusion: Periodically updating representations with collected labels during active learning is crucial for handling messy data pools, and task-driven representations substantially improve over fixed representations

Abstract: Active learning has the potential to be especially useful for messy, uncurated pools where datapoints vary in relevance to the target task. However, state-of-the-art approaches to this problem currently rely on using fixed, unsupervised representations of the pool, focusing on modifying the acquisition function instead. We show that this model setup can undermine their effectiveness at dealing with messy pools, as such representations can fail to capture important information relevant to the task. To address this, we propose using task-driven representations that are periodically updated during the active learning process using the previously collected labels. We introduce two specific strategies for learning these representations, one based on directly learning semi-supervised representations and the other based on supervised fine-tuning of an initial unsupervised representation. We find that both significantly improve empirical performance over using unsupervised or pretrained representations.

[380] HiFloat4 Format for Language Model Inference

Yuanyong Luo, Jing Huang, Yu Cheng, Ziwei Yu, Kaihua Tang, Xinda Ma, Xin Wang, Anping Tong, Guipeng Hu, Yun Xu, Mehran Taghian, Peng Wu, Guanglin Li, Yunke Peng, Tianchi Hu, Minqi Chen, Michael Bi Mi, Hu Liu, Xiping Zhou, Junsong Wang, Qiang Lin, Heng Liao

Main category: cs.LG

TL;DR: HiFloat4 (HiF4) is a novel 4-bit block floating-point format for deep learning that packs 64 4-bit elements with shared scaling metadata, achieving better accuracy than state-of-the-art formats while enabling efficient hardware implementation.

Details

Motivation: The motivation is to develop an efficient low-precision data format for deep learning that balances compression, accuracy, and hardware efficiency, addressing the limitations of existing 4-bit formats like NVFP4.

Method: HiF4 uses a block floating-point format where 64 4-bit elements share 32 bits of scaling metadata organized in a three-level hierarchy to capture inter- and intra-group dynamic range. The large group size enables matrix multiplications to be executed in a highly fixed-point manner.

Result: HiF4 achieves higher average accuracy than state-of-the-art NVFP4 format across multiple language models (LLaMA, Qwen, Mistral, DeepSeek-V3.1, LongCat) and diverse downstream tasks.

Conclusion: HiFloat4 is an effective 4-bit data format that improves accuracy over existing solutions while enabling hardware-efficient implementations through its block floating-point design with hierarchical scaling.

Abstract: This paper introduces HiFloat4 (HiF4), a block floating-point data format tailored for deep learning. Each HiF4 unit packs 64 4-bit elements with 32 bits of shared scaling metadata, averaging 4.5 bits per value. The metadata specifies a three-level scaling hierarchy, capturing inter- and intra-group dynamic range while improving the utilization of the representational space. In addition, the large 64-element group size enables matrix multiplications to be executed in a highly fixed-point manner, significantly reducing hardware area and power consumption. To evaluate the proposed format, we conducted inference experiments on several language models, including LLaMA, Qwen, Mistral, DeepSeek-V3.1 and LongCat. Results show that HiF4 achieves higher average accuracy than the state-of-the-art NVFP4 format across multiple models and diverse downstream tasks.

[381] LLMs as In-Context Meta-Learners for Model and Hyperparameter Selection

Youssef Attia El Hili, Albert Thomas, Malik Tiomoko, Abdelhakim Benechehab, Corentin Léger, Corinne Ancourt, Balázs Kégl

Main category: cs.LG

TL;DR: LLMs can act as in-context meta-learners for model and hyperparameter selection by using dataset metadata, with zero-shot and meta-informed prompting strategies showing competitive performance without expensive search.

Details

Motivation: Model and hyperparameter selection is critical but challenging in machine learning, typically requiring expert intuition or expensive automated search. The paper investigates whether LLMs can serve as lightweight, general-purpose assistants for this task.

Method: Convert each dataset into interpretable metadata, then prompt LLMs to recommend both model families and hyperparameters. Two prompting strategies: (1) zero-shot mode relying on pretrained knowledge, and (2) meta-informed mode augmented with examples of models and their performance on past tasks.

Result: Across synthetic and real-world benchmarks, LLMs can exploit dataset metadata to recommend competitive models and hyperparameters without search. Improvements from meta-informed prompting demonstrate their capacity for in-context meta-learning.

Conclusion: LLMs show promise as lightweight, general-purpose assistants for model selection and hyperparameter optimization, highlighting a new role for LLMs in machine learning workflows.

Abstract: Model and hyperparameter selection are critical but challenging in machine learning, typically requiring expert intuition or expensive automated search. We investigate whether large language models (LLMs) can act as in-context meta-learners for this task. By converting each dataset into interpretable metadata, we prompt an LLM to recommend both model families and hyperparameters. We study two prompting strategies: (1) a zero-shot mode relying solely on pretrained knowledge, and (2) a meta-informed mode augmented with examples of models and their performance on past tasks. Across synthetic and real-world benchmarks, we show that LLMs can exploit dataset metadata to recommend competitive models and hyperparameters without search, and that improvements from meta-informed prompting demonstrate their capacity for in-context meta-learning. These results highlight a promising new role for LLMs as lightweight, general-purpose assistants for model selection and hyperparameter optimization.

[382] Context-Specific Causal Graph Discovery with Unobserved Contexts: Non-Stationarity, Regimes and Spatio-Temporal Patterns

Martin Rabel, Jakob Runge

Main category: cs.LG

TL;DR: A framework for causal discovery on non-stationary spatially gridded time series data that accounts for variations in causal structure across space and time while maintaining stability.

Details

Motivation: Real-world problems like climate applications involve spatially gridded time series data where causal relationships may vary across space and time. These variations encode important information but can negatively affect stability and validity of causal discovery results if not properly accounted for.

Method: Modifies constraint-based causal discovery approaches at the independence testing level to handle non-stationary structure. The framework is modular and extensible, allowing integration with existing methods (PC, PC-stable, FCI, PCMCI, PCMCI+, LPCMCI) and systematic decomposition into simpler subproblems related to change-point detection, clustering, and independence testing.

Result: Developed a framework that addresses challenges of encoding system-states and statistical convergence with imperfectly recoverable non-stationary structure. Numerical experiments support the conceptual suitability of the approach.

Conclusion: Provides a principled framework for stable causal discovery on non-stationary spatially gridded time series data that is modular, extensible, and applicable to various constraint-based causal discovery methods.

Abstract: Real-world problems, for example in climate applications, often require causal reasoning on spatially gridded time series data or data with comparable structure. While the underlying system is often believed to behave similarly at different Points in space and time, those variations that do exist are relevant twofold: They often encode important information in and of themselves. And they may negatively affect the stability and validity of results if not accounted for. We study the information encoded in changes of the causal graph, with stability in mind. Two core challenges arise, related to the complexity of encoding system-states and to statistical convergence properties in the presence of imperfectly recoverable non-stationary structure. We provide a framework realizing principles conceptually suitable to overcome these challenges - an interpretation supported by numerical experiments. Primarily, we modify constraint-based causal discovery approaches on the level of independence testing. This leads to a framework which is additionally highly modular, easily extensible and widely applicable. For example, it allows to leverage existing constraint-based causal discovery methods (demonstrated on PC, PC-stable, FCI, PCMCI, PCMCI+ and LPCMCI), and to systematically divide the problem into simpler subproblems that are easier to analyze and understand and relate more clearly to well-studied problems like change-point-detection, clustering, independence-testing and more. Code is available at https://github.com/martin-rabel/Causal_GLDF.

[383] Quantum Temporal Convolutional Neural Networks for Cross-Sectional Equity Return Prediction: A Comparative Benchmark Study

Chi-Sheng Chen, Xinyu Zhang, En-Jui Kuo, Rong Fu, Qiuzhe Xie, Fan Zhang

Main category: cs.LG

TL;DR: Quantum Temporal Convolutional Neural Network (QTCNN) combines classical temporal encoding with quantum circuits for stock prediction, achieving 72% better Sharpe ratio than classical baselines.

Details

Motivation: Classical forecasting models struggle with noisy financial data, regime shifts, and limited generalization. Quantum machine learning offers potential for enhanced prediction in complex, dynamic financial environments.

Method: Proposes QTCNN with classical temporal encoder for multi-scale pattern extraction from sequential technical indicators, combined with parameter-efficient quantum convolution circuits that leverage superposition and entanglement for enhanced feature representation and overfitting suppression.

Result: On JPX Tokyo Stock Exchange dataset, QTCNN achieves Sharpe ratio of 0.538, outperforming best classical baseline by approximately 72% in out-of-sample portfolio construction.

Conclusion: QTCNN demonstrates practical potential of quantum-enhanced forecasting for robust decision-making in quantitative finance, showing significant improvement over classical approaches.

Abstract: Quantum machine learning offers a promising pathway for enhancing stock market prediction, particularly under complex, noisy, and highly dynamic financial environments. However, many classical forecasting models struggle with noisy input, regime shifts, and limited generalization capacity. To address these challenges, we propose a Quantum Temporal Convolutional Neural Network (QTCNN) that combines a classical temporal encoder with parameter-efficient quantum convolution circuits for cross-sectional equity return prediction. The temporal encoder extracts multi-scale patterns from sequential technical indicators, while the quantum processing leverages superposition and entanglement to enhance feature representation and suppress overfitting. We conduct a comprehensive benchmarking study on the JPX Tokyo Stock Exchange dataset and evaluate predictions through long-short portfolio construction using out-of-sample Sharpe ratio as the primary performance metric. QTCNN achieves a Sharpe ratio of 0.538, outperforming the best classical baseline by approximately 72%. These results highlight the practical potential of quantum-enhanced forecasting model, QTCNN, for robust decision-making in quantitative finance.

[384] Membership and Dataset Inference Attacks on Large Audio Generative Models

Jakub Proboszcz, Paweł Kochanski, Karol Korszun, Donato Crisostomi, Giorgio Strano, Emanuele Rodolà, Kamil Deja, Jan Dubinski

Main category: cs.LG

TL;DR: Membership inference attacks on generative audio models have limited effectiveness, but dataset inference (aggregating evidence across multiple samples) successfully detects if an artist’s collection was used in training.

Details

Motivation: As generative audio models advance, copyright concerns arise about whether artists' works were used in training. The paper investigates verification methods to protect copyright holders by detecting if their material was included in model training datasets.

Method: The study investigates membership inference attacks (MIA) on open-source generative audio models to determine if specific audio samples were in training sets. When MIA proves limited at scale, the research focuses on dataset inference (DI), which aggregates membership evidence across multiple samples from an artist’s collection, building on prior work in text and vision domains.

Result: Membership inference alone is ineffective at scale due to weak per-sample signals in models trained on large, diverse datasets. However, dataset inference (DI) successfully detects whether an artist’s collection contributed to model training, offering a practical verification mechanism.

Conclusion: Dataset inference provides a promising approach for copyright protection and dataset accountability in large audio generative models, enabling artists and media owners to verify if their collections were used in training.

Abstract: Generative audio models, based on diffusion and autoregressive architectures, have advanced rapidly in both quality and expressiveness. This progress, however, raises pressing copyright concerns, as such models are often trained on vast corpora of artistic and commercial works. A central question is whether one can reliably verify if an artist’s material was included in training, thereby providing a means for copyright holders to protect their content. In this work, we investigate the feasibility of such verification through membership inference attacks (MIA) on open-source generative audio models, which attempt to determine whether a specific audio sample was part of the training set. Our empirical results show that membership inference alone is of limited effectiveness at scale, as the per-sample membership signal is weak for models trained on large and diverse datasets. However, artists and media owners typically hold collections of works rather than isolated samples. Building on prior work in text and vision domains, in this work we focus on dataset inference (DI), which aggregates diverse membership evidence across multiple samples. We find that DI is successful in the audio domain, offering a more practical mechanism for assessing whether an artist’s works contributed to model training. Our results suggest DI as a promising direction for copyright protection and dataset accountability in the era of large audio generative models.

[385] ATLAS: Adaptive Topology-based Learning at Scale for Homophilic and Heterophilic Graphs

Turja Kundu, Sanjukta Bhowmick

Main category: cs.LG

TL;DR: ATLAS is a propagation-free graph learning framework that encodes graph structure through multi-resolution community features instead of message passing, achieving competitive performance on both homophilic and heterophilic graphs while enabling scalable training.

Details

Motivation: GNNs struggle with heterophilic graphs where connected nodes don't share labels, and iterative message passing limits scalability due to neighborhood expansion overhead. There's a need for a framework that can handle both homophilic and heterophilic graphs efficiently.

Method: ATLAS uses modularity-guided adaptive search to identify informative community scales, one-hot encodes these communities, projects them into learnable embeddings, and concatenates with node attributes for MLP classification. This avoids message passing entirely.

Result: Across 13 benchmarks including million-node graphs, ATLAS achieves competitive or superior accuracy, with up to 20-point gains over GCN on heterophilic datasets and 12-point gains over MLPs on homophilic graphs.

Conclusion: ATLAS provides a scalable, propagation-free approach to graph learning that adapts intelligently to graph structure, remaining robust when structure is weakly aligned and avoiding propagation when structure misleads.

Abstract: Graph neural networks (GNNs) excel on homophilic graphs where connected nodes share labels, but struggle with heterophilic graphs where edges do not imply similarity. Moreover, iterative message passing limits scalability due to neighborhood expansion overhead. We introduce ATLAS (Adaptive Topology-based Learning at Scale), a propagation-free framework that encodes graph structure through multi-resolution community features rather than message passing. We first prove that community refinement involves a fundamental trade-off: finer partitions increase label-community mutual information but also increase entropy. We formalize when refinement improves normalized mutual information, explaining why intermediate granularities are often most predictive. ATLAS employs modularity-guided adaptive search to automatically identify informative community scales, which are one-hot encoded, projected into learnable embeddings, and concatenated with node attributes for MLP classification. This enables standard mini-batch training and adjacency-free inference after one-time preprocessing. Across 13 benchmarks including million-node graphs, ATLAS achieves competitive or superior accuracy, up to 20-point gains over GCN on heterophilic datasets and 12-point gains over MLPs on homophilic graphs. By treating topology as explicit features, ATLAS adapts intelligently: leveraging structure when informative, remaining robust when weakly aligned, and avoiding propagation when structure misleads, providing both scalable performance and interpretable structural insights.

[386] Imitation Learning for Combinatorial Optimisation under Uncertainty

Prakash Gawas, Antoine Legrain, Louis-Martin Rousseau

Main category: cs.LG

TL;DR: Systematic taxonomy of expert types for imitation learning in combinatorial optimization under uncertainty, with generalized DAgger algorithm and evaluation on dynamic physician assignment problem.

Details

Motivation: Existing imitation learning approaches for combinatorial optimization use diverse expert constructions without a unifying framework to characterize their assumptions, computational properties, and impact on learning performance.

Method: Proposes taxonomy classifying experts along three dimensions: treatment of uncertainty (myopic, deterministic, full-information, stochastic), level of optimality (task-optimal vs approximate), and interaction mode (one-shot to iterative). Introduces generalized DAgger algorithm supporting multiple expert queries, aggregation, and flexible interaction strategies.

Result: Evaluation on dynamic physician-to-patient assignment problem shows policies from stochastic experts outperform deterministic/full-information experts. Interactive learning improves solution quality with fewer demonstrations. Aggregated deterministic experts provide effective alternative when stochastic optimization is computationally challenging.

Conclusion: Provides systematic framework for understanding expert design in imitation learning for combinatorial optimization, demonstrating importance of stochastic experts and interactive learning while offering practical alternatives for computational constraints.

Abstract: Imitation learning (IL) provides a data-driven framework for approximating policies for large-scale combinatorial optimisation problems formulated as sequential decision problems (SDPs), where exact solution methods are computationally intractable. A central but underexplored aspect of IL in this context is the role of the \emph{expert} that generates training demonstrations. Existing studies employ a wide range of expert constructions, yet lack a unifying framework to characterise their modelling assumptions, computational properties, and impact on learning performance. This paper introduces a systematic taxonomy of experts for IL in combinatorial optimisation under uncertainty. Experts are classified along three dimensions: (i) their treatment of uncertainty, including myopic, deterministic, full-information, two-stage stochastic, and multi-stage stochastic formulations; (ii) their level of optimality, distinguishing task-optimal and approximate experts; and (iii) their interaction mode with the learner, ranging from one-shot supervision to iterative, interactive schemes. Building on this taxonomy, we propose a generalised Dataset Aggregation (DAgger) algorithm that supports multiple expert queries, expert aggregation, and flexible interaction strategies. The proposed framework is evaluated on a dynamic physician-to-patient assignment problem with stochastic arrivals and capacity constraints. Computational experiments compare learning outcomes across expert types and interaction regimes. The results show that policies learned from stochastic experts consistently outperform those learned from deterministic or full-information experts, while interactive learning improves solution quality using fewer expert demonstrations. Aggregated deterministic experts provide an effective alternative when stochastic optimisation becomes computationally challenging.

[387] Spectral Ghost in Representation Learning: from Component Analysis to Self-Supervised Learning

Bo Dai, Na Li, Dale Schuurmans

Main category: cs.LG

TL;DR: Theoretical framework for self-supervised learning that unifies diverse SSL objectives through spectral representation analysis, revealing the spectral essence of successful algorithms and enabling principled algorithm design.

Details

Motivation: The rapid growth of diverse SSL methods lacks unified theoretical understanding, making algorithm design unclear and practical use unjustified. Need principled foundation to understand representation learning.

Method: Develop theoretical framework investigating representation sufficiency from spectral representation view, revealing spectral essence of existing SSL algorithms and providing unified analysis framework.

Result: Framework reveals spectral essence of successful SSL algorithms, provides unified understanding, and inspires development of more efficient and easy-to-use representation learning algorithms.

Conclusion: Spectral representation framework provides principled foundation for understanding SSL, enabling better algorithm design and justified use in real-world applications.

Abstract: Self-supervised learning (SSL) has improved empirical performance by unleashing the power of unlabeled data for practical applications. Specifically, SSL extracts the representation from massive unlabeled data, which will be transferred to a plenty of down streaming tasks with limited data. The significant improvement on diverse applications of representation learning has attracted increasing attention, resulting in a variety of dramatically different self-supervised learning objectives for representation extraction, with an assortment of learning procedures, but the lack of a clear and unified understanding. Such an absence hampers the ongoing development of representation learning, leaving a theoretical understanding missing, principles for efficient algorithm design unclear, and the use of representation learning methods in practice unjustified. The urgency for a unified framework is further motivated by the rapid growth in representation learning methods. In this paper, we are therefore compelled to develop a principled foundation of representation learning. We first theoretically investigate the sufficiency of the representation from a spectral representation view, which reveals the spectral essence of the existing successful SSL algorithms and paves the path to a unified framework for understanding and analysis. Such a framework work also inspires the development of more efficient and easy-to-use representation learning algorithms with principled way in real-world applications.

[388] Gauss-Newton Natural Gradient Descent for Shape Learning

James King, Arturs Berzins, Siddhartha Mishra, Marius Zeinhofer

Main category: cs.LG

TL;DR: Gauss-Newton method applied to shape learning optimization for faster convergence and better accuracy compared to first-order methods

Details

Motivation: Shape learning faces challenges like ill-conditioning of differential constraints and mismatch between parameter space optimization and natural function space formulation, requiring better optimization methods

Method: Uses Gauss-Newton method for optimization in shape learning, applied to implicit neural surfaces and geometry-informed neural networks

Result: Significantly faster and more stable convergence than standard first-order methods with far fewer iterations, consistently improving training speed and final solution accuracy across benchmark tasks

Conclusion: Gauss-Newton method effectively addresses key optimization challenges in shape learning, providing superior performance for implicit neural surfaces and geometry-informed networks

Abstract: We explore the use of the Gauss-Newton method for optimization in shape learning, including implicit neural surfaces and geometry-informed neural networks. The method addresses key challenges in shape learning, such as the ill-conditioning of the underlying differential constraints and the mismatch between the optimization problem in parameter space and the function space where the problem is naturally posed. This leads to significantly faster and more stable convergence than standard first-order methods, while also requiring far fewer iterations. Experiments across benchmark shape optimization tasks demonstrate that the Gauss-Newton method consistently improves both training speed and final solution accuracy.

[389] Bayesian Neighborhood Adaptation for Graph Neural Networks

Paribesh Regmi, Rui Li, Kishan KC

Main category: cs.LG

TL;DR: A Bayesian framework for adaptively determining neighborhood scopes in Graph Neural Networks using beta processes, improving expressivity and performance on both homophilic and heterophilic graphs.

Details

Motivation: Current GNNs require time-consuming two-stage approaches to find optimal neighborhood scopes (number of hops) for message aggregation, which is biased by search space design and doesn't adapt well to both homophilic and heterophilic graphs.

Method: Models GNN message-passing as a stochastic process using beta processes to treat the number of hops as a random variable, allowing simultaneous inference of optimal neighborhood scope and optimization of GNN parameters within a Bayesian framework.

Result: Theoretical analysis shows scope inference improves GNN expressivity. Experiments on benchmark datasets demonstrate competitive/superior performance on node classification tasks with well-calibrated predictions, compatible with state-of-the-art GNN variants.

Conclusion: The proposed Bayesian framework enables adaptive neighborhood scope determination for GNNs, addressing limitations of existing approaches and improving performance across different graph types.

Abstract: The neighborhood scope (i.e., number of hops) where graph neural networks (GNNs) aggregate information to characterize a node’s statistical property is critical to GNNs’ performance. Two-stage approaches, training and validating GNNs for every pre-specified neighborhood scope to search for the best setting, is a time-consuming task and tends to be biased due to the search space design. How to adaptively determine proper neighborhood scopes for the aggregation process for both homophilic and heterophilic graphs remains largely unexplored. We thus propose to model the GNNs’ message-passing behavior on a graph as a stochastic process by treating the number of hops as a beta process. This Bayesian framework allows us to infer the most plausible neighborhood scope for message aggregation simultaneously with the optimization of GNN parameters. Our theoretical analysis shows that the scope inference improves the expressivity of a GNN. Experiments on benchmark homophilic and heterophilic datasets show that the proposed method is compatible with state-of-the-art GNN variants, achieving competitive or superior performance on the node classification task, and providing well-calibrated predictions. Implementation is available at : https://github.com/paribeshregmi/BNA-GNN

[390] tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models

Kevin Li, Dibyadeep Saha, Avni Kanodia, Fan Lai

Main category: cs.LG

TL;DR: tLoRA enables efficient batch training of multiple LoRA fine-tuning jobs by fusing adapters into a shared super-model with adaptive scheduling and fused kernels.

Details

Motivation: As LoRA becomes standard for fine-tuning LLMs, shared clusters run many concurrent LoRA training jobs over the same backbone. Current approaches face challenges with heterogeneous adapters (different ranks, batch sizes) causing synchronization stalls and communication overheads worse than independent execution.

Method: tLoRA fuses adapters sharing the same base model into an elastic shared super-model, using distributed training frameworks for parallelism. At kernel level: fused LoRA kernel adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches. At scheduling layer: online, residual-capacity-aware scheduler adaptively groups jobs to maximize throughput.

Result: Evaluations using real-world cluster traces show tLoRA improves training throughput by 1.2-1.8x, job training completion time by 2.3-5.4x, and GPU utilization by 37%.

Conclusion: tLoRA provides an efficient framework for batch training multiple LoRA jobs, addressing synchronization and communication challenges in shared clusters while significantly improving throughput and resource utilization.

Abstract: As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and naïve batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches to maximize overlap between computation and communication across adapters. At the scheduling layer, tLoRA incorporates an online, residual-capacity-aware scheduler that adaptively groups jobs to maximize collective throughput. Evaluations using real-world cluster traces demonstrate that tLoRA improves training throughput by 1.2–1.8x, job training completion time by 2.3–5.4x, and GPU utilization by 37%.

[391] Riemannian MeanFlow

Dongyeop Woo, Marta Skreta, Seonghyun Park, Kirill Neklyudov, Sungsoo Ahn

Main category: cs.LG

TL;DR: Riemannian MeanFlow (RMF) enables efficient generative modeling on manifolds with single-step inference, achieving comparable quality to diffusion models while requiring 10x fewer function evaluations.

Details

Motivation: Diffusion and flow models on Riemannian manifolds require many neural network evaluations at inference time, creating computational bottlenecks for large-scale scientific sampling workflows like protein backbone generation and DNA sequence design.

Method: Introduces Riemannian MeanFlow (RMF) framework for learning flow maps directly on manifolds, with three equivalent characterizations of manifold average velocity (Eulerian, Lagrangian, and semigroup identities), plus parameterizations and stabilization techniques for high-dimensional manifolds.

Result: RMF achieves comparable sample quality to prior methods in promoter DNA design and protein backbone generation while requiring up to 10x fewer function evaluations. Few-step flow maps enable efficient reward-guided design through reward look-ahead.

Conclusion: RMF provides an efficient alternative to diffusion models for manifold-based generative modeling, enabling high-quality generation with minimal inference cost and facilitating reward-guided design applications.

Abstract: Diffusion and flow models have become the dominant paradigm for generative modeling on Riemannian manifolds, with successful applications in protein backbone generation and DNA sequence design. However, these methods require tens to hundreds of neural network evaluations at inference time, which can become a computational bottleneck in large-scale scientific sampling workflows. We introduce Riemannian MeanFlow~(RMF), a framework for learning flow maps directly on manifolds, enabling high-quality generations with as few as one forward pass. We derive three equivalent characterizations of the manifold average velocity (Eulerian, Lagrangian, and semigroup identities), and analyze parameterizations and stabilization techniques to improve training on high-dimensional manifolds. In promoter DNA design and protein backbone generation settings, RMF achieves comparable sample quality to prior methods while requiring up to 10$\times$ fewer function evaluations. Finally, we show that few-step flow maps enable efficient reward-guided design through reward look-ahead, where terminal states can be predicted from intermediate steps at minimal additional cost.

[392] Thermodynamic Isomorphism of Transformers: A Lagrangian Approach to Attention Dynamics

Gunn Kim

Main category: cs.LG

TL;DR: Transformer attention analyzed through thermodynamic field theory, showing Softmax emerges from minimizing Helmholtz free energy, with attention corresponding to canonical ensemble statistics; experiments on modular addition show critical-like crossover behavior preceding generalization.

Details

Motivation: To develop a unified theoretical framework for understanding Transformer attention mechanisms using statistical mechanics and thermodynamics, providing principled interpretation of attention scaling, training dynamics, and positional encoding as emergent properties of an effective thermodynamic system.

Method: Constructed field-theoretic framework using Lagrangian on information manifold with Fisher metric; showed Softmax arises as stationary solution minimizing Helmholtz free energy within Shannon-Boltzmann entropy framework; defined effective specific heat for attention energy fluctuations; tested on modular addition tasks (p=19-113).

Result: Established formal correspondence between scaled dot-product attention and canonical ensemble statistics; observed robust peak in fluctuation measure consistently preceding generalization onset; detected reproducible enhancement of energy variance suggesting critical-like crossover during representational reorganization; no asymptotic power-law divergence found in finite-depth regime.

Conclusion: The framework provides unified statistical-mechanical perspective on attention mechanisms, interpreting phenomena as emergent thermodynamic properties; results indicate finite-size crossover behavior rather than strict phase transition, motivating further investigation into scaling limits through fluctuation-based observables.

Abstract: We propose an effective field-theoretic framework for analyzing Transformer attention through a thermodynamic lens. By constructing a Lagrangian on the information manifold equipped with the Fisher metric, we show that, within the Shannon–Boltzmann entropy framework, the Softmax function arises as a stationary solution minimizing a Helmholtz free energy functional. This establishes a formal correspondence between scaled dot-product attention and canonical ensemble statistics. Extending this mapping to macroscopic observables, we define an effective specific heat associated with fluctuations of the attention energy landscape. In controlled experiments on the modular addition task ($p = 19$–$113$), we observe a robust peak in this fluctuation measure that consistently precedes the onset of generalization. While no asymptotic power-law divergence is detected in this finite-depth regime, the reproducible enhancement of energy variance suggests a critical-like crossover accompanying representational reorganization. Our framework provides a unified statistical-mechanical perspective on attention scaling, training dynamics, and positional encoding, interpreting the phenomena as emergent properties of an effective thermodynamic system rather than isolated heuristics. Although the present results indicate finite-size crossover behavior rather than a strict phase transition, they motivate further investigation into scaling limits of deep architectures through fluctuation-based observables.

[393] Epistemic Throughput: Fundamental Limits of Attention-Constrained Inference

Lei You

Main category: cs.LG

TL;DR: The paper introduces Attention-Constrained Inference (ACI) to address the decoder-side bottleneck in AI systems where many candidates are generated but only few can be verified, analyzing epistemic throughput scaling laws.

Details

Motivation: Modern generative and tool-using AI systems produce many candidates at low cost, but human attention is scarce, creating a bottleneck where only a small fraction can be carefully verified. This requires formalizing how to form reliable posteriors from many records under attention constraints.

Method: Proposes Attention-Constrained Inference (ACI) framework with two stages: cheap screening of K records and expensive verification of at most B records. Analyzes epistemic throughput (reduction in posterior uncertainty per window) under Bayes log-loss, deriving scaling laws.

Result: Derives a “JaKoB” scaling law showing epistemic throughput has a linear baseline term plus an information-leverage term scaling as √(JKB), where J summarizes screening quality. Shows this scaling is tight in weak-screening limit, and that heavy-tailed score distributions are needed for substantial amplification in sparse-verification regimes.

Conclusion: Expanding cheap screening can nonlinearly amplify scarce verification capacity, even when informative records are rare, but requires appropriate score distributions to achieve meaningful information leverage.

Abstract: Recent generative and tool-using AI systems can surface a large volume of candidates at low marginal cost, yet only a small fraction can be checked carefully. This creates a decoder-side bottleneck: downstream decision-makers must form reliable posteriors from many public records under scarce attention. We formalize this regime via Attention-Constrained Inference (ACI), in which a cheap screening stage processes $K$ records and an expensive verification stage can follow up on at most $B$ of them. Under Bayes log-loss, we study the maximum achievable reduction in posterior uncertainty per window, which we call \emph{epistemic throughput}. Our main result is a ``JaKoB’’ scaling law showing that epistemic throughput has a baseline term that grows linearly with verification and prevalence, and an additional \emph{information-leverage} term that scales as $\sqrt{JKB}$, where $J$ summarizes screening quality. Thus, expanding cheap screening can nonlinearly amplify scarce verification, even when informative records are rare. We further show that this scaling is tight in a weak-screening limit, and that in the sparse-verification regime ($B \ll K$), substantial leverage requires heavy-tailed score distributions; for light-tailed scores the amplification is only logarithmic.

[394] Rising Multi-Armed Bandits with Known Horizons

Seockbean Song, Chenyu Gan, Youngsik Yoon, Siwei Wang, Wei Chen, Jungseul Ok

Main category: cs.LG

TL;DR: CURE-UCB algorithm for Rising Multi-Armed Bandits that explicitly incorporates horizon information to outperform horizon-agnostic strategies in environments where arm rewards increase with plays.

Details

Motivation: The paper addresses the underexplored horizon-aware setting in Rising Multi-Armed Bandits (RMAB), where optimal strategies shift dramatically with available budget T, and knowledge of T provides significant utility for decision-making.

Method: Proposes CUmulative Reward Estimation UCB (CURE-UCB) algorithm that explicitly integrates horizon information into the bandit framework, with rigorous analysis establishing new regret bounds.

Result: Theoretical analysis proves CURE-UCB strictly outperforms horizon-agnostic strategies in structured environments like “linear-then-flat” instances, with extensive experiments demonstrating significant superiority over baselines.

Conclusion: Horizon-aware strategies are crucial for RMAB problems, and CURE-UCB provides an effective solution that leverages horizon information to achieve better performance than traditional horizon-agnostic approaches.

Abstract: The Rising Multi-Armed Bandit (RMAB) framework models environments where expected rewards of arms increase with plays, which models practical scenarios where performance of each option improves with the repeated usage, such as in robotics and hyperparameter tuning. For instance, in hyperparameter tuning, the validation accuracy of a model configuration (arm) typically increases with each training epoch. A defining characteristic of RMAB is em horizon-dependent optimality: unlike standard settings, the optimal strategy here shifts dramatically depending on the available budget $T$. This implies that knowledge of $T$ yields significantly greater utility in RMAB, empowering the learner to align its decision-making with this shifting optimality. However, the horizon-aware setting remains underexplored. To address this, we propose a novel CUmulative Reward Estimation UCB (CURE-UCB) that explicitly integrates the horizon. We provide a rigorous analysis establishing a new regret upper bound and prove that our method strictly outperforms horizon-agnostic strategies in structured environments like ``linear-then-flat’’ instances. Extensive experiments demonstrate its significant superiority over baselines.

[395] Calibrating an Imperfect Auxiliary Predictor for Unobserved No-Purchase Choice

Jiangkai Xiong, Kalyan Talluri, Hanzhao Wang

Main category: cs.LG

TL;DR: Paper develops calibration methods to convert imperfect black-box auxiliary predictions of outside-option probabilities into statistically valid no-purchase estimates using only purchase data, addressing missing outside-option information in market analysis.

Details

Motivation: Firms often lack data on key consumer actions like buying from competitors or not buying at all, making market-size and preference estimation difficult with only transaction data. Existing approaches rely on auxiliary data, but these predictors may be biased when trained in different contexts.

Method: Two calibration approaches: 1) Under affine miscalibration in logit space, use simple regression to identify outside-option utility parameters and recover no-purchase probabilities. 2) Under weaker nearly monotone condition, propose rank-based calibration method with finite-sample error bounds that separate auxiliary-predictor quality from utility-learning error.

Result: Methods enable consistent recovery of no-purchase probabilities without collecting new labels for no-purchase events. Error bounds quantify how calibration accuracy affects downstream revenue performance in assortment optimization, with explicit dependence on predictor alignment and utility-learning error.

Conclusion: The calibration methods effectively convert imperfect auxiliary predictions into valid no-purchase estimates using only purchase data, improving market analysis and assortment decisions while providing theoretical guarantees on performance.

Abstract: Firms typically cannot observe key consumer actions: whether customers buy from a competitor, choose not to buy, or even fully consider the firm’s offer. This missing outside-option information makes market-size and preference estimation difficult even in simple multinomial logit (MNL) models, and it is a central obstacle in practice when only transaction data are recorded. Existing approaches often rely on auxiliary market-share, aggregated, or cross-market data. We study a complementary setting in which a black-box auxiliary predictor provides outside-option probabilities, but is potentially biased or miscalibrated because it was trained in a different channel, period, or population, or produced by an external machine-learning system. We develop calibration methods that turn such imperfect predictions into statistically valid no-purchase estimates using purchase-only data from the focal environment. First, under affine miscalibration in logit space, we show that a simple regression identifies outside-option utility parameters and yields consistent recovery of no-purchase probabilities without collecting new labels for no-purchase events. Second, under a weaker nearly monotone condition, we propose a rank-based calibration method and derive finite-sample error bounds that cleanly separate auxiliary-predictor quality from first-stage utility-learning error over observed in-set choices. Our analysis also translates estimation error into downstream decision quality for assortment optimization, quantifying how calibration accuracy affects revenue performance. The bounds provide explicit dependence on predictor alignment and utility-learning error, clarifying when each source dominates. Numerical experiments demonstrate improvements in no-purchase estimation and downstream assortment decisions, and we discuss robust aggregation extensions for combining multiple auxiliary predictors.

cs.MA

[396] Provably Convergent Actor-Critic in Risk-averse MARL

Yizhou Zhang, Eric Mazumdar

Main category: cs.MA

TL;DR: A novel two-timescale Actor-Critic algorithm for learning stationary policies in infinite-horizon general-sum Markov games using Risk-averse Quantal response Equilibria (RQE) with proven global convergence and finite-sample guarantees.

Details

Motivation: Learning stationary policies in infinite-horizon general-sum Markov games is a fundamental open problem in MARL. While stationary strategies are practical, computing stationary forms of classic game-theoretic equilibria is computationally intractable, unlike solving single-agent RL or zero-sum games.

Method: Proposes a novel two-timescale Actor-Critic algorithm with fast-timescale actor and slow-timescale critic. Leverages Risk-averse Quantal response Equilibria (RQE), a solution concept from behavioral game theory that incorporates risk aversion and bounded rationality, which possesses strong regularity conditions making it amenable to learning in Markov games.

Result: The algorithm achieves global convergence with finite-sample guarantees. Empirical validation in several environments demonstrates superior convergence properties compared to risk-neutral baselines.

Conclusion: RQE provides a computationally tractable solution concept for learning stationary policies in general-sum Markov games, bridging the gap between theoretical intractability and practical learning algorithms with proven convergence guarantees.

Abstract: Learning stationary policies in infinite-horizon general-sum Markov games (MGs) remains a fundamental open problem in Multi-Agent Reinforcement Learning (MARL). While stationary strategies are preferred for their practicality, computing stationary forms of classic game-theoretic equilibria is computationally intractable – a stark contrast to the comparative ease of solving single-agent RL or zero-sum games. To bridge this gap, we study Risk-averse Quantal response Equilibria (RQE), a solution concept rooted in behavioral game theory that incorporates risk aversion and bounded rationality. We demonstrate that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs. We propose a novel two-timescale Actor-Critic algorithm characterized by a fast-timescale actor and a slow-timescale critic. Leveraging the regularity of RQE, we prove that this approach achieves global convergence with finite-sample guarantees. We empirically validate our algorithm in several environments to demonstrate superior convergence properties compared to risk-neutral baselines.

[397] Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu, Yang Yan

Main category: cs.MA

TL;DR: Survey paper on agent skills - modular, composable packages that extend LLM capabilities without retraining, covering architecture, acquisition, deployment, security, and open challenges.

Details

Motivation: The transition from monolithic LLMs to modular, skill-equipped agents represents a fundamental shift in AI deployment, enabling dynamic capability extension without model retraining through composable skill packages.

Method: Comprehensive survey organized along four axes: architectural foundations (SKILL.md specification, progressive context loading, MCP integration), skill acquisition (reinforcement learning, autonomous discovery, compositional synthesis), deployment at scale (CUA stack, GUI grounding, benchmarks), and security (vulnerability analysis, Trust Framework).

Result: Identifies that 26.1% of community-contributed skills contain vulnerabilities, proposes Skill Trust and Lifecycle Governance Framework with four-tier permission model, and outlines seven open challenges for trustworthy skill ecosystems.

Conclusion: Agent skills represent an emerging abstraction layer for next-generation agentic systems, requiring focused research on cross-platform portability, capability-based permissions, and trustworthy skill ecosystems.

Abstract: The transition from monolithic language models to modular, skill-equipped agents marks a defining shift in how large language models (LLMs) are deployed in practice. Rather than encoding all procedural knowledge within model weights, agent skills – composable packages of instructions, code, and resources that agents load on demand – enable dynamic capability extension without retraining. It is formalized in a paradigm of progressive disclosure, portable skill definitions, and integration with the Model Context Protocol (MCP). This survey provides a comprehensive treatment of the agent skills landscape, as it has rapidly evolved during the last few months. We organize the field along four axes: (i) architectural foundations, examining the SKILL.md specification, progressive context loading, and the complementary roles of skills and MCP; (ii) skill acquisition, covering reinforcement learning with skill libraries (SAGE), autonomous skill discovery (SEAgent), and compositional skill synthesis; (iii) deployment at scale, including the computer-use agent (CUA) stack, GUI grounding advances, and benchmark progress on OSWorld and SWE-bench; and (iv) security, where recent empirical analyses reveal that 26.1% of community-contributed skills contain vulnerabilities, motivating our proposed Skill Trust and Lifecycle Governance Framework – a four-tier, gate-based permission model that maps skill provenance to graduated deployment capabilities. We identify seven open challenges – from cross-platform skill portability to capability-based permission models – and propose a research agenda for realizing trustworthy, self-improving skill ecosystems. Unlike prior surveys that broadly cover LLM agents or tool use, this work focuses specifically on the emerging skill abstraction layer and its implications for the next generation of agentic systems. Project repo: https://github.com/scienceaix/agentskills.

[398] Theory of Mind Guided Strategy Adaptation for Zero-Shot Coordination

Andrew Ni, Simon Stepputtis, Stefanos Nikolaidis, Michael Lewis, Katia P. Sycara, Woojun Kim

Main category: cs.MA

TL;DR: Proposes an adaptive ensemble agent for zero-shot coordination in multi-agent RL that uses Theory-of-Mind-based policy selection to infer teammate intentions and choose optimal policies from an ensemble, outperforming single best-response baselines in Overcooked environment.

Details

Motivation: Current zero-shot coordination methods focus on diversifying training partner pools but produce static generalist policies that don't adapt well to specific teammates. Need for more adaptive specialist policies that can achieve higher synergy through better teammate adaptation.

Method: Proposes adaptive ensemble agent with Theory-of-Mind-based best-response selection: 1) infers teammate’s intentions, 2) selects most suitable policy from a policy ensemble. Evaluated in Overcooked environment under fully and partially observable settings.

Result: Empirical results demonstrate superiority over single best-response baseline in zero-shot coordination performance. The adaptive approach achieves better synergy with diverse teammates compared to static generalist policies.

Conclusion: Adaptive ensemble agents with Theory-of-Mind-based policy selection outperform traditional best-response approaches in zero-shot coordination, enabling better adaptation to unseen teammates through intention inference and specialized policy selection.

Abstract: A central challenge in multi-agent reinforcement learning is enabling agents to adapt to previously unseen teammates in a zero-shot fashion. Prior work in zero-shot coordination often follows a two-stage process, first generating a diverse training pool of partner agents, and then training a best-response agent to collaborate effectively with the entire training pool. While many previous works have achieved strong performance by devising better ways to diversify the partner agent pool, there has been less emphasis on how to leverage this pool to build an adaptive agent. One limitation is that the best-response agent may converge to a static, generalist policy that performs reasonably well across diverse teammates, rather than learning a more adaptive, specialist policy that can better adapt to teammates and achieve higher synergy. To address this, we propose an adaptive ensemble agent that uses Theory-of-Mind-based best-response selection to first infer its teammate’s intentions and then select the most suitable policy from a policy ensemble. We conduct experiments in the Overcooked environment to evaluate zero-shot coordination performance under both fully and partially observable settings. The empirical results demonstrate the superiority of our method over a single best-response baseline.

[399] Building Large-Scale Drone Defenses from Small-Team Strategies

Grant Douglas, Stephen Franklin, Claudia Szabo, Mingyu Guo

Main category: cs.MA

TL;DR: A framework for scaling defense strategies against large adversarial drone swarms by integrating proven small-team tactics as modular components, assembled via dynamic programming decomposition.

Details

Motivation: Defending against large adversarial drone swarms requires coordination methods that scale effectively beyond conventional multi-agent optimization, as existing approaches don't scale well to very large scenarios.

Method: Proposes a framework that integrates proven small-team defense strategies as modular components, uses dynamic programming decomposition to assemble these components into large teams in polynomial time, and iteratively refines the component pool through sampling and evaluation of large-team outcomes.

Result: The partitioning approach scales to substantially larger scenarios while preserving effectiveness and reveals cooperative behaviors that direct optimization cannot reliably discover.

Conclusion: The framework enables efficient construction of scalable defenses against large drone swarms by leveraging modular small-team strategies and dynamic programming assembly, overcoming scalability limitations of conventional multi-agent optimization.

Abstract: Defending against large adversarial drone swarms requires coordination methods that scale effectively beyond conventional multi-agent optimisation. In this paper, we propose to scale strategies proven effective in small defender teams by integrating them as modular components of larger forces using our proposed framework. A dynamic programming (DP) decomposition assembles these components into large teams in polynomial time, enabling efficient construction of scalable defenses without exhaustive evaluation. Because a unit that is strong in isolation may not remain strong when combined, we sample across multiple small-team candidates. Our framework iterates between evaluating large-team outcomes and refining the pool of modular components, allowing convergence on increasingly effective strategies. Experiments demonstrate that this partitioning approach scales to substantially larger scenarios while preserving effectiveness and revealing cooperative behaviours that direct optimisation cannot reliably discover.

[400] Learning Large-Scale Competitive Team Behaviors with Mean-Field Interactions and Online Opponent Modeling

Bhavini Jeloka, Yue Guan, Panagiotis Tsiotras

Main category: cs.MA

TL;DR: MF-MAPPO extends mean-field PPO to zero-sum team games, enabling scalable multi-agent reinforcement learning for thousands of agents by combining intra-team cooperation with inter-team competition.

Details

Motivation: Existing multi-agent RL algorithms struggle to scale to large populations, and mean-field approaches typically focus only on fully cooperative or purely competitive settings, lacking solutions for mixed cooperative-competitive scenarios.

Method: MF-MAPPO extends PPO with mean-field theory for zero-sum team games, using shared actors and minimally informed critics per team, trained on finite-population simulators, with extension to partial observability via gradient-regularized training.

Result: MF-MAPPO outperforms existing methods on large-scale benchmarks (offense-defense battlefield tasks and population-based rock-paper-scissors games), exhibiting complex heterogeneous behaviors at scale.

Conclusion: Combining mean-field theory with MARL techniques enables effective scaling to thousands of agents in mixed cooperative-competitive settings, with potential for real-world deployment.

Abstract: While multi-agent reinforcement learning (MARL) has been proven effective across both collaborative and competitive tasks, existing algorithms often struggle to scale to large populations of agents. Recent advancements in mean-field (MF) theory provide scalable solutions by approximating population interactions as a continuum, yet most existing frameworks focus exclusively on either fully cooperative or purely competitive settings. To bridge this gap, we introduce MF-MAPPO, a mean-field extension of PPO designed for zero-sum team games that integrate intra-team cooperation with inter-team competition. MF-MAPPO employs a shared actor and a minimally informed critic per team and is trained directly on finite-population simulators, thereby enabling deployment to realistic scenarios with thousands of agents. We further show that MF-MAPPO naturally extends to partially observable settings through a simple gradient-regularized training scheme. Our evaluation utilizes large-scale benchmark scenarios using our own testing simulation platform for MF team games (MFEnv), including offense-defense battlefield tasks as well as variants of population-based rock-paper-scissors games that admit analytical solutions, for benchmarking. Across these benchmarks, MF-MAPPO outperforms existing methods and exhibits complex, heterogeneous behaviors, demonstrating the effectiveness of combining mean-field theory and MARL techniques at scale.

[401] Bayesian Ego-graph Inference for Networked Multi-Agent Reinforcement Learning

Wei Duan, Jie Lu, Junyu Xuan

Main category: cs.MA

TL;DR: BayesG: A decentralized MARL framework using Bayesian variational inference to learn dynamic communication graphs over physical networks, enabling agents to jointly learn interaction topology and decision-making strategies.

Details

Motivation: Networked-MARL faces limitations with static neighborhood assumptions and centralized approaches that require global state access. Real-world decentralized systems need adaptive communication structures that can handle dynamic or heterogeneous environments without centralized infrastructure.

Method: Proposes stochastic graph-based policy where agents condition decisions on sampled subgraphs over local neighborhoods. Introduces BayesG framework with decentralized actors that learn sparse, context-aware interaction structures via Bayesian variational inference. Each agent operates over ego-graphs, samples latent communication masks for message passing, and trains variational distribution end-to-end with policy using ELBO objective.

Result: BayesG outperforms strong MARL baselines on large-scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance compared to existing methods.

Conclusion: BayesG provides an effective decentralized solution for Networked-MARL that can learn dynamic communication structures alongside policies, addressing limitations of static neighborhood assumptions and centralized approaches in real-world applications.

Abstract: In networked multi-agent reinforcement learning (Networked-MARL), decentralized agents must act under local observability and constrained communication over fixed physical graphs. Existing methods often assume static neighborhoods, limiting adaptability to dynamic or heterogeneous environments. While centralized frameworks can learn dynamic graphs, their reliance on global state access and centralized infrastructure is impractical in real-world decentralized systems. We propose a stochastic graph-based policy for Networked-MARL, where each agent conditions its decision on a sampled subgraph over its local physical neighborhood. Building on this formulation, we introduce BayesG, a decentralized actor-framework that learns sparse, context-aware interaction structures via Bayesian variational inference. Each agent operates over an ego-graph and samples a latent communication mask to guide message passing and policy computation. The variational distribution is trained end-to-end alongside the policy using an evidence lower bound (ELBO) objective, enabling agents to jointly learn both interaction topology and decision-making strategies. BayesG outperforms strong MARL baselines on large-scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance.

[402] A Simulation of Ageing and Care Accessibility in Italian Inner Areas

Roberto garrone

Main category: cs.MA

TL;DR: Agent-based model integrating GIS and synthetic populations to analyze how service configurations affect accessibility and caregiver burden in ageing mountainous communities.

Details

Motivation: Ageing societies face increasing strain on care systems, especially in low-density mountainous areas where sparse services and difficult terrain constrain access to care services.

Method: Spatially explicit agent-based model integrating road-network GIS, synthetic populations derived through Iterative Proportional Fitting, and behavioral heterogeneity. Applied to Premeno (Piedmont, Italy) comparing baseline ambulatory services with relocation scenario at Villa Bernocchi.

Result: Results show aggregate neutrality but pronounced local redistribution of accessibility. Spatial impedance dominates accessibility while behavioral capacity modulates care effort. Sensitivity analysis reveals distinctive properties of complex adaptive social systems.

Conclusion: Computational social simulation can highlight policy trade-offs between spatial efficiency, social equity, and care sustainability in ageing territories, demonstrating emergence, heterogeneity, and feedback in complex adaptive systems.

Abstract: Ageing societies face increasing strain on formal and informal care systems, particularly in low-density mountainous municipalities where sparse services and steep terrain constrain access. This study presents a spatially explicit agent-based model that integrates a road-network GIS, synthetic populations derived through Iterative Proportional Fitting, and behavioural heterogeneity to examine how alternative service configurations shape accessibility and caregiver burden. The model, applied to Premeno (Piedmont, Italy), compares a baseline distribution of ambulatory services with a relocation scenario at Villa Bernocchi. System-level indicators (Caregiver Effort, Overwhelmed Caregivers, Hours Not Cared, Walkability) and micro-spatial metrics (Walkability, Detour Ratio, Proximity) are analysed across 40 batches and 50 stochastic replications per scenario. Results reveal aggregate neutrality but pronounced local redistribution of accessibility. Sensitivity analysis shows that spatial impedance dominates accessibility, whereas behavioural capacity modulates care effort. The findings illustrate distinctive properties of complex adaptive social systems - emergence, heterogeneity, and feedback - demonstrating how computational social simulation can highlight policy trade-offs between spatial efficiency, social equity, and care sustainability in ageing territories.

[403] MASPRM: Multi-Agent System Process Reward Model

Milad Yazdani, Mahdi Mostajabdaveh, Zirui Zhou, Ying Xiong

Main category: cs.MA

TL;DR: MASPRM is a process reward model for multi-agent systems that guides inference-time search using partial transcript values trained from terminal rewards without human annotations.

Details

Motivation: Multi-agent systems need efficient inference-time computation to improve quality, requiring methods that can selectively allocate compute to promising search branches without relying on human step-level annotations.

Method: Train a process reward model (MASPRM) from multi-agent MCTS rollouts using only terminal outcome rewards, propagate returns to local targets, then use MASPRM to guide step-level beam search and MCTS during inference by scoring partial inter-agent transcripts.

Result: MASPRM improves Hit@1 over policy likelihood by up to +13.4 points and reduces Hit@1->Hit@5 gaps by up to 10.3 points across GSM8K, MATH, MMLU, and LogiQA benchmarks.

Conclusion: MASPRM effectively guides inference-time search in multi-agent systems using process rewards trained from terminal outcomes, improving performance while reducing compute waste.

Abstract: Practical deployment of multi-agent systems (MAS) demands strong performance at test time, motivating methods that guide search during inference and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns values to partial inter-agent transcripts for each action and each agent, and acts as a controller during inference. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts labeled only with terminal outcome rewards, without requiring human step-level annotations, by propagating returns to local targets. During inference, MASPRM guides step-level beam search (SBS) and MCTS, focusing computation on promising branches and pruning unpromising ones. We train and test MASPRM across different tasks and domains, using GSM8K, MATH, MMLU, and LogiQA as benchmarks. Averaged across these benchmarks, MASPRM improves Hit@1 over policy likelihood by up to $+13.4$ points and improves ranking quality, reducing Hit@1$->$Hit@5 gaps by up to $10.3$ points. MASPRM complements inference-time search by scoring intermediate routed transcripts to guide rollouts in MAS with fixed schedules. Code: https://github.com/milad1378yz/MASPRM

cs.MM

eess.AS

[404] Acoustivision Pro: An Open-Source Interactive Platform for Room Impulse Response Analysis and Acoustic Characterization

Mandip Goswami

Main category: eess.AS

TL;DR: AcoustiVision Pro is an open-source web platform for comprehensive room impulse response analysis with interactive 3D visualization, acoustic parameter computation, and standards compliance checking.

Details

Motivation: Despite standardized acoustic metrics, there's a lack of accessible tools combining rigorous signal processing with intuitive visualization for room acoustics analysis across various domains.

Method: Developed a web-based platform that computes 12 acoustic parameters from RIRs, provides interactive 3D visualizations of early reflections, generates waterfall plots, and checks compliance against international standards. Includes datasets of simulated RIRs with metadata.

Result: Created an open-source platform with real-time auralization, PDF report generation, CSV export, and demonstrated utility across classroom acoustics, healthcare facility design, and recording studio evaluation.

Conclusion: AcoustiVision Pro provides a comprehensive, accessible tool for room acoustics analysis that bridges the gap between rigorous signal processing and intuitive visualization for diverse applications.

Abstract: Room acoustics analysis plays a central role in architectural design, audio engineering, speech intelligibility assessment, and hearing research. Despite the availability of standardized metrics such as reverberation time, clarity, and speech transmission index, accessible tools that combine rigorous signal processing with intuitive visualization remain scarce. This paper presents AcoustiVision Pro, an open-source web-based platform for comprehensive room impulse response (RIR) analysis. The system computes twelve distinct acoustic parameters from uploaded or dataset-sourced RIRs, provides interactive 3D visualizations of early reflections, generates frequency-dependent decay characteristics through waterfall plots, and checks compliance against international standards including ANSI S12.60 and ISO 3382. We introduce the accompanying RIRMega and RIRMega Speech datasets hosted on Hugging Face, containing thousands of simulated room impulse responses with full metadata. The platform supports real-time auralization through FFT-based convolution, exports detailed PDF reports suitable for engineering documentation, and provides CSV data export for further analysis. We describe the mathematical foundations underlying each acoustic metric, detail the system architecture, and present preliminary case studies demonstrating the platform’s utility across diverse application domains including classroom acoustics, healthcare facility design, and recording studio evaluation.

[405] Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR

Jaeyoung Lee, Masato Mimura

Main category: eess.AS

TL;DR: Decoder-only Conformer ASR model with modality-aware sparse MoE that processes speech and text in a single stack without external encoders or pretrained LLMs, achieving better accuracy than AED baselines with fewer parameters.

Details

Motivation: To create a simpler ASR architecture that eliminates the need for external speech encoders or pretrained LLMs, while maintaining or improving accuracy through efficient modality-aware processing.

Method: Uses decoder-only Conformer with modality-aware sparse mixture of experts (MoE) - separate expert pools for speech and text with hard routing and top-1 selection. Employs hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation.

Result: 113M-parameter model outperforms 139M AED baseline on Librispeech (2.8% vs 3.2% test-clean; 5.6% vs 6.0% test-other). On Common Voice 16.1 multilingual (5 languages), reduces average WER from 12.2% to 10.6%.

Conclusion: First randomly initialized decoder-only ASR that surpasses strong AED baselines through modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules.

Abstract: We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules.

[406] A two-step approach for speech enhancement in low-SNR scenarios using cyclostationary beamforming and DNNs

Giovanni Bologni, Nicolás Arrieta Larraza, Richard Heusdens, Richard C. Hendriks

Main category: eess.AS

TL;DR: A framework combining cyclostationarity-aware preprocessing (cMPDR beamformer) with lightweight DNNs for speech enhancement in harmonic noise, showing better performance than end-to-end DNNs at low SNRs.

Details

Motivation: DNNs struggle with noise suppression at low SNRs, especially for harmonic noise. The paper aims to improve speech enhancement by incorporating signal priors about cyclostationary noise rather than just increasing model capacity.

Method: Proposes a pipeline with cyclic minimum power distortionless response (cMPDR) spectral beamformer as preprocessing to suppress harmonic noise components, followed by lightweight DNN denoising (CRNN or ULCNet). The preprocessing exploits spectral correlations of cyclostationary noise without modifying DNN architecture.

Result: Experiments on synthetic data and real-world rotating machinery noise show consistent improvements over end-to-end DNN baselines, especially at low SNRs. A parameter-efficient CRNN with cMPDR preprocessing outperforms larger ULCNet on raw or Wiener-filtered inputs.

Conclusion: Explicitly incorporating cyclostationarity as a signal prior is more effective than increasing model capacity alone for suppressing harmonic interference in speech enhancement.

Abstract: Deep Neural Networks (DNNs) often struggle to suppress noise at low signal-to-noise ratios (SNRs). This paper addresses speech enhancement in scenarios dominated by harmonic noise and proposes a framework that integrates cyclostationarity-aware preprocessing with lightweight DNN-based denoising. A cyclic minimum power distortionless response (cMPDR) spectral beamformer is used as a preprocessing block. It exploits the spectral correlations of cyclostationary noise to suppress harmonic components prior to learning-based enhancement and does not require modifications to the DNN architecture. The proposed pipeline is evaluated in a single-channel setting using two DNN architectures: a simple and lightweight convolutional recurrent neural network (CRNN), and a state-of-the-art model, namely ultra-low complexity network (ULCNet). Experiments on synthetic data and real-world recordings dominated by rotating machinery noise demonstrate consistent improvements over end-to-end DNN baselines, particularly at low SNRs. Remarkably, a parameter-efficient CRNN with cMPDR preprocessing surpasses the performance of the larger ULCNet operating on raw or Wiener-filtered inputs. These results indicate that explicitly incorporating cyclostationarity as a signal prior is more effective than increasing model capacity alone for suppressing harmonic interference.

[407] CUHK-EE Systems for the vTAD Challenge at NCMMSC 2025

Aemon Yat Fei Chiu, Jingyu Li, Yusheng Tian, Guangyan Zhang, Tan Lee

Main category: eess.AS

TL;DR: The paper presents voice timbre attribute detection systems using WavLM-Large embeddings with attentive statistical pooling and Diff-Net variants for comparing timbre attribute intensities between utterance pairs.

Details

Motivation: To develop robust systems for voice timbre attribute detection that can accurately compare timbre intensities between utterance pairs, addressing challenges in fine-grained speaker modeling and generalization.

Method: Uses WavLM-Large embeddings with attentive statistical pooling (ASTP) to extract speaker representations, followed by two Diff-Net variants: Feed-Forward Neural Network (FFN) and Squeeze-and-Excitation-enhanced Residual FFN (SE-ResFFN) for comparing timbre attributes.

Result: WavLM-Large+FFN achieved 77.96% accuracy and 21.79% EER for unseen speakers, while WavLM-Large+SE-ResFFN achieved 94.42% accuracy and 5.49% EER for seen speakers, showing a trade-off between model complexity and generalization.

Conclusion: Architectural choices significantly impact fine-grained speaker modeling, with simpler models generalizing better to unseen speakers while complex models excel on seen data. Future work should address speaker identity, annotation subjectivity, and data imbalance for improved robustness and fairness.

Abstract: This paper presents the Voice Timbre Attribute Detection (vTAD) systems developed by the Digital Signal Processing & Speech Technology Laboratory (DSP&STL) of the Department of Electronic Engineering (EE) at The Chinese University of Hong Kong (CUHK) for the 20th National Conference on Human-Computer Speech Communication (NCMMSC 2025) vTAD Challenge. The proposed systems leverage WavLM-Large embeddings with attentive statistical pooling (ASTP) to extract robust speaker representations, followed by two variants of Diff-Net, i.e., Feed-Forward Neural Network (FFN) and Squeeze-and-Excitation-enhanced Residual FFN (SE-ResFFN), to compare timbre attribute intensities between utterance pairs. Experimental results demonstrate that the WavLM-Large+FFN system generalises better to unseen speakers, achieving 77.96% accuracy and 21.79% equal error rate (EER), while the WavLM-Large+SE-ResFFN model excels in the ‘Seen’ setting with 94.42% accuracy and 5.49% EER. These findings highlight a trade-off between model complexity and generalisation, and underscore the importance of architectural choices in fine-grained speaker modelling. Our analysis also reveals the impact of speaker identity, annotation subjectivity, and data imbalance on system performance, pointing to future directions for improving robustness and fairness in timbre attribute detection.

[408] Tuberculosis Screening from Cough Audio: Baseline Models, Clinical Variables, and Uncertainty Quantification

George P. Kafentzis, Efstratios Selisios

Main category: eess.AS

TL;DR: A standardized framework for automatic tuberculosis detection from cough audio and clinical data with reproducible pipeline and consistent evaluation metrics.

Details

Motivation: Current TB screening from audio studies lack comparability due to varying datasets, methods, and metrics, making it hard to measure real progress and distinguish modeling advances from data/evaluation differences.

Method: Established a strong baseline using cough recordings and clinical metadata from a multi-country dataset with reproducible end-to-end pipeline covering feature extraction, multimodal fusion, cougher-independent evaluation, and uncertainty quantification.

Result: Provides a standardized framework with consistent clinically relevant metrics, quantifies performance for audio-only and multimodal (audio + clinical) models, and releases full experimental protocol for benchmarking.

Conclusion: The baseline serves as a common reference point to reduce methodological variance and facilitate fair comparison, addressing current limitations in TB audio screening research.

Abstract: In this paper, we propose a standardized framework for automatic tuberculosis (TB) detection from cough audio and routinely collected clinical data using machine learning. While TB screening from audio has attracted growing interest, progress is difficult to measure because existing studies vary substantially in datasets, cohort definitions, feature representations, model families, validation protocols, and reported metrics. Consequently, reported gains are often not directly comparable, and it remains unclear whether improvements stem from modeling advances or from differences in data and evaluation. We address this gap by establishing a strong, well-documented baseline for TB prediction using cough recordings and accompanying clinical metadata from a recently compiled dataset from several countries. Our pipeline is reproducible end-to-end, covering feature extraction, multimodal fusion, cougher-independent evaluation, and uncertainty quantification, and it reports a consistent suite of clinically relevant metrics to enable fair comparison. We further quantify performance for cough audio-only and fused (audio + clinical metadata) models, and release the full experimental protocol to facilitate benchmarking. This baseline is intended to serve as a common reference point and to reduce methodological variance that currently holds back progress in the field.

eess.IV

[409] VineetVC: Adaptive Video Conferencing Under Severe Bandwidth Constraints Using Audio-Driven Talking-Head Reconstruction

Vineet Kumar Rakesh, Soumya Mazumdar, Tapas Samanta, Hemendra Kumar Pandey, Amitabha Das, Sarbajit Pal

Main category: eess.IV

TL;DR: Adaptive video conferencing system that switches to audio-driven talking-head synthesis during bandwidth constraints, reducing bandwidth from typical video streams to ~32.8 kbps while maintaining communication.

Details

Motivation: Bandwidth depletion in consumer networks undermines real-time video conferencing stability, causing encoder saturation, packet loss, frame rate deterioration, and increased latency. Need for adaptive solutions that maintain communication quality under constrained network conditions.

Method: Integrates WebRTC media delivery with supplementary audio-driven talking-head reconstruction pathway and telemetry-driven mode regulation. System includes WebSocket signaling, optional SFU for multi-party transmission, browser client with real-time WebRTC statistics extraction, and AI REST service that processes reference face image and audio to synthesize MP4 video. Implements bandwidth-mode switching strategy and client-side logging.

Result: System can substitute outbound camera track with synthesized talking-head stream using median bandwidth of 32.80 kbps, significantly reducing bandwidth requirements compared to traditional video streaming while maintaining visual communication.

Conclusion: Adaptive conferencing system with audio-driven talking-head synthesis provides effective solution for maintaining video communication quality under bandwidth-constrained network conditions, with substantial bandwidth reduction capabilities.

Abstract: Intense bandwidth depletion within consumer and constrained networks has the potential to undermine the stability of real-time video conferencing: encoder rate management becomes saturated, packet loss escalates, frame rates deteriorate, and end-to-end latency significantly increases. This work delineates an adaptive conferencing system that integrates WebRTC media delivery with a supplementary audio-driven talking-head reconstruction pathway and telemetry-driven mode regulation. The system consists of a WebSocket signaling service, an optional SFU for multi-party transmission, a browser client capable of real-time WebRTC statistics extraction and CSV telemetry export, and an AI REST service that processes a reference face image and recorded audio to produce a synthesized MP4; the browser can substitute its outbound camera track with the synthesized stream with a median bandwidth of 32.80 kbps. The solution incorporates a bandwidth-mode switching strategy and a client-side mode-state logger.

[410] Quantum walk inspired JPEG compression of images

Abhishek Verma, Sahil Tomar, Sandeep Kumar

Main category: eess.IV

TL;DR: Quantum-inspired optimization improves JPEG compression by learning optimal quantization tables through quantum walk search, achieving better image quality while maintaining JPEG compatibility.

Details

Motivation: To enhance classical JPEG compression by introducing an optimized quantization table using quantum-inspired optimization techniques, improving compression efficiency and image quality while maintaining backward compatibility.

Method: Proposes a quantum-inspired adaptive quantization framework using Quantum Walk Inspired Optimization (QWIO) to search continuous parameter space of frequency band scaling factors under a unified rate-distortion objective that jointly considers reconstruction fidelity and compression efficiency.

Result: Experimental results on MNIST, CIFAR10, and ImageNet subsets show average gains of 3-6 dB PSNR, better structural preservation of edges, contours, and luminance transitions, while maintaining JPEG decoder compatibility.

Conclusion: The quantum-inspired optimization framework successfully enhances JPEG compression with significant quality improvements while remaining JPEG-compliant and deployable using accessible scientific packages.

Abstract: This work proposes a quantum inspired adaptive quantization framework that enhances the classical JPEG compression by introducing a learned, optimized Qtable derived using a Quantum Walk Inspired Optimization (QWIO) search strategy. The optimizer searches a continuous parameter space of frequency band scaling factors under a unified rate distortion objective that jointly considers reconstruction fidelity and compression efficiency. The proposed framework is evaluated on MNIST, CIFAR10, and ImageNet subsets, using Peak Signal to Noise Ratio (PSNR), Structural Similarity Index (SSIM), Bits Per Pixel (BPP), and error heatmap visual analysis as evaluation metrics. Experimental results show average gains ranging from 3 to 6 dB PSNR, along with better structural preservation of edges, contours, and luminance transitions, without modifying decoder compatibility. The structure remains JPEG compliant and can be implemented using accessible scientific packages making it ideal for deployment and practical research use.

[411] Visible and Hyperspectral Imaging for Quality Assessment of Milk: Property Characterisation and Identification

Massimo Martinelli, Elena Tomassi, Nafiou Arouna, Morena Gabriele, Laryssa Perez Fabbri, Luisa Pozzo, Giuseppe Conte, Davide Moroni, Laura Pucci

Main category: eess.IV

TL;DR: Visible and hyperspectral imaging combined with machine learning can accurately assess milk quality parameters like polyphenols, antioxidant capacity, and fatty acids without destructive chemical analysis.

Details

Motivation: Need for rapid, non-destructive, and cost-effective alternatives to conventional chemical analyses for assessing milk quality parameters including nutritional value and food safety indicators.

Method: Used visible (RGB) smartphone imaging and hyperspectral near-infrared imaging on 52 milk samples, correlated with biochemical measurements (polyphenols, antioxidant capacity, fatty acids) using 11 machine learning algorithms including XGBoost and Random Forest.

Result: Visible imaging achieved 100% accuracy for distinguishing fresh vs. 12-day stored samples and antibiotic-treated vs. untreated groups. XGBoost perfectly predicted polyphenols and antioxidant capacity. Hyperspectral imaging achieved >95% accuracy for fatty acid classification and 94.8% for treatment groups.

Conclusion: Both visible and hyperspectral imaging coupled with machine learning are powerful, non-invasive tools for rapid assessment of milk’s chemical and nutritional profiles, demonstrating strong potential for imaging-based milk quality assessment.

Abstract: Rapid and non-destructive assessment of milk quality is crucial to ensuring both nutritional value and food safety. In this study, we investigated the potential of visible and hyperspectral imaging as cost-effective and quick-response alternatives to conventional chemical analyses for characterizing key properties of cowś milk. A total of 52 milk samples were analysed to determine their biochemical composition (polyphenols, antioxidant capacity, and fatty acids) using spectrophotometer methods and standard gas-liquid and high-performance liquid chromatography (GLC/HPLC). Concurrently, visible (RGB) images were captured using a standard smartphone, and hyperspectral data were acquired in the near-infrared range. A comprehensive analytical framework, including eleven different machine learning algorithms, was employed to correlate imaging features with biochemical measurements. Analysis of visible images accurately distinguished between fresh samples and those stored for 12 days (100 percent accuracy) and achieved perfect discrimination between antibiotic-treated and untreated groups (100 percent accuracy). Moreover, image-derived features enabled perfect prediction of the polyphenols content and the antioxidant capacity using an XGBoost model. Hyperspectral imaging further achieved classification accuracies exceeding 95 percent for several individual fatty acids and 94.8 percent for treatment groups using a Random Forest model. These findings demonstrate that both visible and hyperspectral imaging, when coupled with machine learning, are powerful, non-invasive tools for the rapid assessment of milkś chemical and nutritional profiles, highlighting the strong potential of imaging-based approaches for milk quality assessment.

[412] Conference Proceedings of the Inaugural Conference of the International Society for Tractography (IST 2025 Bordeaux)

Flavio Dell Acqua, Maxime Descoteaux, Graham Little, Laurent Petit, Dogu Baran Aydogan, Stephanie Forkel, Alexander Leemans, Simona Schiavi, Michel Thiebaut de Schotten

Main category: eess.IV

TL;DR: Conference proceedings from the International Society for Tractography 2025 covering neuroanatomy, tractography methods, and clinical applications of diffusion MRI

Details

Motivation: To document and disseminate the latest research presented at the inaugural IST conference, fostering collaboration across neuroanatomy, tractography methods, and clinical applications

Method: Collection of abstracts from poster, power pitch, and oral sessions presented at the conference, covering various aspects of tractography research

Result: Comprehensive documentation of state-of-the-art research in tractography, diffusion MRI, neurological disorders, deep brain stimulation, and brain development

Conclusion: The conference successfully brought together experts to advance tractography research and establish future directions for the field

Abstract: This collection comprises the abstracts presented during poster, power pitch and oral sessions at the Inaugural Conference of the International Society for Tractography (IST Conference 2025), held in Bordeaux, France, from October 13-16, 2025. The conference was designed to foster meaningful exchange and collaboration between disparate fields. The overall focus was on advancing research, innovation, and community in the common fields of interest: neuroanatomy, tractography methods and scientific/clinical applications of tractography. The included abstracts cover the latest advancements in tractography, Diffusion MRI, and related fields including new work on; neurological and psychiatric disorders, deep brain stimulation targeting, and brain development. This landmark event brought together world-leading experts to discuss critical challenges and chart the future direction of the field.

[413] Lung nodule classification on CT scan patches using 3D convolutional neural networks

Volodymyr Sydorskyi

Main category: eess.IV

TL;DR: Advanced CT scan lung nodule classification system with improved cropping, label filtering, and augmentation techniques achieves state-of-the-art performance for lung cancer detection.

Details

Motivation: Early lung cancer detection is critical but challenging due to large study volumes, multiple small nodules, and visual assessment difficulties, necessitating automated systems with accurate classification modules.

Method: Three key improvements: (1) advanced CT scan cropping strategy to focus on target nodules and reduce computation, (2) target filtering techniques to remove noisy labels, (3) novel augmentation methods to improve model robustness.

Result: Multiclass model achieved Macro ROC AUC of 0.9176 and Macro F1-score of 0.7658; binary model reached Binary ROC AUC of 0.9383 and Binary F1-score of 0.8668 on LIDC-IDRI dataset, outperforming previous approaches.

Conclusion: The integrated techniques enable a robust classification subsystem within a comprehensive Clinical Decision Support System for lung cancer detection, capable of operating across diverse acquisition protocols and scanner types.

Abstract: Lung cancer remains one of the most common and deadliest forms of cancer worldwide. The likelihood of successful treatment depends strongly on the stage at which the disease is diagnosed. Therefore, early detection of lung cancer represents a critical medical challenge. However, this task poses significant difficulties for thoracic radiologists due to the large number of studies to review, the presence of multiple nodules within the lungs, and the small size of many nodules, which complicates visual assessment. Consequently, the development of automated systems that incorporate highly accurate and computationally efficient lung nodule detection and classification modules is essential. This study introduces three methodological improvements for lung nodule classification: (1) an advanced CT scan cropping strategy that focuses the model on the target nodule while reducing computational cost; (2) target filtering techniques for removing noisy labels; (3) novel augmentation methods to improve model robustness. The integration of these techniques enables the development of a robust classification subsystem within a comprehensive Clinical Decision Support System for lung cancer detection, capable of operating across diverse acquisition protocols, scanner types, and upstream models (segmentation or detection). The multiclass model achieved a Macro ROC AUC of 0.9176 and a Macro F1-score of 0.7658, while the binary model reached a Binary ROC AUC of 0.9383 and a Binary F1-score of 0.8668 on the LIDC-IDRI dataset. These results outperform several previously reported approaches and demonstrate state-of-the-art performance for this task.

[414] 3DLAND: 3D Lesion Abdominal Anomaly Localization Dataset

Mehran Advand, Zahra Dehghanian, Navid Faraji, Reza Barati, Seyed Amir Ahmad Safavi-Naini, Hamid R. Rabiee

Main category: eess.IV

TL;DR: 3DLAND is a large-scale 3D medical imaging dataset with over 6,000 CT volumes and 20,000 lesion annotations across 7 abdominal organs, enabling robust evaluation of organ-aware 3D segmentation models.

Details

Motivation: Existing medical imaging datasets lack comprehensive 3D annotations, multi-organ coverage, and precise lesion-to-organ associations, limiting robust representation learning and clinical applications in medical AI.

Method: A three-phase pipeline integrates automated spatial reasoning, prompt-optimized 2D segmentation, and memory-guided 3D propagation, validated by expert radiologists with surface dice scores >0.75.

Result: Created 3DLAND dataset with over 6,000 contrast-enhanced CT volumes and 20,000+ high-fidelity 3D lesion annotations linked to 7 abdominal organs, establishing a new benchmark for organ-aware 3D segmentation.

Conclusion: 3DLAND enables scalable evaluation of anomaly detection, localization, and cross-organ transfer learning, paving the way for healthcare-oriented AI advancements. Dataset and code are publicly available.

Abstract: Existing medical imaging datasets for abdominal CT often lack three-dimensional annotations, multi-organ coverage, or precise lesion-to-organ associations, hindering robust representation learning and clinical applications. To address this gap, we introduce 3DLAND, a large-scale benchmark dataset comprising over 6,000 contrast-enhanced CT volumes with over 20,000 high-fidelity 3D lesion annotations linked to seven abdominal organs: liver, kidneys, pancreas, spleen, stomach, and gallbladder. Our streamlined three-phase pipeline integrates automated spatial reasoning, prompt-optimized 2D segmentation, and memory-guided 3D propagation, validated by expert radiologists with surface dice scores exceeding 0.75. By providing diverse lesion types and patient demographics, 3DLAND enables scalable evaluation of anomaly detection, localization, and cross-organ transfer learning for medical AI. Our dataset establishes a new benchmark for evaluating organ-aware 3D segmentation models, paving the way for advancements in healthcare-oriented AI. To facilitate reproducibility and further research, the 3DLAND dataset and implementation code are publicly available at https://mehrn79.github.io/3DLAND.

Laura Alvarez-Florez, Angel Bujalance-Gomez, Femke Raijmakers, Samuel Ruiperez-Campillo, Maarten Z. H. Kolk, Jesse Wiers, Julia Vogt, Erik J. Bekkers, Ivana Išgum, Fleur V. Y. Tjong

Main category: eess.IV

TL;DR: Contrastive learning framework aligns ECG representations with 3D cardiac MRI volumes to extract clinically relevant cardiac phenotypes from ubiquitous ECG data.

Details

Motivation: ECG is ubiquitous and inexpensive but offers limited insight into cardiac structure/function, while CMR provides detailed evaluation but has limited accessibility. Need to bridge this gap by extracting image-derived phenotypes from ECG.

Method: Contrastive learning framework that aligns ECG representations with 3D CMR volumes at end-diastole and end-systole phases using dual-phase contrastive loss to anchor each ECG jointly with both cardiac phases in shared latent space.

Result: Using over 34,000 ECG-CMR pairs from UK Biobank, demonstrated improved extraction of image-derived phenotypes from ECG (particularly functional parameters: ↑9.2%), while improvements in clinical outcome prediction were modest (↑0.7%).

Conclusion: The framework enables scalable and cost-effective extraction of image-derived traits from ECG, bridging the gap between ubiquitous ECG and detailed CMR imaging.

Abstract: Cardiac magnetic resonance imaging (CMR) offers detailed evaluation of cardiac structure and function, but its limited accessibility restricts use to selected patient populations. In contrast, the electrocardiogram (ECG) is ubiquitous and inexpensive, and provides rich information on cardiac electrical activity and rhythm, yet offers limited insight into underlying cardiac structure and mechanical function. To address this, we introduce a contrastive learning framework that improves the extraction of clinically relevant cardiac phenotypes from ECG by learning from paired ECG-CMR data. Our approach aligns ECG representations with 3D CMR volumes at end-diastole (ED) and end-systole (ES), with a dual-phase contrastive loss to anchor each ECG jointly with both cardiac phases in a shared latent space. Unlike prior methods limited to 2D CMR representations with or without a temporal component, our framework models 3D anatomy at both ED and ES phases as distinct latent representations, enabling flexible disentanglement of structural and functional cardiac properties. Using over 34,000 ECG-CMR pairs from the UK Biobank, we demonstrate improved extraction of image-derived phenotypes from ECG, particularly for functional parameters ($\uparrow$ 9.2%), while improvements in clinical outcome prediction remained modest ($\uparrow$ 0.7%). This strategy could enable scalable and cost-effective extraction of image-derived traits from ECG. The code for this research is publicly available.

[416] Efficient Plug-and-Play method for Dynamic Imaging Via Kalman Smoothing

Benjamin Hawkes, Mike Davies, Victor Elvira, Audrey Repetti

Main category: eess.IV

TL;DR: Proposes a Plug-and-Play Kalman Smoothing ADMM algorithm that combines state-space models with deep learning denoisers for improved computational efficiency in temporal imaging problems.

Details

Motivation: Traditional Kalman smoothing methods lack expressivity as they don't incorporate spatial prior information. While recent ADMM approaches combine state-space fidelity with sparsity-based priors, there's a need to leverage more powerful deep learning denoisers within this framework for better performance.

Method: Develops a PnP algorithm based on KS-ADMM iterations that efficiently handles state-space models through Kalman smoothing while enabling the use of powerful denoising networks. The approach combines the KS-ADMM method with deep learning to achieve higher expressivity.

Result: Simulations on a 2D+t imaging problem show that the proposed PnP-KS-ADMM algorithm improves computational efficiency over standard PnP-ADMM for large numbers of timesteps.

Conclusion: The proposed method successfully integrates state-space modeling with deep learning denoisers, offering improved computational efficiency for temporal imaging problems with many timesteps.

Abstract: State-space models (SSM) are common in signal processing, where Kalman smoothing (KS) methods are state-of-the-art. However, traditional KS techniques lack expressivity as they do not incorporate spatial prior information. Recently, [1] proposed an ADMM algorithm that handles the state-space fidelity term with KS while regularizing the object via a sparsity-based prior with proximity operators. Plug-and-Play (PnP) methods are a popular type of iterative algorithms that replace proximal operators encoding prior knowledge with powerful denoisers such as deep neural networks. These methods are widely used in image processing, achieving state-of-the-art results. In this work, we build on the KS-ADMM method, combining it with deep learning to achieve higher expressivity. We propose a PnP algorithm based on KS-ADMM iterations, efficiently handling the SSM through KS, while enabling the use of powerful denoising networks. Simulations on a 2D+t imaging problem show that the proposed PnP-KS-ADMM algorithm improves the computational efficiency over standard PnP-ADMM for large numbers of timesteps.

[417] A Plug-and-Play Method for Guided Multi-contrast MRI Reconstruction based on Content/Style Modeling

Chinmay Rao, Matthias van Osch, Nicola Pezzotti, Jeroen de Bresser, Mark van Buchem, Laurens Beljaards, Jakob Meineke, Elwin de Weerdt, Huangling Lu, Mariya Doneva, Marius Staring

Main category: eess.IV

TL;DR: PnP-CoSMo: A modular two-stage approach for guided MRI reconstruction that disentangles content/style from multi-contrast images without requiring k-space training data, using largely unpaired image-domain datasets.

Details

Motivation: Existing end-to-end learning-based guided MRI reconstruction methods require large paired training datasets with raw k-space data and aligned reference images, which is challenging to obtain. The authors aim to develop a more flexible approach that doesn't need k-space training data and can work with largely unpaired image-domain datasets.

Method: Two-stage approach: 1) Learn a content/style model from largely unpaired image-domain datasets that disentangles contrast-independent (content) and contrast-specific (style) factors. 2) Apply this as a plug-and-play operator in iterative reconstruction, replacing aliased content with high-quality content from reference scans, combined with data consistency and corrective processes.

Result: The method shows improved generalizability compared to end-to-end methods on the NYU fastMRI DICOM dataset, and achieves up to 32.6% more acceleration over learning-based non-guided reconstruction for a given SSIM on in-house multi-coil raw datasets.

Conclusion: PnP-CoSMo provides a practical, interpretable framework for guided MRI reconstruction that doesn’t require k-space training data, works with largely unpaired datasets, and offers better acceleration and generalizability than existing methods.

Abstract: Since multiple MRI contrasts of the same anatomy contain redundant information, one contrast can guide the reconstruction of an undersampled subsequent contrast. To this end, several end-to-end learning-based guided reconstruction methods have been proposed. However, a key challenge is the requirement of large paired training datasets comprising raw data and aligned reference images. We propose a modular two-stage approach that does not require any k-space training data, relying solely on image-domain datasets, a large part of which can be unpaired. Additionally, our approach provides an explanatory framework for the multi-contrast problem based on the shared and non-shared generative factors underlying two given contrasts. A content/style model of two-contrast image data is learned from a largely unpaired image-domain dataset and is subsequently applied as a plug-and-play operator in iterative reconstruction. The disentanglement of content and style allows explicit representation of contrast-independent and contrast-specific factors. Consequently, incorporating prior information into the reconstruction reduces to a simple replacement of the aliased content of the reconstruction iterate with high-quality content derived from the reference scan. Combining this component with a data consistency step and introducing a general corrective process for the content yields an iterative scheme. We name this novel approach PnP-CoSMo. Various aspects like interpretability and convergence are explored via simulations. Furthermore, its practicality is demonstrated on the public NYU fastMRI DICOM dataset, showing improved generalizability compared to end-to-end methods, and on two in-house multi-coil raw datasets, offering up to 32.6% more acceleration over learning-based non-guided reconstruction for a given SSIM.

[418] A Synthetic Data-Driven Radiology Foundation Model for Pan-tumor Clinical Diagnosis

Wenhui Lei, Hanyu Chen, Zitian Zhang, Luyang Luo, Qiong Xiao, Yannian Gu, Peng Gao, Yankai Jiang, Ci Wang, Guangtao Wu, Tongjia Xu, Yingjie Zhang, Pranav Rajpurkar, Xiaofan Zhang, Shaoting Zhang, Zhenning Wang

Main category: eess.IV

TL;DR: PASTA is a pan-tumor radiology foundation model trained on synthetic CT data that achieves SOTA on 45/46 oncology tasks and improves radiologist performance in clinical workflows.

Details

Motivation: Developing robust oncology foundation models is limited by scarce, privacy-restricted annotated medical imaging datasets and high labeling costs.

Method: Created PASTA-Gen synthetic data framework generating 30,000 3D CT scans with lesion masks and structured reports across 10 organ systems, then trained PASTA foundation model on this data.

Result: Achieved state-of-the-art on 45 of 46 oncology tasks; PASTA-AID clinical system increased radiologist throughput 11.1-25.1%, improved sensitivity 17.0-31.4%, reduced segmentation time up to 78.2% and reporting time up to 36.5%.

Conclusion: Established end-to-end synthetic data-driven pipeline for pan-tumor research and clinical translation, demonstrating substantial potential for improving oncology imaging workflows.

Abstract: AI-assisted imaging made substantial advances in tumor diagnosis and management. However, a major barrier to developing robust oncology foundation models is the scarcity of large-scale, high-quality annotated datasets, which are limited by privacy restrictions and the high cost of manual labeling. To address this gap, we present PASTA, a pan-tumor radiology foundation model built on PASTA-Gen, a synthetic data framework that generated 30,000 3D CT scans with pixel-level lesion masks and structured reports of tumors across ten organ systems. Leveraging this resource, PASTA achieves state-of-the-art performance on 45 of 46 oncology tasks, including non-contrast CT tumor screening, lesion segmentation, structured reporting, tumor staging, survival prediction, and MRI-modality transfer. To assess clinical applicability, we developed PASTA-AID, a clinical decision support system, and ran a retrospective simulated clinical trial across two scenarios. For pan-tumor screening on plain CT with fixed reading time, PASTA-AID increased radiologists’ throughput by 11.1-25.1% and improved sensitivity by 17.0-31.4% and precision by 10.5-24.9%; additionally, in a diagnosis-aid workflow, it reduced segmentation time by up to 78.2% and reporting time by up to 36.5%. Beyond gains in accuracy and efficiency, PASTA-AID narrowed the expertise gap, enabling less-experienced radiologists to approach expert-level performance. Together, this work establishes an end-to-end, synthetic data-driven pipeline spanning data generation, model development, and clinical validation, thereby demonstrating substantial potential for pan-tumor research and clinical translation.

[419] LesionDiffusion: Towards Text-controlled General Lesion Synthesis

Wenhui Lei, Henrui Tian, Linrui Dai, Hanyu Chen, Xiaofan Zhang

Main category: eess.IV

TL;DR: LesionDiffusion: A text-controllable lesion synthesis framework for 3D CT imaging that generates both lesions and corresponding masks using structured lesion reports, improving segmentation performance and generalization.

Details

Motivation: Fully-supervised lesion recognition methods require expensive annotated datasets. Synthetic lesion generation addresses this but existing models lack scalability, fine-grained control, and ability to generate complex structures.

Method: Uses structured lesion report template for control over attributes. Consists of two components: lesion mask synthesis network (LMNet) and lesion inpainting network (LINet), both guided by lesion attributes and image features.

Result: Significantly improves segmentation performance with strong generalization to unseen lesion types and organs, outperforming current state-of-the-art models. Dataset includes 1,505 annotated CT scans with paired masks and structured reports covering 14 lesion types across 8 organs.

Conclusion: LesionDiffusion provides a scalable, controllable framework for synthetic lesion generation that enhances medical imaging analysis and reduces reliance on expensive annotated datasets.

Abstract: Fully-supervised lesion recognition methods in medical imaging face challenges due to the reliance on large annotated datasets, which are expensive and difficult to collect. To address this, synthetic lesion generation has become a promising approach. However, existing models struggle with scalability, fine-grained control over lesion attributes, and the generation of complex structures. We propose LesionDiffusion, a text-controllable lesion synthesis framework for 3D CT imaging that generates both lesions and corresponding masks. By utilizing a structured lesion report template, our model provides greater control over lesion attributes and supports a wider variety of lesion types. We introduce a dataset of 1,505 annotated CT scans with paired lesion masks and structured reports, covering 14 lesion types across 8 organs. LesionDiffusion consists of two components: a lesion mask synthesis network (LMNet) and a lesion inpainting network (LINet), both guided by lesion attributes and image features. Extensive experiments demonstrate that LesionDiffusion significantly improves segmentation performance, with strong generalization to unseen lesion types and organs, outperforming current state-of-the-art models. Code is available at https://github.com/HengruiTianSJTU/LesionDiffusion.

[420] Active Sampling for MRI-based Sequential Decision Making

Yuning Du, Jingshuai Liu, Rohan Dharmakumar, Sotirios A. Tsaftaris

Main category: eess.IV

TL;DR: Multi-objective reinforcement learning framework for sequential MRI diagnosis from undersampled k-space data, enabling adaptive sampling for multiple diagnostic decisions with minimal samples.

Details

Motivation: To make MRI affordable as a Point-of-Care device by reducing magnetic field strength and improving sampling strategies, enabling comprehensive sequential diagnostic evaluation from limited k-space samples.

Method: Novel multi-objective reinforcement learning framework with active adaptation during inference to optimally sample k-space. Uses step-wise weighting reward function to identify samples that best contribute to each diagnostic objective.

Result: Achieves competitive diagnostic performance on ACL sprain detection and cartilage thickness loss assessment tasks compared to policy-based benchmarks, while substantially reducing k-space samples.

Conclusion: The approach enables comprehensive sequential MRI diagnosis with minimal sampling, paving the way for affordable Point-of-Care MRI devices.

Abstract: Despite the superior diagnostic capability of Magnetic Resonance Imaging (MRI), its use as a Point-of-Care (PoC) device remains limited by high cost and complexity. To enable such a future by reducing the magnetic field strength, one key approach will be to improve sampling strategies. Previous work has shown that it is possible to make diagnostic decisions directly from k-space with fewer samples. Such work shows that single diagnostic decisions can be made, but if we aspire to see MRI as a true PoC, multiple and sequential decisions are necessary while minimizing the number of samples acquired. We present a novel multi-objective reinforcement learning framework enabling comprehensive, sequential, diagnostic evaluation from undersampled k-space data. Our approach during inference actively adapts to sequential decisions to optimally sample. To achieve this, we introduce a training methodology that identifies the samples that contribute the best to each diagnostic objective using a step-wise weighting reward function. We evaluate our approach in two sequential knee pathology assessment tasks: ACL sprain detection and cartilage thickness loss assessment. Our framework achieves diagnostic performance competitive with various policy-based benchmarks on disease detection, severity quantification, and overall sequential diagnosis, while substantially saving k-space samples. Our approach paves the way for the future of MRI as a comprehensive and affordable PoC device. Our code is publicly available at https://github.com/vios-s/MRI_Sequential_Active_Sampling

[421] Direct Kernel Optimization: Efficient Design for Opto-Electronic Convolutional Neural Networks

Ali Almuallem, Harshana Weligampola, Abhiram Gnanasambandam, Wei Xu, Dilshan Godaliyadda, Hamid R. Sheikh, Stanley H. Chan, Qi Guo

Main category: eess.IV

TL;DR: DKO is a two-stage training framework for hybrid opto-electronic neural networks that first trains an electronic CNN then synthesizes optical kernels to replicate first-layer filters, reducing computational cost compared to end-to-end optimization.

Details

Motivation: Joint end-to-end optimization of optical and electronic components in hybrid neural networks is computationally expensive due to large parameter spaces and repeated optical convolutions, creating a need for more efficient optimization methods.

Method: Direct Kernel Optimization (DKO) uses a two-stage approach: 1) Train a conventional electronic CNN, 2) Synthesize optical kernels to replicate the first-layer convolutional filters, reducing optimization dimensionality and avoiding simulated optical convolutions during optimization.

Result: DKO achieves twice the accuracy of end-to-end training under equal computational budgets while significantly reducing training time in monocular depth estimation simulations.

Conclusion: DKO provides a scalable optimization approach for training hybrid opto-electronic systems by addressing computational challenges through decoupled optimization of optical and electronic components.

Abstract: Hybrid opto-electronic neural networks combine optical front-ends with electronic back-ends to perform vision tasks, but joint end-to-end (E2E) optimization of optical and electronic components is computationally expensive due to large parameter spaces and repeated optical convolutions. We propose Direct Kernel Optimization (DKO), a two-stage training framework that first trains a conventional electronic CNN and then synthesizes optical kernels to replicate the first-layer convolutional filters, reducing optimization dimensionality and avoiding hefty simulated optical convolutions during optimization. We evaluate DKO in simulation on a monocular depth estimation model and show that it achieves twice the accuracy of E2E training under equal computational budgets while reducing training time. Given the substantial computational challenges of optimizing hybrid opto-electronic systems, our results position DKO as a scalable optimization approach to train and realize these systems.

[422] Towards a Unified Theoretical Framework for Self-Supervised MRI Reconstruction

Siying Xu, Kerstin Hammernik, Daniel Rueckert, Sergios Gatidis, Thomas Küstner

Main category: eess.IV

TL;DR: UNITS provides a unified theoretical framework for self-supervised MRI reconstruction, proving SSL can match supervised performance while improving generalization and clinical applicability.

Details

Motivation: MRI acquisition times are long, and while deep learning accelerates reconstruction, supervised methods require fully-sampled reference data that's hard to acquire. Existing SSL approaches are fragmented and lack theoretical foundation.

Method: UNITS unifies prior SSL strategies into a common formalism, introduces sampling stochasticity and flexible data utilization, and provides theoretical guarantees that SSL can achieve supervised performance.

Result: The framework enables consistent interpretation and systematic benchmarking, improves network generalization under out-of-domain distributions, stabilizes training, and establishes a foundation for clinically applicable self-supervised MRI reconstruction.

Conclusion: UNITS serves as both theoretical foundation and practical paradigm for interpretable, generalizable self-supervised MRI reconstruction, addressing key limitations of current supervised and SSL approaches.

Abstract: The demand for high-resolution, non-invasive imaging continues to drive innovation in magnetic resonance imaging (MRI), yet prolonged acquisition times hinder accessibility and real-time applications. While deep learning-based reconstruction methods have accelerated MRI, their predominant supervised paradigm depends on fully-sampled reference data that are challenging to acquire. Recently, self-supervised learning (SSL) approaches have emerged as promising alternatives, but most are empirically designed and fragmented. Therefore, we introduce UNITS (Unified Theory for Self-supervision), a general framework for self-supervised MRI reconstruction. UNITS unifies prior SSL strategies within a common formalism, enabling consistent interpretation and systematic benchmarking. We prove that SSL can achieve the same expected performance as supervised learning. Under this theoretical guarantee, we introduce sampling stochasticity and flexible data utilization, which improve network generalization under out-of-domain distributions and stabilize training. Together, these contributions establish UNITS as a theoretical foundation and a practical paradigm for interpretable, generalizable, and clinically applicable self-supervised MRI reconstruction.

[423] Toward Agentic AI: Task-Oriented Communication for Hierarchical Planning of Long-Horizon Tasks

Sin-Yu Huang, Lele Wang, Vincent W. S. Wong

Main category: eess.IV

TL;DR: Hierarchical task-oriented communication framework for agentic AI that transmits only subtask-relevant information using conditional variational information bottleneck to reduce bandwidth while maintaining task success.

Details

Motivation: Existing task-oriented communication schemes are not designed for hierarchical agentic AI where different subtasks have distinct goals, requiring adaptive information transmission for each subtask to reduce bandwidth while maintaining performance.

Method: Proposes HiTOC framework with high-level planner and low-level actor modules on edge server, robot transmits only subtask-relevant observations using conditional variational information bottleneck (cVIB) approach to minimize transmitted information.

Result: Outperforms three state-of-the-art schemes in success rate on MAP-THOR benchmark using AI2-THOR platform simulations.

Conclusion: HiTOC framework enables efficient hierarchical task-oriented communication for agentic AI by adaptively transmitting minimal information required for each subtask, improving success rates while reducing bandwidth.

Abstract: Agentic artificial intelligence (AI) is an AI paradigm that can perceive the environment, reason over observations, and execute actions to achieve specific goals. Task-oriented communication supports agentic AI by transmitting only the task-related information instead of full raw data in order to reduce the bandwidth requirement. In real-world scenarios, AI agents often need to perform a sequence of actions to complete complex tasks. Completing these long-horizon tasks requires a hierarchical agentic AI architecture, where a high-level planner module decomposes a task into subtasks, and a low-level actor module executes each subtask sequentially. Since each subtask has a distinct goal, the existing task-oriented communication schemes are not designed to handle different goals for different subtasks. To address this challenge, in this paper, we develop a hierarchical task-oriented communication (HiTOC) framework. We consider a system with an edge server and a robot as an edge device. The high-level planner and low-level actor modules reside on the edge server. The robot transmits only the environmental observation that is relevant to the current subtask to the edge server. We propose a conditional variational information bottleneck (cVIB) approach to train the HiTOC framework to adaptively transmit minimal information required for each subtask. Simulations conducted on the AI2-THOR platform demonstrate that the proposed HiTOC framework outperforms three state-of-the-art schemes in terms of the success rate on MAP-THOR benchmark.

Editor’s Picks

[1] OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

[2] Artic: AI-oriented Real-time Communication for MLLM Video Assistant

[3] Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática

Today’s Research Highlights

Table of Contents

cs.CL

[1] A Lightweight LLM Framework for Disaster Humanitarian Information Classification

[2] From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness

[3] Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction

[4] Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting

[5] Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática

[6] propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

[7] RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

[8] RBCorr: Response Bias Correction in Language Models

[9] Discovering Semantic Latent Structures in Psychological Scales: A Response-Free Pathway to Efficient Simplification

[10] MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

[11] Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

[12] CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation

[13] Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

[14] Learning Ordinal Probabilistic Reward from Preferences

[15] $\mathcal{X}$-KD: General Experiential Knowledge Distillation for Large Language Models

[16] ReFilter: Improving Robustness of Retrieval-Augmented Generation via Gated Filter

[17] Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks

[18] Aspect-Based Sentiment Analysis for Future Tourism Experiences: A BERT-MoE Framework for Persian User Reviews

[19] RAT-Bench: A Comprehensive Benchmark for Text Anonymization

[20] Left-right asymmetry in predicting brain activity from LLMs’ representations emerges with their formal linguistic competence

[21] AIWizards at MULTIPRIDE: A Hierarchical Approach to Slur Reclamation Detection

[22] MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

[23] BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

[24] ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark

[25] When Words Don’t Mean What They Say: Figurative Understanding in Bengali Idioms

[26] Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

[27] ProbeLLM: Automating Principled Diagnosis of LLM Failures

[28] SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

[29] Evaluating the Homogeneity of Keyphrase Prediction Models

[30] Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models

[31] Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

[32] TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution

[33] Exploring a New Competency Modeling Process with Large Language Models

[34] Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts

[35] SCOPE: Selective Conformal Optimized Pairwise LLM Judging

[36] From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media

[37] OpenLID-v3: Improving the Precision of Closely Related Language Identification – An Experience Report

[38] Semantic Chunking and the Entropy of Natural Language

[39] CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference

[40] Foundations and Evaluations in NLP

[41] RAISE: Reinforced Adaptive Instruction Selection For Large Language Models

[42] PReSS: A Black-Box Framework for Evaluating Political Stance Stability in LLMs via Argumentative Pressure

[43] Embodied Agents Meet Personalization: Investigating Challenges and Solutions Through the Lens of Memory Utilization

[44] Highlight & Summarize: RAG without the jailbreaks

[45] Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

[46] MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis

[47] ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction

[48] The Mediomatix Corpus: Parallel Data for Romansh Language Varieties via Comparable Schoolbooks

[49] TASO: Task-Aligned Sparse Optimization for Parameter-Efficient Model Adaptation

[50] HEART: Emotionally-Driven Test-Time Scaling of Language Models

[51] Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study

[52] Reasoning about Intent for Ambiguous Requests

[53] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

[54] Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

[55] Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

[56] Layer-wise Swapping for Generalizable Multilingual Safety

[57] Reinforced Attention Learning

[58] SciClaimEval: Cross-modal Claim Verification in Scientific Papers

[59] Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

[60] Large Language Models and Impossible Language Acquisition: “False Promise” or an Overturn of our Current Perspective towards AI

[61] GISA: A Benchmark for General Information-Seeking Assistant

[62] Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

[63] When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

[64] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

[65] Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment

[66] Computational Phenomenology of Temporal Experience in Autism: Quantifying the Emotional and Narrative Characteristics of Lived Unpredictability

[67] Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models

[68] Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

[69] LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

[70] WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

[71] Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

[72] T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization

cs.CV