Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 170]
cs.CV [Total: 303]
cs.AI [Total: 103]
cs.SD [Total: 12]
cs.LG [Total: 308]
cs.MA [Total: 12]
cs.MM [Total: 4]
eess.AS [Total: 11]
eess.IV [Total: 18]

cs.CL

[1] PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization

Jiajun Zhang, Jianke Zhang, Zeyu Cui, Jiaxi Yang, Lei Zhang, Binyuan Hui, Qiang Liu, Zilei Wang, Liang Wang, Junyang Lin

Main category: cs.CL

TL;DR: PlotCraft is a new benchmark for evaluating LLMs on complex visualization tasks, revealing performance gaps. The authors developed SynthVis-30K dataset and PlotCraftor model to address these deficiencies, achieving significant improvements on hard tasks.

Details

Motivation: Current LLMs show remarkable code generation capabilities but their ability to create complex visualizations for scaled and structured data remains largely unevaluated and underdeveloped.

Method: Introduced PlotCraft benchmark with 1k challenging visualization tasks across 7 high-level tasks and 48 chart types. Developed SynthVis-30K dataset via collaborative agent framework and created PlotCraftor model for complex data visualization.

Result: Evaluation of 23 leading LLMs revealed performance deficiencies in sophisticated visualization tasks. PlotCraftor achieved performance comparable to leading proprietary approaches, with over 50% improvement on hard tasks across multiple benchmarks.

Conclusion: The work addresses a significant gap in LLM capabilities for complex data visualization and provides a comprehensive benchmark, dataset, and model that substantially improves performance on challenging visualization tasks.

Abstract: Recent Large Language Models (LLMs) have demonstrated remarkable profi- ciency in code generation. However, their ability to create complex visualiza- tions for scaled and structured data remains largely unevaluated and underdevel- oped. To address this gap, we introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks that cover a wide range of topics, such as fi- nance, scientific research, and sociology. The benchmark is structured around seven high-level visualization tasks and encompasses 48 distinct chart types. Cru- cially, it is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities. Our com- prehensive evaluation of 23 leading LLMs on PlotCraft reveals obvious per- formance deficiencies in handling sophisticated visualization tasks. To bridge this performance gap, we develope SynthVis-30K, a large-scale, high-quality dataset of complex visualization code synthesized via a collaborative agent frame- work. Building upon this dataset, we develope PlotCraftor, a novel code gener- ation model that achieves strong capabilities in complex data visualization with a remarkably small size. Across VisEval, PandasPlotBench, and our proposed PlotCraft, PlotCraftor shows performance comparable to that of leading propri- etary approaches. Especially, on hard task, Our model achieves over 50% per- formance improvement. We will release the benchmark, dataset, and code at https://github.com/Speakn0w/PlotCraft-Benchmark.

[2] Cognitive Alignment in Personality Reasoning: Leveraging Prototype Theory for MBTI Inference

Haoyuan Li, Yuanbo Tong, Yuchen Li, Zirui Wang, Chunhou Liu, Jiamou Liu

Main category: cs.CL

TL;DR: ProtoMBTI is a prototype-based framework for MBTI personality recognition from text that improves accuracy and interpretability by aligning with psychological prototype theory rather than traditional hard-label classification.

Details

Motivation: Traditional personality recognition uses hard-label classification which doesn't capture the graded, prototype-like nature of human personality judgments. There's a need for cognitively aligned approaches that better reflect how humans actually make personality assessments.

Method: Uses an LLM-based pipeline with: 1) LLM-guided multi-dimensional corpus augmentation, 2) LoRA-fine-tuned lightweight encoder for embeddings and prototype standardization, 3) Retrieve-reuse-revise-retain inference cycle with prototype voting and continuous prototype library enrichment.

Result: ProtoMBTI outperforms baselines on both MBTI dichotomies and full 16-type tasks across Kaggle and Pandora benchmarks, and shows robust cross-dataset generalization.

Conclusion: Aligning personality inference with psychological prototype reasoning improves accuracy, interpretability, and transfer learning for text-based personality modeling.

Abstract: Personality recognition from text is typically cast as hard-label classification, which obscures the graded, prototype-like nature of human personality judgments. We present ProtoMBTI, a cognitively aligned framework for MBTI inference that operationalizes prototype theory within an LLM-based pipeline. First, we construct a balanced, quality-controlled corpus via LLM-guided multi-dimensional augmentation (semantic, linguistic, sentiment). Next, we LoRA-fine-tune a lightweight (<=2B) encoder to learn discriminative embeddings and to standardize a bank of personality prototypes. At inference, we retrieve top-k prototypes for a query post and perform a retrieve–reuse–revise–retain cycle: the model aggregates prototype evidence via prompt-based voting, revises when inconsistencies arise, and, upon correct prediction, retains the sample to continually enrich the prototype library. Across Kaggle and Pandora benchmarks, ProtoMBTI improves over baselines on both the four MBTI dichotomies and the full 16-type task, and exhibits robust cross-dataset generalization. Our results indicate that aligning the inference process with psychological prototype reasoning yields gains in accuracy, interpretability, and transfer for text-based personality modeling.

[3] ParaScopes: What do Language Models Activations Encode About Future Text?

Nicky Pochinkov, Yulia Volkova, Anna Vasileva, Sai V R Chereddy

Main category: cs.CL

TL;DR: The paper introduces Residual Stream Decoders as a method to probe language model activations for paragraph-scale and document-scale planning information, finding that small models can encode information equivalent to 5+ future tokens.

Details

Motivation: As language models handle longer time horizon tasks, existing interpretability methods remain limited to testing specific concepts or tokens, creating a need for better ways to understand how models encode longer-term planning information.

Method: Developed a framework of Residual Stream Decoders to probe model activations for paragraph-scale and document-scale plans, testing several decoding methods.

Result: Found that information equivalent to 5+ tokens of future context can be decoded from activations in small models.

Conclusion: These results establish groundwork for improved monitoring of language models and better understanding of how they encode longer-term planning information.

Abstract: Interpretability studies in language models often investigate forward-looking representations of activations. However, as language models become capable of doing ever longer time horizon tasks, methods for understanding activations often remain limited to testing specific concepts or tokens. We develop a framework of Residual Stream Decoders as a method of probing model activations for paragraph-scale and document-scale plans. We test several methods and find information can be decoded equivalent to 5+ tokens of future context in small models. These results lay the groundwork for better monitoring of language models and better understanding how they might encode longer-term planning information.

[4] Training LLMs Beyond Next Token Prediction - Filling the Mutual Information Gap

Chun-Hao Yang, Bo-Han Feng, Tzu-Yuan Lai, Yan Yu Chen, Yin-Kai Dean Huang, Shou-De Lin

Main category: cs.CL

TL;DR: Challenges conventional next-token prediction training for LLMs, proposing to predict information-rich tokens instead to optimize training performance while maintaining computational costs.

Details

Motivation: To improve LLM training efficiency and performance by moving beyond conventional next-token prediction, addressing the challenge of optimizing training while controlling computational costs.

Method: Proposes predicting information-rich tokens during training instead of standard next-token prediction, evaluated across arithmetic, multi-label text classification, and natural-language generation tasks.

Result: The paper demonstrates improved training effectiveness through selective token prediction strategies, though specific performance metrics are not detailed in the abstract.

Conclusion: Provides a principled approach to LLM training optimization that advances both model performance and theoretical understanding of target-token selection strategies.

Abstract: Optimizing training performance in large language models (LLMs) remains an essential challenge, particularly in improving model performance while maintaining computational costs. This work challenges the conventional approach of training LLMs using next-token prediction (NTP), arguing that by predicting information-rich tokens during training, there is a more effective way to train LLMs. We investigate the impact of the proposed solution in three kinds of tasks for LLMs: arithmetic, multi-label classification of text, and natural-language generation. This work offers a principled approach to optimizing LLM training, advancing both model performance and theoretical understanding of the target-token selection strategies.

[5] Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, Natasha Jaques

Main category: cs.CL

TL;DR: A framework for evaluating and improving persona consistency in LLM-generated dialogue using automatic metrics and reinforcement learning fine-tuning.

Details

Motivation: LLMs often drift from assigned personas, contradict earlier statements, or abandon role-appropriate behavior when simulating human users in interactive settings like therapy, education, and social role-play.

Method: Introduced three automatic metrics (prompt-to-line, line-to-line, and Q&A consistency), validated against human annotations, and used them as reward signals for multi-turn reinforcement learning to fine-tune LLMs for patient, student, and social chat partner roles.

Result: The method reduces inconsistency by over 55%, resulting in more coherent and faithful simulated users.

Conclusion: The proposed framework effectively improves persona consistency in LLM-generated dialogue through automatic evaluation metrics and reinforcement learning fine-tuning.

Abstract: Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multi-turn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent and faithful simulated users.

[6] AgentBnB: A Browser-Based Cybersecurity Tabletop Exercise with Large Language Model Support and Retrieval-Aligned Scaffolding

Arman Anwar, Zefang Liu

Main category: cs.CL

TL;DR: AgentBnB is a browser-based cybersecurity training system that uses LLM teammates and a retrieval-augmented copilot to provide scalable, on-demand hints for tabletop exercises, showing promising results for lightweight, repeatable practice.

Details

Motivation: Traditional cybersecurity tabletop exercises are scripted, resource-intensive, and difficult to scale, creating a need for more accessible training solutions.

Method: The system integrates large language model teammates with a Bloom-aligned, retrieval-augmented copilot (C2D2) that expands curated content into cognitive snippets and uses prompt-engineered agents with scaffolding that fades as learners gain confidence.

Result: In a pilot with four graduate students, participants preferred the agent-based version over physical card decks and viewed it as more scalable, though a ceiling effect was observed on simple knowledge quizzes.

Conclusion: LLM-augmented tabletop exercises can provide lightweight, repeatable cybersecurity practice without traditional logistical burdens, with planned extensions including multi-player modes and larger comparative studies.

Abstract: Traditional cybersecurity tabletop exercises (TTXs) provide valuable training but are often scripted, resource-intensive, and difficult to scale. We introduce AgentBnB, a browser-based re-imagining of the Backdoors & Breaches game that integrates large language model teammates with a Bloom-aligned, retrieval-augmented copilot (C2D2). The system expands a curated corpus into factual, conceptual, procedural, and metacognitive snippets, delivering on-demand, cognitively targeted hints. Prompt-engineered agents employ a scaffolding ladder that gradually fades as learner confidence grows. In a solo-player pilot with four graduate students, participants reported greater intention to use the agent-based version compared to the physical card deck and viewed it as more scalable, though a ceiling effect emerged on a simple knowledge quiz. Despite limitations of small sample size, single-player focus, and narrow corpus, these early findings suggest that large language model augmented TTXs can provide lightweight, repeatable practice without the logistical burden of traditional exercises. Planned extensions include multi-player modes, telemetry-driven coaching, and comparative studies with larger cohorts.

[7] IL-PCSR: Legal Corpus for Prior Case and Statute Retrieval

Shounak Paul, Dhananjay Ghumare, Pawan Goyal, Saptarshi Ghosh, Ashutosh Modi

Main category: cs.CL

TL;DR: IL-PCR is a new Indian legal corpus that provides a unified testbed for both statute retrieval and precedent retrieval tasks, addressing the gap where these related tasks were previously handled independently.

Details

Motivation: Legal practitioners need to retrieve both relevant statutes and prior cases for given legal situations, but existing research has treated these inherently related tasks independently with separate datasets and models.

Method: Created IL-PCR corpus for both retrieval tasks, experimented with lexical models, semantic models, GNN-based ensembles, and developed an LLM-based re-ranking approach to exploit task interdependence.

Result: The LLM-based re-ranking approach achieved the best performance by leveraging the dependence between statute retrieval and precedent retrieval tasks.

Conclusion: IL-PCR provides a unified framework for legal retrieval tasks, and exploiting the interdependence between statute and precedent retrieval through LLM-based re-ranking yields superior performance.

Abstract: Identifying/retrieving relevant statutes and prior cases/precedents for a given legal situation are common tasks exercised by law practitioners. Researchers to date have addressed the two tasks independently, thus developing completely different datasets and models for each task; however, both retrieval tasks are inherently related, e.g., similar cases tend to cite similar statutes (due to similar factual situation). In this paper, we address this gap. We propose IL-PCR (Indian Legal corpus for Prior Case and Statute Retrieval), which is a unique corpus that provides a common testbed for developing models for both the tasks (Statute Retrieval and Precedent Retrieval) that can exploit the dependence between the two. We experiment extensively with several baseline models on the tasks, including lexical models, semantic models and ensemble based on GNNs. Further, to exploit the dependence between the two tasks, we develop an LLM-based re-ranking approach that gives the best performance.

[8] POSESTITCH-SLT: Linguistically Inspired Pose-Stitching for End-to-End Sign Language Translation

Abhinav Joshi, Vaibhav Sharma, Sanjeet Singh, Ashutosh Modi

Main category: cs.CL

TL;DR: POSESTITCH-SLT: A novel pre-training method using template-generated sentence pairs that significantly improves sign language translation performance on low-resource datasets.

Details

Motivation: Sign language translation faces challenges due to limited large-scale, sentence-aligned datasets, requiring innovative approaches to overcome data scarcity.

Method: Proposes POSESTITCH-SLT, a pre-training scheme inspired by linguistic-template-based sentence generation, using template-generated sentence pairs to train a transformer-based encoder-decoder architecture.

Result: Achieves significant BLEU-4 score improvements: from 1.97 to 4.56 on How2Sign and from 0.55 to 3.43 on iSign, outperforming prior state-of-the-art methods for pose-based gloss-free translation.

Conclusion: Template-driven synthetic supervision is effective for low-resource sign language translation settings, demonstrating the value of synthetic data generation in overcoming data scarcity.

Abstract: Sign language translation remains a challenging task due to the scarcity of large-scale, sentence-aligned datasets. Prior arts have focused on various feature extraction and architectural changes to support neural machine translation for sign languages. We propose POSESTITCH-SLT, a novel pre-training scheme that is inspired by linguistic-templates-based sentence generation technique. With translation comparison on two sign language datasets, How2Sign and iSign, we show that a simple transformer-based encoder-decoder architecture outperforms the prior art when considering template-generated sentence pairs in training. We achieve BLEU-4 score improvements from 1.97 to 4.56 on How2Sign and from 0.55 to 3.43 on iSign, surpassing prior state-of-the-art methods for pose-based gloss-free translation. The results demonstrate the effectiveness of template-driven synthetic supervision in low-resource sign language settings.

[9] Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

Xinyi Wu, Yanhao Jia, Qinglin Zhang, Yiran Qin, Luwei Xiao, Shuai Zhao

Main category: cs.CL

TL;DR: PBLBench is a novel benchmark for evaluating multimodal large language models in Project-Based Learning contexts, addressing gaps in existing benchmarks through expert-validated structured evaluation and complex reasoning tasks.

Details

Motivation: Existing benchmarks lack free-form output structure and rigorous human expert validation, limiting their effectiveness for real-world educational tasks. Model hallucination and instability also hinder the development of automated pipelines to assist teachers.

Method: Introduces PBLBench benchmark with Analytic Hierarchy Process (AHP) for expert-driven pairwise comparisons to establish reliable ground truth and structured evaluation criteria. Evaluates 15 leading MLLMs/LLMs on complex reasoning tasks requiring domain-specific knowledge and long-context understanding.

Result: Even the most advanced models achieve only 59% rank accuracy on PBLBench, highlighting significant challenges in handling complex educational reasoning tasks that resemble those handled by human experts.

Conclusion: PBLBench serves as a catalyst for developing more capable AI agents to alleviate teacher workload and enhance educational productivity, addressing current limitations in MLLM evaluation for educational applications.

Abstract: Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.

[10] Language Modeling With Factorization Memory

Lee Xiong, Maksim Tkachenko, Johanes Effendi, Ting Cai

Main category: cs.CL

TL;DR: Factorization Memory is an efficient RNN architecture that matches Transformer performance on short-context tasks while excelling at long-context generalization, building on Mamba-2 with parallel training and constant inference complexity.

Details

Motivation: To develop an RNN architecture that combines the efficiency of recurrent models with competitive performance across both short and long-context language modeling tasks, addressing limitations of existing approaches.

Method: Builds upon Mamba-2 architecture with parallel training capabilities and constant inference complexity. Introduces a sparse formulation that updates only a subset of recurrent states while maintaining performance comparable to dense models.

Result: Achieves performance comparable to Transformers on short-context tasks while demonstrating superior generalization in long-context scenarios. The sparse version preserves strong performance while improving efficiency.

Conclusion: Factorization Memory represents the first RNN architecture that successfully combines sparse memory activation with competitive performance across both short and long-context settings, providing an efficient alternative to Transformers.

Abstract: We propose Factorization Memory, an efficient recurrent neural network (RNN) architecture that achieves performance comparable to Transformer models on short-context language modeling tasks while also demonstrating superior generalization in long-context scenarios. Our model builds upon Mamba-2, enabling Factorization Memory to exploit parallel computations during training while preserving constant computational and memory complexity during inference. To further optimize model efficiency and representational capacity, we develop a sparse formulation of Factorization Memory that updates only a subset of recurrent states at each step while preserving the strong performance of its dense counterpart. To our knowledge, this represents the first RNN architecture that successfully combines sparse memory activation with competitive performance across both short and long-context settings. This work provides a systematic empirical analysis of Factorization Memory in comparison to Transformer and Mamba-2 architectures.

[11] Spatial Knowledge Graph-Guided Multimodal Synthesis

Yida Xue, Zhen Bi, Jinnan Yang, Jungang Lou, Kehai Chen, Min Zhang, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: SKG2DATA is a multimodal synthesis framework that uses spatial knowledge graphs to generate spatially coherent data, improving MLLMs’ spatial perception abilities.

Details

Motivation: To address the limitation of spatial perception in Multimodal Large Language Models (MLLMs) through systematic generation of spatially coherent data, overcoming the challenge of ensuring spatial common sense in synthesized data.

Method: Uses automated pipeline to construct Spatial Knowledge Graphs (SKG) capturing human-like spatial cognition (directional and distance relationships), then integrates diffusion models for image generation and MLLMs for text description to create spatially-consistent multimodal data.

Result: Data synthesized from diverse spatial knowledge types (direction and distance) significantly enhances MLLMs’ spatial perception and reasoning abilities, though with slight reduction in general capabilities. The approach enables scalable generation of realistic spatial configurations.

Conclusion: Knowledge-based data synthesis using spatial knowledge graphs can effectively advance spatial intelligence development in MLLMs, providing a systematic framework for generating spatially coherent multimodal data.

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.

[12] Reversal Invariance in Autoregressive Language Models

Mihir Sahasrabudhe

Main category: cs.CL

TL;DR: The paper identifies reversal invariance in causal language modeling - the objective treats forward and reversed text identically, which may limit capturing directional dependencies in natural language.

Details

Motivation: To understand why models trained on reversed text perform similarly to forward-trained models, despite language's inherent time-asymmetry, and to highlight limitations of current pretraining objectives.

Method: Formal analysis of the causal language modeling objective’s structural properties, showing it assigns identical likelihood to corpora and their reversals.

Result: Demonstrates that standard CLM pretraining is direction-blind due to reversal invariance, explaining comparable performance between forward and reversed text training.

Conclusion: Proposes viewing pretraining through temporal asymmetry lens, suggesting future work on loss functions and architectures that explicitly model language’s directional dependencies while maintaining modeling capacity.

Abstract: We formalize a structural property of the causal (autoregressive) language modeling (CLM) objective: reversal invariance. Formally, the next-token prediction loss assigns identical likelihood to a corpus and its reversal, implying that standard CLM pretraining is direction-blind. This symmetry explains why models trained on reversed text can achieve comparable performance to those trained on forward text, despite the inherently time-asymmetric nature of human language and reasoning. We argue that this invariance represents a limitation of current pretraining objectives rather than a benign artifact. If natural language encodes directional dependencies - phonological, morphological, or causal - a symmetric objective may fail to capture them. We therefore propose viewing pretraining through the lens of temporal asymmetry, motivating future work on loss functions and architectures that explicitly model the arrow of language while retaining standard language modeling capacity.

[13] LingGym: How Far Are LLMs from Thinking Like Field Linguists?

Changbing Yang, Franklin Ma, Freda Shi, Jian Zhu

Main category: cs.CL

TL;DR: LingGym is a benchmark for evaluating LLMs’ meta-linguistic reasoning using Interlinear Glossed Text from 18 diverse languages, focusing on word-gloss inference tasks.

Details

Motivation: To assess whether LLMs can generalize linguistic inference across low-resource languages and structures not seen during training, moving beyond specific downstream tasks.

Method: Uses controlled Word-Gloss Inference task where models infer missing words and glosses using varying levels of linguistic information (glosses, grammatical explanations, translations) from typologically diverse reference grammars.

Result: Incorporating structured linguistic cues leads to consistent improvements in reasoning performance across all tested models.

Conclusion: Highlights both the promise and current limitations of using LLMs for typologically informed linguistic analysis and low-resource language documentation.

Abstract: This paper introduces LingGym, a new benchmark that evaluates LLMs’ capacity for meta-linguistic reasoning using Interlinear Glossed Text (IGT) and grammatical descriptions extracted from 18 typologically diverse reference grammars. Unlike previous work that focuses on specific downstream tasks, we assess whether LLMs can generalize linguistic inference across low-resource languages and structures not seen during training. We present a controlled evaluation task: Word-Gloss Inference, in which the model must infer a missing word and gloss from context using varying levels of linguistic information (e.g., glosses, grammatical explanations, translations). Our results show that incorporating structured linguistic cues leads to consistent improvements in reasoning performance across all models. This work highlights both the promise and current limitations of using LLMs for typologically informed linguistic analysis and low-resource language documentation.

[14] MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

Khai Le-Duc, Tuyen Tran, Bach Phan Tat, Nguyen Kim Hai Bui, Quan Dang, Hung-Phong Tran, Thanh-Thuy Nguyen, Ly Nguyen, Tuan-Minh Phan, Thi Thu Phuong Tran, Chris Ngo, Nguyen X. Khanh, Thanh Nguyen-Tang

Main category: cs.CL

TL;DR: This paper introduces MultiMed-ST, the first large-scale multilingual speech translation dataset for the medical domain spanning 5 languages, and provides comprehensive analysis of medical speech translation.

Details

Motivation: To enhance patient care by enabling efficient communication across language barriers in healthcare, alleviate specialized workforce shortages, and facilitate improved diagnosis and treatment, especially during pandemics.

Method: Created MultiMed-ST dataset with 290,000 samples across 5 languages (Vietnamese, English, German, French, Chinese), conducted comprehensive analysis including empirical baselines, bilingual-multilingual comparison, end-to-end vs. cascaded approaches, task-specific vs. multi-task models, code-switch analysis, and error analysis.

Result: MultiMed-ST is the largest medical machine translation dataset and the largest many-to-many multilingual speech translation dataset across all domains. The study provides the most comprehensive speech translation analysis in the field’s history.

Conclusion: The work establishes foundational resources for medical speech translation with publicly available code, data, and models, enabling future research and practical applications in healthcare communication.

Abstract: Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field’s history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST

[15] Reasoning Trajectories for Socratic Debugging of Student Code: From Misconceptions to Contradictions and Updated Beliefs

Erfan Al-Hossami, Razvan Bunescu

Main category: cs.CL

TL;DR: The paper introduces reasoning trajectory generation for Socratic debugging, where LLMs guide students to discover programming misconceptions through contradiction-based reasoning paths.

Details

Motivation: To help novice programmers identify and fix bugs caused by programming misconceptions through guided Socratic debugging instead of direct bug fixes.

Method: Created a dataset of debugging problems with manually annotated reasoning trajectories, then developed LLM-based solutions to generate reasoning trajectories and Socratic conversations based on them.

Result: Frontier models achieved 91% correct reasoning trajectories and 98.7% valid conversation turns in large-scale LLM-as-judge evaluation.

Conclusion: LLMs can effectively generate reasoning trajectories and Socratic conversations for debugging, enabling students to identify and correct programming misconceptions through guided discovery.

Abstract: In Socratic debugging, instructors guide students towards identifying and fixing a bug on their own, instead of providing the bug fix directly. Most novice programmer bugs are caused by programming misconceptions, namely false beliefs about a programming concept. In this context, Socratic debugging can be formulated as a guided Reasoning Trajectory (RT) leading to a statement about the program behavior that contradicts the bug-causing misconception. Upon reaching this statement, the ensuing cognitive dissonance leads the student to first identify and then update their false belief. In this paper, we introduce the task of reasoning trajectory generation, together with a dataset of debugging problems manually annotated with RTs. We then describe LLM-based solutions for generating RTs and Socratic conversations that are anchored on them. A large-scale LLM-as-judge evaluation shows that frontier models can generate up to 91% correct reasoning trajectories and 98.7% valid conversation turns.

[16] PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks

Yiwei Zha, Rui Min, Shanu Sushmita

Main category: cs.CL

TL;DR: Current AI-generated text detectors fail against iteratively-paraphrased content, which creates an intermediate laundering region that evades detection. The paper introduces PADBen benchmark to evaluate detector robustness against paraphrase attacks.

Details

Motivation: To investigate why iteratively-paraphrased text evades AI-generated text detection systems and address vulnerabilities in current detection methods.

Method: Intrinsic mechanism analysis of iterative paraphrasing effects, creation of PADBen benchmark with five-type text taxonomy and five progressive detection tasks, evaluation of 11 state-of-the-art detectors.

Result: Detectors achieve over 90% accuracy on direct LLM outputs but fail catastrophically against iteratively-paraphrased content. Critical asymmetry found: detectors identify plagiarism evasion but fail at authorship obfuscation.

Conclusion: Current detection approaches cannot handle the intermediate laundering region created by iterative paraphrasing, requiring fundamental advances beyond existing semantic and stylistic discrimination methods.

Abstract: While AI-generated text (AIGT) detectors achieve over 90% accuracy on direct LLM outputs, they fail catastrophically against iteratively-paraphrased content. We investigate why iteratively-paraphrased text – itself AI-generated – evades detection systems designed for AIGT identification. Through intrinsic mechanism analysis, we reveal that iterative paraphrasing creates an intermediate laundering region characterized by semantic displacement with preserved generation patterns, which brings up two attack categories: paraphrasing human-authored text (authorship obfuscation) and paraphrasing LLM-generated text (plagiarism evasion). To address these vulnerabilities, we introduce PADBen, the first benchmark systematically evaluating detector robustness against both paraphrase attack scenarios. PADBen comprises a five-type text taxonomy capturing the full trajectory from original content to deeply laundered text, and five progressive detection tasks across sentence-pair and single-sentence challenges. We evaluate 11 state-of-the-art detectors, revealing critical asymmetry: detectors successfully identify the plagiarism evasion problem but fail for the case of authorship obfuscation. Our findings demonstrate that current detection approaches cannot effectively handle the intermediate laundering region, necessitating fundamental advances in detection architectures beyond existing semantic and stylistic discrimination methods. For detailed code implementation, please see https://github.com/JonathanZha47/PadBen-Paraphrase-Attack-Benchmark.

[17] MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts

Naoto Iwase, Hiroki Okuyama, Junichiro Iwasawa

Main category: cs.CL

TL;DR: MedRECT is a cross-lingual benchmark for medical error correction in Japanese and English, evaluating LLMs on error detection, localization, and correction. Reasoning models outperform standard architectures, and fine-tuned models can exceed human expert performance.

Details

Motivation: LLMs show promise in medical applications but their error detection and correction capabilities remain under-evaluated, especially beyond English, which is crucial for safe deployment.

Method: Built MedRECT benchmark from Japanese Medical Licensing Exams and curated English counterpart, evaluating 9 LLMs across three subtasks: error detection, localization, and correction. Used LoRA fine-tuning for targeted improvements.

Result: Reasoning models showed substantial improvements (up to 13.5% in detection, 51% in extraction). Cross-lingual gaps of 5-10% from English to Japanese. Fine-tuned models achieved asymmetric improvements in correction and exceeded human expert performance.

Conclusion: MedRECT is the first comprehensive cross-lingual benchmark for medical error correction, providing a reproducible framework for developing safer medical LLMs across languages.

Abstract: Large language models (LLMs) show increasing promise in medical applications, but their ability to detect and correct errors in clinical texts – a prerequisite for safe deployment – remains under-evaluated, particularly beyond English. We introduce MedRECT, a cross-lingual benchmark (Japanese/English) that formulates medical error handling as three subtasks: error detection, error localization (sentence extraction), and error correction. MedRECT is built with a scalable, automated pipeline from the Japanese Medical Licensing Examinations (JMLE) and a curated English counterpart, yielding MedRECT-ja (663 texts) and MedRECT-en (458 texts) with comparable error/no-error balance. We evaluate 9 contemporary LLMs spanning proprietary, open-weight, and reasoning families. Key findings: (i) reasoning models substantially outperform standard architectures, with up to 13.5% relative improvement in error detection and 51.0% in sentence extraction; (ii) cross-lingual evaluation reveals 5-10% performance gaps from English to Japanese, with smaller disparities for reasoning models; (iii) targeted LoRA fine-tuning yields asymmetric improvements in error correction performance (Japanese: +0.078, English: +0.168) while preserving reasoning capabilities; and (iv) our fine-tuned model exceeds human expert performance on structured medical error correction tasks. To our knowledge, MedRECT is the first comprehensive cross-lingual benchmark for medical error correction, providing a reproducible framework and resources for developing safer medical LLMs across languages.

[18] G2: Guided Generation for Enhanced Output Diversity in LLMs

Zhiwen Ruan, Yixia Li, Yefeng Liu, Yun Chen, Weihua Luo, Peng Li, Yang Liu, Guanhua Chen

Main category: cs.CL

TL;DR: G2 is a training-free plug-and-play method that enhances LLM output diversity while preserving quality, using dual Guides to intervene in decoding without compromising generation quality.

Details

Motivation: LLMs generate highly similar content across attempts, limiting their effectiveness for tasks requiring diverse outputs like creative writing and reasoning. Existing solutions like temperature scaling improve diversity but sacrifice output quality.

Method: G2 employs a base generator with dual Guides that guide generation through decoding-based interventions, encouraging diverse outputs conditioned on the original query without requiring training.

Result: Comprehensive experiments show G2 effectively improves output diversity while maintaining optimal balance between diversity and quality.

Conclusion: G2 successfully addresses LLM output diversity limitations through a training-free approach that preserves generation quality while enhancing diversity.

Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks. However, these models exhibit a critical limitation in output diversity, often generating highly similar content across multiple attempts. This limitation significantly affects tasks requiring diverse outputs, from creative writing to reasoning. Existing solutions, like temperature scaling, enhance diversity by modifying probability distributions but compromise output quality. We propose Guide-to-Generation (G2), a training-free plug-and-play method that enhances output diversity while preserving generation quality. G2 employs a base generator alongside dual Guides, which guide the generation process through decoding-based interventions to encourage more diverse outputs conditioned on the original query. Comprehensive experiments demonstrate that G2 effectively improves output diversity while maintaining an optimal balance between diversity and quality.

[19] Remembering Unequally: Global and Disciplinary Bias in LLM-Generated Co-Authorship Networks

Ghazal Kalhor, Afra Mashhadi

Main category: cs.CL

TL;DR: This study examines how LLM memorization affects co-authorship networks, revealing biases favoring highly cited researchers globally, with some disciplines and regions showing more balanced representation.

Details

Motivation: As LLMs reshape search and recommendation platforms, they introduce fairness and bias issues that could undermine information integrity, particularly in scholarly tools where memorization impacts co-authorship network accuracy.

Method: The study assesses memorization effects across three LLMs (DeepSeek R1, Llama 4 Scout, Mixtral 8x7B), analyzing how memorization-driven outputs vary across academic disciplines and world regions.

Result: Global analysis shows consistent bias favoring highly cited researchers, but this pattern is not uniform - Clinical Medicine and parts of Africa show more balanced representation, suggesting areas where LLM training data may reflect greater equity.

Conclusion: The findings highlight both risks and opportunities in deploying LLMs for scholarly discovery, emphasizing the need to address memorization-driven biases while recognizing potential for more equitable representation in certain contexts.

Abstract: Ongoing breakthroughs in Large Language Models (LLMs) are reshaping search and recommendation platforms at their core. While this shift unlocks powerful new scientometric tools, it also exposes critical fairness and bias issues that could erode the integrity of the information ecosystem. Additionally, as LLMs become more integrated into web-based searches for scholarly tools, their ability to generate summarized research work based on memorized data introduces new dimensions to these challenges. The extent of memorization in LLMs can impact the accuracy and fairness of the co-authorship networks they produce, potentially reflecting and amplifying existing biases within the scientific community and across different regions. This study critically examines the impact of LLM memorization on the co-authorship networks. To this end, we assess memorization effects across three prominent models, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B, analyzing how memorization-driven outputs vary across academic disciplines and world regions. While our global analysis reveals a consistent bias favoring highly cited researchers, this pattern is not uniformly observed. Certain disciplines, such as Clinical Medicine, and regions, including parts of Africa, show more balanced representation, pointing to areas where LLM training data may reflect greater equity. These findings underscore both the risks and opportunities in deploying LLMs for scholarly discovery.

[20] Leveraging the Cross-Domain & Cross-Linguistic Corpus for Low Resource NMT: A Case Study On Bhili-Hindi-English Parallel Corpus

Pooja Singh, Shashwat Bhardwaj, Vaibhav Sharma, Sandeep Kumar

Main category: cs.CL

TL;DR: Created the first large-scale Bhili-Hindi-English parallel corpus (110K sentences) and established comprehensive machine translation benchmarks for the underrepresented Bhili language, showing fine-tuned NLLB-200 model performs best.

Details

Motivation: Address the linguistic diversity challenge in India, particularly for underrepresented tribal languages like Bhili that lack high-quality linguistic resources and machine translation capabilities.

Method: Built BHEPC corpus with expert human translators across education, administration, and news domains. Evaluated proprietary and open-source MLLMs on bidirectional translation tasks, including fine-tuning NLLB-200 and testing in-context learning capabilities.

Result: Fine-tuned NLLB-200 distilled 600M variant model outperformed other models in bidirectional translation between English/Hindi and Bhili. Comprehensive evaluation established benchmarks and assessed cross-domain generalization.

Conclusion: This work bridges critical resource gaps for low-resource languages and promotes inclusive NLP technologies, demonstrating multilingual models’ potential in low-resource scenarios.

Abstract: The linguistic diversity of India poses significant machine translation challenges, especially for underrepresented tribal languages like Bhili, which lack high-quality linguistic resources. This paper addresses the gap by introducing Bhili-Hindi-English Parallel Corpus (BHEPC), the first and largest parallel corpus worldwide comprising 110,000 meticulously curated sentences across Bhili, Hindi, and English. The corpus was created with the assistance of expert human translators. BHEPC spans critical domains such as education, administration, and news, establishing a valuable benchmark for research in low resource machine translation. To establish a comprehensive Bhili Machine Translation benchmark, we evaluated a wide range of proprietary and open-source Multilingual Large Language Models (MLLMs) on bidirectional translation tasks between English/Hindi and Bhili. Comprehensive evaluation demonstrates that the fine-tuned NLLB-200 distilled 600M variant model outperforms others, highlighting the potential of multilingual models in low resource scenarios. Furthermore, we investigated the generative translation capabilities of multilingual LLMs on BHEPC using in-context learning, assessing performance under cross-domain generalization and quantifying distributional divergence. This work bridges a critical resource gap and promotes inclusive natural language processing technologies for low-resource and marginalized languages globally.

[21] With Privacy, Size Matters: On the Importance of Dataset Size in Differentially Private Text Rewriting

Stephen Meisenbacher, Florian Matthes

Main category: cs.CL

TL;DR: The paper investigates how dataset size affects the privacy-utility trade-off in differentially private NLP text rewriting mechanisms, finding that dataset size is a critical factor that should be considered in evaluation procedures.

Details

Motivation: Previous work in DP NLP has ignored the impact of dataset size on mechanism efficacy for utility and privacy preservation, creating a gap in understanding how these mechanisms perform at scale.

Method: Designed utility and privacy tests on large-scale datasets with dynamic split sizes, running experiments on datasets of varying sizes (up to 1 million texts) to quantify the effect of increasing dataset size.

Result: Findings reveal that dataset size plays an integral role in evaluating DP text rewriting mechanisms, significantly affecting the privacy-utility trade-off.

Conclusion: The study calls for more rigorous evaluation procedures in DP NLP and provides insights into the future of DP NLP in practice and at scale.

Abstract: Recent work in Differential Privacy with Natural Language Processing (DP NLP) has proposed numerous promising techniques in the form of text rewriting mechanisms. In the evaluation of these mechanisms, an often-ignored aspect is that of dataset size, or rather, the effect of dataset size on a mechanism’s efficacy for utility and privacy preservation. In this work, we are the first to introduce this factor in the evaluation of DP text privatization, where we design utility and privacy tests on large-scale datasets with dynamic split sizes. We run these tests on datasets of varying size with up to one million texts, and we focus on quantifying the effect of increasing dataset size on the privacy-utility trade-off. Our findings reveal that dataset size plays an integral part in evaluating DP text rewriting mechanisms; additionally, these findings call for more rigorous evaluation procedures in DP NLP, as well as shed light on the future of DP NLP in practice and at scale.

[22] ToM: Leveraging Tree-oriented MapReduce for Long-Context Reasoning in Large Language Models

Jiani Guo, Zuchao Li, Jie Wu, Qianren Wang, Yun Li, Lefei Zhang, Hai Zhao, Yujiu Yang

Main category: cs.CL

TL;DR: ToM is a Tree-oriented MapReduce framework that improves long-context reasoning in LLMs by leveraging document hierarchical structure through recursive bottom-up aggregation, outperforming existing methods.

Details

Motivation: Existing methods like RAG and divide-and-conquer frameworks struggle with logical coherence and long-range dependencies when reasoning over long contexts due to similarity-based rankings and isolated chunk processing.

Method: ToM constructs a DocTree through hierarchical semantic parsing of document structure, then performs recursive reasoning using Tree MapReduce: Map step generates rationales at child nodes, Reduce step aggregates rationales across siblings to resolve conflicts at parent nodes.

Result: Experimental results on 70B+ LLMs show ToM significantly outperforms existing divide-and-conquer frameworks and retrieval-augmented generation methods, achieving better logical coherence and long-context reasoning.

Conclusion: ToM effectively addresses limitations of current approaches by leveraging document hierarchy for coherent long-context reasoning through recursive bottom-up aggregation.

Abstract: Large Language Models (LLMs), constrained by limited context windows, often face significant performance degradation when reasoning over long contexts. To address this, Retrieval-Augmented Generation (RAG) retrieves and reasons over chunks but frequently sacrifices logical coherence due to its reliance on similarity-based rankings. Similarly, divide-and-conquer frameworks (DCF) split documents into small chunks for independent reasoning and aggregation. While effective for local reasoning, DCF struggles to capture long-range dependencies and risks inducing conflicts by processing chunks in isolation. To overcome these limitations, we propose ToM, a novel Tree-oriented MapReduce framework for long-context reasoning. ToM leverages the inherent hierarchical structure of long documents (e.g., main headings and subheadings) by constructing a DocTree through hierarchical semantic parsing and performing bottom-up aggregation. Using a Tree MapReduce approach, ToM enables recursive reasoning: in the Map step, rationales are generated at child nodes; in the Reduce step, these rationales are aggregated across sibling nodes to resolve conflicts or reach consensus at parent nodes. Experimental results on 70B+ LLMs show that ToM significantly outperforms existing divide-and-conquer frameworks and retrieval-augmented generation methods, achieving better logical coherence and long-context reasoning. Our code is available at https://github.com/gjn12-31/ToM .

[23] Zero-RAG: Towards Retrieval-Augmented Generation with Zero Redundant Knowledge

Qi Luo, Xiaonan Li, Junqi Dai, Shuang Cheng, Xipeng Qiu

Main category: cs.CL

TL;DR: Zero-RAG addresses knowledge redundancy in RAG systems by pruning redundant external corpus knowledge and improving LLM’s utilization of internal knowledge, achieving 30% corpus reduction and 22% retrieval speedup without performance loss.

Details

Motivation: Current RAG systems suffer from significant knowledge redundancy between external corpora and LLMs' internal knowledge, which increases retrieval costs and hurts performance on questions that LLMs can answer themselves.

Method: Proposes Mastery-Score metric to identify redundant knowledge for corpus pruning, Query Router to avoid irrelevant documents, and Noise-Tolerant Tuning to improve LLM’s internal knowledge utilization with pruned corpus.

Result: Zero-RAG prunes Wikipedia corpus by 30%, accelerates retrieval stage by 22%, and maintains RAG performance while reducing knowledge redundancy.

Conclusion: Zero-RAG effectively addresses knowledge redundancy in RAG systems through corpus pruning and improved internal knowledge utilization, demonstrating practical efficiency gains without compromising performance.

Abstract: Retrieval-Augmented Generation has shown remarkable results to address Large Language Models’ hallucinations, which usually uses a large external corpus to supplement knowledge to LLMs. However, with the development of LLMs, the internal knowledge of LLMs has expanded significantly, thus causing significant knowledge redundancy between the external corpus and LLMs. On the one hand, the indexing cost of dense retrieval is highly related to the corpus size and thus significant redundant knowledge intensifies the dense retrieval’s workload. On the other hand, the redundant knowledge in the external corpus is not helpful to LLMs and our exploratory analysis shows that it instead hurts the RAG performance on those questions which the LLM can answer by itself. To address these issues, we propose Zero-RAG to tackle these challenges. Specifically, we first propose the Mastery-Score metric to identify redundant knowledge in the RAG corpus to prune it. After pruning, answers to “mastered” questions rely primarily on internal knowledge of the LLM. To better harness the internal capacity, we propose Query Router and Noise-Tolerant Tuning to avoid the irrelevant documents’ distraction and thus further improve the LLM’s utilization of internal knowledge with pruned corpus. Experimental results show that Zero-RAG prunes the Wikipedia corpus by 30% and accelerates the retrieval stage by 22%, without compromising RAG’s performance.

[24] Fine-Tuning DialoGPT on Common Diseases in Rural Nepal for Medical Conversations

Birat Poudel, Satyam Ghimire, Er. Prakash Chandra Prasad

Main category: cs.CL

TL;DR: Fine-tuned DialoGPT model for offline medical conversations in rural Nepal, showing coherent and medically appropriate responses despite limited training data.

Details

Motivation: To support healthcare delivery in resource-constrained rural Nepal where internet connectivity and cloud infrastructure are often unavailable.

Method: Fine-tuned DialoGPT (a lightweight generative dialogue model) on a synthetically constructed dataset of doctor-patient interactions covering ten common diseases prevalent in rural Nepal.

Result: The fine-tuned model produced coherent, contextually relevant, and medically appropriate responses, demonstrating understanding of symptoms, disease context, and empathetic communication.

Conclusion: Compact, offline-capable dialogue models with targeted datasets are effective for domain adaptation in low-resource healthcare environments, offering promising directions for rural medical conversational AI.

Abstract: Conversational agents are increasingly being explored to support healthcare delivery, particularly in resource-constrained settings such as rural Nepal. Large-scale conversational models typically rely on internet connectivity and cloud infrastructure, which may not be accessible in rural areas. In this study, we fine-tuned DialoGPT, a lightweight generative dialogue model that can operate offline, on a synthetically constructed dataset of doctor-patient interactions covering ten common diseases prevalent in rural Nepal, including common cold, seasonal fever, diarrhea, typhoid fever, gastritis, food poisoning, malaria, dengue fever, tuberculosis, and pneumonia. Despite being trained on a limited, domain-specific dataset, the fine-tuned model produced coherent, contextually relevant, and medically appropriate responses, demonstrating an understanding of symptoms, disease context, and empathetic communication. These results highlight the adaptability of compact, offline-capable dialogue models and the effectiveness of targeted datasets for domain adaptation in low-resource healthcare environments, offering promising directions for future rural medical conversational AI.

[25] Exploring and Mitigating Gender Bias in Encoder-Based Transformer Models

Ariyan Hossain, Khondokar Mohammad Ahanaf Hannan, Rakinul Haque, Nowreen Tarannum Rafa, Humayra Musarrat, Shoaib Ahmed Dipu, Farig Yousuf Sadeque

Main category: cs.CL

TL;DR: This paper investigates gender bias in transformer models (BERT, ALBERT, RoBERTa, DistilBERT) using a novel metric MALoR and proposes a mitigation approach using Counterfactual Data Augmentation that significantly reduces bias without performance loss.

Details

Motivation: Encoder-based transformer models exhibit strong gender biases inherited from training data, which is a growing concern in NLP that needs to be addressed.

Method: Introduces MALoR metric to quantify bias based on masked token probabilities, and proposes mitigation through continued pre-training on gender-balanced datasets generated via Counterfactual Data Augmentation.

Result: Significant bias reduction across models: BERT-base “he-she” bias dropped from 1.27 to 0.08, “his-her” from 2.51 to 0.36; BERT-large “male-female” bias decreased from 1.82 to 0.10. Similar improvements observed in other models.

Conclusion: The proposed approach effectively reduces gender bias in transformer models without compromising downstream task performance, providing a practical solution for bias mitigation.

Abstract: Gender bias in language models has gained increasing attention in the field of natural language processing. Encoder-based transformer models, which have achieved state-of-the-art performance in various language tasks, have been shown to exhibit strong gender biases inherited from their training data. This paper investigates gender bias in contextualized word embeddings, a crucial component of transformer-based models. We focus on prominent architectures such as BERT, ALBERT, RoBERTa, and DistilBERT to examine their vulnerability to gender bias. To quantify the degree of bias, we introduce a novel metric, MALoR, which assesses bias based on model probabilities for filling masked tokens. We further propose a mitigation approach involving continued pre-training on a gender-balanced dataset generated via Counterfactual Data Augmentation. Our experiments reveal significant reductions in gender bias scores across different pronoun pairs. For instance, in BERT-base, bias scores for “he-she” dropped from 1.27 to 0.08, and “his-her” from 2.51 to 0.36 following our mitigation approach. We also observed similar improvements across other models, with “male-female” bias decreasing from 1.82 to 0.10 in BERT-large. Our approach effectively reduces gender bias without compromising model performance on downstream tasks.

[26] Word Salad Chopper: Reasoning Models Waste A Ton Of Decoding Budget On Useless Repetitions, Self-Knowingly

Wenya Xie, Shaochen, Zhong, Hoang Anh Duy Le, Zhaozhuo Xu, Jianwen Xie, Zirui Liu

Main category: cs.CL

TL;DR: Large Reasoning Models produce useless self-repetitions called “word salad” that waste output tokens. The paper presents WordSaladChopper (WSC), a lightweight component that detects and removes these redundant tokens using hidden state patterns, achieving significant length savings with minimal quality loss.

Details

Motivation: Large Reasoning Models are bottlenecked by the high cost of output tokens, with a significant portion being useless self-repetitions that exhaust decoding budgets without adding value.

Method: Detect word salad behavior using a single-layer linear classifier that analyzes hidden states of <\n\n> tokens trailing reasoning chunks. Once detected, apply a simple chop followed by a straightforward regeneration prompt.

Result: Substantial length savings with minimal quality loss. WSC is lightweight, minimally invasive, and only removes semantically redundant tokens.

Conclusion: WSC or similar components are essential for all LRM applications focused on user experience, given their low overhead, strong savings, and the lack of semantic value in word salad tokens.

Abstract: Large Reasoning Models (LRMs) are often bottlenecked by the high cost of output tokens. We show that a significant portion of these tokens are useless self-repetitions - what we call “word salad” - that exhaust the decoding budget without adding value. Interestingly, we observe that LRMs are self-aware when trapped in these loops: the hidden states of <\n\n> tokens trailing each reasoning chunk exhibit patterns that allow us to detect word salad behavior on-the-fly via a single-layer linear classifier. Once detected, a simple chop appended by a straightforward regeneration prompt yields substantial length savings with minimal quality loss. Our work offers WordSaladChopper (WSC) - a lightweight, turnkey component for LRM that is minimally invasive to its reasoning trajectory by only removing semantically redundant tokens. Given its low overhead, strong savings, and the lack of semantic value of word salad tokens, we believe it is not too far-fetched to argue that WSC - or a similar component - is a must-have for all LRM applications with user experience in mind. Our code is publicly available at https://github.com/wenyaxie023/WordSaladChopper.

[27] Multi-refined Feature Enhanced Sentiment Analysis Using Contextual Instruction

Peter Atandoh, Jie Zou, Weikang Guo, Jiwei Wei, Zheng Wang

Main category: cs.CL

TL;DR: CISEA-MRFE is a novel PLM-based framework that improves sentiment analysis by integrating contextual instructions, semantic enhancement augmentation, and multi-refined feature extraction to address limitations in handling nuanced emotions, domain shifts, and imbalanced distributions.

Details

Motivation: Existing sentiment analysis approaches underperform with nuanced emotional cues, domain shifts, and imbalanced sentiment distributions due to inadequate semantic grounding, poor generalization, and biases toward dominant sentiment classes.

Method: Proposes CISEA-MRFE framework with three components: Contextual Instruction (CI) for domain-aware sentiment disambiguation, Semantic Enhancement Augmentation (SEA) for sentiment-consistent paraphrastic augmentation, and Multi-Refined Feature Extraction (MRFE) combining Scale-Adaptive Depthwise Encoder (SADE) for multi-scale features and Emotion Evaluator Context Encoder (EECE) for affect-aware modeling.

Result: Outperforms strong baselines on four benchmark datasets with relative accuracy improvements: 4.6% on IMDb, 6.5% on Yelp, 30.3% on Twitter, and 4.1% on Amazon.

Conclusion: CISEA-MRFE demonstrates effective generalization ability for sentiment classification across varied domains, validating the proposed approach’s effectiveness.

Abstract: Sentiment analysis using deep learning and pre-trained language models (PLMs) has gained significant traction due to their ability to capture rich contextual representations. However, existing approaches often underperform in scenarios involving nuanced emotional cues, domain shifts, and imbalanced sentiment distributions. We argue that these limitations stem from inadequate semantic grounding, poor generalization to diverse linguistic patterns, and biases toward dominant sentiment classes. To overcome these challenges, we propose CISEA-MRFE, a novel PLM-based framework integrating Contextual Instruction (CI), Semantic Enhancement Augmentation (SEA), and Multi-Refined Feature Extraction (MRFE). CI injects domain-aware directives to guide sentiment disambiguation; SEA improves robustness through sentiment-consistent paraphrastic augmentation; and MRFE combines a Scale-Adaptive Depthwise Encoder (SADE) for multi-scale feature specialization with an Emotion Evaluator Context Encoder (EECE) for affect-aware sequence modeling. Experimental results on four benchmark datasets demonstrate that CISEA-MRFE consistently outperforms strong baselines, achieving relative improvements in accuracy of up to 4.6% on IMDb, 6.5% on Yelp, 30.3% on Twitter, and 4.1% on Amazon. These results validate the effectiveness and generalization ability of our approach for sentiment classification across varied domains.

[28] Friend or Foe: How LLMs’ Safety Mind Gets Fooled by Intent Shift Attack

Peng Ding, Jun Kuang, Wen Sun, Zongyu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, Shujian Huang

Main category: cs.CL

TL;DR: ISA (Intent Shift Attack) is a new jailbreaking method that transforms harmful intents into benign-seeming requests through minimal edits, achieving over 70% higher attack success rates than direct attacks and nearly 100% success after fine-tuning.

Details

Motivation: Existing jailbreaking attacks primarily distract LLMs with additional context or adversarial tokens without changing the core harmful intent, leaving fundamental vulnerabilities unaddressed.

Method: ISA establishes a taxonomy of intent transformations to generate attacks that mislead LLMs into perceiving harmful requests as benign information-seeking queries, using minimal natural edits rather than complex tokens or lengthy context.

Result: Extensive experiments show ISA achieves over 70% improvement in attack success rate compared to direct harmful prompts, and fine-tuning models on ISA-reformulated benign data elevates success rates to nearly 100%. Existing defenses prove inadequate against ISA.

Conclusion: ISA reveals fundamental challenges in intent inference for LLM safety and underscores the need for more effective defenses beyond current methods.

Abstract: Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities. Investigating these weaknesses is crucial for robust safety mechanisms. Existing attacks primarily distract LLMs by introducing additional context or adversarial tokens, leaving the core harmful intent unchanged. In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks. More specifically, we establish a taxonomy of intent transformations and leverage them to generate attacks that may be misperceived by LLMs as benign requests for information. Unlike prior methods relying on complex tokens or lengthy context, our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts. Extensive experiments on both open-source and commercial LLMs show that ISA achieves over 70% improvement in attack success rate compared to direct harmful prompts. More critically, fine-tuning models on only benign data reformulated with ISA templates elevates success rates to nearly 100%. For defense, we evaluate existing methods and demonstrate their inadequacy against ISA, while exploring both training-free and training-based mitigation strategies. Our findings reveal fundamental challenges in intent inference for LLMs safety and underscore the need for more effective defenses. Our code and datasets are available at https://github.com/NJUNLP/ISA.

[29] FlashEVA: Accelerating LLM inference via Efficient Attention

Juan Gabriel Kostelec, Qinghai Guo

Main category: cs.CL

TL;DR: FlashEVA is an efficient implementation of EVA attention that enables fine-tuning transformers with minimal tokens while maintaining performance, achieving up to 6.7x higher throughput and 5x lower GPU memory usage during inference.

Details

Motivation: Transformer models have high memory demands during inference due to maintaining full context in memory, which poses significant challenges for practical deployment.

Method: Present FlashEVA, an efficient implementation of EVA (Efficient Attention via Control Variates), and demonstrate how to finetune transformers to adapt to FlashEVA attention with as few as 1.5B tokens.

Result: FlashEVA achieves up to 6.7x higher throughput and 5x lower peak GPU memory usage during inference compared to standard Transformer implementations, while preserving effectiveness across various downstream tasks.

Conclusion: FlashEVA represents a significant step towards more efficient and adaptable Transformer-based models for inference, offering control over throughput-accuracy trade-offs through adjustable hyperparameters, though limitations exist in retrieval-focused tasks.

Abstract: Transformer models have revolutionized natural language processing, achieving state-of-the-art performance and demonstrating remarkable scalability. However, their memory demands, particularly due to maintaining full context in memory, pose significant challenges for inference. In this paper, we present FlashEVA, an efficient implementation of EVA (Efficient Attention via Control Variates), and demonstrate how to finetune transformers to adapt to FlashEVA attention. Our method enables fine-tuning of Transformer models with as few as 1.5B tokens while preserving effectiveness across various downstream tasks. Notably, FlashEVA achieves up to 6.7x higher throughput and 5x lower peak GPU memory usage during inference compared to standard Transformer implementations. Despite these improvements, we observe limitations in retrieval-focused tasks. Our implementation offers control over the trade-off between throughput and accuracy through adjustable hyperparameters, providing flexibility for diverse use cases. This work represents a significant step towards more efficient and adaptable Transformer-based models for inference.

[30] OpenSIR: Open-Ended Self-Improving Reasoner

Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z. Pan, Marco Valentino, Pasquale Minervini

Main category: cs.CL

TL;DR: OpenSIR is a self-play framework where LLMs alternate teacher and student roles to generate and solve novel problems without external supervision, enabling open-ended mathematical discovery and significant performance improvements.

Details

Motivation: Existing LLM reasoning methods rely on annotated datasets that limit surpassing human-level performance, while self-play approaches depend on external verifiers or cannot learn open-endedly.

Method: OpenSIR uses alternating teacher-student roles where the LLM generates novel problems optimizing for difficulty and diversity, and solves them without external supervision through self-play.

Result: Substantial improvements: Llama-3.2-3B-Instruct advanced from 73.9 to 78.3 on GSM8K and 28.8 to 34.4 on College Math; Gemma-2-2B-Instruct rose from 38.5 to 58.7 on GSM8K.

Conclusion: OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, enabling autonomous progression from basic to advanced mathematics.

Abstract: Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models’ ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.

[31] SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

Jameson Sandler, Jacob K. Christopher, Thomas Hartvigsen, Nando Fioretto

Main category: cs.CL

TL;DR: SpecDiff-2 is a novel speculative decoding framework that uses discrete diffusion as a non-autoregressive drafter to overcome parallelism limitations and misalignment issues in current approaches, achieving up to 5.5x speed-up over standard decoding without accuracy loss.

Details

Motivation: Current speculative decoding approaches are limited by two bottlenecks: (1) autoregressive dependency during drafting that limits parallelism, and (2) frequent rejections of draft tokens due to misalignment between draft and verify models.

Method: Proposes SpecDiff-2 framework that leverages discrete diffusion as a non-autoregressive drafter to address parallelism limitations, and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers to address misalignment issues.

Result: Achieves state-of-the-art performance across reasoning, coding, and mathematical benchmarks with up to +55% improvement in tokens-per-second over previous baselines and up to 5.5x average speed-up over standard decoding, with no accuracy loss.

Conclusion: SpecDiff-2 successfully addresses the fundamental bottlenecks in speculative decoding through non-autoregressive drafting and improved calibration, establishing a new state-of-the-art for LLM inference acceleration.

Abstract: Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: (1) the autoregressive dependency during drafting which limits parallelism, and (2) frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that SpecDiff-2 achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of +55% over previous baselines and obtaining up to 5.5x average speed-up over standard decoding, without any loss of accuracy.

[32] Certain but not Probable? Differentiating Certainty from Probability in LLM Token Outputs for Probabilistic Scenarios

Autumn Toney-Wails, Ryan Wails

Main category: cs.CL

TL;DR: This paper investigates how well LLMs’ token-level probabilities align with theoretical distributions in probabilistic scenarios, finding that while models achieve perfect accuracy, their probability outputs diverge from expected theoretical distributions.

Details

Motivation: Reliable uncertainty quantification is essential for trustworthy LLM deployment in decision-support applications, but current approaches using token logits may be inadequate for probabilistic scenarios where output probabilities should align with theoretical distributions.

Method: Evaluated GPT-4.1 and DeepSeek-Chat on ten probabilistic prompts (e.g., rolling dice) with and without explicit probability cues, measuring response validity and alignment between token-level probabilities and theoretical distributions.

Result: Both models achieved perfect response accuracy across all scenarios, but their token-level probability and entropy values consistently diverged from the corresponding theoretical distributions.

Conclusion: Current LLM uncertainty quantification methods using token probabilities may not reliably reflect theoretical probability distributions, even when models produce accurate responses.

Abstract: Reliable uncertainty quantification (UQ) is essential for ensuring trustworthy downstream use of large language models, especially when they are deployed in decision-support and other knowledge-intensive applications. Model certainty can be estimated from token logits, with derived probability and entropy values offering insight into performance on the prompt task. However, this approach may be inadequate for probabilistic scenarios, where the probabilities of token outputs are expected to align with the theoretical probabilities of the possible outcomes. We investigate the relationship between token certainty and alignment with theoretical probability distributions in well-defined probabilistic scenarios. Using GPT-4.1 and DeepSeek-Chat, we evaluate model responses to ten prompts involving probability (e.g., roll a six-sided die), both with and without explicit probability cues in the prompt (e.g., roll a fair six-sided die). We measure two dimensions: (1) response validity with respect to scenario constraints, and (2) alignment between token-level output probabilities and theoretical probabilities. Our results indicate that, while both models achieve perfect in-domain response accuracy across all prompt scenarios, their token-level probability and entropy values consistently diverge from the corresponding theoretical distributions.

[33] Modeling the Construction of a Literary Archetype: The Case of the Detective Figure in French Literature

Jean Barré, Olga Seminck, Antoine Bourgois, Thierry Poibeau

Main category: cs.CL

TL;DR: Computational analysis shows the detective archetype in French fiction evolves from secondary character to central “reasoning machine” in classical stories, then becomes more complex with moral ambiguity after WWII hardboiled influence.

Details

Motivation: To understand how the detective archetype evolved in French detective fiction over 150 years using computational methods.

Method: Used quantitative methods and character-level embeddings with supervised model to analyze detective characters across French literature from 1866 to 2017.

Result: The model successfully captured the unity of the detective archetype across 150 years, showing evolution from secondary narrative role to central character and “reasoning machine” in classical detective stories, with increased complexity and moral ambiguity after WWII hardboiled influence.

Conclusion: The detective archetype in French fiction follows an evolutionary path from secondary character to complex protagonist, reflecting genre transformations and cultural influences over time.

Abstract: This research explores the evolution of the detective archetype in French detective fiction through computational analysis. Using quantitative methods and character-level embeddings, we show that a supervised model is able to capture the unity of the detective archetype across 150 years of literature, from M. Lecoq (1866) to Commissaire Adamsberg (2017). Building on this finding, the study demonstrates how the detective figure evolves from a secondary narrative role to become the central character and the “reasoning machine” of the classical detective story. In the aftermath of the Second World War, with the importation of the hardboiled tradition into France, the archetype becomes more complex, navigating the genre’s turn toward social violence and moral ambiguity.

[34] Do You Know About My Nation? Investigating Multilingual Language Models’ Cultural Literacy Through Factual Knowledge

Eshaan Tanwar, Anwoy Chatterjee, Michael Saxon, Alon Albalak, William Yang Wang, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: XNationQA is a multilingual QA benchmark covering 49,280 questions about geography, culture, and history of 9 countries in 7 languages, revealing cultural literacy gaps in multilingual LLMs.

Details

Motivation: Most multilingual QA benchmarks are Western-centric and don't capture regional diversity, creating gaps in evaluating models' understanding of factual information from diverse geographical locations.

Method: Created XNationQA benchmark with questions on 9 countries’ geography, culture, and history in 7 languages, then benchmarked 8 multilingual LLMs using two novel transference metrics.

Result: Models show significant discrepancy in accessing culturally specific facts across languages, often knowing more cultural information in English than the culture’s dominant language. Models perform better in Western languages but aren’t necessarily more literate about Western countries.

Conclusion: Multilingual LLMs have limited ability to transfer knowledge across languages (especially open-source models), and current evaluation methods fail to capture cultural literacy fairly across different regions.

Abstract: Most multilingual question-answering benchmarks, while covering a diverse pool of languages, do not factor in regional diversity in the information they capture and tend to be Western-centric. This introduces a significant gap in fairly evaluating multilingual models’ comprehension of factual information from diverse geographical locations. To address this, we introduce XNationQA for investigating the cultural literacy of multilingual LLMs. XNationQA encompasses a total of 49,280 questions on the geography, culture, and history of nine countries, presented in seven languages. We benchmark eight standard multilingual LLMs on XNationQA and evaluate them using two novel transference metrics. Our analyses uncover a considerable discrepancy in the models' accessibility to culturally specific facts across languages. Notably, we often find that a model demonstrates greater knowledge of cultural information in English than in the dominant language of the respective culture. The models exhibit better performance in Western languages, although this does not necessarily translate to being more literate for Western countries, which is counterintuitive. Furthermore, we observe that models have a very limited ability to transfer knowledge across languages, particularly evident in open-source models.

[35] Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?

Berk Atil, Rebecca J. Passonneau, Fred Morstatter

Main category: cs.CL

TL;DR: First systematic multilingual evaluation of jailbreak attacks and defenses across 10 languages shows safety alignment varies significantly by language, with high-resource languages being safer for standard queries but more vulnerable to adversarial attacks.

Details

Motivation: While LLMs undergo safety alignment, jailbreak attacks can bypass safety measures, but cross-lingual generalization of these attacks and defenses remains underexplored.

Method: Systematic multilingual evaluation across 10 languages (high-, medium-, low-resource) using 6 LLMs on HarmBench and AdvBench, assessing logical-expression-based and adversarial-prompt-based jailbreaks.

Result: Attack success and defense robustness vary across languages; high-resource languages are safer under standard queries but more vulnerable to adversarial attacks; simple defenses are effective but language- and model-dependent.

Conclusion: Findings demonstrate the need for language-aware and cross-lingual safety benchmarks for LLMs to address varying safety alignment across different languages.

Abstract: Large language models (LLMs) undergo safety alignment after training and tuning, yet recent work shows that safety can be bypassed through jailbreak attacks. While many jailbreaks and defenses exist, their cross-lingual generalization remains underexplored. This paper presents the first systematic multilingual evaluation of jailbreaks and defenses across ten languages–spanning high-, medium-, and low-resource languages–using six LLMs on HarmBench and AdvBench. We assess two jailbreak types: logical-expression-based and adversarial-prompt-based. For both types, attack success and defense robustness vary across languages: high-resource languages are safer under standard queries but more vulnerable to adversarial ones. Simple defenses can be effective, but are language- and model-dependent. These findings call for language-aware and cross-lingual safety benchmarks for LLMs.

[36] Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

Yuxuan Hu, Jianchao Tan, Jiaqi Zhang, Wen Zan, Pingwei Sun, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai, Jing Zhang

Main category: cs.CL

TL;DR: Proposes improved Native Sparse Attention with alternating local-global attention patterns and latent attention mechanisms, reducing KV-cache by 50% while enhancing long-context modeling.

Details

Motivation: To enhance long-context modeling capabilities of sparse attention mechanisms by addressing limitations in long-range dependency propagation.

Method: Alternates between local (sliding-window) and global (compression, selective) attention across layers, enhanced with Multi-head Latent Attention for sliding-window and Group-head Latent Attention for compression/selective branches.

Result: Reduces KV-cache memory by 50% versus Native Sparse Attention while improving common-sense reasoning and long-text understanding. Matches or exceeds full attention and native sparse attention in benchmarks.

Conclusion: Alternating attention patterns with latent attention mechanisms effectively improves long-context modeling efficiency and performance across various model sizes.

Abstract: In this work, we conduct a systematic analysis of Native Sparse Attention (NSA) and propose targeted improvements that enhance long-context modeling. A key insight is that alternating between local (sliding-window) and global (compression, selective) attention across layers, rather than using fixed patterns, enables more effective propagation of long-range dependencies and substantially boosts performance on long-sequence tasks. Meanwhile, we further refine NSA’s branches with Latent Attention that the sliding-window branch is enhanced with Multi-head Latent Attention (MLA) while compression and selective branches adopt Group-head Latent Attention (GLA). These changes reduce KV-cache memory by 50% versus NSA while improving the model’s common-sense reasoning and long-text understanding capabilities. Experiments on models from 340M to 1.3B parameters (trained on 15B and 100B tokens) show our method matches or exceeds full attention and native sparse attention in both common-sense reasoning and long-context understanding tasks.

Chong Lyu, Lin Li, Shiqing Wu, Jingling Yuan

Main category: cs.CL

TL;DR: TriCon-Fair is a contrastive learning framework that addresses social bias in LLMs by decoupling biased and unbiased samples using triplet loss and language modeling to eliminate negative-positive coupling.

Details

Motivation: Existing debiasing methods treat biased and unbiased samples independently, ignoring their mutual relationship and creating negative-positive coupling where improvements for one group compromise the other, allowing residual bias to persist.

Method: TriCon-Fair uses a decoupled loss combining triplet and language modeling terms. It assigns each anchor an explicitly biased negative and an unbiased positive to decouple push-pull dynamics, while jointly optimizing an LM objective to preserve general capability.

Result: Experimental results show TriCon-Fair reduces discriminatory output beyond existing debiasing baselines while maintaining strong downstream performance.

Conclusion: TriCon-Fair offers a practical and ethical solution for sensitive NLP applications by effectively addressing social bias in LLMs without compromising model performance.

Abstract: The increasing utilization of large language models raises significant concerns about the propagation of social biases, which may result in harmful and unfair outcomes. However, existing debiasing methods treat the biased and unbiased samples independently, thus ignoring their mutual relationship. This oversight enables a hidden negative-positive coupling, where improvements for one group inadvertently compromise the other, allowing residual social bias to persist. In this paper, we introduce TriCon-Fair, a contrastive learning framework that employs a decoupled loss that combines triplet and language modeling terms to eliminate positive-negative coupling. Our TriCon-Fair assigns each anchor an explicitly biased negative and an unbiased positive, decoupling the push-pull dynamics and avoiding positive-negative coupling, and jointly optimizes a language modeling (LM) objective to preserve general capability. Experimental results demonstrate that TriCon-Fair reduces discriminatory output beyond existing debiasing baselines while maintaining strong downstream performance. This suggests that our proposed TriCon-Fair offers a practical and ethical solution for sensitive NLP applications.

[38] Assessing LLM Reasoning Steps via Principal Knowledge Grounding

Hyeon Hwang, Yewon Cho, Chanwoong Yoon, Yein Park, Minju Song, Kyungjae Lee, Gangwoo Kim, Jaewoo Kang

Main category: cs.CL

TL;DR: A novel evaluation suite that systematically assesses how well LLMs ground their step-by-step reasoning in prerequisite knowledge, with metrics for recall and application of knowledge, plus applications in preference optimization.

Details

Motivation: To verify that LLM reasoning is accurately grounded in knowledge, addressing the fundamental question of how to ensure intermediate reasoning steps are based on proper knowledge foundations.

Method: Three-component framework: (1) Principal Knowledge Collection - large-scale repository of atomic knowledge; (2) knowledge-grounded evaluation metrics for measuring recall and application of prerequisite knowledge; (3) evaluator LLM for cost-effective metric computation.

Result: The evaluation suite effectively identifies missing or misapplied knowledge elements, revealing fundamental reasoning deficiencies in LLMs.

Conclusion: The framework provides crucial insights into LLM reasoning quality and demonstrates applications beyond evaluation, including integration into preference optimization.

Abstract: Step-by-step reasoning has become a standard approach for large language models (LLMs) to tackle complex tasks. While this paradigm has proven effective, it raises a fundamental question: How can we verify that an LLM’s reasoning is accurately grounded in knowledge? To address this question, we introduce a novel evaluation suite that systematically assesses the knowledge grounding of intermediate reasoning. Our framework comprises three key components. (1) Principal Knowledge Collection, a large-scale repository of atomic knowledge essential for reasoning. Based on the collection, we propose (2) knowledge-grounded evaluation metrics designed to measure how well models recall and apply prerequisite knowledge in reasoning. These metrics are computed by our (3) evaluator LLM, a lightweight model optimized for cost-effective and reliable metric computation. Our evaluation suite demonstrates remarkable effectiveness in identifying missing or misapplied knowledge elements, providing crucial insights for uncovering fundamental reasoning deficiencies in LLMs. Beyond evaluation, we demonstrate how these metrics can be integrated into preference optimization, showcasing further applications of knowledge-grounded evaluation.

[39] ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Ahmed Masry, Megh Thakkar, Patrice Bechard, Sathwik Tejaswi Madhusudhan, Rabiul Awal, Shambhavi Mishra, Akshay Kalkunte Suresh, Srivatsava Daruru, Enamul Hoque, Spandana Gella, Torsten Scholak, Sai Rajeswar

Main category: cs.CL

TL;DR: ColMate is a multimodal document retrieval model that improves over existing methods by using OCR-based pretraining, self-supervised masked contrastive learning, and late interaction scoring tailored to multimodal document structures.

Details

Motivation: Existing multimodal document retrieval methods often replicate text-only retrieval techniques without considering multimodal document structures and visual characteristics, limiting their effectiveness.

Method: Uses OCR-based pretraining objective, self-supervised masked contrastive learning, and late interaction scoring mechanism specifically designed for multimodal document structures.

Result: Achieves 3.61% improvement over existing retrieval models on ViDoRe V2 benchmark and shows stronger generalization to out-of-domain benchmarks.

Conclusion: ColMate successfully bridges the gap between multimodal representation learning and document retrieval, demonstrating superior performance through specialized multimodal techniques.

Abstract: Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.

[40] The Biased Oracle: Assessing LLMs’ Understandability and Empathy in Medical Diagnoses

Jianzhou Yao, Shunchang Liu, Guillaume Drui, Rikard Pettersson, Alessandro Blasimme, Sara Kijewski

Main category: cs.CL

TL;DR: LLMs show promise for clinical diagnostic communication but produce overly complex explanations and biased affective empathy, creating uneven accessibility despite adapting to patient variables.

Details

Motivation: To evaluate LLMs' ability to generate understandable and empathetic medical explanations for patients in diagnostic scenarios.

Method: Assessed two leading LLMs on medical diagnostic scenarios using readability metrics for understandability and LLM-as-a-Judge ratings for empathy, compared with human evaluations.

Result: LLMs adapt explanations to socio-demographic variables and patient conditions but generate overly complex content and display biased affective empathy, leading to uneven accessibility.

Conclusion: Systematic calibration is needed to ensure equitable patient communication with LLMs.

Abstract: Large language models (LLMs) show promise for supporting clinicians in diagnostic communication by generating explanations and guidance for patients. Yet their ability to produce outputs that are both understandable and empathetic remains uncertain. We evaluate two leading LLMs on medical diagnostic scenarios, assessing understandability using readability metrics as a proxy and empathy through LLM-as-a-Judge ratings compared to human evaluations. The results indicate that LLMs adapt explanations to socio-demographic variables and patient conditions. However, they also generate overly complex content and display biased affective empathy, leading to uneven accessibility and support. These patterns underscore the need for systematic calibration to ensure equitable patient communication. The code and data are released: https://github.com/Jeffateth/Biased_Oracle

[41] The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles

Abhinav P M, Ojasva Saxena, Oswald C, Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: LLMs show limited culturally-grounded reasoning across 7 Indian languages, with top models being overconfident while weaker models are more self-aware of their mistakes.

Details

Motivation: To examine LLMs' reasoning and self-assessment abilities across non-English languages, particularly Indian languages where cultural grounding is crucial.

Method: Created multilingual riddle dataset with traditional and context-reconstructed variants, evaluated 5 LLMs using 7 prompting strategies, and conducted two-stage evaluation (riddle-solving and self-assessment).

Result: Gemini 2.5 Pro performed best overall but showed minimal gains from few-shot learning. Key finding: initial accuracy inversely correlated with self-awareness - top models were overconfident (4.34% TNR) while weaker models were more self-aware (42.09% TNR).

Conclusion: Clear gaps exist in multilingual reasoning, highlighting the need for models that not only reason effectively but also recognize their own limitations across diverse cultural contexts.

Abstract: The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model’s initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.

[42] Advancing Machine-Generated Text Detection from an Easy to Hard Supervision Perspective

Chenwang Wu, Yiu-ming Cheung, Bo Han, Defu Lian

Main category: cs.CL

TL;DR: Proposes an easy-to-hard enhancement framework for machine-generated text detection that addresses boundary ambiguity and inexact learning by using an easy supervisor on longer texts to enhance a target detector.

Details

Motivation: Traditional MGT detection methods assume exact labels as golden standard, but boundary ambiguity and limitations of human cognition make inexact learning widespread and inevitable.

Method: Uses an easy supervisor targeting longer-text detection tasks to enhance a target detector. Longer texts alleviate inexact label impact, and structural incorporation models supervisor as lower performance bound for detector.

Result: Extensive experiments across cross-LLM, cross-domain, mixed text, and paraphrase attack scenarios demonstrate significant detection effectiveness.

Conclusion: The framework provides reliable supervision under inexact conditions and effectively approximates underlying golden labels through indirect optimization.

Abstract: Existing machine-generated text (MGT) detection methods implicitly assume labels as the “golden standard”. However, we reveal boundary ambiguity in MGT detection, implying that traditional training paradigms are inexact. Moreover, limitations of human cognition and the superintelligence of detectors make inexact learning widespread and inevitable. To this end, we propose an easy-to-hard enhancement framework to provide reliable supervision under such inexact conditions. Distinct from knowledge distillation, our framework employs an easy supervisor targeting relatively simple longer-text detection tasks (despite weaker capabilities), to enhance the more challenging target detector. Firstly, longer texts targeted by supervisors theoretically alleviate the impact of inexact labels, laying the foundation for reliable supervision. Secondly, by structurally incorporating the detector into the supervisor, we theoretically model the supervisor as a lower performance bound for the detector. Thus, optimizing the supervisor indirectly optimizes the detector, ultimately approximating the underlying “golden” labels. Extensive experiments across diverse practical scenarios, including cross-LLM, cross-domain, mixed text, and paraphrase attacks, demonstrate the framework’s significant detection effectiveness. The code is available at: https://github.com/tmlr-group/Easy2Hard.

[43] MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

Haolin Yang, Jipeng Zhang, Zhitao He, Yi R. Fung

Main category: cs.CL

TL;DR: MARS-SQL is a multi-agent framework that combines task decomposition and interactive RL for robust natural language to SQL translation, achieving state-of-the-art performance on BIRD and Spider benchmarks.

Details

Motivation: Natural language to SQL translation remains challenging for complex queries that require environmental interaction and self-correction capabilities.

Method: Uses three specialized agents: Grounding Agent for schema linking, Generation Agent trained via multi-turn RL policy with ReAct-style loop (Think-Act-Observe), and Validation Agent for trajectory selection. Generates multiple interaction trajectories and selects optimal one using next-token prediction.

Result: Achieves Execution Accuracy of 77.84% on BIRD dev set and 89.75% on Spider test set, demonstrating state-of-the-art performance.

Conclusion: The structured multi-agent workflow combining interactive RL for generation and generative modeling for verification proves highly effective for robust and accurate SQL generation.

Abstract: Translating natural language to SQL remains difficult for complex queries. Such queries often need environmental interaction and self-correction. To address this, we introduce MARS-SQL, a novel multi-agent framework that combines principled task decomposition and interactive reinforcement learning (RL). Our system comprises three specialized agents: a Grounding Agent for schema linking, a Generation Agent for query generation, and a Validation Agent for final selection. The core of our framework is the Generation agent, which is trained via a multi-turn RL policy. Adopting a ReAct-style Think-Act-Observe loop, the agent iteratively generates thoughts, executes SQL actions against a live database, and revises its strategy based on execution feedback, enabling dynamic, stateful reasoning and self-correction. At inference time, we generate multiple interaction trajectories to explore diverse reasoning paths. The Validation agent, then selects the optimal trajectory by modeling verification as a next-token prediction task and choosing the solution with the highest generation probability. This structured workflow pipelines specialized agents. It combines interactive RL for generation with generative modeling for verification. The approach proves highly effective for robust and accurate SQL generation. Experiments show that MARS-SQL achieves state-of-the-art Execution Accuracy of 77.84% on the BIRD dev set and 89.75% on the Spider test set. Our code is available at https://github.com/YangHaolin0526/MARS-SQL.

[44] IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation

Bosi Wen, Yilin Niu, Cunxiang Wang, Pei Ke, Xiaoying Ling, Ying Zhang, Aohan Zeng, Hongning Wang, Minlie Huang

Main category: cs.CL

TL;DR: IF-CRITIC is an LLM critic that provides efficient and reliable assessments of constraint following in instructions, outperforming existing LLM-as-a-Judge methods and enabling better instruction-following optimization with lower computational costs.

Details

Motivation: Existing evaluation models for instruction following have deficiencies including substantial costs and unreliable assessments, creating a need for more efficient and reliable assessment methods.

Method: Develop a checklist generator to decompose instructions into constraint checklists, collect high-quality critique training data through multi-stage critique filtering, and employ constraint-level preference optimization to train IF-CRITIC.

Result: IF-CRITIC beats strong LLM-as-a-Judge baselines including Deepseek-R1 and o4-mini, and enables LLMs to achieve substantial performance gains in instruction-following optimization with lower computational overhead.

Conclusion: IF-CRITIC provides scalable reward signals that significantly improve instruction-following evaluation and optimization while reducing computational costs compared to existing methods.

Abstract: Instruction following is a fundamental ability of Large Language Models (LLMs), requiring their generated outputs to follow multiple constraints imposed in input instructions. Numerous studies have attempted to enhance this ability through preference optimization or reinforcement learning based on reward signals from LLM-as-a-Judge. However, existing evaluation models for instruction following still possess many deficiencies, such as substantial costs and unreliable assessments. To this end, we propose IF-CRITIC, an LLM critic that can provide efficient and reliable assessments of constraint following in the instructions. We first develop a checklist generator to decompose instructions and generate constraint checklists. With the assistance of the checklists, we collect high-quality critique training data through a multi-stage critique filtering mechanism and employ a constraint-level preference optimization method to train IF-CRITIC. Extensive experiments demonstrate that the evaluation performance of IF-CRITIC can beat strong LLM-as-a-Judge baselines, including Deepseek-R1 and o4-mini. With the scalable reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization under lower computational overhead compared to strong LLM critic baselines.

[45] Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

Wenjin Liu, Haoran Luo, Xueyuan Lin, Haoming Liu, Tiesunlong Shen, Jiapu Wang, Rui Mao, Erik Cambria

Main category: cs.CL

TL;DR: Prompt-R1 is an RL framework that uses a small LLM to generate prompts for large LLMs, improving performance on complex tasks through multi-turn collaboration and dual-constrained rewards.

Details

Motivation: Users struggle to provide effective prompts for complex problems, limiting LLM performance. The paper aims to automate prompt generation to enhance LLM capabilities.

Method: End-to-end reinforcement learning framework with small LLM generating prompts and large LLM performing reasoning. Uses multi-turn interaction and dual-constrained rewards for correctness, quality, and accuracy.

Result: Significantly outperforms baseline models across multiple public datasets and tasks. The framework is plug-and-play and supports various large-scale LLMs.

Conclusion: Prompt-R1 effectively addresses the prompt engineering challenge through automated collaboration between small and large LLMs, demonstrating superior performance across diverse tasks.

Abstract: Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.

[46] OceanAI: A Conversational Platform for Accurate, Transparent, Near-Real-Time Oceanographic Insights

Bowen Chen, Jayesh Gajbhar, Gregory Dusek, Rob Redmon, Patrick Hogan, Paul Liu, DelWayne Bohnenstiehl, Dongkuan, Xu, Ruoying He

Main category: cs.CL

TL;DR: OceanAI is a conversational platform that combines LLMs with real-time access to NOAA oceanographic data to provide verified, reproducible responses with data references, unlike other AI systems that generate unverified hallucinations.

Details

Motivation: To address the problem of AI systems generating unverified "hallucinations" that undermine scientific rigor, by creating a platform that grounds responses in authoritative data sources.

Method: Integrates open-source LLMs with real-time API calls to NOAA data streams, automatically identifying, parsing, and synthesizing relevant datasets into natural-language responses and visualizations.

Result: In blind comparison with three other AI chat products, only OceanAI produced NOAA-sourced values with original data references; others either declined or provided unsupported results.

Conclusion: OceanAI advances transparency, reproducibility, and trust by grounding outputs in verifiable observations, offering a scalable framework for AI-enabled decision support in ocean sciences.

Abstract: Artificial intelligence is transforming the sciences, yet general conversational AI systems often generate unverified “hallucinations” undermining scientific rigor. We present OceanAI, a conversational platform that integrates the natural-language fluency of open-source large language models (LLMs) with real-time, parameterized access to authoritative oceanographic data streams hosted by the National Oceanic and Atmospheric Administration (NOAA). Each query such as “What was Boston Harbor’s highest water level in 2024?” triggers real-time API calls that identify, parse, and synthesize relevant datasets into reproducible natural-language responses and data visualizations. In a blind comparison with three widely used AI chat-interface products, only OceanAI produced NOAA-sourced values with original data references; others either declined to answer or provided unsupported results. Designed for extensibility, OceanAI connects to multiple NOAA data products and variables, supporting applications in marine hazard forecasting, ecosystem assessment, and water-quality monitoring. By grounding outputs and verifiable observations, OceanAI advances transparency, reproducibility, and trust, offering a scalable framework for AI-enabled decision support within the oceans. A public demonstration is available at https://oceanai.ai4ocean.xyz.

[47] VayuChat: An LLM-Powered Conversational Interface for Air Quality Data Analytics

Vedant Acharya, Abhay Pisharodi, Rishabh Mondal, Mohammad Rafiuddin, Nipun Batra

Main category: cs.CL

TL;DR: VayuChat is a conversational AI system that enables natural language queries about air quality, meteorology, and policy programs in India, generating executable Python code and interactive visualizations to make environmental analytics accessible to non-experts.

Details

Motivation: Air pollution causes 1.6 million premature deaths annually in India, but existing tools require expertise and provide static dashboards, leaving key policy questions unresolved. Decision makers struggle to turn dispersed data into actionable insights.

Method: VayuChat integrates data from CPCB monitoring stations, state-level demographics, and NCAP funding records into a unified interface powered by large language models. It answers natural language questions and responds with executable Python code and interactive visualizations.

Result: The system enables users to perform complex environmental analytics through simple conversations, making data science accessible to policymakers, researchers, and citizens. The platform is publicly deployed and includes a live demonstration.

Conclusion: VayuChat provides an accessible conversational interface that transforms dispersed environmental data into actionable insights, bridging the gap between data availability and policy decision-making for air quality management in India.

Abstract: Air pollution causes about 1.6 million premature deaths each year in India, yet decision makers struggle to turn dispersed data into decisions. Existing tools require expertise and provide static dashboards, leaving key policy questions unresolved. We present VayuChat, a conversational system that answers natural language questions on air quality, meteorology, and policy programs, and responds with both executable Python code and interactive visualizations. VayuChat integrates data from Central Pollution Control Board (CPCB) monitoring stations, state-level demographics, and National Clean Air Programme (NCAP) funding records into a unified interface powered by large language models. Our live demonstration will show how users can perform complex environmental analytics through simple conversations, making data science accessible to policymakers, researchers, and citizens. The platform is publicly deployed at https://huggingface.co/spaces/SustainabilityLabIITGN/ VayuChat. For further information check out video uploaded on https://www.youtube.com/watch?v=d6rklL05cs4.

[48] Building a Silver-Standard Dataset from NICE Guidelines for Clinical LLMs

Qing Ding, Eric Hua Qing Zhang, Felix Jozsa, Julia Ive

Main category: cs.CL

TL;DR: This study introduces a validated dataset for evaluating LLMs’ clinical reasoning based on guidelines, created using GPT with realistic patient scenarios, and benchmarks recent LLMs to demonstrate dataset validity.

Details

Motivation: Standardized benchmarks for evaluating guideline-based clinical reasoning in LLMs are currently missing, despite their increasing use in healthcare.

Method: Created a validated dataset from publicly available guidelines using GPT, containing realistic patient scenarios and clinical questions, then benchmarked a range of recent popular LLMs.

Result: The framework supports systematic evaluation of LLMs’ clinical utility and guideline adherence, showcasing the validity of the dataset through benchmarking.

Conclusion: The study provides a validated framework and dataset for systematically evaluating LLMs’ clinical reasoning capabilities and adherence to medical guidelines.

Abstract: Large language models (LLMs) are increasingly used in healthcare, yet standardised benchmarks for evaluating guideline-based clinical reasoning are missing. This study introduces a validated dataset derived from publicly available guidelines across multiple diagnoses. The dataset was created with the help of GPT and contains realistic patient scenarios, as well as clinical questions. We benchmark a range of recent popular LLMs to showcase the validity of our dataset. The framework supports systematic evaluation of LLMs’ clinical utility and guideline adherence.

[49] HPLT~3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Stephan Oepen, Nikolay Arefev, Mikko Aulamo, Marta Bañón, Maja Buljan, Laurie Burchell, Lucas Charpentier, Pinzhen Chen, Mariya Fedorova, Ona de Gibert, Barry Haddow, Jan Hajič, Jindrič Helcl, Andrey Kutuzov, Zihao Li, Risto Luukkonen, Bhavitvya Malik, Vladislav Mikhailov, Amanda Myntti, Dayyán O’Brien, Lucie Poláková, Sampo Pyysalo, Gema Ramírez Sánchez, Janine Siewert, Pavel Stepachev, Jörg Tiedemann, Teemu Vahtola, Fedor Vitiugin, Tea Vojtěchová, Jaume Zaragoza

Main category: cs.CL

TL;DR: Creation of the largest open multilingual dataset (30 trillion tokens) for LLM pre-training across 200 languages, with comprehensive processing pipeline and evaluation benchmarks.

Details

Motivation: To address the lack of large-scale, high-quality, and richly annotated multilingual datasets for LLM pre-training that are openly available to the research community.

Method: Derived datasets from web crawls with complete open-source pipeline including document selection, text extraction, language identification, deduplication, annotation (register labels, quality estimates, PII), and filtering. Evaluated through quality probes, manual inspection, and end-to-end model training.

Result: Successfully created 30 trillion token dataset - likely the largest available multilingual collection. Also produced comprehensive evaluation benchmarks for 9 European languages, trained 57 monolingual encoder-decoder models, and generated large parallel text collections.

Conclusion: The initiative provides a foundational resource for multilingual NLP research with high-quality data, processing tools, and evaluation frameworks that can advance language model development across diverse languages.

Abstract: We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.

[50] Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering

Vlad Negoita, Mihai Masala, Traian Rebedea

Main category: cs.CL

TL;DR: This paper analyzes Romanian pretraining corpora characteristics compared to English data, develops a lightweight multitask model for multi-level filtering of Romanian texts, and demonstrates improved LLM pretraining performance through data curation.

Details

Motivation: Data quality is crucial for training LLMs, especially for under-represented languages like Romanian where high-quality corpora are scarce. The authors aim to understand how Romanian pretraining data differs from English and improve data curation methods.

Method: Train a lightweight multitask model on LLM-annotated Romanian texts to perform multi-level filtering (educational value, topic, format) and generate high-quality pretraining datasets.

Result: Experiments reveal noteworthy trends in topic differences between Romanian and English data, and demonstrate that filtered data leads to improved LLM pretraining performance across multiple benchmarks.

Conclusion: Careful data curation and filtering significantly enhance LLM pretraining for under-represented languages, with the proposed method effectively improving model performance through multi-level data quality assessment.

Abstract: Large Language Models (LLMs) have recently exploded in popularity, often matching or outperforming human abilities on many tasks. One of the key factors in training LLMs is the availability and curation of high-quality data. Data quality is especially crucial for under-represented languages, where high-quality corpora are scarce. In this work we study the characteristics and coverage of Romanian pretraining corpora and we examine how they differ from English data. By training a lightweight multitask model on carefully LLM-annotated Romanian texts, we are able to analyze and perform multi-level filtering (e.g., educational value, topic, format) to generate high-quality pretraining datasets. Our experiments show noteworthy trends in the topics present in Romanian and English data, while also proving the effectiveness of filtering data through improved LLM pretraining performance across multiple benchmarks.

[51] TSVer: A Benchmark for Fact Verification Against Time-Series Evidence

Marek Strong, Andreas Vlachos

Main category: cs.CL

TL;DR: TSVer is a new benchmark dataset for temporal and numerical reasoning fact verification with time-series evidence, containing 287 real-world claims and 400 time series, showing that even state-of-the-art models struggle with this task.

Details

Motivation: Existing fact-checking datasets lack structured evidence, sufficient justifications, or use synthetic claims, limiting evaluation of temporal and numerical reasoning systems.

Method: Created TSVer dataset with 287 real claims from 38 fact-checking organizations and 400 time series using LLM-assisted multi-step annotation process with high inter-annotator agreement (kappa=0.745).

Result: State-of-the-art models like Gemini-2.5-Pro achieve only 63.37% accuracy on verdicts and 48.63 Ev2R score on verdict justifications, showing significant challenges in temporal reasoning.

Conclusion: TSVer provides a high-quality benchmark for temporal and numerical reasoning in fact verification, revealing current limitations in AI systems’ ability to reason with time-series evidence.

Abstract: Reasoning over temporal and numerical data, such as time series, is a crucial aspect of fact-checking. While many systems have recently been developed to handle this form of evidence, their evaluation remains limited by existing datasets, which often lack structured evidence, provide insufficient justifications for verdicts, or rely on synthetic claims. In this paper, we introduce TSVer, a new benchmark dataset for fact verification focusing on temporal and numerical reasoning with time-series evidence. TSVer contains 287 real-world claims sourced from 38 fact-checking organizations and a curated database of 400 time series covering diverse domains. Each claim is annotated with time frames across all pertinent time series, along with a verdict and justifications reflecting how the evidence is used to reach the verdict. Using an LLM-assisted multi-step annotation process, we improve the quality of our annotations and achieve an inter-annotator agreement of kappa=0.745 on verdicts. We also develop a baseline for verifying claims against time-series evidence and show that even the state-of-the-art reasoning models like Gemini-2.5-Pro are challenged by time series, achieving a 63.37 accuracy score on verdicts and an Ev2R score of 48.63 on verdict justifications.

[52] MicroRemed: Benchmarking LLMs in Microservices Remediation

Lingzhe Zhang, Yunpeng Zhai, Tong Jia, Chiming Duan, Minghua He, Leyi Pan, Zhaoyang Liu, Bolin Ding, Ying Li

Main category: cs.CL

TL;DR: MicroRemed is the first benchmark for evaluating LLMs in end-to-end microservice remediation, requiring models to generate executable Ansible playbooks from diagnosis reports. ThinkRemed, a multi-agent framework, improves performance through iterative reasoning.

Details

Motivation: Existing approaches rely on human-crafted prompts with LLMs merely converting text to code, lacking true autonomous remediation capabilities. The goal is to advance research in automatic microservice system recovery.

Method: Proposed ThinkRemed - a multi-agent framework that emulates SREs’ reflective and perceptive reasoning through iterative reasoning and system reflection.

Result: MicroRemed presents substantial challenges to current LLMs, while ThinkRemed improves end-to-end remediation performance compared to existing approaches.

Conclusion: The benchmark enables evaluation of LLMs in autonomous microservice remediation, and ThinkRemed demonstrates improved performance through multi-agent reasoning frameworks.

Abstract: Large Language Models (LLMs) integrated with agent-based reasoning frameworks have recently shown strong potential for autonomous decision-making and system-level operations. One promising yet underexplored direction is microservice remediation, where the goal is to automatically recover faulty microservice systems. Existing approaches, however, still rely on human-crafted prompts from Site Reliability Engineers (SREs), with LLMs merely converting textual instructions into executable code. To advance research in this area, we introduce MicroRemed, the first benchmark for evaluating LLMs in end-to-end microservice remediation, where models must directly generate executable Ansible playbooks from diagnosis reports to restore system functionality. We further propose ThinkRemed, a multi-agent framework that emulates the reflective and perceptive reasoning of SREs. Experimental results show that MicroRemed presents substantial challenges to current LLMs, while ThinkRemed improves end-to-end remediation performance through iterative reasoning and system reflection. The benchmark is available at https://github.com/LLM4AIOps/MicroRemed.

[53] Learning When to Quit in Sales Conversations

Emaad Manzoor, Eva Ascarza, Oded Netzer

Main category: cs.CL

TL;DR: The paper develops an AI stopping agent that optimizes when salespeople should quit sales conversations, reducing time on failed calls by 54% while maintaining sales, and increasing overall sales by up to 37%.

Details

Motivation: Salespeople face dynamic screening decisions about when to persist or abandon conversations, but little is known about how these decisions are made, their efficiency, or how to improve them in high-volume outbound sales where time is scarce.

Method: Formalized the dynamic screening decision as an optimal stopping problem and developed a generative language model-based sequential decision agent that learns when to quit conversations by imitating a retrospectively-inferred optimal stopping policy.

Result: The stopping agent reduced time spent on failed calls by 54% while preserving nearly all sales; reallocating saved time increased expected sales by up to 37%. Analysis showed salespeople overweight salient expressions of disinterest and mispredict call failure risk.

Conclusion: AI algorithms can correct cognitively-bounded human decisions and improve salesforce efficiency by optimizing conversational persistence decisions in real-time.

Abstract: Salespeople frequently face the dynamic screening decision of whether to persist in a conversation or abandon it to pursue the next lead. Yet, little is known about how these decisions are made, whether they are efficient, or how to improve them. We study these decisions in the context of high-volume outbound sales where leads are ample, but time is scarce and failure is common. We formalize the dynamic screening decision as an optimal stopping problem and develop a generative language model-based sequential decision agent - a stopping agent - that learns whether and when to quit conversations by imitating a retrospectively-inferred optimal stopping policy. Our approach handles high-dimensional textual states, scales to large language models, and works with both open-source and proprietary language models. When applied to calls from a large European telecommunications firm, our stopping agent reduces the time spent on failed calls by 54% while preserving nearly all sales; reallocating the time saved increases expected sales by up to 37%. Upon examining the linguistic cues that drive salespeople’s quitting decisions, we find that they tend to overweight a few salient expressions of consumer disinterest and mispredict call failure risk, suggesting cognitive bounds on their ability to make real-time conversational decisions. Our findings highlight the potential of artificial intelligence algorithms to correct cognitively-bounded human decisions and improve salesforce efficiency.

[54] Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata

Main category: cs.CL

TL;DR: DebateBias-8K is a multilingual benchmark revealing narrative biases in LLMs across 7 languages, showing models reproduce stereotypes despite safety alignment, with biases worsening in low-resource languages.

Details

Motivation: Current bias evaluations rely on English classification tasks, missing how narrative bias appears in realistic generative settings across different languages and cultural contexts.

Method: Created DebateBias-8K with 8,400 structured debate prompts across 4 sensitive domains in 7 languages, tested 4 flagship models (GPT-4o, Claude 3, DeepSeek, LLaMA 3), generating and automatically classifying over 100,000 responses.

Result: All models reproduced entrenched stereotypes: Arabs linked to terrorism/religion (≥95%), Africans to socioeconomic backwardness (up to 77%), Western groups framed as modern/progressive. Biases grew sharply in lower-resource languages.

Conclusion: Current alignment methods trained primarily in English don’t generalize globally, reducing explicit toxicity but failing to prevent biased outputs in open-ended contexts, highlighting persistent multilingual fairness divide.

Abstract: Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce DebateBias-8K, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8,400 structured debate prompts spanning four sensitive domains: women’s rights, socioeconomic development, terrorism, and religion, across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude 3, DeepSeek, and LLaMA 3), we generate and automatically classify over 100,000 responses. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to terrorism and religion (>=95%), Africans to socioeconomic “backwardness” (up to <=77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our DebateBias-8K benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.

[55] ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction

Lvhua Wu, Xuefeng Jiang, Sheng Sun, Tian Wen, Yuwei Wang, Min Liu

Main category: cs.CL

TL;DR: ZoFia is a two-stage zero-shot fake news detection framework that uses hierarchical salience scoring and multi-LLM collaborative analysis to overcome limitations of static models and LLM hallucinations in detecting fast-evolving fake news.

Details

Motivation: Existing fake news detection methods struggle with time-bounded knowledge coverage, LLM hallucinations, and lack of generalization for emerging news topics, making them unreliable for fast-evolving news streams.

Method: Two-stage framework: 1) Hierarchical Salience scoring with SC-MMR algorithm to select informative keywords for retrieving up-to-date evidence, 2) Multi-LLM interactive system with different agent roles performing collaborative analysis and adversarial debate over news content.

Result: Comprehensive experiments on two public datasets show ZoFia significantly outperforms existing zero-shot baselines and most few-shot methods.

Conclusion: ZoFia provides an effective solution for zero-shot fake news detection that handles fast-evolving news streams through external evidence retrieval and multi-agent collaborative analysis, with plans to open-source the code.

Abstract: The rapid spread of fake news threatens social stability and public trust, rendering its detection an imperative research priority. Although large language models (LLMs) excel at numerous natural language processing tasks with their remarkable contextual understanding and extensive prior knowledge, the time-bounded knowledge coverage and tendency for generating hallucination content reduce their reliability when handling fast-evolving news streams. Furthermore, models trained on existing static datasets also often lack the generalization needed for emerging news topics. To address these challenges, we propose ZoFia, a novel two-stage zero-shot fake news detection framework. First, we introduce Hierarchical Salience to quantify the importance of entities in the news content, and propose the SC-MMR algorithm to effectively select an informative and diverse set of keywords that serve as queries for retrieving up-to-date external evidence. Subsequently, a multi LLM interactive system, in which each agent assumes a distinct role, performs multi-view collaborative analysis and adversarial debate over the news text and its related information, and finally produces an interpretable and robust judgment. Comprehensive experiments on two public datasets demonstrate that ZoFia obviously outperforms existing zero-shot baselines and most of few-shot methods. Our codes will be open-sourced to facilitate related communities.

[56] Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning

Ru Wang, Wei Huang, Qi Cao, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo

Main category: cs.CL

TL;DR: Self-Harmony is a test-time reinforcement learning framework that uses a single model as both Solver and Reframer to generate stable answers across original and paraphrased questions, employing harmonic mean aggregation to avoid spurious solutions.

Details

Motivation: To address the problem of standard test-time RL approaches collapsing to spurious popular answers through majority voting, by leveraging the intuition that correct answers should remain stable across question paraphrases.

Method: Uses a single model in dual roles: Solver produces answers and Reframer rephrases inputs. Aggregates answer frequencies across original and reframed views using harmonic mean instead of majority voting, selecting solutions stable under reframing.

Result: Achieves state-of-the-art results in label-free test-time setting, ranking first in 28 of 30 settings across multiple reasoning benchmarks. Shows unprecedented robustness with zero training failures in all experiments.

Conclusion: Self-Harmony provides a stable and reliable test-time adaptation method that avoids spurious answers without requiring human supervision or auxiliary models, demonstrating strong performance and robustness across diverse reasoning tasks.

Abstract: Test-time reinforcement learning (TTRL) offers a label-free paradigm for adapting models using only synthetic signals at inference, but its success hinges on constructing reliable learning signals. Standard approaches such as majority voting often collapse to spurious yet popular answers. We introduce Self-Harmony, a framework built on a simple intuition: the correct answer should remain stable across both an original question and its paraphrase. Self-Harmony operationalizes this by employing a single model in two complementary roles: a Solver to produce answers and a Reframer to rephrase the input. Based on this, we further propose a pseudo-label method: instead of majority voting, it aggregates answer frequencies across these original and reframed views using the harmonic mean. This is a process that naturally selects for solutions stable under reframing, thereby avoiding the common trap of favoring view-dependent, spurious answers. Crucially, this requires no human supervision or auxiliary models. Across diverse reasoning benchmarks, Self-Harmony achieves state-of-the-art results at the label-free test-time setting, ranking first in 28 of 30 settings across multiple methods. Beyond accuracy, it demonstrates unprecedented robustness, with zero training failures in all experiments, underscoring its stability and reliability.

[57] DEER: Disentangled Mixture of Experts with Instance-Adaptive Routing for Generalizable Machine-Generated Text Detection

Guoxin Ma, Xiaoming Liu, Zhanhan Zhang, Chengzhengxu Li, Shengchao Liu, Yu Lan

Main category: cs.CL

TL;DR: A novel DEER framework for machine-generated text detection that uses disentangled mixture-of-experts with reinforcement learning routing to handle domain shift, achieving state-of-the-art performance on both in-domain and out-of-domain datasets.

Details

Motivation: Current machine-generated text detection methods suffer from significant performance degradation under domain shift, making them unreliable for real-world applications where domain labels are often unavailable during inference.

Method: Two-stage DEER architecture: 1) Disentangled mixture-of-experts with domain-specific experts for fine-grained distinctions and shared experts for cross-domain features; 2) Reinforcement learning-based routing mechanism that dynamically selects experts without requiring domain labels during inference.

Result: DEER outperforms state-of-the-art methods with average F1-score improvements of 1.39% (in-domain) and 5.32% (out-of-domain), and accuracy gains of 1.35% (in-domain) and 3.61% (out-of-domain) across five in-domain and five out-of-domain benchmark datasets.

Conclusion: The disentangled expert specialization and adaptive routing mechanism are critical for robust machine-generated text detection under domain shift, effectively bridging the train-inference gap caused by domain uncertainty.

Abstract: Detecting machine-generated text (MGT) has emerged as a critical challenge, driven by the rapid advancement of large language models (LLMs) capable of producing highly realistic, human-like content. However, the performance of current approaches often degrades significantly under domain shift. To address this challenge, we propose a novel framework designed to capture both domain-specific and domain-general MGT patterns through a two-stage Disentangled mixturE-of-ExpeRts (DEER) architecture. First, we introduce a disentangled mixture-of-experts module, in which domain-specific experts learn fine-grained, domain-local distinctions between human and machine-generated text, while shared experts extract transferable, cross-domain features. Second, to mitigate the practical limitation of unavailable domain labels during inference, we design a reinforcement learning-based routing mechanism that dynamically selects the appropriate experts for each input instance, effectively bridging the train-inference gap caused by domain uncertainty. Extensive experiments on five in-domain and five out-of-domain benchmark datasets demonstrate that DEER consistently outperforms state-of-the-art methods, achieving average F1-score improvements of 1.39% and 5.32% on in-domain and out-of-domain datasets respectively, along with accuracy gains of 1.35% and 3.61% respectively. Ablation studies confirm the critical contributions of both disentangled expert specialization and adaptive routing to model performance.

[58] AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs

Mo El-Haj, Paul Rayson

Main category: cs.CL

TL;DR: This paper introduces AraFinNews, the largest Arabic financial news dataset, and shows that domain-adapted LLMs generate more accurate and coherent summaries for Arabic financial texts.

Details

Motivation: To address the lack of domain-specific resources for Arabic financial text summarization and evaluate how financial-domain pretraining improves factual accuracy and numerical reliability.

Method: Created AraFinNews dataset with 212,500 article-headline pairs, then evaluated transformer models (mT5, AraT5, FinAraT5) on factual accuracy, numerical reliability, and stylistic alignment.

Result: Domain-adapted models (FinAraT5) generated more faithful and coherent summaries, especially for quantitative and entity-centric information.

Conclusion: Domain-specific adaptation is crucial for improving factual consistency and narrative fluency in Arabic financial summarization.

Abstract: This paper investigates the impact of domain specificity on abstractive summarisation of Arabic financial texts using large language models (LLMs). We introduce AraFinNews, the largest publicly available Arabic financial news dataset to date, comprising 212,500 article–headline pairs spanning nearly a decade of reporting from October 2015 to July 2025. Designed as the Arabic equivalent of major English summarisation corpora such as CNN/DailyMail, AraFinNews provides a robust benchmark for evaluating domain-specific language understanding and generation in financial contexts. Using this resource, we evaluate transformer-based models – including mT5, AraT5, and the domain-adapted FinAraT5 – to examine how financial-domain pretraining influences factual accuracy, numerical reliability, and stylistic alignment with professional reporting. Experimental results show that domain-adapted models generate more faithful and coherent summaries, particularly in handling quantitative and entity-centric information. The findings highlight the importance of domain-specific adaptation for improving factual consistency and narrative fluency in Arabic financial summarisation. The dataset is freely available for non-commercial research at https://github.com/ArabicNLP-UK/AraFinNews.

[59] When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding

Min Fang, Zhihui Fu, Qibin Zhao, Jun Wang

Main category: cs.CL

TL;DR: ReSpec is a novel retrieval-enhanced speculative decoding framework that uses adaptive decision-making to optimize draft selection and verification, achieving significant speedup over existing methods while maintaining output quality.

Details

Motivation: Current speculative decoding methods face limitations: model-based approaches are accurate but costly, while retrieval-enhanced methods use heuristic switching that triggers unnecessary retrievals, reducing efficiency.

Method: ReSpec introduces three innovations: 1) entropy-guided adaptive trigger to initiate retrieval only when uncertainty is low, 2) feedback-driven candidate selection to organize multiple high-quality candidates, and 3) source-aware relaxed verification strategy with strict checks for model drafts and relaxed verification for retrieved drafts.

Result: Extensive experiments on Spec-Bench show ReSpec achieves state-of-the-art acceleration, outperforming EAGLE-2 by over 33% and SAM-Decoding by over 25%, while maintaining output quality.

Conclusion: ReSpec effectively addresses the limitations of existing speculative decoding methods by transforming heuristic switching into adaptive decision-making, achieving superior acceleration without compromising quality.

Abstract: Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting model. While model-based methods like EAGLE-2 are accurate but costly, retrieval-enhanced methods like SAM-Decoding rely on heuristic switching strategies that often trigger unnecessary retrievals. To address this, we propose ReSpec (\textbf{Re}trieval-enhanced \textbf{Spe}culative Decoding), a novel framework that transforms heuristic drafter switching into adaptive decision-making. ReSpec features three core innovations: 1) An \textbf{entropy-guided adaptive trigger} quantifies contextual predictability to initiate retrieval only when uncertainty is low, avoiding costly low-quality speculations. 2) A \textbf{feedback-driven candidate selection} leverages historical feedback to organize multiple high-quality candidates for parallel verification, maximizing retrieval utility. 3) A source-aware \textbf{relaxed verification strategy} applies strict checks to model-generated drafts while using a relaxed verification for retrieved drafts, achieving a better balance between accuracy and efficiency. Extensive experiments on Spec-Bench demonstrate that ReSpec achieves state-of-the-art acceleration,outperforming EAGLE-2 and SAM-Decoding by over $33%$ and $25%$, respectively, while maintaining output quality.

[60] “Give a Positive Review Only”: An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Qin Zhou, Zhexin Zhang, Zhi Li, Limin Sun

Main category: cs.CL

TL;DR: This paper investigates prompt injection attacks on AI-assisted peer review systems, showing how hidden prompts can manipulate AI reviewers into giving favorable evaluations, and proposes detection-based defenses.

Details

Motivation: With AI models increasingly used for scientific paper review, emerging threats involve hidden prompts that manipulate AI reviewers into providing overly favorable evaluations, requiring systematic investigation.

Method: Proposes two attack classes: (1) static attack using fixed injection prompts, and (2) iterative attack optimizing prompts against simulated reviewer models. Also explores detection-based defense mechanisms.

Result: Both attacks achieve striking performance, frequently inducing full evaluation scores from frontier AI reviewers. Attacks are robust across settings. Detection defense reduces success rate but adaptive attackers can partially circumvent it.

Conclusion: Findings highlight the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review systems.

Abstract: With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.

[61] FirstAidQA: A Synthetic Dataset for First Aid and Emergency Response in Low-Connectivity Settings

Saiyma Sittul Muna, Rezwan Islam Salvi, Mushfiqur Rahman Mushfique, Ajwad Abrar

Main category: cs.CL

TL;DR: Created FirstAidQA, a synthetic dataset of 5,500 high-quality first aid question-answer pairs using ChatGPT-4o-mini and human validation, designed for training lightweight AI models in emergency response applications.

Details

Motivation: Address the lack of high-quality datasets for first aid and emergency response, enabling deployment of LLMs in time-sensitive, low-connectivity environments where current models are too computationally intensive for low-tier devices.

Method: Generated dataset using ChatGPT-4o-mini with prompt-based in-context learning from the Vital First Aid Book (2019), followed by preprocessing (text cleaning, contextual chunking, filtering) and human validation for accuracy and safety.

Result: Successfully created FirstAidQA containing 5,500 validated QA pairs covering diverse first aid scenarios, publicly released on Hugging Face to support instruction-tuning of LLMs and SLMs for emergency applications.

Conclusion: FirstAidQA enables development of faster, more reliable, offline-capable AI systems for emergency settings, advancing research on safety-critical and resource-constrained AI applications in first aid and emergency response.

Abstract: In emergency situations, every second counts. The deployment of Large Language Models (LLMs) in time-sensitive, low or zero-connectivity environments remains limited. Current models are computationally intensive and unsuitable for low-tier devices often used by first responders or civilians. A major barrier to developing lightweight, domain-specific solutions is the lack of high-quality datasets tailored to first aid and emergency response. To address this gap, we introduce FirstAidQA, a synthetic dataset containing 5,500 high-quality question answer pairs that encompass a wide range of first aid and emergency response scenarios. The dataset was generated using a Large Language Model, ChatGPT-4o-mini, with prompt-based in-context learning, using texts from the Vital First Aid Book (2019). We applied preprocessing steps such as text cleaning, contextual chunking, and filtering, followed by human validation to ensure accuracy, safety, and practical relevance of the QA pairs. FirstAidQA is designed to support instruction-tuning and fine-tuning of LLMs and Small Language Models (SLMs), enabling faster, more reliable, and offline-capable systems for emergency settings. We publicly release the dataset to advance research on safety-critical and resource-constrained AI applications in first aid and emergency response. The dataset is available on Hugging Face at https://huggingface.co/datasets/i-am-mushfiq/FirstAidQA.

[62] DeepSpecs: Expert-Level Questions Answering in 5G

Aman Ganapathy Manvattira, Yifei Xu, Ziyue Dang, Songwu Lu

Main category: cs.CL

TL;DR: DeepSpecs is a RAG system enhanced with structural and temporal reasoning for 5G specifications, using three metadata-rich databases to resolve cross-references and trace specification evolution, outperforming existing methods.

Details

Motivation: Existing RAG frameworks cannot reliably resolve cross-references or reason about specification evolution in 5G standards, which are complex and constantly evolving across thousands of pages.

Method: Uses three databases (SpecDB, ChangeDB, TDocDB) with structural and temporal reasoning, explicitly resolves cross-references through recursive metadata lookup, and traces evolution by mining changes linked to Change Requests.

Result: Outperforms base models and state-of-the-art telecom RAG systems across multiple LLM backends; ablations confirm explicit cross-reference resolution and evolution-aware retrieval substantially improve answer quality.

Conclusion: Modeling structural and temporal properties of 5G standards through metadata-rich databases and explicit reasoning significantly enhances RAG system performance for expert-level QA.

Abstract: 5G technology enables mobile Internet access for billions of users. Answering expert-level questions about 5G specifications requires navigating thousands of pages of cross-referenced standards that evolve across releases. Existing retrieval-augmented generation (RAG) frameworks, including telecom-specific approaches, rely on semantic similarity and cannot reliably resolve cross-references or reason about specification evolution. We present DeepSpecs, a RAG system enhanced by structural and temporal reasoning via three metadata-rich databases: SpecDB (clause-aligned specification text), ChangeDB (line-level version diffs), and TDocDB (standardization meeting documents). DeepSpecs explicitly resolves cross-references by recursively retrieving referenced clauses through metadata lookup, and traces specification evolution by mining changes and linking them to Change Requests that document design rationale. We curate two 5G QA datasets: 573 expert-annotated real-world questions from practitioner forums and educational resources, and 350 evolution-focused questions derived from approved Change Requests. Across multiple LLM backends, DeepSpecs outperforms base models and state-of-the-art telecom RAG systems; ablations confirm that explicit cross-reference resolution and evolution-aware retrieval substantially improve answer quality, underscoring the value of modeling the structural and temporal properties of 5G standards.

[63] DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness

Jiabao Ji, Min Li, Priyanshu Kumar, Shiyu Chang, Saloni Potdar

Main category: cs.CL

TL;DR: DeepAmbigQA dataset addresses LLMs’ limitations in handling complex questions with name ambiguity and multi-step reasoning, showing even state-of-the-art models struggle with answer completeness.

Details

Motivation: LLMs with search tools struggle with complex questions requiring name ambiguity resolution and multi-hop reasoning across large evidence sets, which existing benchmarks don't adequately evaluate.

Method: Developed DeepAmbigQAGen pipeline to automatically generate QA tasks grounded in text corpora and knowledge graphs, creating natural questions with systematic name ambiguity and multi-step reasoning challenges.

Result: Created DeepAmbigQA dataset with 3,600 questions (half with explicit name ambiguity). GPT-5 achieved only 0.13 exact match on ambiguous questions and 0.21 on non-ambiguous questions, showing significant performance gaps.

Conclusion: Current QA systems need substantial improvement for robust information gathering and answer completeness, particularly for questions involving name ambiguity and complex reasoning.

Abstract: Large language models (LLMs) with integrated search tools show strong promise in open-domain question answering (QA), yet they often struggle to produce complete answer set to complex questions such as Which actor from the film Heat won at least one Academy Award?, which requires (1) distinguishing between multiple films sharing the same title and (2) reasoning across a large set of actors to gather and integrate evidence. Existing QA benchmarks rarely evaluate both challenges jointly. To address this, we introduce DeepAmbigQAGen, an automatic data generation pipeline that constructs QA tasks grounded in text corpora and linked knowledge graph, generating natural and verifiable questions that systematically embed name ambiguity and multi-step reasoning. Based on this, we build DeepAmbigQA, a dataset of 3,600 questions requiring multi-hop reasoning and half of them explicit name ambiguity resolving. Experiments reveal that, even state-of-the-art GPT-5 show incomplete answers, achieving only 0.13 exact match on ambiguous questions and 0.21 on non-ambiguous questions. These findings highlight the need for more robust QA systems aimed at information gathering and answer completeness.

[64] Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series

Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang

Main category: cs.CL

TL;DR: The paper extends the DistilQwen model family with four specialized series: slow-thinking models for high-accuracy reasoning, adaptive-thinking models for dynamic strategy adjustment, and distilled reward models for reinforcement learning, all optimized for industrial deployment.

Details

Motivation: To meet industrial demand for small and efficient reasoning models that balance performance and inference speed for real-world applications.

Method: Developed four model series through knowledge distillation from Qwen models: slow-thinking models for accuracy, adaptive-thinking models for dynamic efficiency, and distilled reward models for reinforcement learning.

Result: Comprehensive evaluations show high inference efficiency and strong reasoning performance across multiple benchmarks, with practical utility demonstrated for distilled reward models.

Conclusion: The distilled models successfully provide scalable training and inference capabilities on Alibaba Cloud PAI platform, supporting industry practitioners with efficient reasoning solutions.

Abstract: Recently, the demand for small and efficient reasoning models to support real-world applications has driven the development of knowledge distillation techniques that balance reasoning performance and inference speed. In this paper, we further extend the DistilQwen model family, initialized from the Qwen models, by introducing four model series specifically designed to meet industrial requirements. The distilled model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge. Comprehensive evaluations across multiple benchmarks demonstrate both high inference efficiency and strong reasoning performance for these models, as well as the practical utility of distilled reward models. We further show that these models support industry practitioners by providing scalable training and inference functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence) platform.

[65] PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise

Sapir Harary, Eran Hirsch, Aviv Slobodkin, David Wan, Mohit Bansal, Ido Dagan

Main category: cs.CL

TL;DR: The paper introduces MiniTruePrefixes, a specialized model for detecting factual inconsistencies in text prefixes during autoregressive generation, which improves factual consistency in LLM outputs when integrated into controlled decoding frameworks.

Details

Motivation: Current NLI models are trained on complete sentences but autoregressive generation makes decisions at the prefix level, creating a mismatch that limits their effectiveness for improving factual consistency during text generation.

Method: Generalized entailment detection to text prefixes, created evaluation datasets, trained MiniTruePrefixes model, and integrated it into controlled decoding framework for abstractive summarization.

Result: MiniTruePrefixes outperforms baseline NLI models by 5-14 F1 points in prefix-level entailment. When integrated, LLaMA-3.2-3B-Instruct matches faithfulness and runtime of 8B model while using half the memory.

Conclusion: Specialized prefix-level entailment models like MiniTruePrefixes can significantly improve factual consistency in autoregressive generation and enable smaller models to achieve similar performance to larger ones with better efficiency.

Abstract: Natural Language Inference (NLI) models have been used in various ways to improve the factuality of LLM outputs. This is typically done by applying an NLI model to judge whether the model output is entailed from the supposed evidence, triggering some corrective actions, such as beam reranking at inference time or RL rewards during training. While NLI models are trained to detect factual inconsistencies over complete sentences, decisions in the common autoregressive generation architecture are made for each evolving text prefix, during decoding. Addressing this setting, we generalize the entailment detection task to apply over arbitrary text prefixes, and suggest its utility for improving generation faithfulness. Providing suitable evaluation and training datasets for this task, we train MiniTruePrefixes, a novel specialized model that better detects factual inconsistencies over text prefixes, outperforming comparable baseline NLI models by 5-14 F1 points in prefix-level entailment. We further demonstrate that integrating MiniTruePrefixes into a controlled decoding framework substantially improves factual consistency in abstractive summarization. When guided by MiniTruePrefixes, LLaMA-3.2-3B-Instruct matches the faithfulness and runtime of the 8B model from the same model family, while using only half the memory.

[66] Safer in Translation? Presupposition Robustness in Indic Languages

Aadi Palnitkar, Arjun Suresh, Rishi Rajesh, Puneet Puli

Main category: cs.CL

TL;DR: Created Cancer-Myth-Indic, a multilingual benchmark for evaluating LLMs on cancer-related medical advice in 5 Indic languages, addressing the gap in non-English medical evaluation.

Details

Motivation: Address the gap in multilingual LLM evaluation for healthcare advice, as existing medical benchmarks are almost universally in English despite increasing use of LLMs for medical consultation across different languages.

Method: Translated a 500-item subset of Cancer-Myth benchmark into 5 Indic languages (2,500 items total), using native-speaker translators following a style guide to preserve implicit presuppositions. Items contain false presuppositions about cancer.

Result: Evaluated several popular LLMs under presupposition stress using the translated benchmark.

Conclusion: The work helps address the multilingual evaluation gap in LLM healthcare responses by providing a specialized benchmark for Indic languages.

Abstract: Increasingly, more and more people are turning to large language models (LLMs) for healthcare advice and consultation, making it important to gauge the efficacy and accuracy of the responses of LLMs to such queries. While there are pre-existing medical benchmarks literature which seeks to accomplish this very task, these benchmarks are almost universally in English, which has led to a notable gap in existing literature pertaining to multilingual LLM evaluation. Within this work, we seek to aid in addressing this gap with Cancer-Myth-Indic, an Indic language benchmark built by translating a 500-item subset of Cancer-Myth, sampled evenly across its original categories, into five under-served but widely used languages from the subcontinent (500 per language; 2,500 translated items total). Native-speaker translators followed a style guide for preserving implicit presuppositions in translation; items feature false presuppositions relating to cancer. We evaluate several popular LLMs under this presupposition stress.

[67] The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation

İbrahim Ethem Deveci, Duygu Ataman

Main category: cs.CL

TL;DR: This paper questions whether surpassing benchmarks truly demonstrates reasoning ability or just tracks numbers divorced from claimed capabilities, analyzing performance trends of OpenAI, Anthropic, and Google models across reasoning benchmarks over time.

Details

Motivation: The rapid saturation of benchmarks due to model scaling and training advances creates continuous need for new benchmarks, raising questions about whether benchmark performance truly measures reasoning capabilities or just tracks numbers.

Method: Investigation of three model families (OpenAI, Anthropic, Google) analyzing how their reasoning capabilities evolve across different benchmarks over years, examining performance trends across reasoning tasks.

Result: Analysis reveals trends in model performance across reasoning benchmarks over time, highlighting the current situation of benchmarking and remaining challenges in evaluating true reasoning capabilities.

Conclusion: The paper provides a comprehensive overview of benchmarks and reasoning tasks to serve as a reference for future research in reasoning evaluation and model development, addressing the gap between benchmark performance and actual reasoning capabilities.

Abstract: The rapid rise of Large Language Models (LLMs) and Large Reasoning Models (LRMs) has been accompanied by an equally rapid increase of benchmarks used to assess them. However, due to both improved model competence resulting from scaling and novel training advances as well as likely many of these datasets being included in pre or post training data, results become saturated, driving a continuous need for new and more challenging replacements. In this paper, we discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure? We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities across different benchmarks evolve over the years. We also analyze performance trends over the years across different reasoning tasks and discuss the current situation of benchmarking and remaining challenges. By offering a comprehensive overview of benchmarks and reasoning tasks, our work aims to serve as a first reference to ground future research in reasoning evaluation and model development.

[68] Confounding Factors in Relating Model Performance to Morphology

Wessel Poelman, Thomas Bauwens, Miryam de Lhoneux

Main category: cs.CL

TL;DR: This paper examines how language characteristics affect tokenization and language modeling, particularly focusing on morphological complexity. It critiques previous conflicting evidence due to confounding factors and re-assesses hypotheses about why agglutinative languages show higher perplexities than fusional languages in modeling.

Details

Motivation: To resolve conflicting evidence about whether morphological differences affect language modeling performance, and to identify confounding factors in previous experimental setups that make comparisons difficult.

Method: Identifies confounding factors in existing analyses, re-assesses three hypotheses about morphological alignment of tokenization, tokenization efficiency, and dataset size, and introduces token bigram metrics as intrinsic predictors of language modeling difficulty.

Result: Shows that previous conclusions about morphology and language modeling include confounding factors. Token bigram metrics serve as gradient proxies for morphological complexity that don’t require expert annotation.

Conclusion: Outlines necessary conditions to reliably determine whether and how morphology relates to language modeling, emphasizing the need to control for confounding factors in experimental design.

Abstract: The extent to which individual language characteristics influence tokenization and language modeling is an open question. Differences in morphological systems have been suggested as both unimportant and crucial to consider (Cotterell et al., 2018; Gerz et al., 2018a; Park et al., 2021, inter alia). We argue this conflicting evidence is due to confounding factors in experimental setups, making it hard to compare results and draw conclusions. We identify confounding factors in analyses trying to answer the question of whether, and how, morphology relates to language modeling. Next, we re-assess three hypotheses by Arnett & Bergen (2025) for why modeling agglutinative languages results in higher perplexities than fusional languages: they look at morphological alignment of tokenization, tokenization efficiency, and dataset size. We show that each conclusion includes confounding factors. Finally, we introduce token bigram metrics as an intrinsic way to predict the difficulty of causal language modeling, and find that they are gradient proxies for morphological complexity that do not require expert annotation. Ultimately, we outline necessities to reliably answer whether, and how, morphology relates to language modeling.

[69] RAGSmith: A Framework for Finding the Optimal Composition of Retrieval-Augmented Generation Methods Across Datasets

Muhammed Yusuf Kartal, Suha Kagan Kose, Korhan Sevinç, Burak Aktas

Main category: cs.CL

TL;DR: RAGSmith is a framework that uses genetic search to optimize RAG pipeline configurations across 9 technique families and 46,080 possible combinations, achieving consistent improvements over naive RAG baselines.

Details

Motivation: RAG quality depends on many interacting choices across retrieval, ranking, augmentation, prompting, and generation, making isolated module optimization brittle and ineffective.

Method: A genetic search algorithm optimizes a scalar objective that jointly aggregates retrieval metrics (recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic similarity) across 46,080 feasible pipeline configurations.

Result: RAGSmith finds configurations that consistently outperform naive RAG baseline by +3.8% on average (range +1.2% to +6.9% across domains), with gains up to +12.5% in retrieval and +7.5% in generation. The search explores only ~0.2% of the space (~100 candidates).

Conclusion: The framework discovers a robust backbone of vector retrieval plus post-generation reflection/revision, with domain-dependent choices in expansion, reranking, augmentation, and prompt reordering. Evolutionary search proves effective for full-pipeline optimization in RAG systems.

Abstract: Retrieval-Augmented Generation (RAG) quality depends on many interacting choices across retrieval, ranking, augmentation, prompting, and generation, so optimizing modules in isolation is brittle. We introduce RAGSmith, a modular framework that treats RAG design as an end-to-end architecture search over nine technique families and 46{,}080 feasible pipeline configurations. A genetic search optimizes a scalar objective that jointly aggregates retrieval metrics (recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic similarity). We evaluate on six Wikipedia-derived domains (Mathematics, Law, Finance, Medicine, Defense Industry, Computer Science), each with 100 questions spanning factual, interpretation, and long-answer types. RAGSmith finds configurations that consistently outperform naive RAG baseline by +3.8% on average (range +1.2% to +6.9% across domains), with gains up to +12.5% in retrieval and +7.5% in generation. The search typically explores $\approx 0.2%$ of the space ($\sim 100$ candidates) and discovers a robust backbone – vector retrieval plus post-generation reflection/revision – augmented by domain-dependent choices in expansion, reranking, augmentation, and prompt reordering; passage compression is never selected. Improvement magnitude correlates with question type, with larger gains on factual/long-answer mixes than interpretation-heavy sets. These results provide practical, domain-aware guidance for assembling effective RAG systems and demonstrate the utility of evolutionary search for full-pipeline optimization.

[70] LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge

Heng Zhou, Ao Yu, Yuchen Fan, Jianing Shi, Li Kang, Hejia Geng, Yongting Zhang, Yutao Fan, Yuhao Wu, Tiancheng He, Yiran Qin, Lei Bai, Zhenfei Yin

Main category: cs.CL

TL;DR: LiveSearchBench is an automated pipeline for creating retrieval-dependent benchmarks from recent knowledge updates to evaluate LLMs’ ability to handle dynamic world knowledge rather than static memorization.

Details

Motivation: Static benchmarks for LLM evaluation reward memorization and fail to capture the dynamic nature of world knowledge, understating the role of retrieval in question answering.

Method: Automated pipeline that computes deltas between successive Wikidata snapshots, filters candidate triples for quality, and synthesizes natural-language questions at three reasoning difficulty levels with unique, verifiable answers through SPARQL validation.

Result: Experiments show significant performance drop when models face facts post-dating pretraining, especially on multi-hop queries. Retrieval augmentation and larger models provide partial improvements but fail to close the recency gap.

Conclusion: LiveSearchBench shifts evaluation from static memorization toward tasks requiring up-to-date retrieval and reasoning, enabling systematic long-term assessment of LLMs under evolving knowledge.

Abstract: Evaluating large language models (LLMs) on question answering often relies on static benchmarks that reward memorization and understate the role of retrieval, failing to capture the dynamic nature of world knowledge. We present LiveSearchBench, an automated pipeline for constructing retrieval-dependent benchmarks from recent knowledge updates. Our method computes deltas between successive Wikidata snapshots, filters candidate triples for quality, and synthesizes natural-language questions at three levels of reasoning difficulty, each guaranteed to admit a unique, verifiable answer through SPARQL validation. The pipeline is fully automated, scalable across time, and minimizes human intervention, enabling continual regeneration of temporally grounded benchmarks. Experiments show a pronounced performance drop when models confront facts that post-date pretraining, with the gap most salient on multi-hop queries. Retrieval augmented methods and larger, instruction-tuned models provide partial gains but fail to close this recency gap. By design, LiveSearchBench shifts evaluation from static memorization toward tasks that require up-to-date retrieval and reasoning, offering a foundation for systematic, long-term assessment of LLMs under evolving knowledge.

[71] “Don’t Teach Minerva”: Guiding LLMs Through Complex Syntax for Faithful Latin Translation with RAG

Sergio Torres Aguilar

Main category: cs.CL

TL;DR: A draft-based refinement pipeline using fine-tuned NLLB-1.3B and zero-shot LLMs achieves GPT-5 comparable performance for Latin translation without task-specific fine-tuning.

Details

Motivation: Translating morphology-rich, low-resource languages like Latin is challenging, and existing methods struggle with structural faithfulness and performance.

Method: Two-step pipeline: 1) Fine-tuned NLLB-1.3B generates structurally faithful drafts, 2) Zero-shot LLMs (Llama-3.3 or Qwen3) polish drafts, optionally enhanced with RAG using retrieved examples.

Result: The open-source RAG system achieves performance statistically comparable to GPT-5 baseline on both in-domain (Rosenthal, 2023) and challenging out-of-domain (12th-century Latin letters) benchmarks.

Conclusion: The reproducible pipeline enables open-source models to match proprietary system performance for Latin translation, with released code, datasets, and models to support further research.

Abstract: Translating a morphology-rich, low-resource language like Latin poses significant challenges. This paper introduces a reproducible draft-based refinement pipeline that elevates open-source Large Language Models (LLMs) to a performance level statistically comparable to top-tier proprietary systems. Our method first uses a fine-tuned NLLB-1.3B model to generate a high-quality, structurally faithful draft. A zero-shot LLM (Llama-3.3 or Qwen3) then polishes this draft, a process that can be further enhanced by augmenting the context with retrieved out-context examples (RAG). We demonstrate the robustness of this approach on two distinct benchmarks: a standard in-domain test set (Rosenthal, 2023) and a new, challenging out-of-domain (OOD) set of 12th-century Latin letters (2025). Our central finding is that this open-source RAG system achieves performance statistically comparable to the GPT-5 baseline, without any task-specific LLM fine-tuning. We release the pipeline, the Chartres OOD set, and evaluation scripts and models to facilitate replicability and further research.

[72] BARD: budget-aware reasoning distillation

Lujie Niu, Lei Shen, Yi Jiang, Caixia Yuan, Xiaojie Wang, Wenbo Su, Bo zheng

Main category: cs.CL

TL;DR: BARD is a framework that distills reasoning capability to smaller models while enabling fine-grained control over reasoning length using budget constraints.

Details

Motivation: Long CoT distillation creates redundant reasoning processes with uncontrollable computational costs, leading to inefficient resource usage.

Method: Two-phase training: 1) SFT on teacher-generated CoT data compressed to various budget levels, 2) RL with reward balancing reasoning performance and budget fidelity.

Result: An 8B student model achieves strong performance on challenging reasoning benchmarks (AIME24, AIME25, GPQA) with precise control over reasoning length.

Conclusion: BARD successfully enables smaller models to balance reasoning performance and computational efficiency through budget-aware reasoning distillation.

Abstract: While long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models, the reasoning process often remains redundant and computational budget uncontrollable, leading to inefficient resource usage. To address this limitation, we propose \textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length. BARD uses the thinking budget as a user-specified control signal, allowing the model to dynamically balance reasoning performance and computational efficiency. To achieve this concept, BARD introduces a two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on teacher-generated long CoT data compressed to various budget levels, bootstrapping the model’s understanding of budget constraints. The second phase leverages Reinforcement Learning (RL) from a reward signal in consideration of reasoning performance and budget fidelity simultaneously. Incorporating the two-phase regimen is crucial to avoiding policy degradation and ensuring that both objectives are optimized jointly. Extensive experiments demonstrate that our method empowers an 8B student model to achieve strong performance on challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while providing precise and adaptive control over its reasoning length across a wide range of budgets.

[73] Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

Neha Sharma, Navneet Agarwal, Kairit Sirts

Main category: cs.CL

TL;DR: LLMs can serve as consistent annotators for subjective cognitive distortion detection, with GPT-4 achieving high annotation consistency and outperforming models trained on human-labeled data.

Details

Motivation: Cognitive distortion detection is highly subjective with low agreement among human annotators, leading to unreliable annotations. The paper explores using LLMs as more consistent and reliable alternatives.

Method: Used multiple independent LLM runs to identify stable labeling patterns, and introduced a dataset-agnostic evaluation framework using Cohen’s kappa as an effect size measure for fair cross-dataset comparisons.

Result: GPT-4 produced highly consistent annotations (Fleiss’s Kappa = 0.78), and models trained on LLM-generated annotations showed improved test set performance compared to those trained on human-labeled data.

Conclusion: LLMs provide a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.

Abstract: Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen’s kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss’s Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.

[74] Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

Max Schaffelder, Albert Gatt

Main category: cs.CL

TL;DR: This paper examines how synthetic data source diversity affects fine-tuned LLMs, focusing on distribution collapse, adversarial robustness, and self-preference bias.

Details

Motivation: As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial.

Method: Investigates the impact of synthetic data source diversity on fine-tuned LLMs across three dimensions: distribution collapse, adversarial robustness, and self-preference bias.

Result: Diverse synthetic data mitigates distribution collapse and preserves output quality while removing safeguards. Fine-tuning reduces self-preference bias, with human data being most effective.

Conclusion: Synthetic data from diverse sources can preserve output distribution breadth and quality while reducing biases, though it may make outputs more usable and potentially dangerous.

Abstract: As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial. This paper investigates the impact of the diversity of sources of synthetic data on fine-tuned large language models. We focus on three key dimensions: distribution collapse, adversarial robustness, and self-preference bias. Our findings reveal that fine-tuning models on synthetic data from diverse sources can mitigate distribution collapse, preserving the breadth of the output distribution and the diversity of the output text. Furthermore, while both human and synthetic fine-tuning data can remove safeguards, the latter preserves higher output quality, thus making outputs potentially more usable and dangerous. Finally, fine-tuning reduces self-preference bias, with human data being the most effective, followed by multi-source synthetic data.

[75] BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

Ayesha Afroza Mohsin, Mashrur Ahsan, Nafisa Maliyat, Shanta Maria, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

Main category: cs.CL

TL;DR: A pipeline combining Pareto-optimized LLMs and Chain-of-Thought prompting for Bengali text detoxification, supported by a new dataset BanglaNirTox with 68,041 toxic sentences.

Details

Motivation: Toxic language in Bengali is prevalent online with few effective precautions, and text detoxification remains underexplored in Bengali due to limited resources.

Method: Proposed pipeline uses Pareto class-optimized LLMs and Chain-of-Thought prompting to generate detoxified sentences. Created BanglaNirTox dataset with toxic sentences, class-wise toxicity labels, reasonings, and detoxified paraphrases.

Result: Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification. The dataset enables fine-tuning of language models for better detoxification.

Conclusion: The proposed approach effectively addresses Bengali text detoxification challenges through optimized LLMs and comprehensive dataset creation, showing significant improvements in detoxification quality.

Abstract: Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification.

[76] Difficulty-Controllable Cloze Question Distractor Generation

Seokhoon Kang, Yejin Jeon, Seonjeong Hwang, Gary Geunbae Lee

Main category: cs.CL

TL;DR: A novel framework for generating distractors with controllable difficulty in multiple-choice cloze questions using data augmentation and multitask learning, outperforming GPT-4o in difficulty alignment.

Details

Motivation: Existing methods for generating distractors in multiple-choice cloze questions lack adaptability and control over difficulty levels, and the absence of difficulty-annotated datasets hinders progress.

Method: Two-step approach: 1) Create difficulty-annotated dataset through two-way distractor generation, filtering, and ensemble QA system categorization; 2) Train difficulty-controllable generation model via multitask learning with auxiliary tasks for semantic understanding and difficulty estimation.

Result: The method generates high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.

Conclusion: The proposed framework effectively addresses the challenge of generating distractors with controllable difficulty, demonstrating superior performance compared to existing methods including GPT-4o.

Abstract: Multiple-choice cloze questions are commonly used to assess linguistic proficiency and comprehension. However, generating high-quality distractors remains challenging, as existing methods often lack adaptability and control over difficulty levels, and the absence of difficulty-annotated datasets further hinders progress. To address these issues, we propose a novel framework for generating distractors with controllable difficulty by leveraging both data augmentation and a multitask learning strategy. First, to create a high-quality, difficulty-annotated dataset, we introduce a two-way distractor generation process in order to produce diverse and plausible distractors. These candidates are subsequently refined through filtering and then categorized by difficulty using an ensemble QA system. Second, this newly created dataset is leveraged to train a difficulty-controllable generation model via multitask learning. The framework includes carefully designed auxiliary tasks that enhance the model’s semantic understanding of distractors and its ability to estimate their difficulty. Experimental results demonstrate that our method generates high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.

[77] Math anxiety and associative knowledge structure are entwined in psychology students but not in Large Language Models like GPT-3.5 and GPT-4o

Luciana Ciringione, Emma Franchino, Simone Reigl, Isaia D’Onofrio, Anna Serbati, Oleksandra Poquet, Florence Gabriel, Massimo Stella

Main category: cs.CL

TL;DR: This study uses behavioral forma mentis networks to analyze math anxiety in psychology students, comparing human students with GPT-simulated students. Key findings show that positive valence for ‘anxiety’ and negative ratings for ‘math’ predict higher math anxiety in humans, but these patterns don’t apply to GPT models due to cognitive differences.

Details

Motivation: Math anxiety significantly impacts university psychology students' career choices and well-being, making it crucial to understand how students perceive and associate concepts related to math and anxiety.

Method: Used behavioral forma mentis networks to map cognitive associations and emotional perceptions. Conducted 4 experiments with psychology undergraduates (n=127) compared against GPT-simulated students (GPT-3.5: n=300; GPT-4o: n=300). Experiments 1-3 predicted math anxiety scores using individual network features, while Experiment 4 analyzed group-level perceptions.

Result: In human students: positive valence for ‘anxiety’ and negative ratings for ‘math’ predicted higher math anxiety. High math-anxiety students framed ‘anxiety’ in emotionally polarizing ways and contrasted ‘science’ (positive) against ‘math’ (negative). These patterns didn’t apply to GPT models due to cognitive differences in network structures and psychometric scores.

Conclusion: Understanding concept perception and associations is crucial for managing math anxiety. The study highlights cognitive differences between human and AI models in emotional framing of STEM concepts, emphasizing the need for tailored interventions for math-anxious students.

Abstract: Math anxiety poses significant challenges for university psychology students, affecting their career choices and overall well-being. This study employs a framework based on behavioural forma mentis networks (i.e. cognitive models that map how individuals structure their associative knowledge and emotional perceptions of concepts) to explore individual and group differences in the perception and association of concepts related to math and anxiety. We conducted 4 experiments involving psychology undergraduates from 2 samples (n1 = 70, n2 = 57) compared against GPT-simulated students (GPT-3.5: n2 = 300; GPT-4o: n4 = 300). Experiments 1, 2, and 3 employ individual-level network features to predict psychometric scores for math anxiety and its facets (observational, social and evaluational) from the Math Anxiety Scale. Experiment 4 focuses on group-level perceptions extracted from human students, GPT-3.5 and GPT-4o’s networks. Results indicate that, in students, positive valence ratings and higher network degree for “anxiety”, together with negative ratings for “math”, can predict higher total and evaluative math anxiety. In contrast, these models do not work on GPT-based data because of differences in simulated networks and psychometric scores compared to humans. These results were also reconciled with differences found in the ways that high/low subgroups of simulated and real students framed semantically and emotionally STEM concepts. High math-anxiety students collectively framed “anxiety” in an emotionally polarising way, absent in the negative perception of low math-anxiety students. “Science” was rated positively, but contrasted against the negative perception of “math”. These findings underscore the importance of understanding concept perception and associations in managing students’ math anxiety.

[78] ECO Decoding: Entropy-Based Control for Controllability and Fluency in Controllable Dialogue Generation

Seungmin Shin, Dooyoung Kim, Youngjoong Ko

Main category: cs.CL

TL;DR: ECO decoding dynamically adjusts control strength in dialogue generation based on entropy from both language model and attribute classifier, improving controllability while maintaining fluency across single and multi-attribute scenarios.

Details

Motivation: Fixed constant values for managing attribute probability bias in weighted decoding methods make it difficult to find ideal control strength that balances both controllability and fluency in dialogue generation.

Method: Proposed ECO decoding (Entropy-based COntrol) that dynamically adjusts control strength at each generation step based on the model’s entropy in both language model and attribute classifier probability distributions.

Result: Experiments on DailyDialog and MultiWOZ datasets show ECO decoding consistently improves controllability while maintaining fluency and grammaticality, outperforming prior decoding methods across various models and settings.

Conclusion: ECO decoding effectively addresses probability interpolation issues in multi-attribute generation and demonstrates strong performance in both single and multi-attribute controllable dialogue generation scenarios.

Abstract: Controllable Dialogue Generation (CDG) enables chatbots to generate responses with desired attributes, and weighted decoding methods have achieved significant success in the CDG task. However, using a fixed constant value to manage the bias of attribute probabilities makes it challenging to find an ideal control strength that satisfies both controllability and fluency. To address this issue, we propose ECO decoding (Entropy-based COntrol), which dynamically adjusts the control strength at each generation step according to the model’s entropy in both the language model and attribute classifier probability distributions. Experiments on the DailyDialog and MultiWOZ datasets demonstrate that ECO decoding consistently improves controllability while maintaining fluency and grammaticality, outperforming prior decoding methods across various models and settings. Furthermore, ECO decoding alleviates probability interpolation issues in multi-attribute generation and consequently demonstrates strong performance in both single and multi-attribute scenarios.

[79] BIRD: Bronze Inscription Restoration and Dating

Wenjie Hua, Hoang H. Nguyen, Gangyan Ge

Main category: cs.CL

TL;DR: BIRD dataset for bronze inscription restoration and dating with allograph-aware modeling framework

Details

Motivation: Bronze inscriptions from early China are fragmentary and difficult to date, requiring better computational methods

Method: Allograph-aware masked language modeling framework with domain-adaptive pretraining and Glyph Net linking graphemes and allographs

Result: Glyph Net improves restoration performance, glyph-biased sampling improves dating accuracy

Conclusion: Proposed framework effectively addresses challenges in bronze inscription analysis through integrated glyph modeling

Abstract: Bronze inscriptions from early China are fragmentary and difficult to date. We introduce BIRD(Bronze Inscription Restoration and Dating), a fully encoded dataset grounded in standard scholarly transcriptions and chronological labels. We further propose an allograph-aware masked language modeling framework that integrates domain- and task-adaptive pretraining with a Glyph Net (GN), which links graphemes and allographs. Experiments show that GN improves restoration, while glyph-biased sampling yields gains in dating.

[80] Imperfect Language, Artificial Intelligence, and the Human Mind: An Interdisciplinary Approach to Linguistic Errors in Native Spanish Speakers

Francisco Portillo López

Main category: cs.CL

TL;DR: This paper proposes an interdisciplinary study of linguistic errors by native Spanish speakers to analyze how large language models interpret and handle these errors, aiming to develop more cognitively informed NLP systems.

Details

Motivation: Linguistic errors provide insights into cognitive language architecture and expose limitations of AI systems in replicating human language behavior.

Method: Uses a corpus of 500+ authentic Spanish errors, analyzed through theoretical linguistics, neurolinguistics, and NLP perspectives, and tested against AI models like GPT and Gemini.

Result: The research evaluates AI models’ interpretative accuracy and ability to generalize human linguistic error patterns.

Conclusion: The project contributes to understanding Spanish as a native language and developing NLP systems that better handle the imperfect and ambiguous nature of real human language.

Abstract: Linguistic errors are not merely deviations from normative grammar; they offer a unique window into the cognitive architecture of language and expose the current limitations of artificial systems that seek to replicate them. This project proposes an interdisciplinary study of linguistic errors produced by native Spanish speakers, with the aim of analyzing how current large language models (LLM) interpret, reproduce, or correct them. The research integrates three core perspectives: theoretical linguistics, to classify and understand the nature of the errors; neurolinguistics, to contextualize them within real-time language processing in the brain; and natural language processing (NLP), to evaluate their interpretation against linguistic errors. A purpose-built corpus of authentic errors of native Spanish (+500) will serve as the foundation for empirical analysis. These errors will be tested against AI models such as GPT or Gemini to assess their interpretative accuracy and their ability to generalize patterns of human linguistic behavior. The project contributes not only to the understanding of Spanish as a native language but also to the development of NLP systems that are more cognitively informed and capable of engaging with the imperfect, variable, and often ambiguous nature of real human language.

[81] ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

Nikola Ljubešić, Peter Rupnik, Ivan Porupski, Taja Kuzman Pungeršek

Main category: cs.CL

TL;DR: ParlaSpeech is a 6,000-hour multilingual spoken parliamentary corpus covering four Slavic languages with automatic annotations including linguistic features, sentiment analysis, disfluencies, and alignment data.

Details

Motivation: To create a comprehensive spoken parliamentary dataset for Slavic languages that bridges the gap between parliamentary transcripts and actual speech recordings, enabling multidisciplinary research.

Method: Automatically built from ParlaMint transcripts and aligned with speech recordings from parliaments, then enriched with automatic annotation layers including linguistic annotations, sentiment predictions, filled pauses detection, word/grapheme alignments, and stress position annotation.

Result: Created a 6,000-hour corpus spanning Croatian, Czech, Polish, and Serbian with multiple annotation layers, demonstrating utility through acoustic sentiment analysis and making data available in JSONL, TextGrid formats and via concordancer.

Conclusion: The enriched ParlaSpeech corpora significantly increase research utility across disciplines and provide valuable resources for Slavic language processing and parliamentary speech analysis.

Abstract: ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the underlying corpora has been drastically increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.

[82] A Graph-based RAG for Energy Efficiency Question Answering

Riccardo Campi, Nicolò Oreste Pinciroli Vago, Mathyas Giudici, Pablo Barrachina Rodriguez-Guisado, Marco Brambilla, Piero Fraternali

Main category: cs.CL

TL;DR: The paper presents a graph-based RAG system using LLMs for energy efficiency QA, achieving 75.2% accuracy with multilingual capabilities.

Details

Motivation: To improve energy efficiency question answering by leveraging LLMs with structured knowledge from regulatory documents through graph-based retrieval.

Method: Extract Knowledge Graph from energy documents, then use graph navigation and reasoning within a RAG architecture for multilingual QA, validated using RAGAs framework and domain experts.

Result: System achieves 75.2% ± 2.7% accuracy, with higher performance on general EE questions (81.0% ± 4.1%) and minimal multilingual degradation (4.4% accuracy loss).

Conclusion: Graph-based RAG architecture shows strong potential for energy efficiency QA with promising multilingual abilities, though some limitations remain.

Abstract: In this work, we investigate the use of Large Language Models (LLMs) within a graph-based Retrieval Augmented Generation (RAG) architecture for Energy Efficiency (EE) Question Answering. First, the system automatically extracts a Knowledge Graph (KG) from guidance and regulatory documents in the energy field. Then, the generated graph is navigated and reasoned upon to provide users with accurate answers in multiple languages. We implement a human-based validation using the RAGAs framework properties, a validation dataset comprising 101 question-answer pairs, and domain experts. Results confirm the potential of this architecture and identify its strengths and weaknesses. Validation results show how the system correctly answers in about three out of four of the cases (75.2 +- 2.7%), with higher results on questions related to more general EE answers (up to 81.0 +- 4.1%), and featuring promising multilingual abilities (4.4% accuracy loss due to translation).

[83] Evaluating Cultural Knowledge Processing in Large Language Models: A Cognitive Benchmarking Framework Integrating Retrieval-Augmented Generation

Hung-Shin Lee, Chen-Chi Chang, Ching-Yuan Chen, Yun-Hsiang Hsu

Main category: cs.CL

TL;DR: A cognitive benchmarking framework that evaluates LLMs’ processing of culturally specific knowledge using Bloom’s Taxonomy and RAG, tested on Taiwanese Hakka cultural archive.

Details

Motivation: To assess how large language models handle and apply culturally specific knowledge, particularly in the context of cultural heritage preservation.

Method: Integrates Bloom’s Taxonomy with Retrieval-Augmented Generation (RAG) to evaluate LLM performance across six cognitive domains, using a Taiwanese Hakka digital cultural archive as test data.

Result: The framework measures LLM-generated responses for semantic accuracy and cultural relevance across hierarchical cognitive levels.

Conclusion: Proposes a systematic approach for benchmarking LLMs’ cultural knowledge processing capabilities, with potential applications in cultural heritage and AI evaluation.

Abstract: This study proposes a cognitive benchmarking framework to evaluate how large language models (LLMs) process and apply culturally specific knowledge. The framework integrates Bloom’s Taxonomy with Retrieval-Augmented Generation (RAG) to assess model performance across six hierarchical cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Using a curated Taiwanese Hakka digital cultural archive as the primary testbed, the evaluation measures LLM-generated responses’ semantic accuracy and cultural relevance.

[84] EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering

Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Preslav Nakov, Zhuohan Xie

Main category: cs.CL

TL;DR: EngChain is a new benchmark for evaluating LLMs’ multi-step engineering problem-solving capabilities, featuring 90 diverse problems across 3 engineering branches with verifiable reasoning steps.

Details

Motivation: Current benchmarks don't assess the integrative reasoning needed in engineering domains where scientific principles, quantitative modeling, and practical constraints must converge.

Method: Created EngChain benchmark with 90 problems from symbolic templates with randomization; uses two-stage evaluation: quantitative verification of reasoning steps and LLM-As-A-Judge for qualitative error categorization.

Result: Developed a comprehensive benchmark that moves beyond final answer accuracy to verify step-by-step reasoning validity in engineering contexts.

Conclusion: EngChain addresses the gap in evaluating LLMs’ complex engineering reasoning capabilities through verifiable multi-step problem-solving assessment.

Abstract: Large Language Models (LLMs) are increasingly being applied to specialized, high-stakes domains like engineering, which demands rigorous evaluation of their complex reasoning capabilities. While current benchmarks assess language understanding, factual recall, mathematics or code generation, none capture the integrative reasoning central to engineering where scientific principles, quantitative modeling and practical constraints must converge. To address this gap, we introduce EngChain, a benchmark for verifiable multi-step engineering problem-solving. EngChain contains 90 problems spanning three engineering branches, organized into 9 domains and 20 distinct areas. The problems are generated from symbolic templates with a high degree of randomization to ensure diversity and eliminate the risk of contamination. With this benchmark, we move beyond final answer accuracy with a two-stage evaluation: we first quantitatively verify the numerical and semantic validity of each reasoning step and then introduce LLM-As-A-Judge, an automated system to qualitatively categorize the identified reasoning errors.

[85] SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

Chaoqun Liu, Mahani Aljunied, Guizhen Chen, Hou Pong Chan, Weiwen Xu, Yu Rong, Wenxuan Zhang

Main category: cs.CL

TL;DR: SeaLLMs-Audio is the first large audio-language model for Southeast Asian languages (Indonesian, Thai, Vietnamese) plus English and Chinese, supporting multimodal inputs and multiple audio tasks.

Details

Motivation: To address the lack of large audio-language models specifically designed for Southeast Asian languages and advance audio LLM research in the region.

Method: Trained on a large-scale audio corpus, the model supports flexible input modalities (audio only, text only, audio+text) and multiple tasks including audio captioning, speech recognition, translation, emotion recognition, Q&A, and summarization.

Result: SeaLLMs-Audio achieves competitive performance compared to other LALMs on Southeast Asian languages, as evaluated by the new SeaBench-Audio benchmark.

Conclusion: The model represents a significant advancement for audio LLMs in Southeast Asia and is expected to benefit both regional research and industry.

Abstract: We introduce SeaLLMs-Audio, the first large audio-language model (LALM) tailored for multiple Southeast Asian (SEA) languages-Indonesian (id), Thai (th), and Vietnamese (vi)-alongside English (en) and Chinese (zh). Trained on a large-scale audio corpus, SeaLLMs-Audio exhibits strong performance across diverse audio-centric tasks, spanning fine-grained audio understanding and voice-based interaction. Its key features include: 1) Multilingual: the model primarily supports 5 languages, namely Indonesian, Thai, Vietnamese, English, and Chinese; 2) Multimodal: the model accepts flexible input modalities, including audio only, text only, as well as audio with text; 3) Multi-task: the model supports a wide range of tasks, including audio analysis tasks such as Audio Captioning, Automatic Speech Recognition, Speech-to-Text Translation, Speech Emotion Recognition, Speech Question Answering, and Speech Summarization. It also enables voice-based dialogue, including answering factual, mathematical, and general knowledge queries. As a significant step towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs-Audio to benefit both the regional research community and industry. To automate LALM evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark spanning multiple tasks. Experiments show that SeaLLMs-Audio achieves competitive performance compared with other LALMs on SEA languages.

[86] Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI

Sharan Maiya, Henning Bartsch, Nathan Lambert, Evan Hubinger

Main category: cs.CL

TL;DR: This paper introduces the first open implementation of character training for AI assistants, using Constitutional AI and synthetic introspective data to shape persona more effectively than existing methods, with minimal impact on general capabilities.

Details

Motivation: Character training is critical for shaping AI assistant personas (values, beliefs, ethics) but remains unstudied in academic literature despite being important for interaction quality and alignment.

Method: Fine-tuned three popular open-weights models using 11 personas via Constitutional AI and synthetic introspective data pipeline, with revealed preferences analysis to track character changes.

Result: Character changes are more robust to adversarial prompting than alternatives, lead to more coherent and realistic generations, and have little to no effect on general benchmark performance.

Conclusion: The approach provides effective and controlled character training that outperforms existing methods while maintaining model capabilities, with full implementation open-sourced.

Abstract: The character of the “AI assistant” persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect interaction quality, perceived intelligence, and alignment with both developer and user intentions. The shaping of this persona, known as character training, is a critical component of industry post-training, yet remains effectively unstudied in the academic literature. We introduce the first open implementation of character training, leveraging Constitutional AI and a new data pipeline using synthetic introspective data to shape the assistant persona in a more effective and controlled manner than alternatives such as constraining system prompts or activation steering. Specifically, we fine-tune three popular open-weights models using 11 example personas, such as humorous, deeply caring, or even malevolent. To track the effects of our approach, we introduce a method which analyzes revealed preferences, uncovering clear and holistic changes in character. We find these changes are more robust to adversarial prompting than the above two alternatives, while also leading to more coherent and realistic generations. Finally, we demonstrate this fine-tuning has little to no effect on general capabilities as measured by common benchmarks. We describe and open-source our full post-training method, the implementation of which can be found at https://github.com/maiush/OpenCharacterTraining.

[87] Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement

Sekh Mainul Islam, Pepa Atanasova, Isabelle Augenstein

Main category: cs.CL

TL;DR: This paper proposes a rank-2 projection subspace to better disentangle and analyze the interaction between Parametric Knowledge (PK) and Context Knowledge (CK) in Large Language Models’ Natural Language Explanations, revealing diverse knowledge interactions that are missed by traditional rank-1 approaches.

Details

Motivation: Understanding how LLMs combine external context knowledge and internal parametric knowledge in generating Natural Language Explanations is crucial for assessing explanation grounding, but current methods only model this as a binary choice in rank-1 subspace, missing richer interaction forms.

Method: Proposes a novel rank-2 projection subspace that more accurately disentangles PK and CK contributions, enabling the first multi-step analysis of knowledge interactions across longer NLE sequences on four QA datasets with three open-weight instruction-tuned LLMs.

Result: Diverse knowledge interactions are poorly represented in rank-1 subspace but effectively captured in rank-2 formulation. Hallucinated NLEs align strongly with PK direction, context-faithful ones balance PK and CK, and Chain-of-Thought prompting shifts NLEs toward CK by reducing PK reliance.

Conclusion: This work provides the first framework for systematic studies of multi-step knowledge interactions in LLMs through a richer rank-2 subspace disentanglement, enabling better understanding of how LLMs ground their explanations in different knowledge sources.

Abstract: Natural Language Explanations (NLEs) describe how Large Language Models (LLMs) make decisions, drawing on both external Context Knowledge (CK) and Parametric Knowledge (PK) stored in model weights. Understanding their interaction is key to assessing the grounding of NLEs, yet it remains underexplored. Prior work has largely examined only single-step generation, typically the final answer, and has modelled PK and CK interaction only as a binary choice in a rank-1 subspace. This overlooks richer forms of interaction, such as complementary or supportive knowledge. We propose a novel rank-2 projection subspace that disentangles PK and CK contributions more accurately and use it for the first multi-step analysis of knowledge interactions across longer NLE sequences. Experiments on four QA datasets and three open-weight instruction-tuned LLMs show that diverse knowledge interactions are poorly represented in a rank-1 subspace but are effectively captured in our rank-2 formulation. Our multi-step analysis reveals that hallucinated NLEs align strongly with the PK direction, context-faithful ones balance PK and CK, and Chain-of-Thought prompting for NLEs shifts generated NLEs toward CK by reducing PK reliance. This work provides the first framework for systematic studies of multi-step knowledge interactions in LLMs through a richer rank-2 subspace disentanglement. Code and data: https://github.com/copenlu/pk-ck-knowledge-disentanglement.

[88] Efficient Tool-Calling Multi-Expert NPC Agent for Commonsense Persona-Grounded Dialogue

Mahammad Nuriyev

Main category: cs.CL

TL;DR: A multi-expert system using Qwen3 with LoRA adapters creates NPCs capable of natural dialogue and contextual actions, achieving second place in CPDC 2025 while maintaining computational efficiency.

Details

Motivation: To develop NPCs that can engage in natural dialogue and execute contextual actions in interactive environments, addressing the need for computationally efficient yet capable AI agents.

Method: Uses Qwen3 as base model with LoRA adapters to create three specialists: tool calling, tool-response interpretation, and direct dialogue, ensuring computational efficiency on L40S GPUs.

Result: System achieved second place overall in the Commonsense Persona-Grounded Dialogue Challenge 2025, delivering fast responses with modest resource usage.

Conclusion: The multi-expert approach with specialized LoRA adapters provides an effective and computationally efficient solution for creating dialogue-capable NPCs with contextual action execution.

Abstract: We present a multi-expert system for creating Non-Player Characters (NPCs) capable of both natural dialogue and contextual action execution in interactive environments. Using Qwen3 as the base model and Low-Rank Adaptation (LoRA) adapters, we instantiate three specialists: tool calling, tool-response interpretation, and direct dialogue. Our system comfortably meets the computational efficiency requirements, delivering fast responses and maintaining modest resource usage on L40S GPUs. In the Commonsense Persona-Grounded Dialogue Challenge 2025, our method ranked second overall. Code available at: https://github.com/MahammadNuriyev62/CPDC-challenge-2025-solution/

[89] Accumulating Context Changes the Beliefs of Language Models

Jiayi Geng, Howard Chen, Ryan Liu, Manoel Horta Ribeiro, Robb Willer, Graham Neubig, Thomas L. Griffiths

Main category: cs.CL

TL;DR: Language models’ beliefs can significantly shift through extended interactions and text processing, making their responses unreliable.

Details

Motivation: To investigate how accumulating context in language models leads to silent changes in their belief profiles, potentially causing inconsistent user experiences and deviations from original alignment.

Method: Examined belief shifts through moral dilemma discussions, political text reading, and tool use tasks that reflect implicit beliefs.

Result: GPT-5 showed 54.7% belief shift after 10 rounds of moral discussions, while Grok 4 shifted 27.2% on political issues after reading opposing texts. Behavioral changes in tool use aligned with stated belief shifts.

Conclusion: Extended talking and reading sessions pose hidden risks of belief shift in language models, rendering their opinions and actions unreliable in agentic systems.

Abstract: Language model (LM) assistants are increasingly used in applications such as brainstorming and research. Improvements in memory and context size have allowed these models to become more autonomous, which has also resulted in more text accumulation in their context windows without explicit user intervention. This comes with a latent risk: the belief profiles of models – their understanding of the world as manifested in their responses or actions – may silently change as context accumulates. This can lead to subtly inconsistent user experiences, or shifts in behavior that deviate from the original alignment of the models. In this paper, we explore how accumulating context by engaging in interactions and processing text – talking and reading – can change the beliefs of language models, as manifested in their responses and behaviors.Our results reveal that models’ belief profiles are highly malleable: GPT-5 exhibits a 54.7% shift in its stated beliefs after 10 rounds of discussion about moral dilemmas and queries about safety, while Grok 4 shows a 27.2% shift on political issues after reading texts from the opposing position. We also examine models’ behavioral changes by designing tasks that require tool use, where each tool selection corresponds to an implicit belief. We find that these changes align with stated belief shifts, suggesting that belief shifts will be reflected in actual behavior in agentic systems. Our analysis exposes the hidden risk of belief shift as models undergo extended sessions of talking or reading, rendering their opinions and actions unreliable.

[90] Plan-and-Write: Structure-Guided Length Control for LLMs without Model Retraining

Adewale Akinfaderin, Shreyas Subramanian, Akarsha Sehwag

Main category: cs.CL

TL;DR: A prompt engineering method for precise length control in LLMs without retraining, using structure-guided planning and word counting to improve length adherence by up to 37.6% while maintaining output quality.

Details

Motivation: Length control in LLMs is crucial for applications like voice interfaces and research summaries, but current approaches require expensive retraining or complex tooling.

Method: Structure-guided prompt engineering with deliberate planning and word counting mechanisms that encourage models to track and adhere to specified length constraints.

Result: Significantly improved length fidelity across six LLMs, particularly for shorter-to-medium lengths (up to 37.6% improvement), while maintaining or enhancing output quality compared to standard prompting.

Conclusion: Provides an immediately deployable solution for precise length control in production environments where model retraining is impractical or cost-prohibitive.

Abstract: Length control in Large Language Models (LLMs) is a crucial but under-addressed challenge, with applications ranging from voice interfaces requiring concise responses to research summaries needing comprehensive outputs. Current approaches to length control, including Regularized DPO, Length-Instruction Fine Tuning, and tool-augmented methods, typically require expensive model retraining or complex inference-time tooling. This paper presents a prompt engineering methodology that enables precise length control without model retraining. Our structure-guided approach implements deliberate planning and word counting mechanisms within the prompt, encouraging the model to carefully track and adhere to specified length constraints. Comprehensive evaluations across six state-of-the-art LLMs demonstrate that our method significantly improves length fidelity for several models compared to standard prompting when applied to document summarization tasks, particularly for shorter-to-medium length constraints. The proposed technique shows varying benefits across different model architectures, with some models demonstrating up to 37.6% improvement in length adherence. Quality evaluations further reveal that our approach maintains or enhances overall output quality compared to standard prompting techniques. Our approach provides an immediately deployable solution for applications requiring precise length control, particularly valuable for production environments where model retraining is impractical or cost-prohibitive.

[91] KV Cache Transform Coding for Compact Storage in LLM Inference

Konrad Staniszewski, Adrian Łańcucki

Main category: cs.CL

TL;DR: KVTC is a lightweight transform coder that compresses KV caches for efficient LLM serving, achieving up to 20× compression while maintaining accuracy.

Details

Motivation: KV caches consume significant GPU memory in LLM serving, especially with shared-prefix prompts in iterative tasks like code editing and chat, creating memory bottlenecks.

Method: KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding inspired by classical media compression, requiring only brief initial calibration without changing model parameters.

Result: Achieves up to 20× compression while maintaining reasoning and long-context accuracy, with 40× or higher for specific use cases. Outperforms token eviction, quantization, and SVD-based methods across multiple benchmarks and models.

Conclusion: KVTC serves as a practical building block for memory-efficient LLM serving with reusable KV caches, enabling efficient on-GPU and off-GPU storage.

Abstract: Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$\times$ compression while maintaining reasoning and long-context accuracy, and 40$\times$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH-500. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.

[92] Towards Robust Mathematical Reasoning

Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung

Main category: cs.CL

TL;DR: IMO-Bench is a suite of advanced mathematical reasoning benchmarks at International Mathematical Olympiad level, including IMO-AnswerBench for short answers and IMO-ProofBench for proof-writing, which helped achieve gold-level performance at IMO 2025.

Details

Motivation: Existing evaluations for mathematical reasoning are either too easy or only focus on correct short answers, lacking comprehensive assessment of advanced reasoning capabilities needed for IMO-level problems.

Method: Developed IMO-Bench with 400 diverse Olympiad problems for short answers (IMO-AnswerBench) and proof-writing evaluation (IMO-ProofBench) with detailed grading guidelines for automatic grading. Also created IMO-GradingBench with 1000 human gradings.

Result: Gemini Deep Think achieved 80.0% on IMO-AnswerBench and 65.7% on advanced IMO-ProofBench, surpassing best non-Gemini models by 6.9% and 42.4% margins respectively. Autograders built with Gemini reasoning correlate well with human evaluations.

Conclusion: IMO-Bench provides robust benchmarks for advancing mathematical reasoning capabilities and was crucial for achieving gold-level performance at IMO 2025, with potential to drive further progress in automatic evaluation of long-form answers.

Abstract: Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.

[93] Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems

Elias Lumer, Faheem Nizar, Anmol Gulati, Pradeep Honaganahalli Basavaraju, Vamse Kumar Subbiah

Main category: cs.CL

TL;DR: Tool-to-Agent Retrieval improves multi-agent systems by embedding tools and agents in shared vector space, enabling granular retrieval and achieving 19.4% Recall@5 improvement.

Details

Motivation: Existing retrieval methods match queries against coarse agent-level descriptions, obscuring fine-grained tool functionality and leading to suboptimal agent selection.

Method: Unified framework that embeds both tools and their parent agents in shared vector space, connecting them through metadata relationships to enable granular tool-level or agent-level retrieval.

Result: Achieves consistent improvements of 19.4% in Recall@5 and 17.7% in nDCG@5 over previous state-of-the-art agent retrievers on LiveMCPBench benchmark across eight embedding models.

Conclusion: Tool-to-Agent Retrieval ensures agents and their underlying tools are equally represented without context dilution, significantly improving retrieval performance in multi-agent systems.

Abstract: Recent advances in LLM Multi-Agent Systems enable scalable orchestration of sub-agents, each coordinating hundreds or thousands of tools or Model Context Protocol (MCP) servers. However, existing retrieval methods typically match queries against coarse agent-level descriptions before routing, which obscures fine-grained tool functionality and often results in suboptimal agent selection. We introduce Tool-to-Agent Retrieval, a unified framework that embeds both tools and their parent agents in a shared vector space and connects them through metadata relationships. By explicitly representing tool capabilities and traversing metadata to the agent level, Tool-to-Agent Retrieval enables granular tool-level or agent-level retrieval, ensuring that agents and their underlying tools or MCP servers are equally represented without the context dilution that arises from chunking many tools together. Evaluating Tool-to-Agent Retrieval across eight embedding models, our approach achieves consistent improvements of 19.4% in Recall@5 and 17.7% in nDCG@5 over previous state-of-the-art agent retrievers on the LiveMCPBench benchmark.

[94] Complex QA and language models hybrid architectures, Survey

Xavier Daull, Patrice Bellot, Emmanuel Bruno, Vincent Martin, Elisabeth Murisasco

Main category: cs.CL

TL;DR: This paper reviews hybrid architectures and strategies for enabling large language models to handle complex question-answering tasks that require specialized knowledge, reasoning, and multi-step resolution.

Details

Motivation: While LLM-based chatbots show potential for common problems, they face limitations when addressing complex questions requiring domain expertise, decomposition, deep reasoning, data protection, and explainability.

Method: The review covers: (1) necessary skills and LLM limitations for complex QA, (2) datasets and evaluation metrics, (3) solution families including training/reinforcement, hybridization, prompting, and agentic architectures with extended reasoning.

Result: The paper systematically analyzes the state-of-the-art approaches for overcoming LLM limitations in complex question-answering scenarios.

Conclusion: Hybrid architectures combining multiple strategies are essential for enabling LLMs to effectively handle complex questions that require specialized knowledge, reasoning capabilities, and human-in-the-loop processes.

Abstract: This paper reviews the state-of-the-art of large language models (LLM) architectures and strategies for “complex” question-answering with a focus on hybrid architectures. LLM based chatbot services have allowed anyone to grasp the potential of LLM to solve many common problems, but soon discovered their limitations for complex questions. Addressing more specific, complex questions (e.g., “What is the best mix of power-generation methods to reduce climate change ?”) often requires specialized architectures, domain knowledge, new skills, decomposition and multi-step resolution, deep reasoning, sensitive data protection, explainability, and human-in-the-loop processes. Therefore, we review: (1) necessary skills and tasks for handling complex questions and common LLM limits to overcome; (2) dataset, cost functions and evaluation metrics for measuring and improving (e.g. accuracy, explainability, fairness, robustness, groundedness, faithfulness, toxicity…); (3) family of solutions to overcome LLM limitations by (a) training and reinforcement (b) hybridization, (c) prompting, (d) agentic-architectures (agents, tools) and extended reasoning.

[95] MaiBaam Annotation Guidelines

Verena Blaschke, Barbara Kovačić, Siyao Peng, Barbara Plank

Main category: cs.CL

TL;DR: Annotation guidelines for MaiBaam, a Bavarian corpus with POS tags, dependencies, and German lemmas following Universal Dependencies standards.

Details

Motivation: To create standardized annotation guidelines for Bavarian language data within the Universal Dependencies framework, addressing both general linguistic principles and Bavarian-specific grammatical features.

Method: Developed detailed preprocessing, tokenization, POS tagging, and dependency annotation procedures that build upon existing UD guidelines while incorporating Bavarian-specific grammatical considerations.

Result: Comprehensive annotation guidelines that cover data preprocessing, POS tags, syntactic dependencies, and lemma annotation, with specific adaptations for Bavarian grammar.

Conclusion: Successfully established annotation standards for Bavarian within the UD project, providing a framework that can handle both general linguistic patterns and language-specific features of Bavarian.

Abstract: This document provides the annotation guidelines for MaiBaam, a Bavarian corpus manually annotated with part-of-speech (POS) tags, syntactic dependencies, and German lemmas. MaiBaam belongs to the Universal Dependencies (UD) project, and our annotations elaborate on the general and German UD version 2 guidelines. In this document, we detail how to preprocess and tokenize Bavarian data, provide an overview of the POS tags and dependencies we use, explain annotation decisions that would also apply to closely related languages like German, and lastly we introduce and motivate decisions that are specific to Bavarian grammar.

[96] CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists

Yukyung Lee, Joonghoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, Najoung Kim

Main category: cs.CL

TL;DR: CheckEval is a checklist-based evaluation framework that improves rating reliability for LLM-as-a-Judge approaches by using decomposed binary questions instead of Likert scales, reducing rating inconsistencies and improving agreement across evaluator models.

Details

Motivation: Existing LLM-as-a-Judge approaches suffer from rating inconsistencies with low agreement and high variance across different evaluator models due to subjective evaluation criteria and Likert scale scoring.

Method: Introduces CheckEval, a checklist-based evaluation framework that uses decomposed binary questions to improve rating reliability instead of traditional Likert scale scoring.

Result: CheckEval strongly correlates with human judgments, improves average agreement across evaluator models by 0.45, reduces score variance, and provides more interpretable scores through traceable binary decisions.

Conclusion: CheckEval effectively addresses rating inconsistencies in LLM evaluation by providing a more reliable, interpretable framework through decomposed binary questions.

Abstract: Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.

[97] Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models

Xin Li, Weize Chen, Qizhi Chu, Haopeng Li, Zhaojun Sun, Ran Li, Chen Qian, Yiwei Wei, Zhiyuan Liu, Chuan Shi, Maosong Sun, Cheng Yang

Main category: cs.CL

TL;DR: ProGraph is a benchmark for evaluating LLMs’ ability to analyze graphs through programming rather than direct reasoning, revealing current models’ limitations and proposing LLM4Graph datasets to improve performance.

Details

Motivation: Current LLM benchmarks for graph analysis are limited to small graphs with few nodes, while human experts use programming libraries to handle graphs of various scales. The paper investigates whether LLMs can analyze graphs like professionals.

Method: Created ProGraph benchmark with 3 categories of graph tasks requiring programming solutions. Proposed LLM4Graph datasets containing crawled documents and auto-generated codes from 6 popular graph libraries. Enhanced LLMs through document retrieval for closed-source models and fine-tuning open-source models on the generated codes.

Result: Current LLMs performed poorly on ProGraph, with the best model achieving only 36% accuracy. The proposed LLM4Graph approach showed significant improvements - 11-32% absolute accuracy gains across different LLMs.

Conclusion: LLMs’ capabilities in handling structured data like graphs are still under-explored. The LLM4Graph approach effectively enhances LLMs’ graph analysis proficiency, demonstrating the value of programming-based approaches over direct reasoning for complex graph tasks.

Abstract: The need to analyze graphs is ubiquitous across various fields, from social networks to biological research and recommendation systems. Therefore, enabling the ability of large language models (LLMs) to process graphs is an important step toward more advanced general intelligence. However, current LLM benchmarks on graph analysis require models to directly reason over the prompts describing graph topology, and are thus limited to small graphs with only a few dozens of nodes. In contrast, human experts typically write programs based on popular libraries for task solving, and can thus handle graphs with different scales. To this end, a question naturally arises: can LLMs analyze graphs like professionals? In this paper, we introduce ProGraph, a manually crafted benchmark containing 3 categories of graph tasks. The benchmark expects solutions based on programming instead of directly reasoning over raw inputs. Our findings reveal that the performance of current LLMs is unsatisfactory, with the best model achieving only 36% accuracy. To bridge this gap, we propose LLM4Graph datasets, which include crawled documents and auto-generated codes based on 6 widely used graph libraries. By augmenting closed-source LLMs with document retrieval and fine-tuning open-source ones on the codes, we show 11-32% absolute improvements in their accuracies. Our results underscore that the capabilities of LLMs in handling structured data are still under-explored, and show the effectiveness of LLM4Graph in enhancing LLMs’ proficiency of graph analysis. The benchmark, datasets and enhanced open-source models are available at https://github.com/BUPT-GAMMA/ProGraph.

[98] IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Akhilesh Aravapalli, Mounika Marreddy, Radhika Mamidi, Manish Gupta, Subba Reddy Oota

Main category: cs.CL

TL;DR: This paper investigates the encoding capability and robustness of multilingual Transformer models for 6 Indic languages across 8 linguistic properties and 13 perturbations, revealing that universal models show better robustness while Indic-specific models capture linguistic properties better.

Details

Motivation: Previous studies on Transformer models' linguistic encoding and robustness have mainly focused on BERT and English, leaving a gap in understanding how these models perform for Indic languages.

Method: Created IndicSentEval benchmark dataset (~47K sentences) and conducted probing analysis on 9 multilingual Transformer models (7 universal, 2 Indic-specific) across 8 linguistic properties and 13 perturbations in 6 Indic languages.

Result: Multilingual models show consistent encoding for English but mixed results for Indic languages. Indic-specific models capture linguistic properties better, while universal models exhibit better robustness, especially under perturbations like dropping nouns/verbs.

Conclusion: The study provides insights into strengths and weaknesses of multilingual Transformer models for Indic languages, highlighting the trade-off between linguistic encoding capability and robustness across different model types.

Abstract: Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately $\sim$47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [https://github.com/aforakhilesh/IndicBertology].

[99] Exploring Large Language Models for Detecting Mental Disorders

Gleb Kuzmin, Petr Strepetov, Maksim Stankevich, Natalia Chudova, Artem Shelmanov, Ivan Smirnov

Main category: cs.CL

TL;DR: LLMs outperform traditional methods for depression/anxiety detection, especially on noisy/small datasets, but psycholinguistic features and encoder models can match LLM performance when trained on clinically confirmed cases.

Details

Motivation: To compare the effectiveness of traditional ML methods, encoder-based models, and LLMs for detecting depression and anxiety across different Russian-language datasets with varying formats and pathology definition methods.

Method: Tested AutoML models with linguistic features, various BERT-based encoder Transformers, and state-of-the-art LLMs on five Russian-language datasets with different text formats and pathology classification approaches.

Result: LLMs outperformed traditional methods, particularly on noisy and small datasets with varying text lengths and genres. Psycholinguistic features and encoder models achieved comparable performance to LLMs when trained on clinically confirmed depression cases.

Conclusion: LLMs are superior for general depression/anxiety detection, especially in challenging data conditions, but traditional methods remain effective for targeted clinical applications with confirmed cases.

Abstract: This paper compares the effectiveness of traditional machine learning methods, encoder-based models, and large language models (LLMs) on the task of detecting depression and anxiety. Five Russian-language datasets were considered, each differing in format and in the method used to define the target pathology class. We tested AutoML models based on linguistic features, several variations of encoder-based Transformers such as BERT, and state-of-the-art LLMs as pathology classification models. The results demonstrated that LLMs outperform traditional methods, particularly on noisy and small datasets where training examples vary significantly in text length and genre. However, psycholinguistic features and encoder-based models can achieve performance comparable to language models when trained on texts from individuals with clinically confirmed depression, highlighting their potential effectiveness in targeted clinical applications.

[100] A Comprehensive Evaluation of Cognitive Biases in LLMs

Simon Malberg, Roman Poletukhin, Carolin M. Schuster, Georg Groh

Main category: cs.CL

TL;DR: Large-scale evaluation of 30 cognitive biases in 20 state-of-the-art LLMs reveals presence of all tested biases in at least some models, with a novel test framework and 30,000-test benchmark dataset.

Details

Motivation: To systematically evaluate the presence of cognitive biases in modern large language models across various decision-making scenarios, building on previous findings about biases in LLMs.

Method: Developed a novel general-purpose test framework for reliable large-scale generation of tests, created a benchmark dataset with 30,000 tests for detecting cognitive biases, and evaluated 20 state-of-the-art LLMs.

Result: Found evidence of all 30 tested cognitive biases in at least some of the 20 evaluated LLMs, confirming and broadening previous findings about bias presence in LLMs.

Conclusion: Cognitive biases are prevalent across state-of-the-art LLMs, with the published framework enabling future research on bias detection and mitigation in language models.

Abstract: We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code to encourage future research on biases in LLMs: https://github.com/simonmalberg/cognitive-biases-in-llms

[101] Incivility and Rigidity: Evaluating the Risks of Fine-Tuning LLMs for Political Argumentation

Svetlana Churina, Kokil Jaidka

Main category: cs.CL

TL;DR: Fine-tuning GPT-3.5 Turbo on different political discourse datasets shows that training data composition significantly affects argument quality - Reddit data produces safer but rigid arguments, while cross-platform training increases adversarial tone.

Details

Motivation: Incivility on social media platforms complicates AI development for productive political argumentation, requiring understanding how training data and prompting affect argument quality.

Method: Fine-tuned GPT-3.5 Turbo on two datasets: high-incivility Twitter replies to US Congress and low-incivility posts from Reddit’s r/ChangeMyView, evaluating rhetorical framing and deliberative quality.

Result: Reddit-finetuned models generate safer but rhetorically rigid arguments; cross-platform fine-tuning amplifies adversarial tone and toxicity; prompt-based steering reduces overt toxicity but cannot fully offset noisy training data.

Conclusion: Training data composition significantly influences argument quality, and introduced rhetorical evaluation rubric provides guidelines for authoring, moderation, and deliberation-support systems.

Abstract: Incivility on platforms such as Twitter (now X) and Reddit complicates the development of AI systems that can support productive, rhetorically sound political argumentation. We present experiments with \textit{GPT-3.5 Turbo} fine-tuned on two contrasting datasets of political discourse: high-incivility Twitter replies to U.S. Congress and low-incivility posts from Reddit’s \textit{r/ChangeMyView}. Our evaluation examines how data composition and prompting strategies affect the rhetorical framing and deliberative quality of model-generated arguments. Results show that Reddit-finetuned models generate safer but rhetorically rigid arguments, while cross-platform fine-tuning amplifies adversarial tone and toxicity. Prompt-based steering reduces overt toxicity (e.g., personal attacks) but cannot fully offset the influence of noisy training data. We introduce a rhetorical evaluation rubric - covering justification, reciprocity, alignment, and authority - and provide implementation guidelines for authoring, moderation, and deliberation-support systems.

[102] Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian

Main category: cs.CL

TL;DR: Coconut introduces continuous thought representations using LLM hidden states for reasoning, enabling breadth-first search instead of deterministic chain-of-thought paths.

Details

Motivation: Language space may not be optimal for reasoning as many word tokens focus on textual coherence rather than reasoning, and some critical tokens require complex planning that challenges LLMs.

Method: Uses last hidden state of LLM as continuous thought representation, feeds it back as next input embedding directly in continuous space, enabling breadth-first search by encoding multiple alternative next steps.

Result: Outperforms chain-of-thought on logical reasoning tasks requiring substantial search during planning, achieves better trade-off between accuracy and efficiency.

Conclusion: Continuous thought paradigm enables advanced reasoning patterns beyond language space, allowing more flexible and efficient reasoning through latent representations.

Abstract: Large language models (LLMs) are typically constrained to reason in the language space, where they express the reasoning process through a chain-of-thought (CoT) to solve complex problems. However, the language space may not always be optimal for reasoning. Most word tokens primarily ensure textual coherence and are not essential for reasoning, while some critical tokens require complex planning and pose challenges to LLMs. To explore the potential of reasoning beyond language, we introduce a new paradigm called Coconut (Chain of Continuous Thought). Coconut utilizes the last hidden state of the LLM as a representation of the reasoning state, termed “continuous thought.” Instead of decoding this state into words, we feed it back to the model as the next input embedding directly in the continuous space. This latent reasoning paradigm enables an advanced reasoning pattern, where continuous thoughts can encode multiple alternative next steps, allowing the model to perform a breadth-first search (BFS) rather than committing prematurely to a single deterministic path as in CoT. Coconut outperforms CoT on logical reasoning tasks that require substantial search during planning and achieves a better trade-off between accuracy and efficiency.

[103] AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar

Main category: cs.CL

TL;DR: AlignVLM is a novel vision-language alignment method that maps visual features to weighted averages of LLM text embeddings, leveraging linguistic priors for better cross-modal alignment, especially in document understanding tasks.

Details

Motivation: Existing connectors like MLPs lack inductive bias to constrain visual features within the LLM's linguistic structure, making them data-hungry and prone to cross-modal misalignment.

Method: Proposes AlignVLM which maps visual features to a weighted average of LLM text embeddings, leveraging the linguistic priors encoded by the LLM to ensure visual features are mapped to interpretable regions of the embedding space.

Result: Achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. Shows efficiency and robustness to noise.

Conclusion: AlignVLM effectively bridges the vision-language gap by using LLM’s linguistic priors to guide visual feature mapping, particularly benefiting document understanding and low-resource scenarios.

Abstract: Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLM’s embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise.

[104] SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

Hongye Cao, Yanming Wang, Sijia Jing, Ziyue Peng, Zhixin Bai, Zhe Cao, Meng Fang, Fan Feng, Boyan Wang, Jiaheng Liu, Tianpei Yang, Jing Huo, Yang Gao, Fanyu Meng, Xi Yang, Chao Deng, Junlan Feng

Main category: cs.CL

TL;DR: SafeDialBench is a fine-grained benchmark for evaluating LLM safety across various jailbreak attacks in multi-turn dialogues, featuring a hierarchical safety taxonomy and innovative assessment framework.

Details

Motivation: Current safety benchmarks focus on single-turn dialogues or single jailbreak methods, and lack detailed evaluation of LLMs' capability to identify and handle unsafe information.

Method: Created a two-tier hierarchical safety taxonomy with 6 safety dimensions, generated over 4000 multi-turn dialogues in Chinese and English across 22 scenarios, using 7 jailbreak attack strategies including reference attack and purpose reverse.

Result: Evaluation of 17 LLMs showed Yi-34B-Chat and GLM4-9B-Chat have superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.

Conclusion: The proposed SafeDialBench provides comprehensive safety assessment for LLMs in multi-turn dialogues, revealing significant performance variations across different models.

Abstract: With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM’s capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 17 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.

[105] Eye Tracking Based Cognitive Evaluation of Automatic Readability Assessment Measures

Keren Gruteke Klein, Shachar Frenkel, Omer Shubi, Yevgeni Berzak

Main category: cs.CL

TL;DR: Existing readability scoring methods are poor predictors of real-time reading ease measured through eye tracking, and are often outperformed by simple psycholinguistic word properties.

Details

Motivation: To evaluate readability scoring methods using real-time reading ease (eye tracking data) rather than traditional offline measures like comprehension tests and readability ratings.

Method: Introduced an evaluation framework that quantifies readability methods’ ability to account for reading ease while controlling for content variation. Applied this to traditional formulas, ML systems, LLMs, and commercial educational systems.

Result: All existing readability scoring methods were poor predictors of reading ease across native/non-native speakers, reading regimes, and text lengths. Simple psycholinguistic word properties often outperformed complex methods.

Conclusion: Highlights fundamental limitations of current readability approaches and the need for new cognitively-driven scoring methods that better account for reading ease, drawing from psycholinguistics.

Abstract: Methods for scoring text readability have been studied for over a century, and are widely used in research and in user-facing applications in many domains. Thus far, the development and evaluation of such methods have primarily relied on two types of offline behavioral data, performance on reading comprehension tests and ratings of text readability levels. In this work, we instead focus on a fundamental and understudied aspect of readability, real-time reading ease, captured with online reading measures using eye tracking. We introduce an evaluation framework for readability scoring methods which quantifies their ability to account for reading ease, while controlling for content variation across texts. Applying this evaluation to prominent traditional readability formulas, modern machine learning systems, frontier Large Language Models and commercial systems used in education, suggests that they are all poor predictors of reading ease in English. This outcome holds across native and non-native speakers, reading regimes, and textual units of different lengths. The evaluation further reveals that existing methods are often outperformed by word properties commonly used in psycholinguistics for prediction of reading times. Our results highlight a fundamental limitation of existing approaches to readability scoring, the utility of psycholinguistics for readability research, and the need for new, cognitively driven readability scoring approaches that can better account for reading ease.

Yue Xu, Chengyan Fu, Li Xiong, Sibei Yang, Wenjie Wang

Main category: cs.CL

TL;DR: FaIRMaker is an automated framework that generates Fairwords through auto-search and refinement to reduce gender bias in LLMs without compromising task performance.

Details

Motivation: Existing methods for mitigating gender bias in LLMs are resource-intensive, not adaptable to closed-source models, and often sacrifice task performance. There's a need for a flexible, model-independent approach.

Method: Uses an auto-search and refinement paradigm to generate Fairwords, which are instructions integrated into input queries to reduce gender bias while maintaining response quality.

Result: FaIRMaker effectively mitigates gender bias while preserving task integrity and works with both API-based and open-source LLMs.

Conclusion: The proposed framework provides an automated, adaptable solution for gender bias mitigation that maintains performance and works across different LLM types.

Abstract: Pre-training large language models (LLMs) on vast text corpora enhances natural language processing capabilities but risks encoding social biases, particularly gender bias. While parameter-modification methods like fine-tuning mitigate bias, they are resource-intensive, unsuitable for closed-source models, and lack adaptability to evolving societal norms. Instruction-based approaches offer flexibility but often compromise task performance. To address these limitations, we propose $\textbf{FaIRMaker}$, an automated and model-independent framework that employs an $\textbf{auto-search and refinement}$ paradigm to adaptively generate Fairwords, which act as instructions integrated into input queries to reduce gender bias and enhance response quality. Extensive experiments demonstrate that FaIRMaker automatically searches for and dynamically refines Fairwords, effectively mitigating gender bias while preserving task integrity and ensuring compatibility with both API-based and open-source LLMs.

[107] Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps

Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasović, Yonatan Belinkov

Main category: cs.CL

TL;DR: FUR framework measures faithfulness of chain-of-thought reasoning by unlearning reasoning steps from model parameters and observing prediction changes.

Details

Motivation: To determine if reasoning verbalized in chain-of-thought (CoT) actually reflects language models' parametric beliefs, addressing concerns about faithfulness of generated reasoning.

Method: Proposed Faithfulness by Unlearning Reasoning steps (FUR) framework that erases information from reasoning steps in model parameters and measures resulting prediction changes.

Result: FUR frequently precisely changed models’ predictions by unlearning key steps across four LMs and five multi-hop MCQA datasets, indicating parametric faithfulness of CoTs.

Conclusion: CoT reasoning can be parametrically faithful, and unlearning reasoning steps has deeper effects as models generate CoTs supporting different answers post-unlearning.

Abstract: When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work on CoT prompting, it is unclear if reasoning verbalized in a CoT is faithful to the models’ parametric beliefs. We introduce a framework for measuring parametric faithfulness of generated reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an instance of this framework. FUR erases information contained in reasoning steps from model parameters, and measures faithfulness as the resulting effect on the model’s prediction. Our experiments with four LMs and five multi-hop multi-choice question answering (MCQA) datasets show that FUR is frequently able to precisely change the underlying models’ prediction for a given instance by unlearning key steps, indicating when a CoT is parametrically faithful. Further analysis shows that CoTs generated by models post-unlearning support different answers, hinting at a deeper effect of unlearning.

[108] Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, Rui Wang

Main category: cs.CL

TL;DR: ST-BoN is a decoding method that improves Best-of-N sampling efficiency by using early sampling consistency to identify promising paths and truncate suboptimal ones, reducing GPU memory by 80% and latency by 50% while maintaining performance.

Details

Motivation: Best-of-N sampling faces efficiency challenges: high GPU memory consumption from generating N full samples and overhead from reward models. Current methods don't address both issues simultaneously.

Method: Self-Truncation Best-of-N (ST-BoN) leverages early sampling consistency in model’s internal states to identify the most promising path and truncate suboptimal ones, avoiding full generation of all N samples and eliminating reward models.

Result: ST-BoN reduces dynamic GPU memory usage by over 80% and inference latency by 50%. It achieves same performance as Full-BoN with 70%-80% computational cost savings, and improves accuracy by 3-4 points under same cost.

Conclusion: ST-BoN provides an efficient alternative to traditional Best-of-N sampling by addressing both memory and reward model challenges simultaneously, offering significant cost-performance improvements.

Abstract: Test-time scaling enhances large language model performance by allocating additional compute resources during inference. Best-of-N (BoN) sampling serves as a common sampling-based scaling technique, broadening the search space in parallel to find better solutions from the model distribution. However, its cost-performance trade-off is still underexplored. Two main challenges limit the efficiency of BoN sampling: (1) Generating N full samples consumes substantial GPU memory, reducing inference capacity under limited resources. (2) Reward models add extra memory and latency overhead, and training strong reward models introduces potential training data costs. Although some studies have explored efficiency improvements, none have addressed both challenges at once. To address this gap, we propose Self-Truncation Best-of-N (ST-BoN), a decoding method that avoids fully generating all N samples and eliminates the need for reward models. It leverages early sampling consistency in the model’s internal states to identify the most promising path and truncate suboptimal ones. In terms of cost, ST-BoN reduces dynamic GPU memory usage by over 80% and inference latency by 50%. In terms of cost-performance trade-off, ST-BoN achieves the same performance as Full-BoN while saving computational cost by 70%-80%, and under the same cost, it can improve accuracy by 3-4 points.

[109] Targeted Distillation for Sentiment Analysis

Yice Zhang, Guangyu Xie, Jingjie Lin, Jianzhu Bao, Qianlong Wang, Xi Zeng, Ruifeng Xu

Main category: cs.CL

TL;DR: The paper proposes a two-stage distillation framework for sentiment analysis that separates knowledge and alignment, and introduces SentiBench benchmark with 12 datasets to evaluate model performance.

Details

Motivation: To build compact and practical sentiment analysis models that maintain strong and generalizable capabilities while being efficient.

Method: Conceptually decouples distillation into knowledge and alignment components, then implements a two-stage distillation framework. Also creates SentiBench benchmark covering diverse sentiment tasks across 12 datasets.

Result: The approach substantially enhances compact model performance across diverse sentiment analysis tasks and shows strong generalization to unseen tasks, demonstrating robust competitiveness against existing small-scale models.

Conclusion: The proposed targeted distillation method effectively builds compact sentiment analysis models with preserved strong capabilities and good generalization, validated through comprehensive benchmarking.

Abstract: This paper explores targeted distillation methods for sentiment analysis, aiming to build compact and practical models that preserve strong and generalizable sentiment analysis capabilities. To this end, we conceptually decouple the distillation target into knowledge and alignment and accordingly propose a two-stage distillation framework. Moreover, we introduce SentiBench, a comprehensive and systematic sentiment analysis benchmark that covers a diverse set of tasks across 12 datasets. We evaluate a wide range of models on this benchmark. Experimental results show that our approach substantially enhances the performance of compact models across diverse sentiment analysis tasks, and the resulting models demonstrate strong generalization to unseen tasks, showcasing robust competitiveness against existing small-scale models.

[110] Medical Hallucinations in Foundation Models and Their Impact on Healthcare

Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai Xu, Xin Liu, Chunjong Park, Hyeonhoon Lee, Hae Won Park, Daniel McDuff, Samir Tulebaev, Cynthia Breazeal

Main category: cs.CL

TL;DR: Medical hallucinations in foundation models stem from autoregressive training prioritizing token likelihood over accuracy. General-purpose models outperformed medical-specialized ones (76.6% vs 51.3% hallucination-free responses), with chain-of-thought prompting significantly reducing errors. Physician surveys confirm real-world harm potential.

Details

Motivation: To understand and quantify medical hallucinations in foundation models, defined as factually incorrect, logically inconsistent, or unsupported outputs that could alter clinical decisions, and evaluate their real-world impact.

Method: Evaluated 11 foundation models (7 general-purpose, 4 medical-specialized) across seven medical hallucination tasks using medical reasoning and biomedical information retrieval. Used chain-of-thought prompting and physician audits to analyze error patterns.

Result: General-purpose models significantly outperformed medical-specialized models (25.2% difference in hallucination-free responses). Chain-of-thought reduced hallucinations in 86.4% of comparisons. 64-72% of residual errors were reasoning failures. 91.8% of clinicians reported encountering hallucinations, with 84.7% considering them harmful.

Conclusion: Medical safety emerges from sophisticated reasoning and broad knowledge integration developed during large-scale pre-training, not from narrow domain optimization. Chain-of-thought reasoning is crucial for reducing hallucinations through self-verification capabilities.

Abstract: Hallucinations in foundation models arise from autoregressive training objectives that prioritize token-likelihood optimization over epistemic accuracy, fostering overconfidence and poorly calibrated uncertainty. We define medical hallucination as any model-generated output that is factually incorrect, logically inconsistent, or unsupported by authoritative clinical evidence in ways that could alter clinical decisions. We evaluated 11 foundation models (7 general-purpose, 4 medical-specialized) across seven medical hallucination tasks spanning medical reasoning and biomedical information retrieval. General-purpose models achieved significantly higher proportions of hallucination-free responses than medical-specialized models (median: 76.6% vs 51.3%, difference = 25.2%, 95% CI: 18.7-31.3%, Mann-Whitney U = 27.0, p = 0.012, rank-biserial r = -0.64). Top-performing models such as Gemini-2.5 Pro exceeded 97% accuracy when augmented with chain-of-thought prompting (base: 87.6%), while medical-specialized models like MedGemma ranged from 28.6-61.9% despite explicit training on medical corpora. Chain-of-thought reasoning significantly reduced hallucinations in 86.4% of tested comparisons after FDR correction (q < 0.05), demonstrating that explicit reasoning traces enable self-verification and error detection. Physician audits confirmed that 64-72% of residual hallucinations stemmed from causal or temporal reasoning failures rather than knowledge gaps. A global survey of clinicians (n = 70) validated real-world impact: 91.8% had encountered medical hallucinations, and 84.7% considered them capable of causing patient harm. The underperformance of medical-specialized models despite domain training indicates that safety emerges from sophisticated reasoning capabilities and broad knowledge integration developed during large-scale pre-training, not from narrow optimization.

[111] XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

Zhenyu Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yaoyin Zhang, Xuchen Wei, Juntao Li, Min Zhang

Main category: cs.CL

TL;DR: XIFBench is a comprehensive benchmark for evaluating multilingual instruction-following in LLMs, featuring 558 instructions with 0-5 constraints across 5 categories in 6 languages, with methodological innovations for reliable cross-lingual evaluation.

Details

Motivation: Existing evaluations of LLMs in multilingual settings lack systematic investigation and fine-grained constraint analysis across diverse linguistic contexts.

Method: Developed XIFBench with 558 instructions across 5 constraint categories in 6 languages, implementing cultural accessibility annotation, constraint-level translation validation, and requirement-based evaluation using English as semantic anchors.

Result: Extensive experiments quantified performance disparities across resource levels and provided detailed insights into how language resources, constraint categories, instruction complexity, and cultural specificity influence multilingual instruction-following.

Conclusion: XIFBench enables systematic evaluation of multilingual instruction-following capabilities in LLMs, revealing important factors that affect performance across different linguistic contexts.

Abstract: Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings lacks systematic investigation, with existing evaluations lacking fine-grained constraint analysis across diverse linguistic contexts. We introduce XIFBench, a comprehensive constraint-based benchmark for evaluating multilingual instruction-following abilities of LLMs, comprising 558 instructions with 0-5 additional constraints across five categories (Content, Style, Situation, Format, and Numerical) in six languages spanning different resource levels. To support reliable and consistent cross-lingual evaluation, we implement three methodological innovations: cultural accessibility annotation, constraint-level translation validation, and requirement-based evaluation using English requirements as semantic anchors across languages. Extensive experiments with various LLMs not only quantify performance disparities across resource levels but also provide detailed insights into how language resources, constraint categories, instruction complexity, and cultural specificity influence multilingual instruction-following. Our code and data are available at https://github.com/zhenyuli801/XIFBench.

[112] Natural Language Generation

Emiel van Miltenburg, Chenghua Lin

Main category: cs.CL

TL;DR: Overview of Natural Language Generation (NLG) as a field that studies systems verbalizing information through natural language, covering data-to-text, text-to-text, and image-to-text applications, and discussing its relationship with other NLP subfields.

Details

Motivation: To provide a comprehensive overview of the NLG field, clarify its scope and boundaries with related subfields like Machine Translation and Dialog Systems, and discuss how LLMs have influenced convergence in NLP methodologies.

Method: Conceptual analysis and field overview approach, examining different NLG applications (data-to-text, summarization, image captioning) and discussing relationships with other NLP subfields.

Result: Clear definition of NLG scope, distinction from Machine Translation (no content selection) and Dialog Systems (NLG as component), and observation of methodological convergence across NLP subfields due to LLMs.

Conclusion: NLG encompasses diverse applications for verbalizing information through natural language, with evolving boundaries and methodological convergence with other NLP subfields driven by Large Language Models.

Abstract: This article provides a brief overview of the field of Natural Language Generation. The term Natural Language Generation (NLG), in its broadest definition, refers to the study of systems that verbalize some form of information through natural language. That information could be stored in a large database or knowledge graph (in data-to-text applications), but NLG researchers may also study summarisation (text-to-text) or image captioning (image-to-text), for example. As a subfield of Natural Language Processing, NLG is closely related to other sub-disciplines such as Machine Translation (MT) and Dialog Systems. Some NLG researchers exclude MT from their definition of the field, since there is no content selection involved where the system has to determine what to say. Conversely, dialog systems do not typically fall under the header of Natural Language Generation since NLG is just one component of dialog systems (the others being Natural Language Understanding and Dialog Management). However, with the rise of Large Language Models (LLMs), different subfields of Natural Language Processing have converged on similar methodologies for the production of natural language and the evaluation of automatically generated text.

[113] JudgeLRM: Large Reasoning Models as a Judge

Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He

Main category: cs.CL

TL;DR: JudgeLRM is a family of judgment-oriented LLMs trained with reinforcement learning to improve reasoning capabilities for evaluation tasks, outperforming SFT approaches and even state-of-the-art reasoning models.

Details

Motivation: Existing supervised fine-tuning approaches fall short in domains requiring complex reasoning for judgment tasks, which involve evidence verification, error identification, and decision justification.

Method: Used reinforcement learning with judge-wise, outcome-driven rewards to activate reasoning capabilities, creating JudgeLRM models of various sizes (3B to 14B parameters).

Result: JudgeLRM consistently outperformed SFT-tuned baselines and other RL/SFT variants. JudgeLRM-3B/4B exceeded GPT-4, while JudgeLRM-7B/8B/14B outperformed DeepSeek-R1 by over 2% in F1 score, with particularly strong gains on reasoning-heavy tasks.

Conclusion: Reinforcement learning is valuable for unlocking reasoning-aligned LLM judges, addressing the limitations of supervised fine-tuning in reasoning-intensive evaluation scenarios.

Abstract: Large Language Models (LLMs) are increasingly adopted as evaluators, offering a scalable alternative to human annotation. However, existing supervised fine-tuning (SFT) approaches often fall short in domains that demand complex reasoning. Judgment is inherently reasoning-intensive: beyond surface-level scoring, it requires verifying evidence, identifying errors, and justifying decisions. Through the analysis of evaluation tasks, we find a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples, revealing the limits of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs, trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards to activate reasoning capabilities. JudgeLRM consistently outperform SFT-tuned baselines in the same size, as well as other RL and SFT variants, and even surpass state-of-the-art reasoning models: notably, JudgeLRM-3B/4B exceeds GPT-4, while JudgeLRM-7B/8B/14B outperforms DeepSeek-R1 by over 2% in F1 score, with particularly strong gains on reasoning-heavy tasks. Our findings underscore the value of RL in unlocking reasoning-aligned LLM judges.

[114] Do LLM Evaluators Prefer Themselves for a Reason?

Wei-Lin Chen, Zhepei Wei, Xinyu Zhu, Shi Feng, Yu Meng

Main category: cs.CL

TL;DR: LLMs show self-preference bias in evaluation, but this study distinguishes harmful vs legitimate self-preference using objective benchmarks. Stronger models prefer themselves mostly legitimately due to superior performance, but also show more harmful self-preference when wrong.

Details

Motivation: To determine whether LLM self-preference is harmful or simply reflects genuinely higher-quality outputs, addressing limitations of previous studies that relied on subjective tasks without objective ground truth.

Method: Used verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) with objective ground-truth assessment. Conducted large-scale experiments across diverse model families under controlled evaluation conditions.

Result: 1) Stronger models exhibit greater self-preference but mostly legitimately due to superior performance. 2) Harmful self-preference persists when evaluator models err, with stronger models showing more pronounced harmful self-preference when wrong. 3) Chain-of-Thought strategies effectively reduce harmful self-preference.

Conclusion: Provides nuanced understanding of LLM-based evaluation: self-preference is mostly legitimate for stronger models but harmful when models err, and inference-time scaling can mitigate harmful bias, offering practical insights for improving evaluation reliability.

Abstract: Large language models (LLMs) are increasingly used as automatic evaluators in applications like benchmarking, reward modeling, and self-refinement. Prior work highlights a potential self-preference bias where LLMs favor their own generated responses, a tendency often intensifying with model size and capability. This raises a critical question: Is self-preference harmful, or does it simply reflect the genuinely higher-quality outputs of stronger models? Answering this has been difficult as previous studies relied primarily on subjective tasks. These tasks lack an objective ground truth, meaning that either preference can be reasonably justified. To address this ambiguity, we investigate self-preference using verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) that allow objective ground-truth assessment. This enables us to distinguish harmful self-preference (favoring objectively worse responses) from legitimate self-preference (favoring genuinely superior ones). We conduct large-scale experiments under controlled evaluation conditions across diverse model families (e.g., Llama, Qwen, Gemma, Mistral, Phi, GPT, DeepSeek). Our findings reveal three key insights: (1) While stronger models exhibit greater self-preference, much of this preference aligns with objectively superior performance, indicating stronger models prefer themselves mostly legitimately. (2) Harmful self-preference persists when evaluator models err as generators, and stronger models display more pronounced harmful self-preference when they do err. This suggests stronger models struggle more to recognize when they are wrong. (3) Inference-time scaling strategies, such as generating a long Chain-of-Thought before evaluation, effectively reduce harmful self-preference. These results provide a more nuanced understanding of LLM-based evaluation and practical insights for improving its reliability.

[115] Self-Adaptive Cognitive Debiasing for Large Language Models in Decision-Making

Yougang Lyu, Shijie Ren, Yue Feng, Zihan Wang, Zhumin Chen, Zhaochun Ren, Maarten de Rijke

Main category: cs.CL

TL;DR: SACD is a cognitive debiasing approach that iteratively refines prompts through bias determination, analysis, and debiasing steps to mitigate multiple cognitive biases in LLMs for decision-making tasks.

Details

Motivation: Existing cognitive bias mitigation strategies only handle single biases, limiting effectiveness in real-world scenarios where multiple cognitive biases often co-occur in LLM decision-making.

Method: Self-adaptive cognitive debiasing (SACD) with three sequential steps: bias determination, bias analysis, and cognitive debiasing, applied iteratively to refine prompts and mitigate biases.

Result: SACD achieves the lowest average bias scores compared to advanced prompt engineering methods and existing debiasing techniques in both single-bias and multi-bias settings across finance, healthcare, and legal domains.

Conclusion: SACD effectively enhances LLM reliability by iteratively mitigating multiple cognitive biases, outperforming existing methods in challenging multi-bias scenarios.

Abstract: Large language models (LLMs) have shown potential in supporting decision-making applications, particularly as personal assistants in the financial, healthcare, and legal domains. While prompt engineering strategies have enhanced the capabilities of LLMs in decision-making, cognitive biases inherent to LLMs present significant challenges. Cognitive biases are systematic patterns of deviation from norms or rationality in decision-making that can lead to the production of inaccurate outputs. Existing cognitive bias mitigation strategies assume that input prompts only contain one type of cognitive bias, limiting their effectiveness in more challenging scenarios involving multiple cognitive biases. To fill this gap, we propose a cognitive debiasing approach, self-adaptive cognitive debiasing (SACD), that enhances the reliability of LLMs by iteratively refining prompts. Our method follows three sequential steps - bias determination, bias analysis, and cognitive debiasing - to iteratively mitigate potential cognitive biases in prompts. We evaluate SACD on finance, healthcare, and legal decision-making tasks using both open-weight and closed-weight LLMs. Compared to advanced prompt engineering methods and existing cognitive debiasing techniques, SACD achieves the lowest average bias scores in both single-bias and multi-bias settings.

[116] The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

Tianshi Zheng, Yixiang Chen, Chengxi Li, Chunyang Li, Qing Zong, Haochen Shi, Baixuan Xu, Yangqiu Song, Ginny Y. Wong, Simon See

Main category: cs.CL

TL;DR: CoT prompting underperforms direct answering in pattern-based ICL tasks across 16 LLMs, revealing a hybrid explicit-implicit reasoning mechanism where flawed explicit reasoning undermines performance despite partial compensation by implicit reasoning.

Details

Motivation: To challenge the prevailing assumption that Chain-of-Thought (CoT) universally enhances reasoning in LLMs, particularly in fundamental pattern-based in-context learning tasks where its effectiveness was unknown.

Method: Extensive experiments with 16 state-of-the-art LLMs across nine diverse pattern-based ICL datasets, systematically testing various hypotheses about CoT’s underperformance through designed validation experiments.

Result: CoT and its reasoning variants consistently underperform direct answering across all model scales and benchmark complexities. Analysis reveals a hybrid explicit-implicit reasoning mechanism where explicit reasoning fails due to LLMs’ inability to infer patterns from demonstrations, while implicit reasoning partially compensates but is disrupted by increased contextual distance.

Conclusion: CoT’s universal efficacy is challenged, revealing fundamental limitations in pattern-based ICL. Findings provide novel insights into CoT’s mechanisms and guide future research toward more nuanced reasoning methodologies for LLMs.

Abstract: Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs). However, our study reveals a surprising contradiction to this prevailing perspective within the fundamental domain of pattern-based in-context learning (ICL). Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental hybrid mechanism of explicit-implicit reasoning driving CoT’s performance in pattern-based ICL: while explicit reasoning falters due to LLMs’ struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This hybrid mechanism explains CoT’s relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.

[117] Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment

Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang

Main category: cs.CL

TL;DR: Proposes CRV+CogPO framework to train small reasoning models by aligning their reasoning processes with cognitive capacities through multi-agent critique-rethink-verify system and cognitive preference optimization.

Details

Motivation: Large reasoning models have high resource demands, creating need for effective small reasoning models. Direct knowledge distillation from large to small models is often ineffective due to different cognitive capabilities.

Method: CRV system with multiple LLM agents: one critiques CoT rationales based on small model capabilities, another rethinks/refines CoTs, third verifies correctness. CogPO algorithm aligns reasoning processes with cognitive capacities.

Result: Outperforms other methods by large margin on challenging reasoning benchmarks.

Conclusion: CRV+CogPO framework effectively trains small reasoning models by considering their unique cognitive capabilities rather than direct distillation from large models.

Abstract: The reasoning capabilities of large reasoning models (LRMs), such as OpenAI’s o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need for training effective small reasoning models. A critical challenge is that small models possess different reasoning capacities and cognitive trajectories compared with their larger counterparts. Hence, directly distilling chain-of-thought (CoT) rationales from large LRMs to smaller ones can sometimes be ineffective and often requires a substantial amount of annotated data. In this paper, we first introduce a novel Critique-Rethink-Verify (CRV) system, designed for training smaller yet powerful LRMs. Our CRV system consists of multiple LLM agents, each specializing in unique tasks: (i) critiquing the CoT rationales according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. Building on the CRV system, we further propose the Cognitive Preference Optimization (CogPO) algorithm to continuously enhance the reasoning abilities of smaller models by aligning their reasoning processes with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of our CRV+CogPO framework, which outperforms other methods by a large margin.

[118] Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access

Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, Wei Wu

Main category: cs.CL

TL;DR: Proposes Hierarchical Sparse Attention (HSA) to enhance RNNs with long-range random access while maintaining efficiency, achieving perfect accuracy on 64M-length tasks despite 4K-length pre-training.

Details

Motivation: RNNs have linear complexity advantages over Transformers but lack random access to historical context. Simply adding attention undermines RNN efficiency benefits.

Method: HSA divides inputs into chunks, selects top-k chunks using token-level relevance, and hierarchically aggregates information with hardware-aligned kernel design.

Result: RAMba (Mamba + HSA) achieves perfect accuracy on 64M-length passkey retrieval despite 4K pre-training, with significant downstream improvements and near-constant memory footprint.

Conclusion: RAMba demonstrates huge potential for long-context modeling by combining RNN efficiency with flexible random access through hierarchical sparse attention.

Abstract: A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose Hierarchical Sparse Attention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selects the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba’s huge potential in long-context modeling.

[119] PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

Yiming Wang, Pei Zhang, Jialong Tang, Haoran Wei, Baosong Yang, Rui Wang, Chenshu Sun, Feitong Sun, Jiran Zhang, Junxuan Wu, Qiqian Cang, Yichang Zhang, Fei Huang, Junyang Lin, Fei Huang, Jingren Zhou

Main category: cs.CL

TL;DR: PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 difficulty levels, designed to evaluate LLMs’ multilingual reasoning capabilities.

Details

Motivation: To create a highly discriminative multilingual mathematical benchmark that ensures difficulty comprehensiveness, language diversity, and high-quality translation for evaluating reasoning LLMs.

Method: Developed a benchmark with 18 languages and 4 difficulty levels, then conducted comprehensive evaluation of advanced LLMs including Qwen-3-235B-A22B-Thinking and Gemini-2.5-pro.

Result: Even top models achieved only 54.6 and 52.2 benchmark scores, with about 40% accuracy at the highest difficulty level. The benchmark revealed key challenges: wide performance variation across languages, low input-output language consistency, and significant thinking length differences by language.

Conclusion: Controlling output language in instructions can affect reasoning performance, especially for low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.

Abstract: In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and Gemini-2.5-pro, achieve only 54.6 and 52.2 benchmark scores, with about 40% accuracy under the highest level From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.

[120] JobHop: A Large-Scale Dataset of Career Trajectories

Iman Johary, Raphael Romero, Alexandru C. Mara, Tijl De Bie

Main category: cs.CL

TL;DR: JobHop is a large-scale public dataset of career trajectories extracted from anonymized resumes using LLMs, providing structured career information with ESCO occupation codes for labor market analysis.

Details

Motivation: Comprehensive datasets capturing real-world career trajectories are scarce, making it difficult to understand labor market dynamics for policymakers, employers, and job seekers.

Method: Used Large Language Models to process unstructured resume data from VDAB, extracted structured career information, and normalized it to standardized ESCO occupation codes using a multi-label classification model.

Result: Created a rich dataset with over 1.67 million work experiences from more than 361,000 user resumes, mapped to standardized ESCO occupation codes.

Conclusion: The JobHop dataset enables diverse applications for labor market research, including analyzing mobility, job stability, career breaks, and supports career path prediction and data-driven decision-making.

Abstract: Understanding labor market dynamics is essential for policymakers, employers, and job seekers. However, comprehensive datasets that capture real-world career trajectories are scarce. In this paper, we introduce JobHop, a large-scale public dataset derived from anonymized resumes provided by VDAB, the public employment service in Flanders, Belgium. Utilizing Large Language Models (LLMs), we process unstructured resume data to extract structured career information, which is then normalized to standardized ESCO occupation codes using a multi-label classification model. This results in a rich dataset of over 1.67 million work experiences, extracted from and grouped into more than 361,000 user resumes and mapped to standardized ESCO occupation codes, offering valuable insights into real-world occupational transitions. This dataset enables diverse applications, such as analyzing labor market mobility, job stability, and the effects of career breaks on occupational transitions. It also supports career path prediction and other data-driven decision-making processes. To illustrate its potential, we explore key dataset characteristics, including job distributions, career breaks, and job transitions, demonstrating its value for advancing labor market research.

[121] New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models

Julia Wunderle, Anton Ehrmanntraut, Jan Pfister, Fotis Jannidis, Andreas Hotho

Main category: cs.CL

TL;DR: This paper compares two approaches for creating German encoders: training from scratch (ModernGBERT) and converting decoders via LLM2Vec (LLaMmleinVec). ModernGBERT 1B achieves state-of-the-art performance on German benchmarks, outperforming larger converted models.

Details

Motivation: Encoders remain essential for efficient German NLP despite the rise of decoder-only LLMs. The study aims to provide guidance on creating high-quality German encoders under identical data and training constraints.

Method: Two approaches: 1) Training ModernGBERT from scratch (134M, 1B parameters) and 2) Converting decoders via LLM2Vec to create LLaMmleinVec (120M, 1B, 7B parameters). Both undergo context extension to 8,192 tokens and are trained with masked next-token prediction.

Result: ModernGBERT 1B sets new SOTA on SuperGLEBer (avg 0.808), surpassing GBERT Large (+4%) and the 7B converted model (0.787). On German MTEB after fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557).

Conclusion: From-scratch encoders dominate when parameter efficiency and latency matter. When a pre-trained decoder exists and compute is limited, conversion offers an effective alternative. All models and resources are released under research-only RAIL license.

Abstract: Encoders remain essential for efficient German NLP and NLU scenarios despite the rise of decoder-only LLMs. This work studies two routes to high-quality German encoders under identical data and training constraints: 1) training from scratch and 2) converting decoders via LLM2Vec. We introduce two resources: ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT style, and LL"aMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions trained with masked next-token prediction, both undergoing a context extension to 8.192 tokens. Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808), surpassing GBERT Large (+4%) and the seven-times larger converted 7B model (0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557). We release all models, checkpoints, datasets, and full training records, and introduce an encoder-adapted QA-NIAH evaluation. All in all, our results provide actionable guidance: when parameter efficiency and latency matter, from-scratch encoders dominate. When a pre-trained decoder exists and compute is a limited, conversion offers an effective alternative. ModernGBERT and LL"aMmleinVec, including all code, data and intermediary checkpoints are published under a research-only RAIL license.

[122] Editing Across Languages: A Survey of Multilingual Knowledge Editing

Nadir Durrani, Basel Mousi, Fahim Dalvi

Main category: cs.CL

TL;DR: This survey systematizes research on Multilingual Knowledge Editing (MKE), presenting a comprehensive taxonomy of methods, available benchmarks, key findings on effectiveness and transfer patterns, and identifies challenges in cross-lingual propagation.

Details

Motivation: Knowledge Editing has been extensively studied in monolingual settings but remains underexplored in multilingual contexts, creating a need to consolidate research on ensuring factual edits generalize reliably across languages.

Method: The paper presents a comprehensive taxonomy of MKE methods covering parameter-based, memory-based, fine-tuning, and hypernetwork approaches, and surveys available benchmarks.

Result: The survey summarizes key findings on method effectiveness and transfer patterns, and identifies challenges in cross-lingual propagation.

Conclusion: The analysis consolidates a rapidly evolving area and lays the groundwork for future progress in editable language-aware LLMs, highlighting open problems related to language anisotropy, evaluation coverage, and edit scalability.

Abstract: While Knowledge Editing has been extensively studied in monolingual settings, it remains underexplored in multilingual contexts. This survey systematizes recent research on Multilingual Knowledge Editing (MKE), a growing subdomain of model editing focused on ensuring factual edits generalize reliably across languages. We present a comprehensive taxonomy of MKE methods, covering parameter-based, memory-based, fine-tuning, and hypernetwork approaches. We survey available benchmarks,summarize key findings on method effectiveness and transfer patterns, identify challenges in cross-lingual propagation, and highlight open problems related to language anisotropy, evaluation coverage, and edit scalability. Our analysis consolidates a rapidly evolving area and lays the groundwork for future progress in editable language-aware LLMs.

[123] Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling

He Hu, Yucheng Zhou, Juzheng Si, Qianning Wang, Hengheng Zhang, Fuji Ren, Fei Ma, Laizhong Cui, Qi Tian

Main category: cs.CL

TL;DR: PsyLLM is the first large language model designed for mental health counseling that systematically integrates diagnostic and therapeutic reasoning using clinical standards like DSM/ICD and multiple therapeutic frameworks.

Details

Motivation: Existing LLM-based mental health approaches lack clinical grounding, particularly in explicit diagnostic reasoning aligned with standards like DSM/ICD and incorporating diverse therapeutic modalities beyond basic empathy or single strategies.

Method: Developed an automated data synthesis pipeline that processes real-world mental health posts from Reddit, generates multi-turn dialogue structures, and uses LLMs guided by international diagnostic standards and multiple therapeutic frameworks (CBT, ACT, psychodynamic) to simulate clinical reasoning processes with rigorous multi-dimensional filtering.

Result: PsyLLM significantly outperforms state-of-the-art baseline models on a new benchmark assessing counseling quality across four key dimensions.

Conclusion: The proposed PsyLLM model successfully addresses critical limitations in existing approaches by integrating systematic diagnostic and therapeutic reasoning for mental health counseling, with publicly released model weights and dataset.

Abstract: Large language models (LLMs) hold significant potential for mental health support, capable of generating empathetic responses and simulating therapeutic conversations. However, existing LLM-based approaches often lack the clinical grounding necessary for real-world psychological counseling, particularly in explicit diagnostic reasoning aligned with standards like the DSM/ICD and incorporating diverse therapeutic modalities beyond basic empathy or single strategies. To address these critical limitations, we propose PsyLLM, the first large language model designed to systematically integrate both diagnostic and therapeutic reasoning for mental health counseling. To develop PsyLLM, we design a novel automated data synthesis pipeline that processes real-world mental health posts collected from Reddit, where users frequently share psychological distress and seek community support. This pipeline processes real-world mental health posts, generates multi-turn dialogue structures, and leverages LLMs guided by international diagnostic standards (e.g., DSM/ICD) and multiple therapeutic frameworks (e.g., CBT, ACT, psychodynamic) to simulate detailed clinical reasoning processes. Rigorous multi-dimensional filtering ensures the generation of high-quality, clinically aligned dialogue data. In addition, we introduce a new benchmark and evaluation protocol, assessing counseling quality across four key dimensions. Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models on this benchmark. The model weights and dataset have been publicly released at https://github.com/Emo-gml/PsyLLM.

[124] The Language of Interoception: Examining Embodiment and Emotion Through a Corpus of Body Part Mentions

Sophie Wu, Jan Philip Wahle, Saif M. Mohammad

Main category: cs.CL

TL;DR: First large-scale study linking emotion, embodiment, and body part mentions in natural language, showing BPMs are common in online text and correlate with emotional intensity and poorer health outcomes.

Details

Motivation: To investigate the connection between emotion, embodiment, and everyday language using large-scale natural language data, exploring how body part mentions relate to emotional expression and wellbeing.

Method: Created corpora of body part mentions in online English text (blogs and tweets), including human-annotated subset for emotions, and analyzed using word-emotion association lexicons and statistical correlation analysis.

Result: BPMs are common (5-10% of posts), usage varies by time/location, text with BPMs is more emotionally charged even without explicit physical reactions, and strong correlation exists between body-related language and poorer health outcomes.

Conclusion: Investigating body-part related words in language opens valuable research avenues at the intersection of NLP, affective sciences, and human wellbeing studies.

Abstract: This paper is the first investigation of the connection between emotion, embodiment, and everyday language in a large sample of natural language data. We created corpora of body part mentions (BPMs) in online English text (blog posts and tweets). This includes a subset featuring human annotations for the emotions of the person whose body part is mentioned in the text. We show that BPMs are common in personal narratives and tweets (~5% to 10% of posts include BPMs) and that their usage patterns vary markedly by time and %geographic location. Using word-emotion association lexicons and our annotated data, we show that text containing BPMs tends to be more emotionally charged, even when the BPM is not explicitly used to describe a physical reaction to the emotion in the text. Finally, we discover a strong and statistically significant correlation between body-related language and a variety of poorer health outcomes. In sum, we argue that investigating the role of body-part related words in language can open up valuable avenues of future research at the intersection of NLP, the affective sciences, and the study of human wellbeing.

[125] Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning

Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, Xiaoyu Shen

Main category: cs.CL

TL;DR: This paper provides a comprehensive survey of latent Chain-of-Thought reasoning in LLMs, where reasoning occurs in latent spaces rather than through explicit language generation, establishing a systematic taxonomy and analyzing recent advances.

Details

Motivation: Conventional CoT reasoning relies on explicit verbalization of intermediate steps, which limits applicability in abstract reasoning tasks beyond language. Latent CoT offers richer cognitive representations and more flexible, faster inference by decoupling reasoning from language generation.

Method: The paper presents a systematic taxonomy of latent CoT methods, categorizing them from token-wise horizontal approaches to layer-wise vertical strategies. It provides in-depth analysis of design principles, applications, and challenges.

Result: The survey establishes a structured foundation for advancing latent CoT reasoning in LLMs, offering comprehensive overview of this emerging paradigm and recent methodological advances.

Conclusion: Latent CoT reasoning represents a promising direction for enhancing LLM reasoning capabilities, with potential for broader applicability in abstract reasoning tasks through embedded reasoning processes in latent spaces.

Abstract: Large Language Models (LLMs) have shown impressive performance on complex tasks through Chain-of-Thought (CoT) reasoning. However, conventional CoT relies on explicitly verbalized intermediate steps, which constrains its broader applicability, particularly in abstract reasoning tasks beyond language. To address this, there has been growing research interest in \textit{latent CoT reasoning}, where the reasoning process is embedded within latent spaces. By decoupling reasoning from explicit language generation, latent CoT offers the promise of richer cognitive representations and facilitates more flexible, faster inference. This paper aims to present a comprehensive overview of this emerging paradigm and establish a systematic taxonomy. We analyze recent advances in methods, categorizing them from token-wise horizontal approaches to layer-wise vertical strategies. We then provide in-depth discussions of these methods, highlighting their design principles, applications, and remaining challenges. We hope that our survey provides a structured foundation for advancing this promising direction in LLM reasoning. The relevant papers will be regularly updated at https://github.com/EIT-NLP/Awesome-Latent-CoT.

[126] Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally

Agam Shah, Siddhant Sukhani, Huzaifa Pardawala, Saketh Budideti, Riya Bhadani, Rudra Gopal, Siddhartha Somani, Rutwik Routu, Michael Galarnyk, Soungmin Lee, Arnav Hiray, Akshar Ravichandran, Eric Kim, Pranav Aluru, Joshua Zhang, Sebastian Jaskowski, Veer Guda, Meghaj Tarte, Liqin Ye, Spencer Gosden, Rachel Yuh, Sloka Chava, Sahasra Chava, Dylan Patrick Kelly, Aiden Chiang, Harsit Mittal, Sudheer Chava

Main category: cs.CL

TL;DR: The paper introduces the World Central Banks (WCB) dataset - the most comprehensive monetary policy corpus with 380k+ sentences from 25 central banks across 28 years, and benchmarks various language models on three annotation tasks.

Details

Motivation: To address the need for better understanding central bank communications, as misinterpretations can disproportionately impact vulnerable populations, and to provide a comprehensive dataset for monetary policy analysis.

Method: Created WCB dataset with 25k uniformly sampled sentences annotated using dual annotators, disagreement resolutions, and expert reviews. Defined three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation. Benchmarked 7 PLMs and 9 LLMs across 15,075 experiments.

Result: A model trained on aggregated data across banks significantly outperforms models trained on individual bank data, confirming that “the whole is greater than the sum of its parts.” Human evaluations and error analyses validate the framework’s economic utility.

Conclusion: The WCB dataset enables robust analysis of central bank communications, demonstrating the value of cross-bank aggregated training data, with artifacts made publicly available under CC-BY-NC-SA 4.0 license.

Abstract: Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank’s data, confirming the principle “the whole is greater than the sum of its parts.” Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework’s economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.

[127] Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation

Cécile Rousseau, Tobia Boschi, Giandomenico Cornacchia, Dhaval Salwala, Alessandra Pascale, Juan Bernabe Moreno

Main category: cs.CL

TL;DR: SDForger is an LLM-based framework for generating high-quality multivariate time series using text embeddings, outperforming existing models in similarity and forecasting tasks.

Details

Motivation: To create a flexible and efficient method for synthetic time series generation that can work with few samples and enable multimodal integration with textual information.

Method: Transforms univariate/multivariate signals into tabular embeddings, encodes them as text to fine-tune autoregressive LLMs, then samples and decodes new textual embeddings into synthetic time series.

Result: Outperforms existing generative models across diverse datasets in both similarity-based evaluations and downstream forecasting tasks.

Conclusion: SDForger enables efficient time series generation with textual conditioning, paving the way for multimodal modeling and streamlined integration of time series with textual data.

Abstract: SDForger is a flexible and efficient framework for generating high-quality multivariate time series using LLMs. Leveraging a compact data representation, SDForger provides synthetic time series generation from a few samples and low-computation fine-tuning of any autoregressive LLM. Specifically, the framework transforms univariate and multivariate signals into tabular embeddings, which are then encoded into text and used to fine-tune the LLM. At inference, new textual embeddings are sampled and decoded into synthetic time series that retain the original data’s statistical properties and temporal dynamics. Across a diverse range of datasets, SDForger outperforms existing generative models in many scenarios, both in similarity-based evaluations and downstream forecasting tasks. By enabling textual conditioning in the generation process, SDForger paves the way for multimodal modeling and the streamlined integration of time series with textual information. The model is open-sourced at https://github.com/IBM/fms-dgt/tree/main/fms_dgt/public/databuilders/time_series.

[128] GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations

Odysseas S. Chlapanis, Dimitrios Galanis, Nikolaos Aletras, Ion Androutsopoulos

Main category: cs.CL

TL;DR: GreekBarBench is a new benchmark for evaluating LLMs on Greek Bar exam legal questions requiring citations, using a three-dimensional scoring system with LLM-as-judge and meta-evaluation to improve alignment with human experts.

Details

Motivation: To create a specialized benchmark for evaluating LLMs on legal reasoning tasks that require citation of statutory articles and case facts, addressing the challenges of free-text evaluation in legal domains.

Method: Developed GreekBarBench with questions from five Greek legal areas, implemented a three-dimensional scoring system using LLM-as-judge approach, and created meta-evaluation to assess correlation between LLM and human expert evaluations.

Result: Evaluation of 13 LLMs showed best models outperform average expert scores but fall short of the 95th percentile of human experts. Simple span-based rubrics improved alignment between LLM-judges and human evaluations.

Conclusion: While current LLMs show promising performance on legal reasoning tasks, they still cannot match top human expert performance, and specialized evaluation methods with improved rubrics are needed for accurate assessment.

Abstract: We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.

[129] KIT’s Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization

Zhaolin Li, Yining Liu, Danni Liu, Tuan Nam Nguyen, Enes Yavuz Ugan, Tu Anh Dinh, Carlos Mullov, Alexander Waibel, Jan Niehues

Main category: cs.CL

TL;DR: KIT’s IWSLT 2025 submission develops cascaded (ASR+MT) and end-to-end speech translation systems for Bemba, North Levantine Arabic, and Tunisian Arabic to English, using fine-tuning, synthetic data, model regularization, and system combination techniques.

Details

Motivation: To develop effective speech translation systems for low-resource languages where parallel training data is scarce, particularly exploring efficient resource utilization and synthetic data strategies.

Method: Fine-tuned pre-trained models with different strategies, used MT-augmented ST by generating translations from ASR data, employed text-to-speech for synthetic speech generation, applied intra-distillation for regularization, and used Minimum Bayes Risk decoding for system combination.

Result: For North Levantine Arabic, synthetic data system slightly surpassed cascaded system trained on real data. Synthetic data improved ASR and ST performance for Bemba. Intra-distillation consistently improved ASR, MT, and ST tasks. System combination achieved ~1.5 BLEU point improvement.

Conclusion: Synthetic data generation and model regularization techniques effectively enhance speech translation performance for low-resource languages, with system combination providing additional gains.

Abstract: This paper presents KIT’s submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.

[130] Exploring the Hidden Capacity of LLMs for One-Step Text Generation

Gleb Mezentsev, Ivan Oseledets

Main category: cs.CL

TL;DR: Frozen LLMs can generate hundreds of accurate tokens in one forward pass using just two learned embeddings, revealing multi-token generation capability without autoregressive decoding.

Details

Motivation: To explore whether autoregressive decoding is essential for text reconstruction in LLMs and investigate alternative multi-token generation methods.

Method: Use frozen LLMs with only two learned embeddings to generate multiple tokens in a single token-parallel forward pass, analyzing the embedding properties and information encoding.

Result: LLMs can generate hundreds of accurate tokens in one pass, with embeddings forming connected local regions in embedding space, suggesting potential for practical encoder training.

Conclusion: Multi-token generation may be natively accessible in off-the-shelf LLMs via learned input encoders, eliminating autoregressive decoding bottlenecks without requiring model retraining.

Abstract: A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one trained input embedding. In this work, we explore whether autoregressive decoding is essential for such reconstruction. We show that frozen LLMs can generate hundreds of accurate tokens in just one token-parallel forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored multi-token generation capability of autoregressive LLMs. We examine these embeddings and characterize the information they encode. We also empirically show that, although these representations are not unique for a given text, they form connected and local regions in embedding space - suggesting the potential to train a practical encoder. The existence of such representations hints that multi-token generation may be natively accessible in off-the-shelf LLMs via a learned input encoder, eliminating heavy retraining and helping to overcome the fundamental bottleneck of autoregressive decoding while reusing already-trained models.

[131] Advancing Expert Specialization for Better MoE

Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, Xudong Jiang

Main category: cs.CL

TL;DR: Proposes two complementary objectives - orthogonality loss and variance loss - to improve expert specialization in Mixture-of-Experts models by reducing expert overlap and encouraging discriminative routing, achieving up to 23.79% improvement over baseline methods.

Details

Motivation: The commonly used auxiliary load balancing loss in MoE models leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training.

Method: Introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. These are compatible with existing auxiliary loss and optimize training through gradient-level analysis.

Result: Experimental results show significant enhancement in expert specialization across various model architectures and benchmarks. The method improves classic MoE baselines with auxiliary loss by up to 23.79% while maintaining load balancing in downstream tasks without architectural modifications.

Conclusion: The proposed simple yet effective solution successfully addresses expert overlap and routing uniformity issues in MoE models, significantly improving performance through better expert specialization while maintaining load balancing properties.

Abstract: Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.

[132] Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

Zhongzhen Huang, Linjie Mu, Yakun Zhu, Xiangyu Zhao, Shaoting Zhang, Xiaofan Zhang

Main category: cs.CL

TL;DR: MedE² is a two-stage post-training pipeline that enhances multimodal reasoning for medical domains through text-only fine-tuning followed by multimodal medical case training.

Details

Motivation: Multimodal reasoning models have shown success in mathematics and science but remain underexplored in medical domains, where effective clinical decision-making depends on iterative, multimodal reasoning across diverse evidence sources.

Method: Two-stage pipeline: Stage-I fine-tunes models using 2,000 text-only data with precise reasoning demonstrations; Stage-II enhances reasoning using 1,500 curated multimodal medical cases aligned with multimodal medical reasoning preferences.

Result: MedE² consistently outperforms baselines across multiple medical multimodal benchmarks, with additional validation confirming robustness and practical utility on larger models and under inference-time scaling.

Conclusion: The proposed MedE² pipeline effectively improves reasoning performance of medical multimodal models, demonstrating efficacy, reliability, and practical utility in medical domains.

Abstract: Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model’s reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.

[133] A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models

Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi

Main category: cs.CL

TL;DR: This paper presents the first comprehensive study of Chain-of-Thought (CoT) faithfulness in large vision-language models (LVLMs), revealing that subtle image-based biases are rarely articulated compared to explicit text-based ones, and identifying a new phenomenon called “inconsistent reasoning” where models correctly reason before abruptly changing answers.

Details

Motivation: To investigate whether CoT reasoning traces faithfully reflect the internal processes of models, particularly how both text-based and previously unexplored image-based biases affect reasoning and bias articulation in LVLMs.

Method: The authors introduce a novel, fine-grained evaluation pipeline for categorizing bias articulation patterns, enabling more precise analysis of CoT reasoning than previous methods. They apply this framework to evaluate both LVLMs and LLMs across various levels of implicit cues.

Result: Findings show that subtle image-based biases are rarely articulated compared to explicit text-based ones, even in reasoning-specialized models. Many models exhibit “inconsistent reasoning” - correctly reasoning before abruptly changing answers. Current language-only reasoning models continue to struggle with articulating cues that are not overtly stated.

Conclusion: The study reveals critical distinctions in how models process different types of biases, providing new insights into LVLM CoT faithfulness. Inconsistent reasoning serves as a potential canary for detecting biased reasoning from unfaithful CoTs, highlighting ongoing challenges in model reasoning transparency.

Abstract: Chain-of-thought (CoT) reasoning enhances performance of large language models, but questions remain about whether these reasoning traces faithfully reflect the internal processes of the model. We present the first comprehensive study of CoT faithfulness in large vision-language models (LVLMs), investigating how both text-based and previously unexplored image-based biases affect reasoning and bias articulation. Our work introduces a novel, fine-grained evaluation pipeline for categorizing bias articulation patterns, enabling significantly more precise analysis of CoT reasoning than previous methods. This framework reveals critical distinctions in how models process and respond to different types of biases, providing new insights into LVLM CoT faithfulness. Our findings reveal that subtle image-based biases are rarely articulated compared to explicit text-based ones, even in models specialized for reasoning. Additionally, many models exhibit a previously unidentified phenomenon we term ``inconsistent’’ reasoning - correctly reasoning before abruptly changing answers, serving as a potential canary for detecting biased reasoning from unfaithful CoTs. We then apply the same evaluation pipeline to revisit CoT faithfulness in LLMs across various levels of implicit cues. Our findings reveal that current language-only reasoning models continue to struggle with articulating cues that are not overtly stated.

[134] Trustworthy Medical Question Answering: An Evaluation-Centric Survey

Yinuo Wang, Baiyang Wang, Robert E. Mercer, Frank Rudzicz, Sudipta Singha Roy, Pengjie Ren, Zhumin Chen, Xindi Wang

Main category: cs.CL

TL;DR: This survey systematically examines six key dimensions of trustworthiness in medical question-answering systems using large language models: Factuality, Robustness, Fairness, Safety, Explainability, and Calibration.

Details

Motivation: Trustworthiness in healthcare QA systems is crucial for patient safety, clinical effectiveness, and user confidence, especially as LLMs become integrated into medical settings where reliability directly impacts clinical decision-making and patient outcomes.

Method: The survey reviews how each trustworthiness dimension is evaluated in existing LLM-based medical QA systems, compiles and compares major benchmarks, and analyzes evaluation-guided techniques like retrieval-augmented grounding, adversarial fine-tuning, and safety alignment.

Result: The study identifies current evaluation methods and improvement techniques for trustworthy medical QA systems, highlighting the need for better assessment frameworks.

Conclusion: The paper identifies open challenges including scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies, proposing future research directions to advance safe, reliable, and transparent deployment of LLM-powered medical QA.

Abstract: Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA poses significant challenges due to the inherent complexity of healthcare data, the critical nature of clinical scenarios, and the multifaceted dimensions of trustworthy AI. In this survey, we systematically examine six key dimensions of trustworthiness in medical QA, i.e., Factuality, Robustness, Fairness, Safety, Explainability, and Calibration. We review how each dimension is evaluated in existing LLM-based medical QA systems. We compile and compare major benchmarks designed to assess these dimensions and analyze evaluation-guided techniques that drive model improvements, such as retrieval-augmented grounding, adversarial fine-tuning, and safety alignment. Finally, we identify open challenges-such as scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies-and propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.

[135] SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat

Yuru Jiang, Wenxuan Ding, Shangbin Feng, Greg Durrett, Yulia Tsvetkov

Main category: cs.CL

TL;DR: SPARTA ALIGNMENT is a collective alignment algorithm where multiple LLMs compete and evaluate each other through duels, using an adapted Elo-ranking system to aggregate scores and create preference pairs for iterative learning.

Details

Motivation: To address the limitations of single models, including lack of diversity in generation and biases in evaluation, by leveraging multiple LLMs in a competitive framework.

Method: Multiple LLMs form a ‘sparta tribe’ where they compete in duels to fulfill instructions while serving as judges. An adapted Elo-ranking system aggregates evaluation scores, and combat results become preference pairs for collective learning.

Result: Outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks with 7.0% average improvement, showing better generalization to unseen tasks and more logical, direct outputs.

Conclusion: SPARTA ALIGNMENT enables effective self-evolution of multiple LLMs through collective competition, leveraging expertise diversity for improved performance and generalization.

Abstract: We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat. To complement a single model’s lack of diversity in generation and biases in evaluation, multiple LLMs form a “sparta tribe” to compete against each other in fulfilling instructions while serving as judges for the competition of others. For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system, where winners/losers of combat gain/lose weight in evaluating others. The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration. SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative and collective competition process. Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.

[136] Transferring Linear Features Across Language Models With Model Stitching

Alan Chen, Jack Merullo, Alessandro Stolfo, Ellie Pavlick

Main category: cs.CL

TL;DR: Affine mappings between language model residual streams enable efficient feature transfer between models of different sizes, reducing SAE training costs by 50% while maintaining performance.

Details

Motivation: To improve training efficiency of expensive components like Sparse Autoencoders (SAEs) by transferring learned representations between models of different sizes.

Method: Using affine mappings between residual streams to transfer SAE weights, probes, and steering vectors between small and large language models.

Result: Small and large models learn similar representation spaces; transferred SAEs achieve 50% cheaper training; transferred probes and steering vectors recover ground truth performance; semantic and structural features transfer differently.

Conclusion: Linear representation spaces of small and large models show both similarities and differences, and feature transfer provides an effective method for improving SAE training efficiency.

Abstract: In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. In particular, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.

[137] Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Oscar Sainz, Naiara Perez, Julen Etxaniz, Joseba Fernandez de Landa, Itziar Aldabe, Iker García-Ferrero, Aimar Zabala, Ekhi Azurmendi, German Rigau, Eneko Agirre, Mikel Artetxe, Aitor Soroa

Main category: cs.CL

TL;DR: This paper presents a method for adapting language models to low-resource languages like Basque using synthetic instructions from multilingual instructed models, without requiring human-created instruction datasets.

Details

Motivation: Instruction datasets are scarce for low-resource languages, limiting the development of instructed language models. The paper aims to find alternatives to conventional instruction adaptation pipelines in such scenarios.

Method: Uses available components: target language corpora, multilingual base models, instructed backbone LLMs, and synthetically generated instructions from the instructed backbone. Systematically experiments with different combinations for Basque language adaptation.

Result: Target language corpora are essential, synthetic instructions yield robust models, and using instruction-tuned backbones outperforms base models. Scaling to Llama 3.1 Instruct 70B produces models competitive with much larger frontier models for Basque, without using Basque instructions.

Conclusion: The approach enables effective instruction adaptation for low-resource languages using synthetic instructions and existing multilingual models, achieving near-state-of-the-art performance without human-created instruction datasets.

Abstract: Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model. Scaling up to Llama 3.1 Instruct 70B as backbone, our model comes near frontier models of much larger sizes for Basque, without using any Basque instructions. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation. https://github.com/hitz-zentroa/latxa-instruct

[138] MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification

Iustin Sirbu, Robert-Adrian Popovici, Cornelia Caragea, Stefan Trausan-Matu, Traian Rebedea

Main category: cs.CL

TL;DR: MultiMatch is a novel semi-supervised learning algorithm that combines co-training and consistency regularization with pseudo-labeling, featuring a pseudo-label weighting module that improves robustness and achieves state-of-the-art results on NLP datasets.

Details

Motivation: To develop a more robust semi-supervised learning approach that effectively handles pseudo-label selection and filtering, addressing limitations in existing methods for text classification tasks.

Method: Combines co-training and consistency regularization with pseudo-labeling, featuring a pseudo-label weighting module that uses head agreement, model confidence, and classification difficulty to select, filter, and weight pseudo-labels.

Result: Achieves state-of-the-art results on 8 out of 10 setups from 5 NLP datasets, ranks first among 21 methods in Friedman test, and shows exceptional robustness in imbalanced settings (3.26% improvement over second-best approach).

Conclusion: MultiMatch provides a holistic SSL approach that significantly improves performance and robustness, particularly valuable for real-world text classification tasks with imbalanced data.

Abstract: We introduce MultiMatch, a novel semi-supervised learning (SSL) algorithm combining the paradigms of co-training and consistency regularization with pseudo-labeling. At its core, MultiMatch features a pseudo-label weighting module designed for selecting and filtering pseudo-labels based on head agreement and model confidence, and weighting them according to the perceived classification difficulty. This novel module enhances and unifies three existing techniques – heads agreement from Multihead Co-training, self-adaptive thresholds from FreeMatch, and Average Pseudo-Margins from MarginMatch – resulting in a holistic approach that improves robustness and performance in SSL settings. Experimental results on benchmark datasets highlight the superior performance of MultiMatch, i.e., MultiMatch achieves state-of-the-art results on 8 out of 10 setups from 5 natural language processing datasets and ranks first according to the Friedman test among 21 methods. Furthermore, MultiMatch demonstrates exceptional robustness in highly imbalanced settings, outperforming the second-best approach by 3.26%, a critical advantage for real-world text classification tasks. Our code is available on GitHub.

[139] Representation Consistency for Accurate and Coherent LLM Answer Aggregation

Junqi Jiang, Tom Bewley, Salim I. Amoukou, Francesco Leofante, Antonio Rago, Saumitra Mishra, Francesca Toni

Main category: cs.CL

TL;DR: Representation Consistency (RC) is a test-time scaling method that improves LLM performance by aggregating answers based on both frequency and internal activation consistency across multiple responses, without requiring additional model queries.

Details

Motivation: Existing test-time scaling methods require complex modifications to prompting and sampling strategies. RC aims to provide a simpler approach that works regardless of how candidate responses were generated.

Method: RC aggregates answers by considering both the frequency of each answer and the consistency of the model’s internal activations (dense or sparse) during response generation. Inconsistent representations indicate incoherent reasoning and are down-weighted.

Result: Experiments with four open-source LLMs and four reasoning datasets show consistent accuracy improvements (up to 4%) over strong test-time scaling baselines, with sparse activations aligning well with coherent reasoning.

Conclusion: RC effectively improves LLM performance during inference by leveraging representation consistency for answer aggregation, requiring only cached activations and lightweight computations without additional model queries.

Abstract: Test-time scaling improves large language models’ (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model’s internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model’s representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.

[140] Discourse Heuristics For Paradoxically Moral Self-Correction

Guangliang Liu, Zimo Qi, Xitong Zhang, Kristen Marie Johnson

Main category: cs.CL

TL;DR: Moral self-correction in LLMs operates superficially and struggles with identifying causes of moral inconsistency, relying on heuristic shortcuts from training data that create paradoxes when trying to improve both self-correction and self-diagnosis capabilities.

Details

Motivation: To understand and address the paradoxes in moral self-correction where LLMs can superficially correct outputs but struggle to identify root causes of moral inconsistency, and to improve this capability.

Method: Analyzed discourse constructions in fine-tuning corpora to uncover heuristic shortcuts, demonstrated reliance on these heuristics, and proposed solutions leveraging curated datasets while examining generalization challenges across contexts and model scales.

Result: Found that moral self-correction depends on heuristic shortcuts from training data, and that attempting to enhance both self-correction and self-diagnosis simultaneously leads to inconsistency due to these heuristics.

Conclusion: Moral self-correction in LLMs is fundamentally limited by heuristic-based approaches from training data, requiring careful dataset curation and facing significant generalization challenges across different contexts and model sizes.

Abstract: Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.

[141] Context Tuning for In-Context Optimization

Jack Lu, Ryan Teehan, Zhenbang Yang, Mengye Ren

Main category: cs.CL

TL;DR: Context Tuning is a method that enhances few-shot learning in LLMs by initializing trainable prompts with task-specific examples instead of irrelevant tokens, leveraging ICL capabilities for better performance without fine-tuning.

Details

Motivation: Traditional prompt-based adaptation methods initialize prompts with irrelevant tokens, which limits their effectiveness. Context Tuning aims to improve few-shot learning by using task-specific demonstration examples to initialize prompts.

Method: Context Tuning initializes trainable prompts or prefixes with task-specific demonstration examples, utilizing the model’s In-Context Learning ability to extract relevant information for improved few-shot adaptation without fine-tuning model parameters.

Result: Extensive evaluations on CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC show that Context Tuning outperforms traditional prompt-based methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.

Conclusion: Context Tuning provides a simple yet effective approach for few-shot adaptation of LLMs, demonstrating superior performance over traditional prompt-based methods while maintaining high training efficiency.

Abstract: We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for LLMs, they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model’s inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.

[142] On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study

Riccardo Alberghi, Elizaveta Demyanenko, Luca Biggio, Luca Saglietti

Main category: cs.CL

TL;DR: Training LLMs on inefficient reasoning traces with backtracking improves generalization better than optimal dynamic programming traces, despite using the same token budget, due to increased model confidence in next-token prediction.

Details

Motivation: To study how reasoning trace efficiency affects LLM generalization, particularly whether longer, less efficient traces provide better training signals than optimal but shorter traces.

Method: Used shortest-path tasks in layered graphs, trained decoder-only transformers on question-trace-answer triples with custom tokenizer, comparing models trained on optimal dynamic programming traces vs. longer backtracking traces.

Result: Models trained on inefficient backtracking traces generalized better to unseen graphs than those trained on optimal traces, with generalization correlating with model confidence in next-token prediction.

Conclusion: Long, coherent, and locally incremental reasoning traces make training signals easier to optimize, leading to better generalization, rather than trace length alone.

Abstract: Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model’s confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.

[143] SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

Krithika Ramesh, Daniel Smolyak, Zihao Zhao, Nupoor Gandhi, Ritu Agarwal, Margrét Bjarnadóttir, Anjalie Field

Main category: cs.CL

TL;DR: SynthTextEval is a toolkit for comprehensive evaluation of synthetic text across multiple dimensions including utility, fairness, privacy risk, distributional differences, and expert feedback.

Details

Motivation: The fluency of LLM outputs makes synthetic text viable for applications like privacy preservation in high-stakes AI systems, but requires principled evaluation methods.

Method: Provides a toolkit that allows users to conduct multi-dimensional evaluations over uploaded or generated synthetic data, with functionality demonstrated on healthcare and law datasets.

Result: The toolkit enables standardized evaluation metrics across key dimensions to assess synthetic text quality and risks.

Conclusion: By consolidating and standardizing evaluation metrics, SynthTextEval aims to improve the viability of synthetic text and enhance privacy-preservation in AI development.

Abstract: We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit’s generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development.

[144] An Exploration of Knowledge Editing for Arabic

Basel Mousi, Nadir Durrani, Fahim Dalvi

Main category: cs.CL

TL;DR: First study of Knowledge Editing in Arabic, evaluating four methods on Arabic benchmarks and showing parameter-based methods struggle with cross-lingual generalization while instruction-tuned methods perform better.

Details

Motivation: Knowledge Editing has been widely explored in English but remains underexamined in morphologically rich languages like Arabic.

Method: Evaluated four KE methods (ROME, MEMIT, ICE, LTE) on Arabic translations of ZsRE and Counterfact benchmarks, analyzing multilingual and cross-lingual settings on Llama-2-7B-chat.

Result: Parameter-based methods struggle with cross-lingual generalization, while instruction-tuned methods perform more robustly. Extending LTE to multilingual setting with joint Arabic-English training improves both editability and transfer.

Conclusion: Released Arabic KE benchmarks and multilingual training data for LTE to support future research in Arabic knowledge editing.

Abstract: While Knowledge Editing (KE) has been widely explored in English, its behavior in morphologically rich languages like Arabic remains underexamined. In this work, we present the first study of Arabic KE. We evaluate four methods (ROME, MEMIT, ICE, and LTE) on Arabic translations of the ZsRE and Counterfact benchmarks, analyzing both multilingual and cross-lingual settings. Our experiments on Llama-2-7B-chat show that parameter-based methods struggle with cross-lingual generalization, while instruction-tuned methods perform more robustly. We extend Learning-To-Edit (LTE) to a multilingual setting and show that joint Arabic-English training improves both editability and transfer. We release Arabic KE benchmarks and multilingual training for LTE data to support future research.

[145] Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

Vinicius Anjos de Almeida, Vinicius de Camargo, Raquel Gómez-Bravo, Egbert van der Haring, Kees van Boven, Marcelo Finger, Luis Fernandez Lopez

Main category: cs.CL

TL;DR: LLMs can effectively automate ICPC-2 medical coding using semantic search engine outputs, with top models achieving F1-scores >0.85 without fine-tuning.

Details

Motivation: To assess the potential of large language models for automating medical coding tasks, specifically assigning ICPC-2 codes using domain-specific search engine outputs, which could improve healthcare data processing for research and policy.

Method: Used 437 Brazilian Portuguese clinical expressions annotated with ICPC-2 codes. A semantic search engine retrieved candidates from 73,563 labeled concepts, and 33 LLMs were prompted with queries and retrieved results to select the best-matching ICPC-2 code.

Result: 28 models achieved F1-score >0.8, with 10 exceeding 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization improved performance by up to 4 points. Most models returned valid codes with reduced hallucinations, though smaller models (<3B) struggled with formatting and input length.

Conclusion: LLMs show strong potential for automating ICPC-2 coding without fine-tuning. However, broader multilingual evaluations and clinical validation are needed, as findings are limited by dataset scope and setup.

Abstract: Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine. Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI’s text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Results: Twenty-eight models achieved F1-score > 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B) struggled with formatting and input length. Conclusions: LLMs show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.

[146] Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation

Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann

Main category: cs.CL

TL;DR: Identifies language-specific neurons in multilingual LLMs and demonstrates how to manipulate them through language arithmetics to control language behavior across various multilingual tasks.

Details

Motivation: To understand the neural mechanisms behind language-specific processing in multilingual LLMs and develop methods to control language behavior.

Method: Used Language Activation Probability Entropy (LAPE) method to identify language-specific neurons, then applied language arithmetics (activation addition and multiplication) to steer models by deactivating unwanted languages and activating desired ones.

Result: Successfully manipulated language behavior across five multilingual tasks (language forcing, translation, QA, comprehension, NLI), with better performance for high-resource languages and improved effectiveness with typological similarity. Cross-lingual neuron steering enhanced downstream performance.

Conclusion: Language-specific neurons cluster in deeper layers, related languages share overlapping neurons, and systematic neuron manipulation can effectively control language behavior in multilingual LLMs, revealing internal fallback mechanisms for language selection.

Abstract: Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal “fallback” mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.

[147] Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP

Vukosi Marivate, Isheanesu Dzingirai, Fiskani Banda, Richard Lastrucci, Thapelo Sindane, Keabetswe Madumo, Kayode Olaleye, Abiodun Modupe, Unarine Netshifhefhe, Herkulaas Combrink, Mohlatlego Nakeng, Matome Ledwaba

Main category: cs.CL

TL;DR: Mafoko addresses the lack of structured terminological data for South Africa’s official languages by aggregating fragmented terminology resources into open, interoperable datasets, demonstrating improved machine translation accuracy through RAG integration.

Details

Motivation: The critical lack of structured terminological data for South Africa's official languages hampers multilingual NLP progress, with valuable terminology resources remaining fragmented and locked in non-machine-readable formats.

Method: Systematically aggregating, cleaning, and standardising scattered terminology resources into open datasets under the NOODL framework, then integrating the terminology into a Retrieval-Augmented Generation (RAG) pipeline.

Result: Experiments show substantial improvements in accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models.

Conclusion: Mafoko provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa’s rich linguistic diversity is represented in the digital age.

Abstract: The critical lack of structured terminological data for South Africa’s official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. Mafoko addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational Mafoko dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. Mafoko provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa’s rich linguistic diversity is represented in the digital age.

[148] Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency

Aman Goel, Daniel Schwartz, Yanjun Qi

Main category: cs.CL

TL;DR: Finch-Zk is a black-box framework that detects and mitigates hallucinations in LLM outputs using cross-model consistency checking and targeted corrections, improving detection F1 scores by 6-39% and answer accuracy by up to 9 percentage points.

Details

Motivation: LLMs are susceptible to hallucinations - generating plausible but factually inaccurate content - which limits their reliability in production systems.

Method: Uses fine-grained cross-model consistency checking by comparing responses from diverse models on semantically-equivalent prompts, plus targeted mitigation that applies precise corrections to problematic segments while preserving accurate content.

Result: Improves hallucination detection F1 scores by 6-39% on FELM dataset and achieves up to 9% absolute improvement in answer accuracy on GPQA-diamond dataset with state-of-the-art models like Llama 4 Maverick and Claude 4 Sonnet.

Conclusion: Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems without requiring external knowledge sources.

Abstract: Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations–generating content that appears plausible but contains factual inaccuracies. We present Finch-Zk, a black-box framework that leverages fine-grained cross-model consistency to detect and mitigate hallucinations in LLM outputs without requiring external knowledge sources. Finch-Zk introduces two key innovations:

a cross-model consistency checking strategy that reveals fine-grained inaccuracies by comparing responses generated by diverse models from semantically-equivalent prompts, and 2) a targeted mitigation technique that applies precise corrections to problematic segments while preserving accurate content. Experiments on the FELM dataset show Finch-Zk improves hallucination detection F1 scores by 6-39% compared to existing approaches. For mitigation, Finch-Zk achieves up to 9 absolute percentage points improvement in answer accuracy on the GPQA-diamond dataset when applied to state-of-the-art models like Llama 4 Maverick and Claude 4 Sonnet. Extensive evaluation on multiple datasets demonstrates that Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems.

[149] From BERT to LLMs: Comparing and Understanding Chinese Classifier Prediction in Language Models

Ziqi Zhang, Jianfei Ma, Emmanuele Chersoni, Jieshun You, Zhaoxin Feng

Main category: cs.CL

TL;DR: LLMs perform worse than BERT in predicting Chinese classifiers, even with fine-tuning. Bidirectional attention and noun information are crucial for accurate predictions.

Details

Motivation: To evaluate whether popular LLMs possess proper knowledge of Chinese classifiers, which are important for educational applications but remain unexplored in NLP literature.

Method: Used various masking strategies to evaluate LLMs’ intrinsic ability, contribution of sentence elements, and attention mechanisms. Also explored fine-tuning LLMs to enhance classifier performance.

Result: LLMs performed worse than BERT in predicting Chinese classifiers, even with fine-tuning. Prediction greatly benefits from information about the following noun, explaining the advantage of bidirectional attention models like BERT.

Conclusion: BERT’s bidirectional attention mechanism provides significant advantages over standard LLMs for Chinese classifier prediction, highlighting the importance of contextual noun information in this task.

Abstract: Classifiers are an important and defining feature of the Chinese language, and their correct prediction is key to numerous educational applications. Yet, whether the most popular Large Language Models (LLMs) possess proper knowledge the Chinese classifiers is an issue that has largely remain unexplored in the Natural Language Processing (NLP) literature. To address such a question, we employ various masking strategies to evaluate the LLMs’ intrinsic ability, the contribution of different sentence elements, and the working of the attention mechanisms during prediction. Besides, we explore fine-tuning for LLMs to enhance the classifier performance. Our findings reveal that LLMs perform worse than BERT, even with fine-tuning. The prediction, as expected, greatly benefits from the information about the following noun, which also explains the advantage of models with a bidirectional attention mechanism such as BERT.

[150] OpinioRAG: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews

Mir Tafseer Nayeem, Davood Rafiei

Main category: cs.CL

TL;DR: OpinioRAG is a scalable, training-free framework that uses RAG-based evidence retrieval with LLMs to generate personalized opinion summaries from thousands of user reviews, addressing the limitations of existing methods that produce generic summaries.

Details

Motivation: Existing methods for opinion highlights generation from large volumes of user reviews either fail to scale or produce generic, one-size-fits-all summaries that overlook personalized needs.

Method: Introduces OpinioRAG, a scalable framework combining RAG-based evidence retrieval with LLMs, and proposes novel reference-free verification metrics for sentiment-rich domains. Also contributes a large-scale dataset of long-form user reviews with expert summaries and annotated queries.

Result: Through extensive experiments, the framework demonstrates effectiveness in generating accurate, relevant, and structured summaries at scale, while identifying key challenges and providing actionable insights.

Conclusion: OpinioRAG positions itself as a robust framework for scalable opinion summary generation, paving the way for future research in this domain.

Abstract: We study the problem of opinion highlights generation from large volumes of user reviews, often exceeding thousands per entity, where existing methods either fail to scale or produce generic, one-size-fits-all summaries that overlook personalized needs. To tackle this, we introduce OpinioRAG, a scalable, training-free framework that combines RAG-based evidence retrieval with LLMs to efficiently produce tailored summaries. Additionally, we propose novel reference-free verification metrics designed for sentiment-rich domains, where accurately capturing opinions and sentiment alignment is essential. These metrics offer a fine-grained, context-sensitive assessment of factual consistency. To facilitate evaluation, we contribute the first large-scale dataset of long-form user reviews, comprising entities with over a thousand reviews each, paired with unbiased expert summaries and manually annotated queries. Through extensive experiments, we identify key challenges, provide actionable insights into improving systems, pave the way for future research, and position OpinioRAG as a robust framework for generating accurate, relevant, and structured summaries at scale.

[151] Verbalized Algorithms

Supriya Lall, Christian Farrell, Hari Pathanjaly, Marko Pavic, Sarvesh Chezhian, Masataro Asai

Main category: cs.CL

TL;DR: Verbalized Algorithms (VAs) use LLMs as simple operation oracles within classical algorithms instead of one-shot querying for complex reasoning tasks.

Details

Motivation: To improve reliability by limiting LLMs to simple, well-defined operations within proven algorithms rather than relying on them for complex reasoning in one-shot queries.

Method: Decompose tasks into elementary operations on natural language strings, using LLMs as binary comparison oracles within established algorithms like bitonic sorting networks.

Result: Demonstrated effectiveness on sorting and clustering tasks by leveraging classical algorithms with theoretical guarantees.

Conclusion: Verbalized Algorithms provide a more reliable approach by combining LLMs’ natural language capabilities with the theoretical soundness of classical algorithms.

Abstract: Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which leverage classical algorithms with established theoretical understanding. VAs decompose a task into simple elementary operations on natural language strings that they should be able to answer reliably, and limit the scope of LLMs to only those simple tasks. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of this approach on sorting and clustering tasks.

[152] Teaching According to Talents! Instruction Tuning LLMs with Competence-Aware Curriculum Learning

Yangning Li, Tingwei Lu, Yinghui Li, Yankai Chen, Wei-Chieh Huang, Wenhao Jiang, Hui Wang, Hai-Tao Zheng, Philip S. Yu

Main category: cs.CL

TL;DR: CAMPUS is a dynamic curriculum learning framework for instruction tuning that adapts to model capabilities during training, overcoming the rigidity of static difficulty metrics in traditional methods.

Details

Motivation: Current curriculum learning methods for instruction tuning suffer from curriculum rigidity - they use static difficulty metrics that don't adapt to the model's evolving capabilities during training, leading to suboptimal learning trajectories.

Method: CAMPUS framework features: (1) dynamic selection for sub-curriculum, (2) competency-aware adjustment to curriculum schedule, and (3) multiple difficulty-based scheduling to adapt to model capabilities.

Result: Extensive experiments show CAMPUS achieves superior performance compared to other state-of-the-art baselines for efficient instruction tuning.

Conclusion: CAMPUS effectively addresses curriculum rigidity in instruction tuning by dynamically adapting to model capabilities, resulting in improved performance over static curriculum methods.

Abstract: Efficient instruction tuning aims to enhance the ultimate performance of large language models (LLMs) trained on a given instruction dataset. Curriculum learning as a typical data organization strategy has shown preliminary effectiveness in instruction tuning. However, current curriculum tuning methods suffer from the curriculum rigidity, since they rely solely on static heuristic difficulty metrics. These methods fail to adapt to the evolving capabilities of models during training, resulting in a fixed and potentially sub-optimal learning trajectory. To address the issue, Competence-Aware Multi-Perspective cUrriculum inStruction tuning framework termed CAMPUS is proposed. CAMPUS offers several advantages: (1) Dynamic selection for sub-curriculum. (2) Competency-aware adjustment to the curriculum schedule. (3) Multiple difficulty-based scheduling. Extensive experiments prove the superior performance of CAMPUS, compared to other state-of-the-art baselines for efficient instruction tuning.

[153] Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses

Fangyi Yu, Nabeel Seedat, Dasha Herrmannova, Frank Schilder, Jonathan Richard Schwarz

Main category: cs.CL

TL;DR: DeCE is a decomposed LLM evaluation framework that separates precision (factual accuracy) and recall (coverage) for evaluating long-form answers in expert domains like law and medicine, achieving strong correlation with expert judgments.

Details

Motivation: Standard metrics like BLEU and ROUGE fail to capture semantic correctness in high-stakes domains, and current LLM-based evaluators often reduce nuanced answer quality into a single undifferentiated score.

Method: DeCE separates precision (factual accuracy and relevance) and recall (coverage of required concepts) using instance-specific criteria automatically extracted from gold answer requirements. It is model-agnostic and domain-general without requiring predefined taxonomies or handcrafted rubrics.

Result: DeCE achieves substantially stronger correlation with expert judgments (r=0.78) compared to traditional metrics (r=0.12), pointwise LLM scoring (r=0.35), and modern multidimensional evaluators (r=0.48). It reveals interpretable trade-offs between generalist and specialized models.

Conclusion: DeCE offers an interpretable and actionable LLM evaluation framework in expert domains, with only 11.95% of LLM-generated criteria requiring expert revision, underscoring its scalability.

Abstract: Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments ($r=0.78$), compared to traditional metrics ($r=0.12$), pointwise LLM scoring ($r=0.35$), and modern multidimensional evaluators ($r=0.48$). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE’s scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.

[154] EmbeddingGemma: Powerful and Lightweight Text Representations

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, Jingxiao Zheng, Jyotinder Singh, Abheesht Sharma, Divyashree Sreepathihalli, Aashi Jain, Adham Elarabawy, AJ Co, Andreas Doumanoglou, Babak Samari, Ben Hora, Brian Potetz, Dahun Kim, Enrique Alfonseca, Fedor Moiseev, Feng Han, Frank Palma Gomez, Gustavo Hernández Ábrego, Hesen Zhang, Hui Hui, Jay Han, Karan Gill, Ke Chen, Koert Chen, Madhuri Shanbhogue, Michael Boratko, Paul Suganthan, Sai Meher Karthik Duddu, Sandeep Mariserla, Setareh Ariafar, Shanfeng Zhang, Shijie Zhang, Simon Baumgartner, Sonam Goenka, Steve Qiu, Tanmaya Dabral, Trevor Walker, Vikram Rao, Waleed Khawaja, Wenlei Zhou, Xiaoqi Ren, Ye Xia, Yichang Chen, Yi-Ting Chen, Zhe Dong, Zhongli Ding, Francesco Visin, Gaël Liu, Jiageng Zhang, Kathleen Kenealy, Michelle Casbon, Ravin Kumar, Thomas Mesnard, Zach Gleicher, Cormac Brick, Olivier Lacombe, Adam Roberts, Qin Yin, Yunhsuan Sung, Raphael Hoffmann, Tris Warkentin, Armand Joulin, Tom Duerig, Mojtaba Seyedhosseini

Main category: cs.CL

TL;DR: EmbeddingGemma is a lightweight 300M parameter text embedding model that achieves state-of-the-art performance on MTEB benchmark, outperforming larger models through innovative training techniques.

Details

Motivation: To create an efficient text embedding model that provides exceptional performance-to-cost ratio for practical applications like on-device use, while being open-source and lightweight.

Method: Uses encoder-decoder initialization and geometric embedding distillation to capture knowledge from larger models, spread-out regularizer for robustness, and merges checkpoints from optimized mixtures for generalizability.

Result: Achieves state-of-the-art results on MTEB across multilingual, English, and code domains, outperforms prior top models with fewer than 500M parameters, and provides comparable performance to models double its size.

Conclusion: EmbeddingGemma offers exceptional efficiency and performance, making it ideal for low-latency, high-throughput applications while maintaining strong performance even when quantized or truncated.

Abstract: We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.

[155] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse

Nathan Junzi Chen

Main category: cs.CL

TL;DR: This study evaluates political biases in six mainstream LLMs using zero-shot classification to measure ideological alignment, topicality, sentiment, and objectivity, finding amplified liberal-authoritarian alignment across models.

Details

Motivation: To address the problem of internalized political biases in generative AI systems stemming from training data skews, human prejudice, and algorithmic flaws that can influence political discourse.

Method: Used zero-shot classification approach with 1800 model responses across six LLMs, analyzed through four distinct fine-tuned classification algorithms measuring ideological alignment, topicality, response sentiment, and objectivity.

Result: Found amplified liberal-authoritarian alignment across all six LLMs, with instances of reasoning supersessions and canned refusals. The biases can manifest as conformity or polarization depending on regional socio-political structures.

Conclusion: The study highlights how intrinsic biases in AI systems can permeate public discourse and distort the political landscape, emphasizing the psychological influences in human-computer interactions.

Abstract: Amidst the rapid normalization of generative artificial intelligence (GAI), intelligent systems have come to dominate political discourse across information media. However, internalized political biases stemming from training data skews, human prejudice, and algorithmic flaws continue to plague this novel technology. This study employs a zero-shot classification approach to evaluate algorithmic political partisanship through a methodical combination of ideological alignment, topicality, response sentiment, and objectivity. A total of 1800 model responses across six mainstream large language models (LLMs) were individually input into four distinct fine-tuned classification algorithms, each responsible for computing one of the aforementioned metrics. The results show an amplified liberal-authoritarian alignment across the six LLMs evaluated, with notable instances of reasoning supersessions and canned refusals. The study subsequently highlights the psychological influences underpinning human-computer interactions and how intrinsic biases can permeate public discourse. The resulting distortion of the political landscape can ultimately manifest as conformity or polarization, depending on the region’s pre-existing socio-political structures.

[156] Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor

Main category: cs.CL

TL;DR: Inoculation prompting modifies finetuning data by prepending prompts that deliberately elicit undesirable traits, reducing their expression at test time while maintaining desired behaviors.

Details

Motivation: Language model finetuning often results in learning undesirable traits alongside desired ones, creating a need for selective learning techniques.

Method: Prepend short system-prompt instructions to finetuning data that deliberately elicit undesirable traits, then evaluate without these instructions at test time.

Result: Inoculated models show much lower expression of undesirable traits across multiple settings including reducing emergent misalignment, defending against backdoor injections, and mitigating trait transmission via subliminal learning.

Conclusion: Inoculation is an effective technique for selective learning that reduces optimization pressure by making traits less surprising, contributing to better understanding of how language models generalize.

Abstract: Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.’’) teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.

[157] Debiasing LLMs by Masking Unfairness-Driving Attention Heads

Tingxu Han, Wei Song, Ziqi Ding, Ziming Li, Chunrong Fang, Yuekang Li, Dongfang Liu, Zhenyu Chen, Zhenting Wang

Main category: cs.CL

TL;DR: DiffHeads is a lightweight debiasing framework that identifies and masks bias heads in LLMs through differential activation analysis between Direct-Answer and Chain-of-Thought prompting, reducing unfairness by 49.4% and 40.3% respectively without harming utility.

Details

Motivation: LLMs increasingly mediate decisions in domains requiring fair treatment, but existing bias mitigation approaches are fragile and lack insight into the underlying mechanisms that generate biased outputs.

Method: 1) Compare DA vs CoT prompting across 8 LLMs; 2) Define token-to-head contribution score to trace bias to specific attention heads; 3) Propose DiffHeads framework that identifies bias heads through differential activation analysis and selectively masks them.

Result: DA triggering increases unfairness by 534.5%-391.9%; identified small cluster of bias heads that activate under DA but stay dormant with CoT; DiffHeads reduces unfairness by 49.4% under DA and 40.3% under CoT without harming model utility.

Conclusion: DiffHeads provides a causal link between prompting strategy and bias emergence, offering an effective lightweight debiasing approach by targeting specific bias heads rather than the entire model.

Abstract: Large language models (LLMs) increasingly mediate decisions in domains where unfair treatment of demographic groups is unacceptable. Existing work probes when biased outputs appear, but gives little insight into the mechanisms that generate them, leaving existing mitigations largely fragile. In this paper, we conduct a systematic investigation LLM unfairness and propose DiffHeads, a lightweight debiasing framework for LLMs. We first compare Direct-Answer (DA) prompting to Chain-of-Thought (CoT) prompting across eight representative open- and closed-source LLMs. DA will trigger the nature bias part of LLM and improve measured unfairness by 534.5%-391.9% in both one-turn and two-turn dialogues. Next, we define a token-to-head contribution score that traces each token’s influence back to individual attention heads. This reveals a small cluster of bias heads that activate under DA but stay largely dormant with CoT, providing the first causal link between prompting strategy and bias emergence. Finally, building on this insight, we propose DiffHeads that identifies bias heads through differential activation analysis between DA and CoT, and selectively masks only those heads. DiffHeads reduces unfairness by 49.4%, and 40.3% under DA and CoT, respectively, without harming model utility.

[158] Dynamic Topic Evolution with Temporal Decay and Attention in Large Language Models

Di Wu, Shuaidong Pan

Main category: cs.CL

TL;DR: A framework for dynamic topic evolution using temporal LLMs with decay functions and attention mechanisms to capture topic changes over time through semantic embeddings and state transitions.

Details

Motivation: To systematically model how topics evolve, expand, and decline over time in large-scale text data, addressing the need for temporal-aware topic analysis.

Method: Uses LLM embeddings with temporal decay functions and attention mechanisms, maps to latent topic space with state transition matrix, and applies joint optimization for semantic and temporal consistency.

Result: Effectively captures topic generation, expansion, and decline; outperforms existing models across multiple metrics on real-world corpora.

Conclusion: Provides a systematic solution for dynamic semantic pattern understanding, enriches topic modeling research, and supports complex text analysis across domains.

Abstract: This paper proposes a modeling framework for dynamic topic evolution based on temporal large language models. The method first uses a large language model to obtain contextual embeddings of text and then introduces a temporal decay function and an attention mechanism. These components allow the model to adjust the importance of semantic units according to time intervals and capture topic variations across different periods. The temporal representations are then mapped into a latent topic space, where a state transition matrix is applied to describe the dynamic evolution of topics. A joint optimization objective constrains both semantic modeling and temporal consistency, ensuring diversity and smoothness in topic generation. The design emphasizes the unified modeling of semantic representation and temporal evolution, which improves topic coherence and diversity while enhancing stability and interpretability over time. Experiments on real-world corpora show that the framework effectively captures the generation, expansion, and decline of topics and outperforms existing models across multiple metrics. Overall, the proposed method provides a systematic solution for understanding dynamic semantic patterns in large-scale text, enriches the research paradigm of topic modeling, and supports complex text analysis tasks in multiple domains.

[159] MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts

Shujun Xia, Haokun Lin, Yichen Wu, Yinan Zhou, Zixuan Li, Zhongwei Wan, Xingrun Xing, Yefeng Zheng, Xiang Li, Caifeng Shan, Zhenan Sun, Quanzheng Li

Main category: cs.CL

TL;DR: MedREK is a retrieval-based editing framework for medical LLMs that addresses representation overlap and enables batch-editing, outperforming existing methods on medical benchmarks.

Details

Motivation: LLMs generate outdated/inaccurate medical information due to rapid knowledge evolution and training errors, limiting clinical use. Current editing methods have locality issues or can't handle batch edits needed for real-world applications.

Method: Proposed MedREK framework with shared query-key module for precise matching and attention-based prompt encoder for guidance. Also created MedVersa benchmark for evaluating single/batch edits under locality constraints.

Result: MedREK achieves superior performance across various medical benchmarks and provides the first validated solution for batch-editing in medical LLMs.

Conclusion: MedREK effectively addresses critical challenges in medical LLM editing through improved retrieval accuracy and batch-editing capability, enabling more reliable clinical applications.

Abstract: LLMs hold great promise for healthcare applications, but the rapid evolution of medical knowledge and errors in training data often cause them to generate outdated or inaccurate information, limiting their applicability in high-stakes clinical practice. Model editing has emerged as a potential remedy without full retraining. While parameter-based editing often compromises locality and is thus ill-suited for the medical domain, retrieval-based editing offers a more viable alternative. However, it still faces two critical challenges: (1) representation overlap within the medical knowledge space often causes inaccurate retrieval and reduces editing accuracy; (2) existing methods are restricted to single-sample edits, while batch-editing remains largely unexplored despite its importance for real-world medical applications. To address these challenges, we first construct MedVersa, an enhanced benchmark with broader coverage of medical subjects, designed to evaluate both single and batch edits under strict locality constraints. We then propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance. Experimental results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics and provides the first validated solution for batch-editing in medical LLMs. Our code and dataset are available at https://github.com/mylittleriver/MedREK.

[160] Interpreting the Latent Structure of Operator Precedence in Language Models

Dharunish Yugeswardeenoo, Harshil Nukala, Ved Shah, Cole Blondin, Sean O Brien, Vasu Sharma, Kevin Zhu

Main category: cs.CL

TL;DR: LLMs struggle with arithmetic tasks despite strong reasoning capabilities. This paper investigates whether LLaMA 3.2-3B encodes operator precedence internally using interpretability techniques.

Details

Motivation: To understand the internal structure through which LLMs perform arithmetic computation, specifically whether they encode operator precedence in their representations.

Method: Used a dataset of arithmetic expressions with varying parentheses, applied interpretability techniques (logit lens, linear classification probes, UMAP visualization), and introduced partial embedding swap to modify operator precedence.

Result: Found that intermediate computations appear in the residual stream (especially after MLP blocks) and that the model linearly encodes precedence in operator embeddings post attention layer.

Conclusion: LLMs internally represent arithmetic operator precedence, with intermediate computations detectable in the residual stream and precedence encoded in operator embeddings.

Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities but continue to struggle with arithmetic tasks. Prior works largely focus on outputs or prompting strategies, leaving the open question of the internal structure through which models do arithmetic computation. In this work, we investigate whether LLMs encode operator precedence in their internal representations via the open-source instruction-tuned LLaMA 3.2-3B model. We constructed a dataset of arithmetic expressions with three operands and two operators, varying the order and placement of parentheses. Using this dataset, we trace whether intermediate results appear in the residual stream of the instruction-tuned LLaMA 3.2-3B model. We apply interpretability techniques such as logit lens, linear classification probes, and UMAP geometric visualization. Our results show that intermediate computations are present in the residual stream, particularly after MLP blocks. We also find that the model linearly encodes precedence in each operator’s embeddings post attention layer. We introduce partial embedding swap, a technique that modifies operator precedence by exchanging high-impact embedding dimensions between operators.

[161] Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon

Main category: cs.CL

TL;DR: Fine-tuning AI models on authors’ complete works dramatically improves their ability to emulate literary styles, making AI-generated text preferred over expert human writing by both experts and lay readers, while reducing detectable AI stylistic quirks.

Details

Motivation: To investigate whether AI models can generate high-quality literary text that effectively emulates authors' styles, addressing copyright concerns about AI's ability to create derivative content.

Method: Preregistered study comparing MFA-trained expert writers with ChatGPT, Claude, and Gemini using both in-context prompting and fine-tuning approaches. 159 expert and lay readers conducted blind pairwise evaluations of 450-word excerpts emulating 50 award-winning authors’ styles.

Result: In-context prompting was strongly disfavored by experts for stylistic fidelity and writing quality, but fine-tuning completely reversed these findings - experts now preferred AI-generated text. Fine-tuned outputs were rarely detected as AI-generated (3% vs 97% for in-context) and showed dramatic cost reduction ($81 per author vs typical writer compensation).

Conclusion: Author-specific fine-tuning enables AI to produce non-verbatim writing that readers prefer to expert human writing, providing empirical evidence relevant to copyright’s fair-use considerations regarding market impact.

Abstract: The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI’s ability to generate derivative content. Yet it’s unclear if these models can generate high quality literary text while emulating authors’ styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors’ diverse styles. In blind pairwise evaluations by 159 representative expert & lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p<10^-8) & writing quality (OR=0.13, p<10^-7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors’ complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p<10^-13) & writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors & styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning & inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright’s fourth fair-use factor, the “effect upon the potential market or value” of the source works.

[162] Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

Chenchen Tan, Youyang Qu, Xinghao Li, Hui Zhang, Shujie Cui, Cunjian Chen, Longxiang Gao

Main category: cs.CL

TL;DR: The Attention-Shifting (AS) framework addresses the dilemma in machine unlearning for LLMs by balancing utility preservation with effective data removal through attention-level interventions.

Details

Motivation: Existing unlearning approaches either compromise model utility or risk hallucinated responses, limiting LLMs' reliability in knowledge-intensive applications.

Method: AS uses two attention-level interventions: importance-aware suppression for unlearning data and attention-guided retention enhancement for retained data, jointly optimized via a dual-loss objective.

Result: AS improves performance preservation by up to 15% higher accuracy on ToFU benchmark and 10% on TDEC benchmark while maintaining competitive hallucination-free unlearning effectiveness.

Conclusion: AS demonstrates superior balance between unlearning effectiveness, generalization, and response reliability compared to existing methods.

Abstract: The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs’ reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs’ linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.

[163] Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

Teague McMillan, Gabriele Dominici, Martin Gjoreski, Marc Langheinrich

Main category: cs.CL

TL;DR: LLMs often generate unfaithful explanations that don’t reflect their actual reasoning, which is especially problematic in healthcare. The study examines how inference and training choices affect explanation faithfulness across different models and datasets.

Details

Motivation: Unfaithful explanations in LLMs can undermine clinician trust and lead to unsafe decision support in healthcare settings by omitting clinical cues or masking spurious shortcuts.

Method: Evaluated three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on BBQ (social bias) and MedQA datasets, manipulating few-shot examples, prompting strategies, and training procedures to measure faithfulness.

Result: Few-shot example quantity and quality significantly impact faithfulness; faithfulness is sensitive to prompting design; instruction-tuning improves faithfulness on MedQA.

Conclusion: The findings provide strategies for enhancing LLM interpretability and trustworthiness in sensitive domains through careful design of inference and training procedures.

Abstract: Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.

[164] Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models

Guangyu Xie, Yice Zhang, Jianzhu Bao, Qianlong Wang, Yang Sun, Bingbing Wang, Ruifeng Xu

Main category: cs.CL

TL;DR: CompEffDist is a comprehensive and efficient distillation framework for sentiment analysis that uses attribute-based automatic instruction construction and difficulty-based data filtering to enable 3B student models to match 20x larger teacher models’ performance with only 10% of the data.

Details

Motivation: Current sentiment analysis distillation methods face two key challenges: manually written instructions lack diversity and quantity for comprehensive knowledge coverage, and large-scale user texts incur high computational costs that hinder practicality.

Method: The framework consists of two key modules: attribute-based automatic instruction construction to generate diverse instructions, and difficulty-based data filtering to reduce computational costs by selecting the most valuable training samples.

Result: 3B student models achieved performance comparable to 20x larger teacher models on most tasks, and the method greatly outperformed baselines in data efficiency, achieving the same performance level with only 10% of the data.

Conclusion: CompEffDist provides an effective solution for developing lightweight sentiment analysis models through comprehensive knowledge distillation while maintaining high data efficiency and computational practicality.

Abstract: Recent efforts leverage knowledge distillation techniques to develop lightweight and practical sentiment analysis models. These methods are grounded in human-written instructions and large-scale user texts. Despite the promising results, two key challenges remain: (1) manually written instructions are limited in diversity and quantity, making them insufficient to ensure comprehensive coverage of distilled knowledge; (2) large-scale user texts incur high computational cost, hindering the practicality of these methods. To this end, we introduce CompEffDist, a comprehensive and efficient distillation framework for sentiment analysis. Our framework consists of two key modules: attribute-based automatic instruction construction and difficulty-based data filtering, which correspondingly tackle the aforementioned challenges. Applying our method across multiple model series (Llama-3, Qwen-3, and Gemma-3), we enable 3B student models to match the performance of 20x larger teacher models on most tasks. In addition, our approach greatly outperforms baseline methods in data efficiency, attaining the same performance level with only 10% of the data.

[165] Scaling Latent Reasoning via Looped Language Models

Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang, Yoshua Bengio, Jason Eshraghian

Main category: cs.CL

TL;DR: Ouro is a family of pre-trained Looped Language Models that integrate reasoning into pre-training through iterative latent space computation, entropy-regularized depth allocation, and large-scale training on 7.7T tokens, achieving superior performance with smaller models.

Details

Motivation: Current LLMs rely on explicit text generation like chain-of-thought for reasoning, which defers reasoning to post-training and under-utilizes pre-training data. The authors aim to build reasoning capabilities directly into the pre-training phase.

Method: Three key components: (1) iterative computation in latent space, (2) entropy-regularized objective for learned depth allocation, (3) scaling to 7.7T tokens. The approach is named after the recursive Ouroboros concept.

Result: Ouro 1.4B and 2.6B models match the performance of up to 12B state-of-the-art LLMs across various benchmarks. The advantage comes from superior knowledge manipulation capabilities rather than increased knowledge capacity. LoopLM produces reasoning traces more aligned with final outputs than explicit chain-of-thought.

Conclusion: LoopLM represents a promising novel scaling direction for reasoning capabilities in language models, demonstrating the potential of integrating reasoning directly into pre-training rather than relying on post-training techniques.

Abstract: Modern LLMs are trained to “think” primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model is available here: http://ouro-llm.github.io.

[166] AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache

Dinghong Song, Yuan Feng, Yiwei Wang, Shangye Chen, Cyril Guyot, Filip Blagojevic, Hyeran Jeon, Pengfei Su, Dong Li

Main category: cs.CL

TL;DR: AttnCache is a framework that accelerates LLM prefill inference by reusing similar attention maps from a cache, reducing self-attention computation overhead.

Details

Motivation: Many real-world workloads use only the prefill stage of LLM inference, where self-attention's quadratic complexity becomes a major performance bottleneck.

Method: The framework builds an attention map memorization database and uses efficient caching and similarity search to retrieve and reuse similar attention maps during inference.

Result: Achieves 1.2-1.6x end-to-end speedup and 2-3x attention speedup on CPU/GPU with negligible accuracy degradation.

Conclusion: AttnCache effectively accelerates prefill-only LLM inference by exploiting attention map similarity across different inputs.

Abstract: Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.

[167] QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback

Taku Mikuriya, Tatsuya Ishigaki, Masayuki Kawarada, Shunya Minami, Tadashi Kadowaki, Yohichi Suzuki, Soshun Naito, Shunya Takata, Takumi Kato, Tamotsu Basseda, Reo Yamada, Hiroya Takamura

Main category: cs.CL

TL;DR: QCoder Benchmark is an evaluation framework that assesses LLMs on quantum programming with hardware feedback, showing reasoning-based models outperform both standard LLMs and human coders.

Details

Motivation: To address the gap in evaluating LLMs for quantum programming domains that require interaction with hardware devices, where conventional code generation approaches are insufficient.

Method: Created QCoder Benchmark with quantum simulator environment for domain-specific metrics feedback and incorporated human-written code from real programming contests for comparison.

Result: Advanced models like GPT-4o achieved only 18.97% accuracy, while reasoning-based models like o3 reached 78% accuracy, outperforming human-written code success rates (39.98%).

Conclusion: The benchmark demonstrates the difficulty of quantum programming for LLMs and shows reasoning-based approaches are more effective, with the framework released to support further research.

Abstract: Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research. (Codes and datasets are available at https://qcoder-bench.github.io/ )

[168] Hebrew Diacritics Restoration using Visual Representation

Yair Elboher, Yuval Pinter

Main category: cs.CL

TL;DR: DIVRIT is a novel Hebrew diacritization system that frames the task as zero-shot classification using a Hebrew Visual Language Model to process undiacritized text as images.

Details

Motivation: Hebrew diacritics restoration is crucial for accurate pronunciation and meaning disambiguation, as the language is highly ambiguous when unvocalized.

Method: Operates at word level using zero-shot classification with dynamically generated candidate sets. Uses a Hebrew Visual Language Model that processes undiacritized text as images to embed diacritic information directly in vector representations.

Result: Achieves high accuracy in oracle settings where correct diacritized forms are guaranteed among candidates. Architectural enhancements and optimized training significantly improve generalization capabilities.

Conclusion: Visual representations show promising potential for accurate and automated Hebrew diacritization without complex linguistic analysis.

Abstract: Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language’s high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input’s vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle’’ setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system’s overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.

[169] SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

Yiqiao Jin, Rachneet Kaur, Zhen Zeng, Sumitra Ganesh, Srijan Kumar

Main category: cs.CL

TL;DR: SlideAgent is an agentic framework that improves multi-page visual document understanding through specialized agents for global, page, and element-level reasoning, achieving significant performance gains over existing models.

Details

Motivation: Current LLM systems struggle with complex multi-page visual documents that convey information through layout, colors, icons, and cross-slide references, particularly in fine-grained reasoning over elements and pages.

Method: SlideAgent employs specialized agents and decomposes reasoning into three levels (global, page, element) to construct a structured, query-agnostic representation. During inference, it selectively activates specialized agents for multi-level reasoning and integrates their outputs.

Result: Extensive experiments show SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).

Conclusion: SlideAgent provides an effective framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks, through hierarchical agentic reasoning.

Abstract: Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).

[170] Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du

Main category: cs.CL

TL;DR: Kimi Linear is a hybrid linear attention architecture that outperforms full attention across various scenarios using Kimi Delta Attention (KDA) with improved gating and efficient chunkwise algorithms, achieving better performance while reducing KV cache usage by up to 75% and increasing decoding throughput up to 6x.

Details

Motivation: To develop a linear attention architecture that can outperform full attention under fair comparisons across different scenarios (short-context, long-context, RL scaling) while maintaining efficiency.

Method: Uses Kimi Delta Attention (KDA) - an expressive linear attention module extending Gated DeltaNet with finer-grained gating, combined with a bespoke chunkwise algorithm using specialized DPLR transition matrices for hardware efficiency. Implements a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA).

Result: A 3B activated parameter model (48B total) outperforms full MLA across all evaluated tasks, reduces KV cache usage by up to 75%, and achieves up to 6 times decoding throughput for 1M context length.

Conclusion: Kimi Linear serves as a drop-in replacement for full attention architectures with superior performance and efficiency, particularly for tasks with longer input and output lengths.

Abstract: We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios – including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

cs.CV

[171] Generative human motion mimicking through feature extraction in denoising diffusion settings

Alexander Okupnik, Johannes Schneider, Kyriakos Flouris

Main category: cs.CV

TL;DR: The paper presents an interactive AI model for creative dance collaboration that generates responsive movement sequences by mimicking and enhancing human motion input using diffusion models, motion inpainting, and style transfer.

Details

Motivation: To explore embodied human-AI interaction through dance, complementing the verbal interaction capabilities of large language models with physical, creative movement collaboration.

Method: Uses single-person motion capture data and high-level features with diffusion models, motion inpainting, and motion style transfer to generate temporally coherent movement representations that respond to human movement input.

Result: The model successfully generates diverse and realistic dance movements that show various deviations from human input while maintaining coherence, with quantitative assessment showing convergence between generated and test set feature distributions.

Conclusion: The approach represents first steps toward creative dancing with AI, enabling realistic movement generation that creatively enhances human input without relying on human-human interaction data.

Abstract: Recent success with large language models has sparked a new wave of verbal human-AI interaction. While such models support users in a variety of creative tasks, they lack the embodied nature of human interaction. Dance, as a primal form of human expression, is predestined to complement this experience. To explore creative human-AI interaction exemplified by dance, we build an interactive model based on motion capture (MoCap) data. It generates an artificial other by partially mimicking and also “creatively” enhancing an incoming sequence of movement data. It is the first model, which leverages single-person motion data and high level features in order to do so and, thus, it does not rely on low level human-human interaction data. It combines ideas of two diffusion models, motion inpainting, and motion style transfer to generate movement representations that are both temporally coherent and responsive to a chosen movement reference. The success of the model is demonstrated by quantitatively assessing the convergence of the feature distribution of the generated samples and the test set which serves as simulating the human performer. We show that our generations are first steps to creative dancing with AI as they are both diverse showing various deviations from the human partner while appearing realistic.

[172] Med-Banana-50K: A Cross-modality Large-Scale Dataset for Text-guided Medical Image Editing

Zhihui Chen, Mengling Feng

Main category: cs.CV

TL;DR: Med-Banana-50K is a comprehensive 50K-image dataset for instruction-based medical image editing across three modalities and 23 disease types, featuring systematic medical quality control and including failed attempts for preference learning.

Details

Motivation: The research community lacks large-scale, high-quality, openly accessible datasets specifically designed for medical image editing with strict anatomical and clinical constraints.

Method: Dataset constructed using Gemini-2.5-Flash-Image to generate bidirectional edits from real medical images, with LLM-as-Judge medical quality control using a medically grounded rubric and history-aware iterative refinement up to five rounds.

Result: Created Med-Banana-50K dataset spanning chest X-ray, brain MRI, and fundus photography with 23 disease types, including 37K failed attempts with full conversation logs for preference learning.

Conclusion: Med-Banana-50K establishes a foundation for training and evaluating next-generation medical image editing models by providing a large-scale, medically validated, and fully documented resource.

Abstract: Recent advances in multimodal large language models have enabled remarkable medical image editing capabilities. However, the research community’s progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built specifically for medical image editing with strict anatomical and clinical constraints. We introduce Med-Banana-50K, a comprehensive 50K-image dataset for instruction-based medical image editing spanning three modalities (chest X-ray, brain MRI, fundus photography) and 23 disease types. Our dataset is constructed by leveraging Gemini-2.5-Flash-Image to generate bidirectional edits (lesion addition and removal) from real medical images. What distinguishes Med-Banana-50K from general-domain editing datasets is our systematic approach to medical quality control: we employ LLM-as-Judge with a medically grounded rubric (instruction compliance, structural plausibility, realism, and fidelity preservation) and history-aware iterative refinement up to five rounds. Beyond single-turn editing, Med-Banana-50K includes 37K failed attempts with full conversation logs for preference learning and alignment research. By providing this large-scale, medically validated, and fully documented resource, Med-Banana-50K establishes a foundation for training and evaluating the next generation of medical image editing models.Our dataset and code are publicly available at [https://github.com/richardChenzhihui/med-banana-50k].

[173] Deep Learning Models for Coral Bleaching Classification in Multi-Condition Underwater Image Datasets

Julio Jerison E. Macrohon, Gordon Hung

Main category: cs.CV

TL;DR: A machine learning system for coral bleaching classification using CNN, ResNet, and ViT models, with CNN achieving 88% accuracy on a diverse global dataset.

Details

Motivation: Coral reefs are vital marine ecosystems facing increasing threats from pollution, ocean acidification, and temperature anomalies, making efficient monitoring urgent.

Method: Developed a coral bleaching classification system using three state-of-the-art models (ResNet, ViT, CNN) on a diverse global dataset with samples from various environments including deep seas, marshes, and coastal zones.

Result: After hyperparameter tuning, the CNN model achieved the highest accuracy of 88%, outperforming existing benchmarks.

Conclusion: The study provides important insights for autonomous coral monitoring and presents comprehensive analysis of widely used computer vision models for coral health assessment.

Abstract: Coral reefs support numerous marine organisms and are an important source of coastal protection from storms and floods, representing a major part of marine ecosystems. However coral reefs face increasing threats from pollution, ocean acidification, and sea temperature anomalies, making efficient protection and monitoring heavily urgent. Therefore, this study presents a novel machine-learning-based coral bleaching classification system based on a diverse global dataset with samples of healthy and bleached corals under varying environmental conditions, including deep seas, marshes, and coastal zones. We benchmarked and compared three state-of-the-art models: Residual Neural Network (ResNet), Vision Transformer (ViT), and Convolutional Neural Network (CNN). After comprehensive hyperparameter tuning, the CNN model achieved the highest accuracy of 88%, outperforming existing benchmarks. Our findings offer important insights into autonomous coral monitoring and present a comprehensive analysis of the most widely used computer vision models.

Xinyu Mao, Junsi Li, Haoji Zhang, Yu Liang, Ming Sun

Main category: cs.CV

TL;DR: SEPS framework addresses patch redundancy and ambiguity in cross-modal alignment by integrating dense and sparse text semantics and using relevance-aware selection to improve patch-word correspondences.

Details

Motivation: Current approaches struggle with patch redundancy and ambiguity due to information density disparities across vision and language modalities, and MLLMs' dense outputs may conflict with sparse captions.

Method: Two-stage mechanism that integrates unified semantics from dense and sparse texts to identify salient visual patches, plus relevance-aware selection with mean value computation to highlight crucial patch-word correspondences.

Result: Achieves 23%-86% improvement in rSum across diverse model architectures on Flickr30K and MS-COCO datasets, with notable enhancements in text-to-image retrieval.

Conclusion: SEPS effectively addresses patch redundancy and ambiguity in cross-modal alignment, demonstrating superior performance over existing approaches.

Abstract: Fine-grained cross-modal alignment aims to establish precise local correspondences between vision and language, forming a cornerstone for visual question answering and related multimodal applications. Current approaches face challenges in addressing patch redundancy and ambiguity, which arise from the inherent information density disparities across modalities. Recently, Multimodal Large Language Models (MLLMs) have emerged as promising solutions to bridge this gap through their robust semantic generation capabilities. However, the dense textual outputs from MLLMs may introduce conflicts with the original sparse captions. Furthermore, accurately quantifying semantic relevance between rich visual patches and concise textual descriptions remains a core challenge. To overcome these limitations, we introduce the Semantic-Enhanced Patch Slimming (SEPS) framework, which systematically addresses patch redundancy and ambiguity. Our approach employs a two-stage mechanism to integrate unified semantics from both dense and sparse texts, enabling the identification of salient visual patches. Additionally, it leverages relevance-aware selection with mean value computation to highlight crucial patch-word correspondences, thereby improving cross-modal similarity assessment. Comprehensive experiments on Flickr30K and MS-COCO datasets validate that SEPS achieves superior performance, surpassing existing approaches by 23%-86% in rSum across diverse model architectures, with notable enhancements in text-to-image retrieval scenarios. Our implementation is available at https://github.com/Sweet4tars/seps.git.

[175] Automating Coral Reef Fish Family Identification on Video Transects Using a YOLOv8-Based Deep Learning Pipeline

Jules Gerard, Leandro Di Bella, Filip Huyghe, Marc Kochzius

Main category: cs.CV

TL;DR: YOLOv8-based deep learning pipeline for automated family-level fish identification from video transects in Kenya and Tanzania, achieving mAP@0.5 of 0.52.

Details

Motivation: Coral reef monitoring in the Western Indian Ocean is limited by labor demands of underwater visual censuses, requiring automated solutions.

Method: YOLOv8-based deep learning pipeline tested on curated dataset of 24 fish families under different configurations.

Result: Best model achieved mAP@0.5 of 0.52, with high accuracy for abundant families but weaker detection of rare or complex taxa.

Conclusion: Deep learning shows potential as a scalable complement to traditional reef monitoring methods in the Western Indian Ocean.

Abstract: Coral reef monitoring in the Western Indian Ocean is limited by the labor demands of underwater visual censuses. This work evaluates a YOLOv8-based deep learning pipeline for automating family-level fish identification from video transects collected in Kenya and Tanzania. A curated dataset of 24 families was tested under different configurations, providing the first region-specific benchmark for automated reef fish monitoring in the Western Indian Ocean. The best model achieved mAP@0.5 of 0.52, with high accuracy for abundant families but weaker detection of rare or complex taxa. Results demonstrate the potential of deep learning as a scalable complement to traditional monitoring methods.

[176] How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

Zhen Chen, Qing Xu, Jinlin Wu, Biao Yang, Yuhao Zhai, Geng Guo, Jing Zhang, Yinlu Ding, Nassir Navab, Jiebo Luo

Main category: cs.CV

TL;DR: SurgVeo is the first expert-curated benchmark for evaluating video generation models in surgery, using a novel Surgical Plausibility Pyramid framework. Testing Veo-3 model reveals it achieves excellent visual plausibility but fails at higher-level surgical understanding.

Details

Motivation: Foundation models show promise as world simulators, but their application in high-stakes surgical domains requiring specialized causal knowledge remains unexplored. There's a critical gap in evaluating whether these models can truly understand surgical procedures beyond general physical rules.

Method: Created SurgVeo benchmark and Surgical Plausibility Pyramid (SPP) - a four-tiered framework to assess model outputs from basic appearance to complex surgical strategy. Tested Veo-3 model with zero-shot prediction on surgical clips from laparoscopic and neurosurgical procedures, evaluated by four board-certified surgeons using SPP.

Result: Revealed a distinct “plausibility gap”: Veo-3 achieves exceptional Visual Perceptual Plausibility but fails critically at higher SPP levels including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility.

Conclusion: This provides first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. SurgVeo and SPP establish crucial foundation for developing future models capable of navigating specialized healthcare domains.

Abstract: Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four-tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo-3 model with a zero-shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board-certified surgeons evaluates the generated videos according to the SPP. Our results reveal a distinct “plausibility gap”: while Veo-3 achieves exceptional Visual Perceptual Plausibility, it fails critically at higher levels of the SPP, including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility. This work provides the first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. Our findings from SurgVeo and the SPP establish a crucial foundation and roadmap for developing future models capable of navigating the complexities of specialized, real-world healthcare domains.

[177] Mutual Information guided Visual Contrastive Learning

Hanyang Chen, Yanchao Yang

Main category: cs.CV

TL;DR: The paper proposes using mutual information from real-world distributions to select training data for contrastive learning, replacing human-engineered data augmentation with more principled selection based on natural perturbations.

Details

Motivation: Current contrastive learning methods rely on human-designed data augmentation (like color jittering) which may be suboptimal. The authors aim to use mutual information from real-world distributions to select better training data that improves generalization in open environments.

Method: Select patches with high mutual information under natural perturbations (color changes, motion) as positive samples for contrastive learning. This replaces traditional human-engineered augmentation with data selection based on mutual information computed from real-world distributions.

Result: The proposed mutual-information-informed data augmentation method was evaluated on multiple benchmarks across state-of-the-art representation learning frameworks and demonstrated effectiveness.

Conclusion: Mutual-information-based data selection is a promising direction for improving representation learning, providing better generalization than human-engineered augmentation methods.

Abstract: Representation learning methods utilizing the InfoNCE loss have demonstrated considerable capacity in reducing human annotation effort by training invariant neural feature extractors. Although different variants of the training objective adhere to the information maximization principle between the data and learned features, data selection and augmentation still rely on human hypotheses or engineering, which may be suboptimal. For instance, data augmentation in contrastive learning primarily focuses on color jittering, aiming to emulate real-world illumination changes. In this work, we investigate the potential of selecting training data based on their mutual information computed from real-world distributions, which, in principle, should endow the learned features with better generalization when applied in open environments. Specifically, we consider patches attached to scenes that exhibit high mutual information under natural perturbations, such as color changes and motion, as positive samples for learning with contrastive loss. We evaluate the proposed mutual-information-informed data augmentation method on several benchmarks across multiple state-of-the-art representation learning frameworks, demonstrating its effectiveness and establishing it as a promising direction for future research.

[178] Benchmarking Federated Learning Frameworks for Medical Imaging Deployment: A Comparative Study of NVIDIA FLARE, Flower, and Owkin Substra

Riya Gupta, Alexander Chowdhury, Sahil Nalawade

Main category: cs.CV

TL;DR: This study benchmarks three FL frameworks (NVIDIA FLARE, Flower, Owkin Substra) for medical imaging using PathMNIST dataset, evaluating performance, convergence, communication overhead, scalability, and developer experience.

Details

Motivation: Federated Learning enables collaborative model training across medical institutions without direct data sharing, addressing privacy concerns in healthcare AI applications.

Method: Benchmarking three FL frameworks using PathMNIST dataset to assess model performance, convergence efficiency, communication overhead, scalability, and developer experience in medical imaging applications.

Result: NVIDIA FLARE offers superior production scalability, Flower provides flexibility for prototyping and academic research, and Owkin Substra demonstrates exceptional privacy and compliance features.

Conclusion: Each FL framework has distinct strengths optimized for different use cases, making them relevant for practical deployment in healthcare environments based on specific requirements.

Abstract: Federated Learning (FL) has emerged as a transformative paradigm in medical AI, enabling collaborative model training across institutions without direct data sharing. This study benchmarks three prominent FL frameworks NVIDIA FLARE, Flower, and Owkin Substra to evaluate their suitability for medical imaging applications in real-world settings. Using the PathMNIST dataset, we assess model performance, convergence efficiency, communication overhead, scalability, and developer experience. Results indicate that NVIDIA FLARE offers superior production scalability, Flower provides flexibility for prototyping and academic research, and Owkin Substra demonstrates exceptional privacy and compliance features. Each framework exhibits strengths optimized for distinct use cases, emphasizing their relevance to practical deployment in healthcare environments.

[179] A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift

Sanath Budakegowdanadoddi Nagaraju, Brian Bernhard Moser, Tobias Christian Nauen, Stanislav Frolov, Federico Raue, Andreas Dengel

Main category: cs.CV

TL;DR: TaylorIR is a plug-and-play framework for image super-resolution that uses 1x1 patch embeddings and TaylorShift attention to achieve state-of-the-art performance with near-linear complexity and 60% memory reduction.

Details

Motivation: Transformer-based SR models face scalability limitations due to quadratic attention costs and coarse patch embeddings that weaken pixel-level fidelity.

Method: Proposes TaylorIR with 1x1 patch embeddings for pixel-wise reasoning and TaylorShift attention mechanism based on Taylor series for full token interactions with near-linear complexity.

Result: Achieves state-of-the-art performance across multiple SR benchmarks while reducing memory consumption by up to 60%.

Conclusion: TaylorIR effectively bridges the gap between fine-grained detail restoration and efficient transformer scaling in super-resolution tasks.

Abstract: Transformer-based architectures have recently advanced the image reconstruction quality of super-resolution (SR) models. Yet, their scalability remains limited by quadratic attention costs and coarse patch embeddings that weaken pixel-level fidelity. We propose TaylorIR, a plug-and-play framework that enforces 1x1 patch embeddings for true pixel-wise reasoning and replaces conventional self-attention with TaylorShift, a Taylor-series-based attention mechanism enabling full token interactions with near-linear complexity. Across multiple SR benchmarks, TaylorIR delivers state-of-the-art performance while reducing memory consumption by up to 60%, effectively bridging the gap between fine-grained detail restoration and efficient transformer scaling.

[180] Enhancing rice leaf images: An overview of image denoising techniques

Rupjyoti Chutia, Dibya Jyoti Bora

Main category: cs.CV

TL;DR: Comparative study of image denoising methods combined with CLAHE for enhancing rice leaf images to improve agricultural analysis tasks like disease detection and nutrient evaluation.

Details

Motivation: Image enhancement is crucial for rice leaf analysis in agriculture, particularly for disease detection, nutrient deficiency evaluation, and growth analysis. Denoising and contrast enhancement are essential preprocessing steps to make subsequent image processing tasks more reliable.

Method: Extensive comparative study of well-known image denoising methods combined with CLAHE (Contrast Limited Adaptive Histogram Equalization). Experiments performed on a rice leaf image dataset using various metrics to comprehensively test enhancement methods.

Result: Results were examined using various metrics to assess the effectiveness of different denoising methods combined with CLAHE for rice leaf image enhancement.

Conclusion: The approach provides a strong basis for assessing the effectiveness of image enhancement methodologies in digital image processing and reveals insights useful for future adaptation in agricultural research and other domains.

Abstract: Digital image processing involves the systematic handling of images using advanced computer algorithms, and has gained significant attention in both academic and practical fields. Image enhancement is a crucial preprocessing stage in the image-processing chain, improving image quality and emphasizing features. This makes subsequent tasks (segmentation, feature extraction, classification) more reliable. Image enhancement is essential for rice leaf analysis, aiding in disease detection, nutrient deficiency evaluation, and growth analysis. Denoising followed by contrast enhancement are the primary steps. Image filters, generally employed for denoising, transform or enhance visual characteristics like brightness, contrast, and sharpness, playing a crucial role in improving overall image quality and enabling the extraction of useful information. This work provides an extensive comparative study of well-known image-denoising methods combined with CLAHE (Contrast Limited Adaptive Histogram Equalization) for efficient denoising of rice leaf images. The experiments were performed on a rice leaf image dataset to ensure the data is relevant and representative. Results were examined using various metrics to comprehensively test enhancement methods. This approach provides a strong basis for assessing the effectiveness of methodologies in digital image processing and reveals insights useful for future adaptation in agricultural research and other domains.

[181] EgoBlind: Towards Egocentric Visual Assistance for the Blind

Junbin Xiao, Nanxin Huang, Hao Qiu, Zhulin Tao, Xun Yang, Richang Hong, Meng Wang, Angela Yao

Main category: cs.CV

TL;DR: EgoBlind is the first egocentric VideoQA dataset from blind individuals to evaluate MLLMs’ assistive capabilities, showing current models perform poorly (60% accuracy) compared to humans (87.4%).

Details

Motivation: To create a benchmark for evaluating multimodal large language models' ability to provide visual assistance to blind and visually impaired individuals through first-person video understanding.

Method: Collected 1,392 first-person videos from blind individuals’ daily lives with 5,311 questions verified by blind participants, each with 3 annotated reference answers to reduce subjectivity.

Result: Evaluation of 16 advanced MLLMs shows all models struggle, with best performers achieving only 60% accuracy compared to human performance of 87.4%. Major limitations in egocentric visual assistance were identified.

Conclusion: EgoBlind serves as a foundation for developing effective AI assistants to enhance blind individuals’ independence, with identified limitations guiding future improvements in MLLMs for visual assistance.

Abstract: We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60%, which is far behind human performance of 87.4%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at https://github.com/doc-doc/EgoBlind.

[182] Which LiDAR scanning pattern is better for roadside perception: Repetitive or Non-repetitive?

Zhiqi Qi, Runxin Zhao, Hanyang Zhuang, Chunxiang Wang, Ming Yang

Main category: cs.CV

TL;DR: This paper investigates how different LiDAR scanning patterns (repetitive vs non-repetitive) affect roadside perception performance, introduces a new benchmark dataset, and finds that non-repetitive LiDAR offers comparable detection performance to high-end repetitive LiDAR at lower cost.

Details

Motivation: While LiDAR placement optimization has been studied, the impact of different scanning patterns on perceptual performance remains under-investigated, especially for infrastructure-based ITS applications.

Method: Created ‘InfraLiDARs’ Benchmark’ dataset in CARLA simulation with concurrent infrastructure LiDARs using both scanning paradigms, then conducted statistical analysis and evaluated various 3D object detection algorithms.

Result: Non-repetitive scanning LiDAR and 128-line repetitive LiDAR showed comparable detection performance across scenarios. Non-repetitive LiDAR is cost-effective despite limited perception range.

Conclusion: Provides guidance for optimal LiDAR scanning pattern selection and compatible algorithms for roadside perception systems, and releases the benchmark dataset to support further research.

Abstract: LiDAR-based roadside perception is a cornerstone of advanced Intelligent Transportation Systems (ITS). While considerable research has addressed optimal LiDAR placement for infrastructure, the profound impact of differing LiDAR scanning patterns on perceptual performance remains comparatively under-investigated. The inherent nature of various scanning modes - such as traditional repetitive (mechanical/solid-state) versus emerging non-repetitive (e.g. prism-based) systems - leads to distinct point cloud distributions at varying distances, critically dictating the efficacy of object detection and overall environmental understanding. To systematically investigate these differences in infrastructure-based contexts, we introduce the “InfraLiDARs’ Benchmark,” a novel dataset meticulously collected in the CARLA simulation environment using concurrently operating infrastructure-based LiDARs exhibiting both scanning paradigms. Leveraging this benchmark, we conduct a comprehensive statistical analysis of the respective LiDAR scanning abilities and evaluate the impact of these distinct patterns on the performance of various leading 3D object detection algorithms. Our findings reveal that non-repetitive scanning LiDAR and the 128-line repetitive LiDAR were found to exhibit comparable detection performance across various scenarios. Despite non-repetitive LiDAR’s limited perception range, it’s a cost-effective option considering its low price. Ultimately, this study provides insights for setting up roadside perception system with optimal LiDAR scanning patterns and compatible algorithms for diverse roadside applications, and publicly releases the “InfraLiDARs’ Benchmark” dataset to foster further research.

[183] SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding

Qianqian Sun, Jixiang Luo, Dell Zhang, Xuelong Li

Main category: cs.CV

TL;DR: SmartFreeEdit is a novel framework that integrates multimodal large language models with hypergraph-enhanced inpainting for precise, mask-free image editing guided by natural language instructions.

Details

Motivation: Conventional image editing methods face challenges in spatial reasoning, precise region segmentation, and maintaining semantic consistency in complex scenes.

Method: Uses region aware tokens and mask embedding for spatial understanding, reasoning segmentation pipeline for mask generation, and hypergraph-augmented inpainting module for structural and semantic preservation.

Result: Outperforms state-of-the-art methods on Reason-Edit benchmark across segmentation accuracy, instruction adherence, and visual quality preservation metrics.

Conclusion: SmartFreeEdit effectively addresses limitations of local-based image generation and improves global consistency in edited images.

Abstract: Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes. To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include:(1)the introduction of region aware tokens and a mask embedding paradigm that enhance the spatial understanding of complex scenes;(2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions;and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at https://github.com/smileformylove/SmartFreeEdit.

[184] World Simulation with Video Foundation Models for Physical AI

NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta, Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz, Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang, Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren, Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen, Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, Andrew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang, Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng, Andrew Zhu, Yuke Zhu

Main category: cs.CV

TL;DR: Cosmos-Predict2.5 is a next-generation Physical AI foundation model that unifies Text2World, Image2World, and Video2World generation using flow-based architecture, with improved video quality and instruction alignment over previous versions.

Details

Motivation: To create more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems by developing unified world generation models with better control and fidelity.

Method: Built on flow-based architecture, leverages Cosmos-Reason1 for text grounding and world simulation control, trained on 200M curated video clips with reinforcement learning-based post-training, and includes Cosmos-Transfer2.5 for Sim2Real/Real2Real translation.

Result: Achieves substantial improvements over Cosmos-Predict1 in video quality and instruction alignment, with models at 2B and 14B scales. Cosmos-Transfer2.5 is 3.5x smaller than previous version but delivers higher fidelity and robust long-horizon video generation.

Conclusion: These advances establish Cosmos-Predict2.5 and Cosmos-Transfer2.5 as versatile tools for scaling embodied intelligence, with open-source release to accelerate Physical AI research and deployment.

Abstract: We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.

[185] An Efficient and Generalizable Transfer Learning Method for Weather Condition Detection on Ground Terminals

Wenxuan Zhang, Peng Hu

Main category: cs.CV

TL;DR: Proposes a transfer learning method for fine-grained weather condition detection on satellite ground terminals to improve reliability of LEO satellite Internet.

Details

Motivation: Weather events significantly impact satellite Internet performance, disrupting space-ground links. Current solutions lack fine-grained detection capability for ground terminal weather conditions needed for fault diagnostics.

Method: Efficient transfer learning method that enables ground components to locally detect weather-related conditions (snow, wet, etc.) from adverse weather events.

Result: Superior performance compared to typical deep learning methods like YOLOv7, YOLOv9, Faster R-CNN, and R-YOLO. Method shows advantage of being generalizable to various scenarios.

Conclusion: The transfer learning approach provides effective and generalizable solution for fine-grained weather condition detection, addressing reliability challenges in satellite Internet deployments.

Abstract: The increasing adoption of satellite Internet with low-Earth-orbit (LEO) satellites in mega-constellations allows ubiquitous connectivity to rural and remote areas. However, weather events have a significant impact on the performance and reliability of satellite Internet. Adverse weather events such as snow and rain can disturb the performance and operations of satellite Internet’s essential ground terminal components, such as satellite antennas, significantly disrupting the space-ground link conditions between LEO satellites and ground stations. This challenge calls for not only region-based weather forecasts but also fine-grained detection capability on ground terminal components of fine-grained weather conditions. Such a capability can assist in fault diagnostics and mitigation for reliable satellite Internet, but its solutions are lacking, not to mention the effectiveness and generalization that are essential in real-world deployments. This paper discusses an efficient transfer learning (TL) method that can enable a ground component to locally detect representative weather-related conditions. The proposed method can detect snow, wet, and other conditions resulting from adverse and typical weather events and shows superior performance compared to the typical deep learning methods, such as YOLOv7, YOLOv9, Faster R-CNN, and R-YOLO. Our TL method also shows the advantage of being generalizable to various scenarios.

[186] Habitat and Land Cover Change Detection in Alpine Protected Areas: A Comparison of AI Architectures

Harald Kristen, Daniel Kulmer, Manuela Hirschmugl

Main category: cs.CV

TL;DR: Deep learning approaches for change detection in complex alpine habitats, comparing geospatial foundation models (GFMs) against traditional U-Net CNNs for both post-classification and direct change detection methods.

Details

Motivation: Rapid climate change and disturbances in alpine ecosystems require frequent habitat monitoring, but manual mapping is too expensive for the needed temporal resolution.

Method: Compared two paradigms: post-classification change detection using GFMs (Prithvi-EO-2.0, Clay v1.0) vs U-Net CNNs, and direct change detection using ChangeViT transformer vs U-Net baselines. Used high-resolution multimodal data (RGB, NIR, LiDAR, terrain attributes) covering 4,480 documented changes over 15.3 km².

Result: Clay v1.0 achieved 51% overall accuracy vs U-Net’s 41% for multi-class habitat change; both reached 67% for binary change detection. Direct CD yielded superior IoU (0.53 vs 0.35) for binary but only 28% accuracy for multi-class detection. LiDAR integration improved semantic segmentation from 30% to 50% accuracy.

Conclusion: Although accuracies are lower than in homogeneous landscapes, they reflect realistic performance for complex alpine habitats. GFMs show robustness in cross-temporal evaluation. Future work will integrate object-based post-processing and physical constraints.

Abstract: Rapid climate change and other disturbances in alpine ecosystems demand frequent habitat monitoring, yet manual mapping remains prohibitively expensive for the required temporal resolution. We employ deep learning for change detection using long-term alpine habitat data from Gesaeuse National Park, Austria, addressing a major gap in applying geospatial foundation models (GFMs) to complex natural environments with fuzzy class boundaries and highly imbalanced classes. We compare two paradigms: post-classification change detection (CD) versus direct CD. For post-classification CD, we evaluate GFMs Prithvi-EO-2.0 and Clay v1.0 against U-Net CNNs; for direct CD, we test the transformer ChangeViT against U-Net baselines. Using high-resolution multimodal data (RGB, NIR, LiDAR, terrain attributes) covering 4,480 documented changes over 15.3 km2, results show Clay v1.0 achieves 51% overall accuracy versus U-Net’s 41% for multi-class habitat change, while both reach 67% for binary change detection. Direct CD yields superior IoU (0.53 vs 0.35) for binary but only 28% accuracy for multi-class detection. Cross-temporal evaluation reveals GFM robustness, with Clay maintaining 33% accuracy on 2020 data versus U-Net’s 23%. Integrating LiDAR improves semantic segmentation from 30% to 50% accuracy. Although overall accuracies are lower than in more homogeneous landscapes, they reflect realistic performance for complex alpine habitats. Future work will integrate object-based post-processing and physical constraints to enhance applicability.

[187] OmniTrack++: Omnidirectional Multi-Object Tracking by Learning Large-FoV Trajectory Feedback

Kai Luo, Hao Shi, Kunyu Peng, Fei Teng, Sheng Wu, Kaiwei Wang, Kailun Yang

Main category: cs.CV

TL;DR: OmniTrack++ is a feedback-driven MOT framework for panoramic imagery that addresses 360° FoV challenges through feature stabilization, trajectory-informed tracking, expert memory for appearance cues, and adaptive mode switching, achieving state-of-the-art performance on panoramic MOT benchmarks.

Details

Motivation: Conventional MOT methods designed for narrow-FoV pinhole cameras perform poorly on panoramic imagery due to 360° FoV, resolution dilution, and severe view-dependent distortions, requiring specialized solutions.

Method: Uses DynamicSSM block for feature stabilization, FlexiTrack Instances with trajectory feedback for flexible localization, ExpertTrack Memory with Mixture-of-Experts for appearance consolidation, and adaptive Tracklet Management that switches between end-to-end and tracking-by-detection modes.

Result: Achieves substantial HOTA improvements: +25.5% on JRDB and +43.07% on QuadTrack over original OmniTrack, demonstrating state-of-the-art performance on panoramic MOT benchmarks.

Conclusion: OmniTrack++ effectively addresses panoramic MOT challenges through its feedback-driven framework and adaptive components, with the EmboTrack benchmark providing comprehensive evaluation for real-world panoramic perception.

Abstract: This paper investigates Multi-Object Tracking (MOT) in panoramic imagery, which introduces unique challenges including a 360{\deg} Field of View (FoV), resolution dilution, and severe view-dependent distortions. Conventional MOT methods designed for narrow-FoV pinhole cameras generalize unsatisfactorily under these conditions. To address panoramic distortion, large search space, and identity ambiguity under a 360{\deg} FoV, OmniTrack++ adopts a feedback-driven framework that progressively refines perception with trajectory cues. A DynamicSSM block first stabilizes panoramic features, implicitly alleviating geometric distortion. On top of normalized representations, FlexiTrack Instances use trajectory-informed feedback for flexible localization and reliable short-term association. To ensure long-term robustness, an ExpertTrack Memory consolidates appearance cues via a Mixture-of-Experts design, enabling recovery from fragmented tracks and reducing identity drift. Finally, a Tracklet Management module adaptively switches between end-to-end and tracking-by-detection modes according to scene dynamics, offering a balanced and scalable solution for panoramic MOT. To support rigorous evaluation, we establish the EmboTrack benchmark, a comprehensive dataset for panoramic MOT that includes QuadTrack, captured with a quadruped robot, and BipTrack, collected with a bipedal wheel-legged robot. Together, these datasets span wide-angle environments and diverse motion patterns, providing a challenging testbed for real-world panoramic perception. Extensive experiments on JRDB and EmboTrack demonstrate that OmniTrack++ achieves state-of-the-art performance, yielding substantial HOTA improvements of +25.5% on JRDB and +43.07% on QuadTrack over the original OmniTrack. Datasets and code will be made publicly available at https://github.com/xifen523/OmniTrack.

[188] LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation

Huanlin Gao, Ping Chen, Fuyuan Shi, Chao Tan, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian

Main category: cs.CV

TL;DR: LeMiCa is a training-free acceleration framework for diffusion-based video generation that uses lexicographic minimax path optimization to bound global errors, achieving 2.9x speedup on Latte model with minimal quality degradation.

Details

Motivation: Existing caching strategies for video generation focus on reducing local errors but overlook global error accumulation, leading to content degradation between accelerated and original videos.

Method: Formulates cache scheduling as a directed graph with error-weighted edges and introduces Lexicographic Minimax Path Optimization to explicitly bound worst-case path error, improving global content and style consistency.

Result: Achieves 2.9x speedup on Latte model and LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques with minimal perceptual quality degradation.

Conclusion: LeMiCa provides a robust and generalizable paradigm for accelerating diffusion-based video generation that balances speed and quality, serving as a foundation for future efficient video synthesis research.

Abstract: We present LeMiCa, a training-free and efficient acceleration framework for diffusion-based video generation. While existing caching strategies primarily focus on reducing local heuristic errors, they often overlook the accumulation of global errors, leading to noticeable content degradation between accelerated and original videos. To address this issue, we formulate cache scheduling as a directed graph with error-weighted edges and introduce a Lexicographic Minimax Path Optimization strategy that explicitly bounds the worst-case path error. This approach substantially improves the consistency of global content and style across generated frames. Extensive experiments on multiple text-to-video benchmarks demonstrate that LeMiCa delivers dual improvements in both inference speed and generation quality. Notably, our method achieves a 2.9x speedup on the Latte model and reaches an LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques. Importantly, these gains come with minimal perceptual quality degradation, making LeMiCa a robust and generalizable paradigm for accelerating diffusion-based video generation. We believe this approach can serve as a strong foundation for future research on efficient and reliable video synthesis. Our code is available at :https://github.com/UnicomAI/LeMiCa

[189] Class Agnostic Instance-level Descriptor for Visual Instance Search

Qi-Ying Sun, Wan-Lei Zhao, Hui-Ying Xie, Yi-Bo Miao, Chong-Wah Ngo

Main category: cs.CV

TL;DR: The paper proposes a hierarchical instance-level feature representation method using self-supervised ViT to unify image retrieval, multi-instance search, and instance search into one framework.

Details

Motivation: Current deep features lack effective instance-level representation for visual instance search, and supervised object detection methods perform poorly on unknown object categories.

Method: Models instance-level region discovery as detecting compact feature subsets hierarchically from self-supervised ViT outputs, producing a hierarchy of instance regions with uniform-length features.

Result: Empirical studies on three benchmarks show effectiveness on both known and unknown object categories, achieving superior performance on single-instance and multi-instance search, as well as image retrieval tasks.

Conclusion: The hierarchical instance-level descriptor successfully addresses object embedding and occlusion problems while unifying multiple retrieval tasks into a single framework.

Abstract: Despite the great success of the deep features in content-based image retrieval, the visual instance search remains challenging due to the lack of effective instance-level feature representation. Supervised or weakly supervised object detection methods are not the appropriate solutions due to their poor performance on the unknown object categories. In this paper, based on the feature set output from self-supervised ViT, the instance-level region discovery is modeled as detecting the compact feature subsets in a hierarchical fashion. The hierarchical decomposition results in a hierarchy of instance regions. On the one hand, this kind of hierarchical decomposition well addresses the problem of object embedding and occlusions, which are widely observed in real scenarios. On the other hand, the non-leaf nodes and leaf nodes on the hierarchy correspond to the instance regions in different granularities within an image. Therefore, features in uniform length are produced for these instance regions, which may cover across a dominant image region, an integral of multiple instances, or various individual instances. Such a collection of features allows us to unify the image retrieval, multi-instance search, and instance search into one framework. The empirical studies on three benchmarks show that such an instance-level descriptor remains effective on both the known and unknown object categories. Moreover, the superior performance is achieved on single-instance and multi-instance search, as well as image retrieval tasks.

[190] Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi “Jim” Fan, Guanya Shi, Yuke Zhu

Main category: cs.CV

TL;DR: PLD is a three-stage framework that improves vision-language-action models through residual RL and distribution-aware data collection, achieving near-perfect task success across multiple benchmarks.

Details

Motivation: Supervised fine-tuning (SFT) relies on costly human demonstrations, limiting scalability and generalization of large vision-language-action models.

Method: Three-stage framework: 1) Train lightweight residual actors to probe failure regions, 2) Use hybrid rollout scheme for distribution-aligned trajectory collection with recovery behaviors, 3) Distill curated trajectories back into generalist with SFT.

Result: Achieves 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks.

Conclusion: Residual probing and distribution-aware replay are key to collecting deployment-aligned data that improves both seen and unseen tasks, offering a scalable path toward self-improving VLA models.

Abstract: Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1, we train lightweight residual actors to probe failure regions of the VLA generalist. In Stage 2, we use a hybrid rollout scheme that aligns collected trajectories with the generalist’s deployment distribution while capturing recovery behaviors. In Stage 3, we distill the curated trajectories back into the generalist with standard SFT. PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks. Ablations show that residual probing and distribution-aware replay are key to collecting deployment-aligned data that improves both seen and unseen tasks, offering a scalable path toward self-improving VLA models.

[191] SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation

Jiaming Liu, Dingwei Fan, Junyong Zhao, Chunlin Li, Haipeng Si, Liang Sun

Main category: cs.CV

TL;DR: SpinalSAM-R1 is a multimodal vision-language system that combines fine-tuned SAM with DeepSeek-R1 for spine CT image segmentation, achieving superior performance with natural language-guided refinement.

Details

Motivation: Spine CT image segmentation faces challenges due to low contrast and complex vertebral boundaries, while existing models like SAM have limited performance in this domain due to high annotation requirements and poor adaptability.

Method: Proposed SpinalSAM-R1 integrates fine-tuned SAM with DeepSeek-R1, using anatomy-guided attention mechanism and LoRA for efficient adaptation. Includes PyQt5-based interactive software supporting point, box, and text prompts.

Result: Achieves superior segmentation performance on spine anatomical structures, supports 11 clinical operations with 94.3% parsing accuracy and sub-800ms response times.

Conclusion: SpinalSAM-R1 effectively addresses limitations of existing models in spine CT segmentation through multimodal integration and natural language interaction, with released open-source software.

Abstract: The anatomical structure segmentation of the spine and adjacent structures from computed tomography (CT) images is a key step for spinal disease diagnosis and treatment. However, the segmentation of CT images is impeded by low contrast and complex vertebral boundaries. Although advanced models such as the Segment Anything Model (SAM) have shown promise in various segmentation tasks, their performance in spinal CT imaging is limited by high annotation requirements and poor domain adaptability. To address these limitations, we propose SpinalSAM-R1, a multimodal vision-language interactive system that integrates a fine-tuned SAM with DeepSeek-R1, for spine CT image segmentation. Specifically, our SpinalSAM-R1 introduces an anatomy-guided attention mechanism to improve spine segmentation performance, and a semantics-driven interaction protocol powered by DeepSeek-R1, enabling natural language-guided refinement. The SpinalSAM-R1 is fine-tuned using Low-Rank Adaptation (LoRA) for efficient adaptation. We validate our SpinalSAM-R1 on the spine anatomical structure with CT images. Experimental results suggest that our method achieves superior segmentation performance. Meanwhile, we develop a PyQt5-based interactive software, which supports point, box, and text-based prompts. The system supports 11 clinical operations with 94.3% parsing accuracy and sub-800 ms response times. The software is released on https://github.com/6jm233333/spinalsam-r1.

[192] A filtering scheme for confocal laser endomicroscopy (CLE)-video sequences for self-supervised learning

Nils Porsche, Flurin Müller-Diesing, Sweta Banerjee, Miguel Goncalves, Marc Aubreville

Main category: cs.CV

TL;DR: This paper proposes a filter method to reduce redundancy in CLE video sequences for self-supervised learning, improving training efficiency and performance on tumor classification tasks.

Details

Motivation: CLE imaging is hard to interpret for non-experts and suffers from limited labeled data, leading to overfitting in machine learning models. Self-supervised learning can help but faces challenges due to high inter-frame correlation in CLE videos.

Method: Proposed a filter functionality on CLE video sequences to reduce dataset redundancy for SSL training. Used four state-of-the-art baseline networks and a SSL teacher-student network with vision transformer backbone, evaluated on sinonasal tumor and skin carcinoma datasets.

Result: Filtered SSL-pretrained models achieved highest test accuracy: 67.48% on sinonasal tumor dataset and 73.52% on skin carcinoma dataset, significantly outperforming non-SSL baselines. Training time reduced by 67%.

Conclusion: SSL is effective for CLE pretraining, and the proposed video filter improves training efficiency in self-supervised scenarios while enhancing downstream task performance.

Abstract: Confocal laser endomicroscopy (CLE) is a non-invasive, real-time imaging modality that can be used for in-situ, in-vivo imaging and the microstructural analysis of mucous structures. The diagnosis using CLE is, however, complicated by images being hard to interpret for non-experienced physicians. Utilizing machine learning as an augmentative tool would hence be beneficial, but is complicated by the shortage of histopathology-correlated CLE imaging sequences with respect to the plurality of patterns in this domain, leading to overfitting of machine learning models. To overcome this, self-supervised learning (SSL) can be employed on larger unlabeled datasets. CLE is a video-based modality with high inter-frame correlation, leading to a non-stratified data distribution for SSL training. In this work, we propose a filter functionality on CLE video sequences to reduce the dataset redundancy in SSL training and improve SSL training convergence and training efficiency. We use four state-of-the-art baseline networks and a SSL teacher-student network with a vision transformer small backbone for the evaluation. These networks were evaluated on downstream tasks for a sinonasal tumor dataset and a squamous cell carcinoma of the skin dataset. On both datasets, we found the highest test accuracy on the filtered SSL-pretrained model, with 67.48% and 73.52%, both considerably outperforming their non-SSL baselines. Our results show that SSL is an effective method for CLE pretraining. Further, we show that our proposed CLE video filter can be utilized to improve training efficiency in self-supervised scenarios, resulting in a reduction of 67% in training time.

[193] FreeSliders: Training-Free, Modality-Agnostic Concept Sliders for Fine-Grained Diffusion Control in Images, Audio, and Video

Rotem Ezra, Hedi Zisling, Nimrod Berman, Ilan Naiman, Alexey Gorkor, Liran Nochumsohn, Eliya Nachmani, Omri Azencot

Main category: cs.CV

TL;DR: FreeSliders is a training-free, modality-agnostic method for fine-grained controllable generation in diffusion models by partially estimating Concept Sliders formula during inference, with extended benchmarks for video/audio and automatic scale selection.

Details

Motivation: Existing Concept Sliders require per-concept training and architecture-specific fine-tuning (e.g., LoRA), limiting scalability to new modalities. There's a need for training-free, modality-agnostic approaches for fine-grained concept control.

Method: Partially estimate Concept Sliders formula during inference without training. Introduce two-stage procedure for automatic scale selection and non-linear traversals that detects saturation points and reparameterizes traversal for perceptually uniform edits.

Result: Enables plug-and-play, training-free concept control across modalities (images, video, audio). Improves over existing baselines and establishes tools for principled controllable generation.

Conclusion: FreeSliders provides an effective training-free solution for fine-grained concept control in diffusion models, working across multiple modalities with improved evaluation metrics and automatic scale selection.

Abstract: Diffusion models have become state-of-the-art generative models for images, audio, and video, yet enabling fine-grained controllable generation, i.e., continuously steering specific concepts without disturbing unrelated content, remains challenging. Concept Sliders (CS) offer a promising direction by discovering semantic directions through textual contrasts, but they require per-concept training and architecture-specific fine-tuning (e.g., LoRA), limiting scalability to new modalities. In this work we introduce FreeSliders, a simple yet effective approach that is fully training-free and modality-agnostic, achieved by partially estimating the CS formula during inference. To support modality-agnostic evaluation, we extend the CS benchmark to include both video and audio, establishing the first suite for fine-grained concept generation control with multiple modalities. We further propose three evaluation properties along with new metrics to improve evaluation quality. Finally, we identify an open problem of scale selection and non-linear traversals and introduce a two-stage procedure that automatically detects saturation points and reparameterizes traversal for perceptually uniform, semantically meaningful edits. Extensive experiments demonstrate that our method enables plug-and-play, training-free concept control across modalities, improves over existing baselines, and establishes new tools for principled controllable generation. An interactive presentation of our benchmark and method is available at: https://azencot-group.github.io/FreeSliders/

[194] AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency

Piyushkumar Patel

Main category: cs.CV

TL;DR: MOVAI is a hierarchical framework for text-to-video generation that integrates compositional scene understanding with temporal-aware diffusion models, achieving state-of-the-art performance through three key innovations: compositional scene parsing, temporal-spatial attention, and progressive video refinement.

Details

Motivation: Existing text-to-video generation approaches struggle with maintaining temporal consistency, compositional understanding, and fine-grained control over visual narratives, creating a need for more sophisticated frameworks.

Method: MOVAI introduces three key components: 1) Compositional Scene Parser (CSP) that decomposes text into hierarchical scene graphs with temporal annotations, 2) Temporal-Spatial Attention Mechanism (TSAM) for coherent motion dynamics and spatial detail preservation, and 3) Progressive Video Refinement (PVR) module for iterative quality enhancement through multi-scale temporal reasoning.

Result: MOVAI achieves state-of-the-art performance with improvements of 15.3% in LPIPS, 12.7% in FVD, and 18.9% in user preference studies compared to existing methods. It excels at generating complex multi-object scenes with realistic temporal dynamics and fine-grained semantic control.

Conclusion: The MOVAI framework successfully addresses key challenges in text-to-video generation by integrating compositional scene understanding with temporal-aware diffusion models, demonstrating superior performance in generating high-fidelity videos with consistent temporal dynamics and fine-grained control.

Abstract: Text to video generation has emerged as a critical frontier in generative artificial intelligence, yet existing approaches struggle with maintaining temporal consistency, compositional understanding, and fine grained control over visual narratives. We present MOVAI (Multimodal Original Video AI), a novel hierarchical framework that integrates compositional scene understanding with temporal aware diffusion models for high fidelity text to video synthesis. Our approach introduces three key innovations: (1) a Compositional Scene Parser (CSP) that decomposes textual descriptions into hierarchical scene graphs with temporal annotations, (2) a Temporal-Spatial Attention Mechanism (TSAM) that ensures coherent motion dynamics across frames while preserving spatial details, and (3) a Progressive Video Refinement (PVR) module that iteratively enhances video quality through multi-scale temporal reasoning. Extensive experiments on standard benchmarks demonstrate that MOVAI achieves state-of-the-art performance, improving video quality metrics by 15.3% in LPIPS, 12.7% in FVD, and 18.9% in user preference studies compared to existing methods. Our framework shows particular strength in generating complex multi-object scenes with realistic temporal dynamics and fine-grained semantic control.

[195] Chain of Time: In-Context Physical Simulation with Image Generation Models

YingQiao Wang, Eric Bigelow, Boyi Li, Tomer Ullman

Main category: cs.CV

TL;DR: Chain of Time method improves physical simulation in vision-language models by generating intermediate images during simulation, inspired by human mental simulation and in-context reasoning.

Details

Motivation: To improve and interpret physical simulation capabilities in vision-language models, drawing inspiration from human mental simulation and machine learning in-context reasoning.

Method: Chain of Time approach generates series of intermediate images during simulation at inference time without additional fine-tuning, applied to 2-D graphics and 3-D videos testing physical properties like velocity, acceleration, fluid dynamics, and momentum conservation.

Result: Substantially improves performance of state-of-the-art image generation model, reveals hidden physical reasoning capabilities including simulation of velocity, gravity, and collisions over time, but also identifies limitations in inferring physical parameters from input images.

Conclusion: Chain of Time method effectively enhances physical simulation in vision-language models and provides interpretability insights into their reasoning capabilities and limitations.

Abstract: We propose a novel cognitively-inspired method to improve and interpret physical simulation in vision-language models. Our ``Chain of Time" method involves generating a series of intermediate images during a simulation, and it is motivated by in-context reasoning in machine learning, as well as mental simulation in humans. Chain of Time is used at inference time, and requires no additional fine-tuning. We apply the Chain-of-Time method to synthetic and real-world domains, including 2-D graphics simulations and natural 3-D videos. These domains test a variety of particular physical properties, including velocity, acceleration, fluid dynamics, and conservation of momentum. We found that using Chain-of-Time simulation substantially improves the performance of a state-of-the-art image generation model. Beyond examining performance, we also analyzed the specific states of the world simulated by an image model at each time step, which sheds light on the dynamics underlying these simulations. This analysis reveals insights that are hidden from traditional evaluations of physical reasoning, including cases where an image generation model is able to simulate physical properties that unfold over time, such as velocity, gravity, and collisions. Our analysis also highlights particular cases where the image generation model struggles to infer particular physical parameters from input images, despite being capable of simulating relevant physical processes.

[196] End-to-End Framework Integrating Generative AI and Deep Reinforcement Learning for Autonomous Ultrasound Scanning

Hanae Elmekki, Amanda Spilkin, Ehsan Zakeri, Antonela Mariel Zanuttini, Ahmed Alagha, Hani Sami, Jamal Bentahar, Lyes Kadem, Wen-Fang Xie, Philippe Pibarot, Rabeb Mizouni, Hadi Otrok, Azzam Mourad, Sami Muhaidat

Main category: cs.CV

TL;DR: This paper presents an end-to-end AI framework combining generative AI and deep reinforcement learning for autonomous cardiac ultrasound scanning, addressing limitations of operator dependence and accessibility.

Details

Motivation: Cardiac ultrasound effectiveness is limited by operator dependence, time constraints, human error, and shortage of trained professionals, especially in remote areas, creating need for automated solutions.

Method: Framework integrates conditional generative simulator (GANs + VAEs) for realistic action-conditioned US images and DRL module for learning autonomous scanning policies, with expert-validated classification models for image type and quality assessment.

Result: The solution delivers AI-driven guidance, supports conditional generation of realistic US images, establishes reproducible foundation extendable to other organs, and releases public dataset for reproducibility. VAE-GAN benchmarks against existing variants and DRL system evaluated under varying configurations.

Conclusion: The framework enables autonomous and reproducible cardiac US scanning through integrated generative AI and DRL approach, validated through experiments and public dataset release to ensure reproducibility.

Abstract: Cardiac ultrasound (US) is among the most widely used diagnostic tools in cardiology for assessing heart health, but its effectiveness is limited by operator dependence, time constraints, and human error. The shortage of trained professionals, especially in remote areas, further restricts access. These issues underscore the need for automated solutions that can ensure consistent, and accessible cardiac imaging regardless of operator skill or location. Recent progress in artificial intelligence (AI), especially in deep reinforcement learning (DRL), has gained attention for enabling autonomous decision-making. However, existing DRL-based approaches to cardiac US scanning lack reproducibility, rely on proprietary data, and use simplified models. Motivated by these gaps, we present the first end-to-end framework that integrates generative AI and DRL to enable autonomous and reproducible cardiac US scanning. The framework comprises two components: (i) a conditional generative simulator combining Generative Adversarial Networks (GANs) with Variational Autoencoders (VAEs), that models the cardiac US environment producing realistic action-conditioned images; and (ii) a DRL module that leverages this simulator to learn autonomous, accurate scanning policies. The proposed framework delivers AI-driven guidance through expert-validated models that classify image type and assess quality, supports conditional generation of realistic US images, and establishes a reproducible foundation extendable to other organs. To ensure reproducibility, a publicly available dataset of real cardiac US scans is released. The solution is validated through several experiments. The VAE-GAN is benchmarked against existing GAN variants, with performance assessed using qualitative and quantitative approaches, while the DRL-based scanning system is evaluated under varying configurations to demonstrate effectiveness.

[197] VLM6D: VLM based 6Dof Pose Estimation based on RGB-D Images

Md Selim Sarowar, Sungho Kim

Main category: cs.CV

TL;DR: VLM6D is a dual-stream architecture for robust 6D object pose estimation that combines visual features from RGB images using DINOv2 transformer and geometric features from point clouds using PointNet++, achieving state-of-the-art performance on challenging occlusion scenarios.

Details

Motivation: Current 6D pose estimation methods struggle with generalization from synthetic to real-world data, particularly under varying lighting, textureless objects, and severe occlusions.

Method: Dual-stream architecture with separate encoders: DINOv2 Vision Transformer for RGB images and PointNet++ for 3D point clouds from depth data, followed by feature fusion and multi-task prediction head.

Result: VLM6D achieved new state-of-the-art performance on the challenging Occluded-LineMOD benchmark, demonstrating superior robustness and accuracy.

Conclusion: The integration of complementary visual and geometric features enables robust 6D pose estimation that handles texture variations, lighting changes, and severe occlusions effectively.

Abstract: The primary challenge in computer vision is precisely calculating the pose of 6D objects, however many current approaches are still fragile and have trouble generalizing from synthetic data to real-world situations with fluctuating lighting, textureless objects, and significant occlusions. To address these limitations, VLM6D, a novel dual-stream architecture that leverages the distinct strengths of visual and geometric data from RGB-D input for robust and precise pose estimation. Our framework uniquely integrates two specialized encoders: a powerful, self-supervised Vision Transformer (DINOv2) processes the RGB modality, harnessing its rich, pre-trained understanding of visual grammar to achieve remarkable resilience against texture and lighting variations. Concurrently, a PointNet++ encoder processes the 3D point cloud derived from depth data, enabling robust geometric reasoning that excels even with the sparse, fragmented data typical of severe occlusion. These complementary feature streams are effectively fused to inform a multi task prediction head. We demonstrate through comprehensive experiments that VLM6D obtained new SOTA performance on the challenging Occluded-LineMOD, validating its superior robustness and accuracy.

[198] Integrating ConvNeXt and Vision Transformers for Enhancing Facial Age Estimation

Gaby Maroun, Salah Eddine Bekhouche, Fadi Dornaika

Main category: cs.CV

TL;DR: A novel hybrid architecture combining ConvNeXt and Vision Transformers (ViT) for age estimation from facial images, achieving superior performance on benchmark datasets.

Details

Motivation: To leverage complementary strengths of CNNs' localized feature extraction and Transformers' global attention mechanisms for improved age estimation accuracy.

Method: Integrated ConvNeXt with ViT, used pre-trained models, linear layers, advanced regularization techniques, and adapted attention mechanisms within CNN framework.

Result: Achieved superior performance in mean absolute error (MAE) on MORPH II, CACD, and AFAD datasets, outperforming traditional methods.

Conclusion: Hybrid ConvNeXt-ViT architecture provides robust foundation for future advances in age estimation and demonstrates transformative potential of combining CNNs and transformers.

Abstract: Age estimation from facial images is a complex and multifaceted challenge in computer vision. In this study, we present a novel hybrid architecture that combines ConvNeXt, a state-of-the-art advancement of convolutional neural networks (CNNs), with Vision Transformers (ViT). While each model independently delivers excellent performance on a variety of tasks, their integration leverages the complementary strengths of the CNNs localized feature extraction capabilities and the Transformers global attention mechanisms. Our proposed ConvNeXt-ViT hybrid solution was thoroughly evaluated on benchmark age estimation datasets, including MORPH II, CACD, and AFAD, and achieved superior performance in terms of mean absolute error (MAE). To address computational constraints, we leverage pre-trained models and systematically explore different configurations, using linear layers and advanced regularization techniques to optimize the architecture. Comprehensive ablation studies highlight the critical role of individual components and training strategies, and in particular emphasize the importance of adapted attention mechanisms within the CNN framework to improve the model focus on age-relevant facial features. The results show that the ConvNeXt-ViT hybrid not only outperforms traditional methods, but also provides a robust foundation for future advances in age estimation and related visual tasks. This work underscores the transformative potential of hybrid architectures and represents a promising direction for the seamless integration of CNNs and transformers to address complex computer vision challenges.

[199] FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding

Janghoon Cho, Jungsoo Lee, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi

Main category: cs.CV

TL;DR: FLoC is an efficient visual token compression framework for long video understanding that uses facility location function and lazy greedy algorithm to select a compact, representative subset of visual tokens while maintaining near-optimal performance.

Details

Motivation: The scalability of video-LMMs is limited by the overwhelming volume of visual tokens from extended video sequences, creating a need for efficient token compression methods.

Method: Proposes FLoC framework based on facility location function with lazy greedy algorithm to select a compact, representative subset of visual tokens within a predefined budget. The approach is training-free, model-agnostic, and query-agnostic.

Result: Extensive evaluations on Video-MME, MLVU, and LongVideoBench show FLoC consistently surpasses recent compression techniques in effectiveness, robustness, and processing speed.

Conclusion: FLoC provides a versatile, efficient solution for visual token compression in long video understanding that seamlessly integrates with diverse video-LLMs and existing workflows while maintaining performance.

Abstract: Recent studies in long video understanding have harnessed the advanced visual-language reasoning capabilities of Large Multimodal Models (LMMs), driving the evolution of video-LMMs specialized for processing extended video sequences. However, the scalability of these models is severely limited by the overwhelming volume of visual tokens generated from extended video sequences. To address this challenge, this paper proposes FLoC, an efficient visual token compression framework based on the facility location function, a principled approach that swiftly selects a compact yet highly representative and diverse subset of visual tokens within a predefined budget on the number of visual tokens. By integrating the lazy greedy algorithm, our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens, drastically reducing the number of visual tokens while guaranteeing near-optimal performance. Notably, our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution that seamlessly integrates with diverse video-LLMs and existing workflows. Extensive evaluations on large-scale benchmarks, such as Video-MME, MLVU, and LongVideoBench, demonstrate that our framework consistently surpasses recent compression techniques, highlighting not only its effectiveness and robustness in addressing the critical challenges of long video understanding, but also its efficiency in processing speed.

[200] BEN: Using Confidence-Guided Matting for Dichotomous Image Segmentation

Maxwell Meyer, Jack Spruyt

Main category: cs.CV

TL;DR: Proposes Confidence-Guided Matting (CGM) for dichotomous image segmentation, combining image matting and segmentation techniques through a two-component BEN model that achieves state-of-the-art results on DIS5K.

Details

Motivation: Current approaches treat image matting and object segmentation as separate tasks, limiting architectural innovation. Combining these techniques offers promising directions for improving segmentation quality.

Method: Proposed Confidence-Guided Matting (CGM) with Background Erase Network (BEN) - BEN Base for initial segmentation and BEN Refiner for confidence-based refinement using matting techniques.

Result: Achieves substantial improvements over state-of-the-art methods on DIS5K validation dataset, demonstrating significant enhancement in segmentation quality through matting-based refinement.

Conclusion: Introduces a new paradigm for integrating matting and segmentation techniques, improving fine-grained object boundary prediction in computer vision.

Abstract: Current approaches to dichotomous image segmentation (DIS) treat image matting and object segmentation as fundamentally different tasks. As improvements in image segmentation become increasingly challenging to achieve, combining image matting and grayscale segmentation techniques offers promising new directions for architectural innovation. Inspired by the possibility of aligning these two model tasks, we propose a new architectural approach for DIS called Confidence-Guided Matting (CGM). We created the first CGM model called Background Erase Network (BEN). BEN consists of two components: BEN Base for initial segmentation and BEN Refiner for confidence-based refinement. Our approach achieves substantial improvements over current state-of-the-art methods on the DIS5K validation dataset, demonstrating that matting-based refinement can significantly enhance segmentation quality. This work introduces a new paradigm for integrating matting and segmentation techniques, improving fine-grained object boundary prediction in computer vision.

[201] BlurGuard: A Simple Approach for Robustifying Image Protection Against AI-Powered Editing

Jinsu Kim, Yunhun Nam, Minseon Kim, Sangpil Kim, Jongheon Jeong

Main category: cs.CV

TL;DR: Proposes BlurGuard, a method that applies adaptive Gaussian blur to adversarial noise to make image protection more robust against reversal techniques like JPEG compression, while maintaining imperceptibility.

Details

Motivation: Existing adversarial noise for protecting images from text-to-image model editing is easily reversed by simple techniques like JPEG compression, limiting practical effectiveness.

Method: Applies adaptive per-region Gaussian blur on adversarial noise to adjust the frequency spectrum, making the noise more robust against reversal while maintaining imperceptibility.

Result: Consistently improves worst-case protection performance against various reversal techniques across diverse editing scenarios, while reducing quality degradation in perceptual metrics.

Conclusion: BlurGuard effectively enhances the robustness of image protection methods by making adversarial noise more irreversible, addressing a key limitation of prior approaches.

Abstract: Recent advances in text-to-image models have increased the exposure of powerful image editing techniques as a tool, raising concerns about their potential for malicious use. An emerging line of research to address such threats focuses on implanting “protective” adversarial noise into images before their public release, so future attempts to edit them using text-to-image models can be impeded. However, subsequent works have shown that these adversarial noises are often easily “reversed,” e.g., with techniques as simple as JPEG compression, casting doubt on the practicality of the approach. In this paper, we argue that adversarial noise for image protection should not only be imperceptible, as has been a primary focus of prior work, but also irreversible, viz., it should be difficult to detect as noise provided that the original image is hidden. We propose a surprisingly simple method to enhance the robustness of image protection methods against noise reversal techniques. Specifically, it applies an adaptive per-region Gaussian blur on the noise to adjust the overall frequency spectrum. Through extensive experiments, we show that our method consistently improves the per-sample worst-case protection performance of existing methods against a wide range of reversal techniques on diverse image editing scenarios, while also reducing quality degradation due to noise in terms of perceptual metrics. Code is available at https://github.com/jsu-kim/BlurGuard.

[202] A Racing Dataset and Baseline Model for Track Detection in Autonomous Racing

Shreya Ghosh, Yi-Huan Chen, Ching-Hsiang Huang, Abu Shafin Mohammad Mahdee Jameel, Chien Chou Ho, Aly El Gamal, Samuel Labi

Main category: cs.CV

TL;DR: RoRaTrack dataset with annotated multi-camera racing images for track detection, collected using Dallara AV-21 at Indiana racing circuit. RaceGAN baseline model outperforms state-of-the-art in handling racing-specific challenges.

Details

Motivation: Lack of publicly available datasets with raw images and annotations for racing-related research, particularly for track detection tasks.

Method: RaceGAN - a Generative Adversarial Network (GAN) baseline model designed to handle racing-specific challenges like blurriness, color inversion, and absence of lane markings.

Result: The proposed RaceGAN model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection.

Conclusion: RoRaTrack dataset and RaceGAN model effectively address racing-specific challenges in track detection, with dataset and code publicly available for research use.

Abstract: A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at https://github.com/ghosh64/RaceGAN.

[203] CompAgent: An Agentic Framework for Visual Compliance Verification

Rahul Ghosh, Baishali Chaudhury, Hari Prasanna Das, Meghana Ashok, Ryan Razkenari, Sungmin Hong, Chun-Hao Liu

Main category: cs.CV

TL;DR: CompAgent is an agentic framework that augments multi-modal LLMs with visual tools for visual compliance verification, outperforming specialized classifiers and direct MLLM prompting.

Details

Motivation: Visual compliance verification is critical in domains like media and advertising, but existing methods rely on costly task-specific models with limited generalizability, while MLLMs struggle with fine-grained visual reasoning.

Method: CompAgent uses a planning agent to dynamically select visual tools (object detectors, face analyzers, etc.) and a verification agent that integrates image, tool outputs, and policy context for multi-modal reasoning.

Result: CompAgent achieves up to 76% F1 score and 10% improvement over state-of-the-art on UnsafeBench dataset, outperforming specialized classifiers, direct MLLM prompting, and routing baselines.

Conclusion: Agentic planning and tool-augmented reasoning enable scalable, accurate, and adaptable visual compliance verification.

Abstract: Visual compliance verification is a critical yet underexplored problem in computer vision, especially in domains such as media, entertainment, and advertising where content must adhere to complex and evolving policy rules. Existing methods often rely on task-specific deep learning models trained on manually labeled datasets, which are costly to build and limited in generalizability. While recent multi-modal large language models (MLLMs) offer broad real-world knowledge and policy understanding, they struggle to reason over fine-grained visual details and apply structured compliance rules effectively on their own. In this paper, we propose CompAgent, the first agentic framework for visual compliance verification. CompAgent augments MLLMs with a suite of visual tools - such as object detectors, face analyzers, NSFW detectors, and captioning models - and introduces a planning agent that dynamically selects appropriate tools based on the compliance policy. A verification agent then integrates image, tool outputs, and policy context to perform multi-modal reasoning. Experiments on public benchmarks show that CompAgent outperforms specialized classifiers, direct MLLM prompting, and curated routing baselines, achieving up to 76% F1 score and a 10% improvement over the state-of-the-art on the UnsafeBench dataset. Our results demonstrate the effectiveness of agentic planning and tool-augmented reasoning for scalable, accurate, and adaptable visual compliance verification.

[204] Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification

Samuel Räber, Till Aczel, Andreas Plesner, Roger Wattenhofer

Main category: cs.CV

TL;DR: Lossy compression preprocessing can defend against adversarial attacks, with high-realism reconstructions being substantially more resistant to attacks than low-realism ones, due to distributional alignment with natural images rather than gradient masking.

Details

Motivation: Previous work suggested lossy compression could defend against adversarial perturbations, but lacked comprehensive attack evaluations. This paper aims to construct strong attacks against compression models to understand their true defensive capabilities.

Method: Constructed strong white-box and adaptive attacks against various compression models, with rigorous evaluation across multiple attack scenarios to test robustness.

Result: Compression models producing realistic, high-fidelity reconstructions are substantially more resistant to attacks, while low-realism compression models can be broken. The resistance is not due to gradient masking but rather realistic reconstructions maintaining distributional alignment with natural images.

Conclusion: High realism in reconstructed images presents a significant obstacle for adversarial attacks, and developing techniques to overcome this realism represents an essential challenge for comprehensive security evaluation.

Abstract: Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.

[205] From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

Mengfei Liang, Yiting Qu, Yukun Jiang, Michael Backes, Yang Zhang

Main category: cs.CV

TL;DR: AIFo is a training-free multi-agent framework for AI-generated image detection that emulates human forensic investigation through collaborative agents using various forensic tools and structured debate mechanisms.

Details

Motivation: Existing AI-generated image detection methods have limitations: traditional classifiers lack interpretability and generalization, while vision-language models are constrained to single-shot analysis and pixel-level reasoning.

Method: Uses multi-agent collaboration with forensic tools (reverse image search, metadata extraction, classifiers, VLM analysis) coordinated by LLM-based agents, plus structured multi-agent debate and memory-augmented reasoning from historical cases.

Result: Achieves 97.05% accuracy on 6,000 images across controlled and real-world scenarios, substantially outperforming traditional classifiers and state-of-the-art VLMs.

Conclusion: Agent-based procedural reasoning offers a new paradigm for more robust, interpretable, and adaptable AI-generated image detection.

Abstract: The rapid evolution of AI-generated images poses unprecedented challenges to information integrity and media authenticity. Existing detection approaches suffer from fundamental limitations: traditional classifiers lack interpretability and fail to generalize across evolving generative models, while vision-language models (VLMs), despite their promise, remain constrained to single-shot analysis and pixel-level reasoning. To address these challenges, we introduce AIFo (Agent-based Image Forensics), a novel training-free framework that emulates human forensic investigation through multi-agent collaboration. Unlike conventional methods, our framework employs a set of forensic tools, including reverse image search, metadata extraction, pre-trained classifiers, and VLM analysis, coordinated by specialized LLM-based agents that collect, synthesize, and reason over cross-source evidence. When evidence is conflicting or insufficient, a structured multi-agent debate mechanism allows agents to exchange arguments and reach a reliable conclusion. Furthermore, we enhance the framework with a memory-augmented reasoning module that learns from historical cases to improve future detection accuracy. Our comprehensive evaluation spans 6,000 images across both controlled laboratory settings and challenging real-world scenarios, including images from modern generative platforms and diverse online sources. AIFo achieves 97.05% accuracy, substantially outperforming traditional classifiers and state-of-the-art VLMs. These results demonstrate that agent-based procedural reasoning offers a new paradigm for more robust, interpretable, and adaptable AI-generated image detection.

[206] A Retrospect to Multi-prompt Learning across Vision and Language

Ziliang Chen, Xin Huang, Quanlong Guan, Liang Lin, Weiqi Luo

Main category: cs.CV

TL;DR: The paper proposes Energy-based Multi-prompt Learning (EMPL), a parameter-efficient method that generates multiple prompt embeddings from an energy-based distribution defined by Vision-Language Models to improve generalization.

Details

Motivation: Existing research focuses on single-prompt paradigms, but the potential of multi-prompt learning for Vision-Language Models remains unexplored. The paper aims to provide principled analysis of vision-language multi-prompt learning.

Method: EMPL generates multiple prompt embeddings by drawing instances from an energy-based distribution implicitly defined by VLMs. This approach extends the constant modality gap phenomenon to learnable prompts.

Result: Comprehensive experiments justify the superiority of vision-language transfer with multi-prompt augmentation and demonstrate EMPL’s excellence in achieving balance between in-domain and out-of-domain open-vocabulary generalization.

Conclusion: Multi-prompt learning with EMPL is parameter-efficient and rigorously leads to better generalization balance, showing significant potential beyond single-prompt paradigms in vision-language pretraining.

Abstract: The vision community is undergoing the unprecedented progress with the emergence of Vision-Language Pretraining Models (VLMs). Prompt learning plays as the holy grail of accessing VLMs since it enables their fast adaptation to downstream tasks with limited resources. Whereas existing researches milling around single-prompt paradigms, rarely investigate the technical potential behind their multi-prompt learning counterparts. This paper aims to provide a principled retrospect for vision-language multi-prompt learning. We extend the recent constant modality gap phenomenon to learnable prompts and then, justify the superiority of vision-language transfer with multi-prompt augmentation, empirically and theoretically. In terms of this observation, we propose an Energy-based Multi-prompt Learning (EMPL) to generate multiple prompt embeddings by drawing instances from an energy-based distribution, which is implicitly defined by VLMs. So our EMPL is not only parameter-efficient but also rigorously lead to the balance between in-domain and out-of-domain open-vocabulary generalization. Comprehensive experiments have been conducted to justify our claims and the excellence of EMPL.

[207] DM-QPMNET: Dual-modality fusion network for cell segmentation in quantitative phase microscopy

Rajatsubhra Chakraborty, Ana Espinosa-Momox, Riley Haskin, Depeng Xu, Rosario Porras-Aguilar

Main category: cs.CV

TL;DR: DM-QPMNet is a dual-encoder network for cell segmentation in single-shot quantitative phase microscopy that treats polarized intensity images and phase maps as distinct modalities, using multi-head attention for content-aware feature fusion.

Details

Motivation: Traditional thresholding methods are sensitive to noise and cell density, while deep learning approaches using simple channel concatenation fail to exploit the complementary nature of polarized intensity images and phase maps in ssQPM.

Method: Dual-encoder network with separate encoding streams for each modality, fusing modality-specific features at intermediate depth via multi-head attention, with dual-source skip connections and per-modality normalization.

Result: Substantial improvements over monolithic concatenation and single-modality baselines, demonstrating effective exploitation of ssQPM’s simultaneous capture of complementary illumination and phase cues.

Conclusion: Modality-specific encoding with learnable fusion effectively exploits ssQPM’s complementary illumination and phase cues for robust cell segmentation, preserving training stability while adding principled multi-modal integration.

Abstract: Cell segmentation in single-shot quantitative phase microscopy (ssQPM) faces challenges from traditional thresholding methods that are sensitive to noise and cell density, while deep learning approaches using simple channel concatenation fail to exploit the complementary nature of polarized intensity images and phase maps. We introduce DM-QPMNet, a dual-encoder network that treats these as distinct modalities with separate encoding streams. Our architecture fuses modality-specific features at intermediate depth via multi-head attention, enabling polarized edge and texture representations to selectively integrate complementary phase information. This content-aware fusion preserves training stability while adding principled multi-modal integration through dual-source skip connections and per-modality normalization at minimal overhead. Our approach demonstrates substantial improvements over monolithic concatenation and single-modality baselines, showing that modality-specific encoding with learnable fusion effectively exploits ssQPM’s simultaneous capture of complementary illumination and phase cues for robust cell segmentation.

[208] Towards 1000-fold Electron Microscopy Image Compression for Connectomics via VQ-VAE with Transformer Prior

Fuming Yang, Yicong Li, Hanspeter Pfister, Jeff W. Lichtman, Yaron Meirovitch

Main category: cs.CV

TL;DR: VQ-VAE compression framework for EM data with 16x-1024x compression, featuring pay-as-you-decode capability and ROI-driven selective reconstruction.

Details

Motivation: Petascale EM datasets challenge storage, transfer, and analysis capabilities, requiring efficient compression methods.

Method: Vector-quantized variational autoencoder (VQ-VAE) with Transformer prior for texture restoration via FiLM and concatenation, plus ROI-driven workflow for selective high-resolution reconstruction.

Result: Enables extreme compression (1024x) with flexible decoding options and targeted high-resolution reconstruction from compressed latents.

Conclusion: The framework provides scalable compression for EM data while maintaining reconstruction quality through adaptive decoding strategies.

Abstract: Petascale electron microscopy (EM) datasets push storage, transfer, and downstream analysis toward their current limits. We present a vector-quantized variational autoencoder-based (VQ-VAE) compression framework for EM that spans 16x to 1024x and enables pay-as-you-decode usage: top-only decoding for extreme compression, with an optional Transformer prior that predicts bottom tokens (without changing the compression ratio) to restore texture via feature-wise linear modulation (FiLM) and concatenation; we further introduce an ROI-driven workflow that performs selective high-resolution reconstruction from 1024x-compressed latents only where needed.

[209] Hyperbolic Optimal Transport

Yan Bin Ng, Xianfeng Gu

Main category: cs.CV

TL;DR: Proposes a novel algorithm for computing optimal transport maps in hyperbolic space using geometric variational techniques, extending methods from Euclidean and spherical geometry.

Details

Motivation: Optimal transport has diverse applications but existing methods are primarily for Euclidean spaces and spheres. Hyperbolic space naturally arises in contexts with hierarchical data, networks, and multi-genus Riemann surfaces.

Method: Extends geometric variational techniques from Euclidean and spherical geometry to hyperbolic space to compute optimal transport maps efficiently.

Result: The proposed method is validated through experiments on synthetic data and multi-genus surface models, demonstrating its efficacy.

Conclusion: The paper successfully develops an efficient algorithm for optimal transport in hyperbolic space, addressing a gap in existing methods and showing practical applicability in hierarchical data contexts.

Abstract: The optimal transport (OT) problem aims to find the most efficient mapping between two probability distributions under a given cost function, and has diverse applications in many fields such as machine learning, computer vision and computer graphics. However, existing methods for computing optimal transport maps are primarily developed for Euclidean spaces and the sphere. In this paper, we explore the problem of computing the optimal transport map in hyperbolic space, which naturally arises in contexts involving hierarchical data, networks, and multi-genus Riemann surfaces. We propose a novel and efficient algorithm for computing the optimal transport map in hyperbolic space using a geometric variational technique by extending methods for Euclidean and spherical geometry to the hyperbolic setting. We also perform experiments on synthetic data and multi-genus surface models to validate the efficacy of the proposed method.

[210] Object-Aware 4D Human Motion Generation

Shurui Gui, Deep Anil Patel, Xiner Li, Martin Renqiang Min

Main category: cs.CV

TL;DR: Proposes MSDI framework for generating realistic 4D human motions using 3D Gaussian representations and motion diffusion priors, enabling zero-shot object-aware motion generation without retraining.

Details

Motivation: Current video diffusion models produce unrealistic deformations and physical inconsistencies due to lack of 3D physical priors, requiring better spatial and semantic constraints for human-object interactions.

Method: Combines Motion Score Distilled Interaction (MSDI) with Motion Diffusion Score Distillation Sampling (MSDS) and LLMs to distill score gradients from pre-trained motion models while respecting object and semantic constraints.

Result: Produces natural and physically plausible human motions that respect 3D spatial context, generalizing to out-of-distribution object-aware motions without retraining.

Conclusion: Offers a scalable zero-shot solution for realistic 4D human motion generation by integrating 3D physical priors with motion diffusion and semantic constraints.

Abstract: Recent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.

[211] Merlin L48 Spectrogram Dataset

Aaron Sun, Subhransu Maji, Grant Van Horn

Main category: cs.CV

TL;DR: The paper introduces L48, a real-world fine-grained multi-label dataset for single-positive multi-label learning, highlighting limitations of synthetic benchmarks and showing significant performance differences for existing methods.

Details

Motivation: Existing SPML methods use synthetic datasets created from fully-annotated data, which don't reflect real-world complexities and fine-grained challenges that cause difficult misclassifications.

Method: Created L48 dataset - a fine-grained, real-world multi-label dataset from bird sound recordings with natural single-positive annotations and two extended settings with domain priors for additional negative labels.

Result: Benchmarked existing SPML methods on L48 and observed significant performance differences compared to synthetic datasets, revealing method weaknesses.

Conclusion: There’s a need for more realistic and difficult benchmarks in SPML learning, as current synthetic approaches fail to capture real-world fine-grained complexities.

Abstract: In the single-positive multi-label (SPML) setting, each image in a dataset is labeled with the presence of a single class, while the true presence of other classes remains unknown. The challenge is to narrow the performance gap between this partially-labeled setting and fully-supervised learning, which often requires a significant annotation budget. Prior SPML methods were developed and benchmarked on synthetic datasets created by randomly sampling single positive labels from fully-annotated datasets like Pascal VOC, COCO, NUS-WIDE, and CUB200. However, this synthetic approach does not reflect real-world scenarios and fails to capture the fine-grained complexities that can lead to difficult misclassifications. In this work, we introduce the L48 dataset, a fine-grained, real-world multi-label dataset derived from recordings of bird sounds. L48 provides a natural SPML setting with single-positive annotations on a challenging, fine-grained domain, as well as two extended settings in which domain priors give access to additional negative labels. We benchmark existing SPML methods on L48 and observe significant performance differences compared to synthetic datasets and analyze method weaknesses, underscoring the need for more realistic and difficult benchmarks.

[212] BeetleFlow: An Integrative Deep Learning Pipeline for Beetle Image Processing

Fangxun Liu, S M Rayeed, Samuel Stevens, Alyson East, Cheng Hsuan Chiang, Colin Lee, Daniel Yi, Junke Yang, Tejas Naik, Ziyi Wang, Connor Kilrain, Elijah H Buckwalter, Jiacheng Hou, Saul Ibaven Bueno, Shuheng Wang, Xinyue Ma, Yifan Liu, Zhiyuan Tao, Ziheng Zhang, Eric Sokol, Michael Belitz, Sydne Record, Charles V. Stewart, Wei-Lun Chao

Main category: cs.CV

TL;DR: A 3-stage automated pipeline for processing large-scale beetle images: detection using transformer-based methods, sorting/cropping, and morphological segmentation with fine-tuned transformer models.

Details

Motivation: Biologists need to process thousands of beetle tray images efficiently for entomology research, requiring automated solutions for large-scale data processing.

Method: 3-stage pipeline: 1) Detection using transformer-based open-vocabulary object detector and vision-language model, 2) Sorting and cropping individual beetles, 3) Morphological segmentation using fine-tuned transformer models trained on 670 manually labeled beetle images.

Result: Developed a specialized pipeline that achieves relatively high accuracy for fine-grained beetle segmentation and can process large-scale beetle data efficiently.

Conclusion: The integrated deep learning pipeline significantly improves efficiency in processing beetle images and accelerates biological research by automating large-scale data analysis.

Abstract: In entomology and ecology research, biologists often need to collect a large number of insects, among which beetles are the most common species. A common practice for biologists to organize beetles is to place them on trays and take a picture of each tray. Given the images of thousands of such trays, it is important to have an automated pipeline to process the large-scale data for further research. Therefore, we develop a 3-stage pipeline to detect all the beetles on each tray, sort and crop the image of each beetle, and do morphological segmentation on the cropped beetles. For detection, we design an iterative process utilizing a transformer-based open-vocabulary object detector and a vision-language model. For segmentation, we manually labeled 670 beetle images and fine-tuned two variants of a transformer-based segmentation model to achieve fine-grained segmentation of beetles with relatively high accuracy. The pipeline integrates multiple deep learning methods and is specialized for beetle image processing, which can greatly improve the efficiency to process large-scale beetle data and accelerate biological research.

[213] MambaNetLK: Enhancing Colonoscopy Point Cloud Registration with Mamba

Linzhe Jiang, Jiayuan Huang, Sophia Bano, Matthew J. Clarkson, Zhehua Mao, Mobarak I. Hoque

Main category: cs.CV

TL;DR: MambaNetLK is a novel correspondence-free 3D registration framework that integrates Mamba State Space Model with PointNetLK architecture, achieving superior performance on clinical endoscopic data with 56.04% reduction in rotation error and 26.19% reduction in translation error compared to state-of-the-art methods.

Details

Motivation: Accurate 3D point cloud registration is critical for reliable image-guided colonoscopy, but faces challenges from repetitive textures, homogeneous geometry, and domain shifts between pre-operative and intra-operative data that cause feature degeneracy and alignment instability.

Method: Proposed MambaNetLK framework enhances PointNetLK by integrating Mamba State Space Model as cross-modal feature extractor to capture long-range dependencies with linear-time complexity, using Lucas-Kanade algorithm for iterative alignment. Also introduced C3VD-Raycasting-10k dataset with 10,014 geometrically aligned point cloud pairs from clinical CT data.

Result: On C3VD-Raycasting-10k clinical dataset, MambaNetLK achieved best performance: 56.04% reduction in median rotation error and 26.19% reduction in RMSE translation error over second-best method. Also demonstrated strong generalization on ModelNet40 and superior robustness to initial pose perturbations.

Conclusion: MambaNetLK provides robust foundation for 3D registration in surgical navigation, enabling more accurate and reliable guidance systems in minimally invasive procedures through globally expressive SSM-based feature extraction and large-scale clinical dataset.

Abstract: Accurate 3D point cloud registration underpins reliable image-guided colonoscopy, directly affecting lesion localization, margin assessment, and navigation safety. However, biological tissue exhibits repetitive textures and locally homogeneous geometry that cause feature degeneracy, while substantial domain shifts between pre-operative anatomy and intra-operative observations further degrade alignment stability. To address these clinically critical challenges, we introduce a novel 3D registration method tailored for endoscopic navigation and a high-quality, clinically grounded dataset to support rigorous and reproducible benchmarking. We introduce C3VD-Raycasting-10k, a large-scale benchmark dataset with 10,014 geometrically aligned point cloud pairs derived from clinical CT data. We propose MambaNetLK, a novel correspondence-free registration framework, which enhances the PointNetLK architecture by integrating a Mamba State Space Model (SSM) as a cross-modal feature extractor. As a result, the proposed framework efficiently captures long-range dependencies with linear-time complexity. The alignment is achieved iteratively using the Lucas-Kanade algorithm. On the clinical dataset, C3VD-Raycasting-10k, MambaNetLK achieves the best performance compared with the state-of-the-art methods, reducing median rotation error by 56.04% and RMSE translation error by 26.19% over the second-best method. The model also demonstrates strong generalization on ModelNet40 and superior robustness to initial pose perturbations. MambaNetLK provides a robust foundation for 3D registration in surgical navigation. The combination of a globally expressive SSM-based feature extractor and a large-scale clinical dataset enables more accurate and reliable guidance systems in minimally invasive procedures like colonoscopy.

Neha Balamurugan, Sarah Wu, Adam Chun, Gabe Gaw, Cristobal Eyzaguirre, Tobias Gerstenberg

Main category: cs.CV

TL;DR: A benchmark called ‘Spot The Ball’ evaluates visual social inference in vision-language models by having them locate removed sports balls from images, revealing humans significantly outperform AI models.

Details

Motivation: To assess visual social inference capabilities in AI by testing if models can infer hidden scene elements from behavioral cues like gaze and pose, which humans excel at.

Method: Created a benchmark using soccer, basketball, and volleyball images with balls removed, evaluated four state-of-the-art VLMs with three prompting strategies, and compared against human baselines.

Result: Humans consistently outperformed all models (20-34% accuracy vs ≤17% for models), with models relying on superficial spatial heuristics while humans used social cues like gaze and body pose.

Conclusion: There’s a persistent gap in visual social reasoning between humans and AI, highlighting the need for architectures that explicitly encode structured behavioral cues for human-like inference.

Abstract: Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people’s gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ($\leq$ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics–such as guessing near the image center or nearby players–while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.

[215] FedReplay: A Feature Replay Assisted Federated Transfer Learning Framework for Efficient and Privacy-Preserving Smart Agriculture

Long Li, Jiajia Li, Dong Chen, Lina Pu, Haibo Yao, Yanbo Huang

Main category: cs.CV

TL;DR: A federated learning framework using frozen CLIP ViT with lightweight transformer classifier achieves 86.6% accuracy in agricultural classification, outperforming baseline FL methods by 4x while reducing communication costs and handling non-IID data.

Details

Motivation: To address privacy concerns in centralized training and overcome limitations of standard federated learning with non-IID data and high communication costs in smart agriculture applications.

Method: Integrates frozen CLIP vision transformer for feature extraction with lightweight transformer classifier, shares 1% of CLIP-extracted features across clients to mitigate non-IID issues while preserving privacy.

Result: Achieves 86.6% accuracy on agricultural classification tasks, which is more than 4 times higher compared to baseline federated learning approaches.

Conclusion: The combination of vision-language model features with federated learning provides an effective and efficient solution for privacy-preserving and scalable agricultural intelligence.

Abstract: Accurate classification plays a pivotal role in smart agriculture, enabling applications such as crop monitoring, fruit recognition, and pest detection. However, conventional centralized training often requires large-scale data collection, which raises privacy concerns, while standard federated learning struggles with non-independent and identically distributed (non-IID) data and incurs high communication costs. To address these challenges, we propose a federated learning framework that integrates a frozen Contrastive Language-Image Pre-training (CLIP) vision transformer (ViT) with a lightweight transformer classifier. By leveraging the strong feature extraction capability of the pre-trained CLIP ViT, the framework avoids training large-scale models from scratch and restricts federated updates to a compact classifier, thereby reducing transmission overhead significantly. Furthermore, to mitigate performance degradation caused by non-IID data distribution, a small subset (1%) of CLIP-extracted feature representations from all classes is shared across clients. These shared features are non-reversible to raw images, ensuring privacy preservation while aligning class representation across participants. Experimental results on agricultural classification tasks show that the proposed method achieve 86.6% accuracy, which is more than 4 times higher compared to baseline federated learning approaches. This demonstrates the effectiveness and efficiency of combining vision-language model features with federated learning for privacy-preserving and scalable agricultural intelligence.

[216] Multi-View Consistent Human Image Customization via In-Context Learning

Hengjia Li, Jianjin Xu, Keli Cheng, Lei Wang, Ning Bi, Boxi Wu, Fernando De la Torre, Deng Cai

Main category: cs.CV

TL;DR: PersonalView is a lightweight adaptation method that enables existing models to generate consistent multi-view images of a person with only 100 training samples, outperforming baselines trained on large multi-view datasets.

Details

Motivation: Most personalized generative models cannot control viewpoint or generate consistent multiple views of the same person, limiting their practical applications.

Method: Uses a conditioning architecture leveraging pre-trained diffusion transformer’s in-context learning ability, and preserves original generative capabilities with Semantic Correspondence Alignment Loss.

Result: Significantly outperforms baselines trained on large multi-view datasets across multi-view consistency, text alignment, identity similarity, and visual quality metrics.

Conclusion: PersonalView provides an efficient solution for multi-view generation with minimal training data, demonstrating strong performance compared to methods requiring extensive training corpora.

Abstract: Recent advances in personalized generative models demonstrate impressive results in creating identity-consistent images of the same person under diverse settings. Yet, we note that most methods cannot control the viewpoint of the generated image, nor generate consistent multiple views of the person. To address this problem, we propose a lightweight adaptation method, PersonalView, capable of enabling an existing model to acquire multi-view generation capability with as few as 100 training samples. PersonalView consists of two key components: First, we design a conditioning architecture to take advantage of the in-context learning ability of the pre-trained diffusion transformer. Second, we preserve the original generative ability of the pretrained model with a new Semantic Correspondence Alignment Loss. We evaluate the multi-view consistency, text alignment, identity similarity, and visual quality of PersonalView and compare it to recent baselines with potential capability of multi-view customization. PersonalView significantly outperforms baselines trained on a large corpus of multi-view data with only 100 training samples.

[217] Towards Automated Petrography

Isai Daniel Chacón, Paola Ruiz Puentes, Jillian Pearse, Pablo Arbeláez

Main category: cs.CV

TL;DR: LITHOS is the largest public dataset for automated petrography, containing 211,604 polarized light patches and 105,802 expert-annotated mineral grains across 25 categories, with a dual-encoder transformer model that outperforms single-polarization approaches.

Details

Motivation: Petrography is labor-intensive and requires expert visual examination, limiting scalability. There's a need for automated techniques to analyze mineral composition from thin section samples.

Method: Created LITHOS dataset with high-resolution RGB patches and expert annotations. Proposed a dual-encoder transformer architecture that integrates both polarization modalities for mineral classification.

Result: The dual-encoder transformer consistently outperforms single-polarization models, demonstrating the value of polarization synergy in mineral classification.

Conclusion: LITHOS provides a comprehensive benchmark for automated petrographic analysis, and the proposed method shows improved performance by leveraging multiple polarization modalities.

Abstract: Petrography is a branch of geology that analyzes the mineralogical composition of rocks from microscopical thin section samples. It is essential for understanding rock properties across geology, archaeology, engineering, mineral exploration, and the oil industry. However, petrography is a labor-intensive task requiring experts to conduct detailed visual examinations of thin section samples through optical polarization microscopes, thus hampering scalability and highlighting the need for automated techniques. To address this challenge, we introduce the Large-scale Imaging and Thin section Optical-polarization Set (LITHOS), the largest and most diverse publicly available experimental framework for automated petrography. LITHOS includes 211,604 high-resolution RGB patches of polarized light and 105,802 expert-annotated grains across 25 mineral categories. Each annotation consists of the mineral class, spatial coordinates, and expert-defined major and minor axes represented as intersecting vector paths, capturing grain geometry and orientation. We evaluate multiple deep learning techniques for mineral classification in LITHOS and propose a dual-encoder transformer architecture that integrates both polarization modalities as a strong baseline for future reference. Our method consistently outperforms single-polarization models, demonstrating the value of polarization synergy in mineral classification. We have made the LITHOS Benchmark publicly available, comprising our dataset, code, and pretrained models, to foster reproducibility and further research in automated petrographic analysis.

[218] Beyond ImageNet: Understanding Cross-Dataset Robustness of Lightweight Vision Models

Weidong Zhang, Pak Lun Kevin Ding, Huan Liu

Main category: cs.CV

TL;DR: Systematic evaluation of 11 lightweight vision models across 7 datasets reveals ImageNet accuracy doesn’t predict cross-domain performance, introduces xScore metric for robustness assessment, and identifies architectural elements that drive generalization.

Details

Motivation: Lightweight vision models are deployed on mobile devices but primarily benchmarked on ImageNet, raising questions about their cross-domain generalization and how to systematically quantify robustness across diverse datasets.

Method: Evaluated 11 lightweight vision models (2.5M parameters) trained under fixed 100-epoch schedule across 7 diverse datasets, introduced Cross-Dataset Score (xScore) metric to quantify performance consistency and robustness.

Result: ImageNet accuracy doesn’t reliably predict performance on fine-grained or medical datasets; xScore provides scalable predictor of mobile model performance; isotropic convolutions with higher spatial resolution and channel-wise attention promote generalization, while Transformer blocks offer little benefit despite higher parameter cost.

Conclusion: Provides reproducible framework for evaluating lightweight vision models beyond ImageNet, highlights key design principles for mobile-friendly architectures, and guides development of models that generalize robustly across diverse domains.

Abstract: Lightweight vision classification models such as MobileNet, ShuffleNet, and EfficientNet are increasingly deployed in mobile and embedded systems, yet their performance has been predominantly benchmarked on ImageNet. This raises critical questions: Do models that excel on ImageNet also generalize across other domains? How can cross-dataset robustness be systematically quantified? And which architectural elements consistently drive generalization under tight resource constraints? Here, we present the first systematic evaluation of 11 lightweight vision models (2.5M parameters), trained under a fixed 100-epoch schedule across 7 diverse datasets. We introduce the Cross-Dataset Score (xScore), a unified metric that quantifies the consistency and robustness of model performance across diverse visual domains. Our results show that (1) ImageNet accuracy does not reliably predict performance on fine-grained or medical datasets, (2) xScore provides a scalable predictor of mobile model performance that can be estimated from just four datasets, and (3) certain architectural components–such as isotropic convolutions with higher spatial resolution and channel-wise attention–promote broader generalization, while Transformer-based blocks yield little additional benefit, despite incurring higher parameter overhead. This study provides a reproducible framework for evaluating lightweight vision models beyond ImageNet, highlights key design principles for mobile-friendly architectures, and guides the development of future models that generalize robustly across diverse application domains.

[219] A DeepONet joint Neural Tangent Kernel Hybrid Framework for Physics-Informed Inverse Source Problems and Robust Image Reconstruction

Yuhao Fang, Zijian Wang, Yao Lu, Ye Zhang, Chun Li

Main category: cs.CV

TL;DR: A hybrid DeepONet-NTK framework for solving complex inverse problems like source localization and image reconstruction, handling nonlinearity, sparsity, and noise through physics-informed constraints.

Details

Motivation: To address challenges in solving complex inverse problems governed by physical laws, particularly dealing with nonlinearity, sparse data, and noise in applications like source localization and image reconstruction.

Method: Integrates Deep Operator Networks (DeepONet) with Neural Tangent Kernel (NTK), incorporating physics-informed constraints and task-specific regularization into the loss function to ensure physical consistency.

Result: Validated on diverse synthetic and real datasets, demonstrating robustness, scalability, and precision in solving inverse problems.

Conclusion: The framework shows broad potential applications in computational physics and imaging sciences for solving complex inverse problems with physical constraints.

Abstract: This work presents a novel hybrid approach that integrates Deep Operator Networks (DeepONet) with the Neural Tangent Kernel (NTK) to solve complex inverse problem. The method effectively addresses tasks such as source localization governed by the Navier-Stokes equations and image reconstruction, overcoming challenges related to nonlinearity, sparsity, and noisy data. By incorporating physics-informed constraints and task-specific regularization into the loss function, the framework ensures solutions that are both physically consistent and accurate. Validation on diverse synthetic and real datasets demonstrates its robustness, scalability, and precision, showcasing its broad potential applications in computational physics and imaging sciences.

[220] Federated Dialogue-Semantic Diffusion for Emotion Recognition under Incomplete Modalities

Xihang Qiu, Jiarong Cheng, Yuhao Fang, Wanpeng Zhang, Yao Lu, Ye Zhang, Chun Li

Main category: cs.CV

TL;DR: FedDISC is a federated learning framework for multimodal emotion recognition that addresses modality absence by using diffusion models and semantic consistency mechanisms.

Details

Motivation: Real-world multimodal emotion recognition suffers from unpredictable modality absence, which degrades performance. Existing methods relying on complete multimodal data fail under extreme data distributions like fixed-modality absence.

Method: Proposes FedDISC framework integrating federated learning with diffusion models. Uses modality-specific diffusion models trained on clients and broadcast to clients missing modalities. Includes DISC-Diffusion module with Dialogue Graph Network and Semantic Conditioning Network for consistency. Implements Alternating Frozen Aggregation strategy.

Result: Extensive experiments on IEMOCAP, CMUMOSI, and CMUMOSEI datasets show superior emotion classification performance across diverse missing modality patterns, outperforming existing approaches.

Conclusion: FedDISC effectively addresses modality absence in multimodal emotion recognition through federated learning and semantic-consistent diffusion, achieving robust performance in real-world scenarios.

Abstract: Multimodal Emotion Recognition in Conversations (MERC) enhances emotional understanding through the fusion of multimodal signals. However, unpredictable modality absence in real-world scenarios significantly degrades the performance of existing methods. Conventional missing-modality recovery approaches, which depend on training with complete multimodal data, often suffer from semantic distortion under extreme data distributions, such as fixed-modality absence. To address this, we propose the Federated Dialogue-guided and Semantic-Consistent Diffusion (FedDISC) framework, pioneering the integration of federated learning into missing-modality recovery. By federated aggregation of modality-specific diffusion models trained on clients and broadcasting them to clients missing corresponding modalities, FedDISC overcomes single-client reliance on modality completeness. Additionally, the DISC-Diffusion module ensures consistency in context, speaker identity, and semantics between recovered and available modalities, using a Dialogue Graph Network to capture conversational dependencies and a Semantic Conditioning Network to enforce semantic alignment. We further introduce a novel Alternating Frozen Aggregation strategy, which cyclically freezes recovery and classifier modules to facilitate collaborative optimization. Extensive experiments on the IEMOCAP, CMUMOSI, and CMUMOSEI datasets demonstrate that FedDISC achieves superior emotion classification performance across diverse missing modality patterns, outperforming existing approaches.

[221] OSMGen: Highly Controllable Satellite Image Synthesis using OpenStreetMap Data

Amir Ziashahabi, Narges Ghasemi, Sajjad Shahabi, John Krumm, Salman Avestimehr, Cyrus Shahabi

Main category: cs.CV

TL;DR: OSMGen is a generative framework that creates realistic satellite imagery from OpenStreetMap (OSM) JSON data, enabling generation of consistent before-after image pairs for urban monitoring and training data.

Details

Motivation: Automating urban monitoring is challenging due to scarce curated datasets of specific urban features and their changes. There's a need for accurate geospatial data for urban planning, infrastructure monitoring, and environmental management.

Method: OSMGen uses raw OpenStreetMap JSON data including vector geometries, semantic tags, location, and time to generate satellite imagery. It enables user edits to OSM inputs to create targeted visual changes while preserving the rest of the scene.

Result: The framework produces realistic satellite imagery directly from OSM data and can generate consistent before-after image pairs. It addresses data scarcity and class imbalance for training, and allows planners to preview proposed interventions.

Conclusion: OSMGen creates paired (JSON, image) data for both static and changed states, paving the way toward a closed-loop system where satellite imagery can automatically drive structured OSM updates.

Abstract: Accurate and up-to-date geospatial data are essential for urban planning, infrastructure monitoring, and environmental management. Yet, automating urban monitoring remains difficult because curated datasets of specific urban features and their changes are scarce. We introduce OSMGen, a generative framework that creates realistic satellite imagery directly from raw OpenStreetMap (OSM) data. Unlike prior work that relies on raster tiles, OSMGen uses the full richness of OSM JSON, including vector geometries, semantic tags, location, and time, giving fine-grained control over how scenes are generated. A central feature of the framework is the ability to produce consistent before-after image pairs: user edits to OSM inputs translate into targeted visual changes, while the rest of the scene is preserved. This makes it possible to generate training data that addresses scarcity and class imbalance, and to give planners a simple way to preview proposed interventions by editing map data. More broadly, OSMGen produces paired (JSON, image) data for both static and changed states, paving the way toward a closed-loop system where satellite imagery can automatically drive structured OSM updates. Source code is available at https://github.com/amir-zsh/OSMGen.

[222] Detecting AI-Generated Images via Diffusion Snap-Back Reconstruction: A Forensic Approach

Mohd Ruhul Ameen, Akif Islam

Main category: cs.CV

TL;DR: A diffusion-based forensic framework using multi-strength image reconstruction dynamics (diffusion snap-back) to detect AI-generated images from systems like Stable Diffusion and DALL-E.

Details

Motivation: Traditional deepfake detection methods fail against modern text-to-image systems that produce photorealistic, artifact-free results, making it challenging to distinguish authentic from synthetic visual content.

Method: Analyzes how reconstruction metrics (LPIPS, SSIM, PSNR) evolve across varying noise strengths to extract interpretable manifold-based features that differentiate real and synthetic images.

Result: Achieves 0.993 AUROC under cross-validation on a balanced dataset of 4,000 images, remaining robust to common distortions like compression and noise.

Conclusion: The method demonstrates strong generalization and interpretability despite limited data and a single diffusion backbone, offering a foundation for scalable, model-agnostic synthetic media forensics.

Abstract: The rapid rise of generative diffusion models has made distinguishing authentic visual content from synthetic imagery increasingly challenging. Traditional deepfake detection methods, which rely on frequency or pixel-level artifacts, fail against modern text-to-image systems such as Stable Diffusion and DALL-E that produce photorealistic and artifact-free results. This paper introduces a diffusion-based forensic framework that leverages multi-strength image reconstruction dynamics, termed diffusion snap-back, to identify AI-generated images. By analysing how reconstruction metrics (LPIPS, SSIM, and PSNR) evolve across varying noise strengths, we extract interpretable manifold-based features that differentiate real and synthetic images. Evaluated on a balanced dataset of 4,000 images, our approach achieves 0.993 AUROC under cross-validation and remains robust to common distortions such as compression and noise. Despite using limited data and a single diffusion backbone (Stable Diffusion v1.5), the proposed method demonstrates strong generalization and interpretability, offering a foundation for scalable, model-agnostic synthetic media forensics.

[223] Transfer Learning for Onboard Cloud Segmentation in Thermal Earth Observation: From Landsat to a CubeSat Constellation

Niklas Wölki, Lukas Kondmann, Christian Mollière, Martin Langer, Julia Gottfriedsen, Martin Werner

Main category: cs.CV

TL;DR: Transfer learning with UNet and MobileNet encoder enables efficient thermal-only cloud segmentation for CubeSats, achieving 0.877 F1 score and under 5-second inference on Jetson Nano hardware.

Details

Motivation: CubeSat missions face challenges in cloud segmentation due to limited hardware, single thermal band data, and insufficient labeled data, making conventional cloud masking techniques infeasible.

Method: Used transfer learning with UNet architecture and lightweight MobileNet encoder, pretrained on Landsat-7 Cloud Cover Assessment Dataset and fine-tuned with mission-specific samples in joint-training setup.

Result: Improved macro F1 from 0.850 to 0.877 over FOREST-2-only baselines, with full-image inference in under 5 seconds on NVIDIA Jetson Nano using TensorRT engine.

Conclusion: Public datasets and lightweight architectures can enable accurate, efficient thermal-only cloud masking on-orbit, supporting real-time decision-making in data-limited Earth observation missions.

Abstract: Onboard cloud segmentation is a critical yet underexplored task in thermal Earth observation (EO), particularly for CubeSat missions constrained by limited hardware and spectral information. CubeSats often rely on a single thermal band and lack sufficient labeled data, making conventional cloud masking techniques infeasible. This work addresses these challenges by applying transfer learning to thermal cloud segmentation for the FOREST-2 CubeSat, using a UNet with a lightweight MobileNet encoder. We pretrain the model on the public Landsat-7 Cloud Cover Assessment Dataset and fine-tune it with a small set of mission-specific samples in a joint-training setup, improving the macro F1 from 0.850 to 0.877 over FOREST-2-only baselines. We convert the model to a TensorRT engine and demonstrate full-image inference in under 5 seconds on an NVIDIA Jetson Nano. These results show that leveraging public datasets and lightweight architectures can enable accurate, efficient thermal-only cloud masking on-orbit, supporting real-time decision-making in data-limited EO missions.

[224] Oitijjo-3D: Generative AI Framework for Rapid 3D Heritage Reconstruction from Street View Imagery

Momen Khandoker Ope, Akif Islam, Mohd Ruhul Ameen, Abu Saleh Musa Miah, Md Rashedul Islam, Jungpil Shin

Main category: cs.CV

TL;DR: Oitijjo-3D is a free generative AI framework that uses Google Street View imagery to create 3D models of cultural heritage sites in Bangladesh, overcoming resource and expertise limitations.

Details

Motivation: To address the challenges of cultural heritage restoration in developing countries like Bangladesh, where traditional 3D digitization methods are too expensive and require specialized expertise.

Method: Two-stage pipeline: multimodal visual reasoning with Gemini 2.5 Flash Image for structure-texture synthesis, and neural image-to-3D generation through Hexagen for geometry recovery using Google Street View imagery.

Result: Produces photorealistic, metrically coherent 3D reconstructions in seconds with significant speedups compared to conventional methods, while preserving visual and structural fidelity.

Conclusion: The framework democratizes cultural preservation by turning open imagery into digital heritage, making it a community-driven, AI-assisted solution for resource-limited nations.

Abstract: Cultural heritage restoration in Bangladesh faces a dual challenge of limited resources and scarce technical expertise. Traditional 3D digitization methods, such as photogrammetry or LiDAR scanning, require expensive hardware, expert operators, and extensive on-site access, which are often infeasible in developing contexts. As a result, many of Bangladesh’s architectural treasures, from the Paharpur Buddhist Monastery to Ahsan Manzil, remain vulnerable to decay and inaccessible in digital form. This paper introduces Oitijjo-3D, a cost-free generative AI framework that democratizes 3D cultural preservation. By using publicly available Google Street View imagery, Oitijjo-3D reconstructs faithful 3D models of heritage structures through a two-stage pipeline - multimodal visual reasoning with Gemini 2.5 Flash Image for structure-texture synthesis, and neural image-to-3D generation through Hexagen for geometry recovery. The system produces photorealistic, metrically coherent reconstructions in seconds, achieving significant speedups compared to conventional Structure-from-Motion pipelines, without requiring any specialized hardware or expert supervision. Experiments on landmarks such as Ahsan Manzil, Choto Sona Mosque, and Paharpur demonstrate that Oitijjo-3D preserves both visual and structural fidelity while drastically lowering economic and technical barriers. By turning open imagery into digital heritage, this work reframes preservation as a community-driven, AI-assisted act of cultural continuity for resource-limited nations.

[225] Who Can We Trust? Scope-Aware Video Moment Retrieval with Multi-Agent Conflict

Chaochen Wu, Guan Luo, Meiyun Zuo, Zhitao Fan

Main category: cs.CV

TL;DR: A reinforcement learning-based video moment retrieval model that uses multi-agent systems with evidential learning to resolve conflicts between agents’ localization outputs and detect out-of-scope queries.

Details

Motivation: Current video moment retrieval methods don't handle conflicts between different models' location results, preventing proper integration of multiple models to improve performance.

Method: Proposed a reinforcement learning model that scans videos once to find moment boundaries with locational evidence, and a multi-agent framework using evidential learning to resolve conflicts between agents’ outputs.

Result: Extensive experiments on benchmark datasets show effectiveness compared to state-of-the-art approaches, with the ability to detect out-of-scope queries without additional training.

Conclusion: Modeling competition and conflict in multi-agent systems effectively improves RL performance in moment retrieval, and evidential learning plays a new role in multi-agent frameworks.

Abstract: Video moment retrieval uses a text query to locate a moment from a given untrimmed video reference. Locating corresponding video moments with text queries helps people interact with videos efficiently. Current solutions for this task have not considered conflict within location results from different models, so various models cannot integrate correctly to produce better results. This study introduces a reinforcement learning-based video moment retrieval model that can scan the whole video once to find the moment’s boundary while producing its locational evidence. Moreover, we proposed a multi-agent system framework that can use evidential learning to resolve conflicts between agents’ localization output. As a side product of observing and dealing with conflicts between agents, we can decide whether a query has no corresponding moment in a video (out-of-scope) without additional training, which is suitable for real-world applications. Extensive experiments on benchmark datasets show the effectiveness of our proposed methods compared with state-of-the-art approaches. Furthermore, the results of our study reveal that modeling competition and conflict of the multi-agent system is an effective way to improve RL performance in moment retrieval and show the new role of evidential learning in the multi-agent framework.

[226] VisionCAD: An Integration-Free Radiology Copilot Framework

Jiaming Li, Junlei Wu, Sheng Wang, Honglin Xiong, Jiangdong Cai, Zihao Zhao, Yitao Zhu, Yuan Yin, Dinggang Shen, Qian Wang

Main category: cs.CV

TL;DR: VisionCAD is a vision-based radiological assistance framework that captures medical images from displays using cameras, enabling AI-assisted diagnosis without modifying existing hospital IT infrastructure.

Details

Motivation: To overcome the barrier of integrating computer-aided diagnosis systems with existing hospital IT infrastructure, which hinders widespread clinical deployment.

Method: Uses a camera system to capture medical images from displays, then processes them through an automated pipeline that detects, restores, and analyzes on-screen medical images to transform camera-captured data into diagnostic-quality images suitable for automated analysis and report generation.

Result: Achieves diagnostic performance comparable to conventional CAD systems with F1-score degradation typically less than 2% across classification tasks, and natural language generation metrics for automated reports remain within 1% of those derived from original images.

Conclusion: VisionCAD offers an accessible approach for AI-assisted diagnosis that can be deployed in diverse clinical settings using only a camera device and standard computing resources, without requiring modifications to existing infrastructure.

Abstract: Widespread clinical deployment of computer-aided diagnosis (CAD) systems is hindered by the challenge of integrating with existing hospital IT infrastructure. Here, we introduce VisionCAD, a vision-based radiological assistance framework that circumvents this barrier by capturing medical images directly from displays using a camera system. The framework operates through an automated pipeline that detects, restores, and analyzes on-screen medical images, transforming camera-captured visual data into diagnostic-quality images suitable for automated analysis and report generation. We validated VisionCAD across diverse medical imaging datasets, demonstrating that our modular architecture can flexibly utilize state-of-the-art diagnostic models for specific tasks. The system achieves diagnostic performance comparable to conventional CAD systems operating on original digital images, with an F1-score degradation typically less than 2% across classification tasks, while natural language generation metrics for automated reports remain within 1% of those derived from original images. By requiring only a camera device and standard computing resources, VisionCAD offers an accessible approach for AI-assisted diagnosis, enabling the deployment of diagnostic capabilities in diverse clinical settings without modifications to existing infrastructure.

[227] Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

Fan Zhang, Haoxuan Li, Shengju Qian, Xin Wang, Zheng Lian, Hao Wu, Zhihong Zhu, Yuan Gao, Qiankun Li, Yefeng Zheng, Zhouchen Lin, Pheng-Ann Heng

Main category: cs.CV

TL;DR: FERBench benchmark reveals MLLMs’ limitations in facial expression reasoning, leading to development of UniFER-7B model with post-training strategies that outperforms major MLLMs.

Details

Motivation: To address the unexplored performance of cutting-edge Multimodal Large Language Models (MLLMs) on facial expression recognition (FER) tasks and their limitations in reasoning and interpretability.

Method: Created FERBench benchmark with 20 state-of-the-art MLLMs across 4 FER datasets, then developed post-training strategies using two curated datasets (UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for RLVR) to build UniFER-7B foundation model.

Result: MLLMs show good classification performance but significant limitations in reasoning and interpretability. UniFER-7B outperforms many open-sourced and closed-source generalist MLLMs including Gemini-2.5-Pro and Qwen2.5-VL-72B.

Conclusion: Post-training strategies effectively enhance MLLMs’ facial expression reasoning capabilities, and the unified UniFER-7B model demonstrates superior performance in FER tasks compared to existing generalist MLLMs.

Abstract: Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant limitations in reasoning and interpretability. To this end, we introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs. Specifically, we curate two high-quality and large-scale datasets: UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for reinforcement learning with verifiable rewards (RLVR), respectively. Building upon them, we develop a unified and interpretable FER foundation model termed UniFER-7B, which outperforms many open-sourced and closed-source generalist MLLMs (e.g., Gemini-2.5-Pro and Qwen2.5-VL-72B).

[228] VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning

Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, Lin Ma

Main category: cs.CV

TL;DR: VinciCoder is a unified multimodal code generation model that uses a two-stage training framework with supervised finetuning on 1.6M image-code pairs and visual reinforcement learning with coarse-to-fine reward mechanism to achieve state-of-the-art performance.

Details

Motivation: Current vision-language models rely on single-task training which limits their generalization for visual code intelligence, creating a need for more unified multimodal code generation approaches.

Method: Two-stage training: 1) Supervised Finetuning with 1.6M image-code pairs for direct code generation and visual-based code refinement, 2) Visual Reinforcement Learning (ViRL) with coarse-to-fine reward mechanism calculating visual similarity across local and global image patches.

Result: VinciCoder achieves state-of-the-art performance on various multimodal code generation benchmarks, demonstrating the effectiveness of the coarse-to-fine ViRL strategy.

Conclusion: The proposed VinciCoder model with its two-stage training framework and visual reinforcement learning strategy successfully addresses the limitations of single-task training and advances multimodal code generation capabilities.

Abstract: Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like Chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized \textbf{VI}sio\textbf{N} \textbf{C}ode \textbf{I}ntelligence. In this work, we introduce \textbf{VinciCoder}, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating visual similarity across local and global image patches. Extensive experiments on various multimodal code generation benchmarks demonstrate that VinciCoder achieves state-of-the-art performance, underscoring the effectiveness of our coarse-to-fine ViRL strategy. The code and model will be available at https://github.com/DocTron-hub/VinciCoder.

[229] CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks

Long Li, Shuichen Ji, Ziyang Luo, Nian Liu, Dingwen Zhang, Junwei Han

Main category: cs.CV

TL;DR: A unified framework that handles three saliency tasks (SOD, CoSOD, SIS) using Chain-of-Thought reasoning in Vision-Language Models with a two-stage training approach (SFT + RL) and a novel Confidence-Guided Policy Optimization method.

Details

Motivation: To address operational heterogeneity across different saliency tasks by creating a unified framework that can handle multiple tasks through a common reasoning process.

Method: Uses Chain-of-Thought reasoning in VLMs with two-stage training: Supervised Fine-Tuning and Reinforcement Learning. Introduces Confidence-Guided Policy Optimization (CGPO) for RL and “output-to-reasoning” strategy for SFT data construction.

Result: Matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks, achieving S-measure of 0.899 on CoCA for CoSOD (8.0 percentage points improvement over prior best) with less training data.

Conclusion: The proposed unified framework successfully bridges task heterogeneity through CoT reasoning and achieves superior performance across multiple saliency tasks with efficient training.

Abstract: We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks, eg, SOD, CoSOD, and SIS, by casting each as a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity. CoT training follows a two-stage paradigm: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). To enhance CoT quality in RL, we propose Confidence-Guided Policy Optimization (CGPO), a lightweight single-sample algorithm that leverages the discrepancy between reward and model confidence as a per-sample advantage signal. This design naturally focuses updates on informative responses while eliminating group sampling, thereby addressing GRPO’s key limitations: confidence-agnostic learning, signal dilution, and prohibitive computational overhead. We also introduce an “output-to-reasoning” strategy to construct high-fidelity SFT data that ensures logical consistency with ground-truth masks. Experiments show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks, especially achieving an S-measure of 0.899 on CoCA for CoSOD, surpassing the prior best by 8.0 percentage points, despite using far less training data.

[230] LGCA: Enhancing Semantic Representation via Progressive Expansion

Thanh Hieu Cao, Trung Khang Tran, Gia Thinh Pham, Tuong Nghiem Diep, Thanh Binh Nguyen

Main category: cs.CV

TL;DR: LGCA is a framework that addresses misinformation in CLIP by capturing local features, selecting salient regions for expansion, and combining local-global features to improve zero-shot image classification.

Details

Motivation: CLIP's sensitivity to random image crops can introduce misinformation and bias due to similar features at small scales, which needs to be addressed for better performance.

Method: LGCA captures local features, repeatedly selects the most salient regions and expands them, then computes similarity scores incorporating both original and expanded images to capture local and global features.

Result: Extensive experiments show LGCA substantially improves zero-shot performance across diverse datasets, outperforming state-of-the-art baselines.

Conclusion: LGCA effectively minimizes misinformation while maintaining efficiency and scalability, providing a robust solution for enhanced zero-shot image classification.

Abstract: Recent advancements in large-scale pretraining in natural language processing have enabled pretrained vision-language models such as CLIP to effectively align images and text, significantly improving performance in zero-shot image classification tasks. Subsequent studies have further demonstrated that cropping images into smaller regions and using large language models to generate multiple descriptions for each caption can further enhance model performance. However, due to the inherent sensitivity of CLIP, random image crops can introduce misinformation and bias, as many images share similar features at small scales. To address this issue, we propose Localized-Globalized Cross-Alignment (LGCA), a framework that first captures the local features of an image and then repeatedly selects the most salient regions and expands them. The similarity score is designed to incorporate both the original and expanded images, enabling the model to capture both local and global features while minimizing misinformation. Additionally, we provide a theoretical analysis demonstrating that the time complexity of LGCA remains the same as that of the original model prior to the repeated expansion process, highlighting its efficiency and scalability. Extensive experiments demonstrate that our method substantially improves zero-shot performance across diverse datasets, outperforming state-of-the-art baselines.

[231] Leveraging Hierarchical Image-Text Misalignment for Universal Fake Image Detection

Daichi Zhang, Tong Zhang, Jianmin Bao, Shiming Ge, Sabine Süsstrunk

Main category: cs.CV

TL;DR: ITEM is a fake image detector that uses image-text misalignment in CLIP space as discriminative clues, outperforming existing methods with better generalization across generative models.

Details

Motivation: Existing fake image detection methods focus only on visual clues and suffer from overfitting to specific image patterns, lacking generalization to unseen generative models.

Method: Proposes ITEM detector that measures image-text misalignment in pre-trained CLIP space using a hierarchical scheme - global image alignment and fine-grained object-level alignment - then tunes an MLP head for detection.

Result: Extensive experiments show superior performance against state-of-the-art competitors with impressive generalization and robustness across various recent generative models.

Conclusion: Leveraging multi-modal image-text misalignment provides more effective and generalizable fake image detection compared to visual-only approaches.

Abstract: With the rapid development of generative models, detecting generated fake images to prevent their malicious use has become a critical issue recently. Existing methods frame this challenge as a naive binary image classification task. However, such methods focus only on visual clues, yielding trained detectors susceptible to overfitting specific image patterns and incapable of generalizing to unseen models. In this paper, we address this issue from a multi-modal perspective and find that fake images cannot be properly aligned with corresponding captions compared to real images. Upon this observation, we propose a simple yet effective detector termed ITEM by leveraging the image-text misalignment in a joint visual-language space as discriminative clues. Specifically, we first measure the misalignment of the images and captions in pre-trained CLIP’s space, and then tune a MLP head to perform the usual detection task. Furthermore, we propose a hierarchical misalignment scheme that first focuses on the whole image and then each semantic object described in the caption, which can explore both global and fine-grained local semantic misalignment as clues. Extensive experiments demonstrate the superiority of our method against other state-of-the-art competitors with impressive generalization and robustness on various recent generative models.

[232] Enhancing Frequency Forgery Clues for Diffusion-Generated Image Detection

Daichi Zhang, Tong Zhang, Shiming Ge, Sabine Süsstrunk

Main category: cs.CV

TL;DR: Proposes a frequency-based detection method (F^2C) that enhances discriminative frequency bands to identify diffusion-generated images, achieving superior generalization to unseen models and robustness to perturbations.

Details

Motivation: Address limitations of existing detectors that struggle with generalization across different diffusion models and robustness to various perturbations, by leveraging frequency domain differences between real and generated images.

Method: Introduces a frequency-selective function as a weighted filter to Fourier spectrum, suppressing less discriminative bands while enhancing more informative ones based on progressive frequency differences observed between natural and diffusion-generated images.

Result: Extensive experiments show the method outperforms state-of-the-art detectors with superior generalization to unseen diffusion models and robust resilience to various perturbations across multiple datasets.

Conclusion: The frequency-based approach (F^2C) provides an effective solution for detecting diffusion-generated images by leveraging frequency domain characteristics, demonstrating strong generalization and robustness capabilities.

Abstract: Diffusion models have achieved remarkable success in image synthesis, but the generated high-quality images raise concerns about potential malicious use. Existing detectors often struggle to capture discriminative clues across different models and settings, limiting their generalization to unseen diffusion models and robustness to various perturbations. To address this issue, we observe that diffusion-generated images exhibit progressively larger differences from natural real images across low- to high-frequency bands. Based on this insight, we propose a simple yet effective representation by enhancing the Frequency Forgery Clue (F^2C) across all frequency bands. Specifically, we introduce a frequency-selective function which serves as a weighted filter to the Fourier spectrum, suppressing less discriminative bands while enhancing more informative ones. This approach, grounded in a comprehensive analysis of frequency-based differences between natural real and diffusion-generated images, enables general detection of images from unseen diffusion models and provides robust resilience to various perturbations. Extensive experiments on various diffusion-generated image datasets demonstrate that our method outperforms state-of-the-art detectors with superior generalization and robustness.

[233] ToxicTextCLIP: Text-Based Poisoning and Backdoor Attacks on CLIP Pre-training

Xin Yao, Haiyang Zhao, Yimin Chen, Jiawei Guo, Kecheng Huang, Ming Zhao

Main category: cs.CV

TL;DR: ToxicTextCLIP is a framework that generates adversarial texts to poison CLIP during pre-training, achieving high success rates while bypassing existing defenses.

Details

Motivation: CLIP's reliance on uncurated web data makes it vulnerable to data poisoning and backdoor attacks, particularly through the text modality which has been underexplored compared to image-based attacks.

Method: The framework uses iterative background-aware selection to find texts aligned with target classes, and background-driven augmentation to generate diverse poisoned samples while maintaining semantic coherence.

Result: Achieves up to 95.83% poisoning success rate and 98.68% backdoor Hit@1 on classification and retrieval tasks, successfully bypassing RoCLIP, CleanCLIP and SafeCLIP defenses.

Conclusion: ToxicTextCLIP demonstrates significant vulnerabilities in CLIP’s text modality, highlighting the need for more robust defense mechanisms against text-based poisoning attacks.

Abstract: The Contrastive Language-Image Pretraining (CLIP) model has significantly advanced vision-language modeling by aligning image-text pairs from large-scale web data through self-supervised contrastive learning. Yet, its reliance on uncurated Internet-sourced data exposes it to data poisoning and backdoor risks. While existing studies primarily investigate image-based attacks, the text modality, which is equally central to CLIP’s training, remains underexplored. In this work, we introduce ToxicTextCLIP, a framework for generating high-quality adversarial texts that target CLIP during the pre-training phase. The framework addresses two key challenges: semantic misalignment caused by background inconsistency with the target class, and the scarcity of background-consistent texts. To this end, ToxicTextCLIP iteratively applies: 1) a background-aware selector that prioritizes texts with background content aligned to the target class, and 2) a background-driven augmenter that generates semantically coherent and diverse poisoned samples. Extensive experiments on classification and retrieval tasks show that ToxicTextCLIP achieves up to 95.83% poisoning success and 98.68% backdoor Hit@1, while bypassing RoCLIP, CleanCLIP and SafeCLIP defenses. The source code can be accessed via https://github.com/xinyaocse/ToxicTextCLIP/.

[234] Weakly Supervised Pneumonia Localization from Chest X-Rays Using Deep Neural Network and Grad-CAM Explanations

Kiran Shahi, Anup Bagale

Main category: cs.CV

TL;DR: Weakly supervised deep learning framework for pneumonia classification and localization from chest X-rays using Grad-CAM explanations, achieving 98% accuracy with ResNet-18 and EfficientNet-B0.

Details

Motivation: To develop a pneumonia screening system that doesn't require costly pixel-level annotations and provides clinically meaningful localization of affected regions to enhance transparency and clinical trust in AI-assisted medical imaging.

Method: Uses image-level labels with Grad-CAM explanations to generate pneumonia localization heatmaps. Evaluates seven ImageNet-pretrained architectures (ResNet-18/50, DenseNet-121, EfficientNet-B0, MobileNet-V2/V3, ViT-B16) under identical training conditions with focal loss and patient-wise splits to prevent data leakage.

Result: ResNet-18 and EfficientNet-B0 achieve best overall test accuracy of 98%, ROC-AUC = 0.997, and F1 = 0.987. MobileNet-V2 provides optimal trade-off between accuracy and computational cost. Grad-CAM visualizations confirm models focus on clinically relevant lung regions.

Conclusion: The proposed weakly supervised explainable models enhance pneumonia screening transparency and clinical trust in AI-assisted medical imaging, demonstrating the potential of interpretable AI for radiological diagnostics.

Abstract: This study proposes a weakly supervised deep learning framework for pneumonia classification and localization from chest X-rays, utilizing Grad-CAM explanations. Instead of costly pixel-level annotations, our approach utilizes image-level labels to generate clinically meaningful heatmaps that highlight regions affected by pneumonia. We evaluate seven ImageNet-pretrained architectures ResNet-18/50, DenseNet-121, EfficientNet-B0, MobileNet-V2/V3, and ViT-B16 under identical training conditions with focal loss and patient-wise splits to prevent data leakage. Experimental results on the Kermany CXR dataset demonstrate that ResNet-18 and EfficientNet-B0 achieve the best overall test accuracy of 98%, ROC-AUC = 0.997, and F1 = 0.987, while MobileNet-V2 provides an optimal trade-off between accuracy and computational cost. Grad-CAM visualizations confirm that the proposed models focus on clinically relevant lung regions, supporting the use of interpretable AI for radiological diagnostics. This work highlights the potential of weakly supervised explainable models that enhance pneumonia screening transparency, and clinical trust in AI-assisted medical imaging. https://github.com/kiranshahi/pneumonia-analysis

[235] HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation

Panwang Pan, Tingting Shen, Chenxin Li, Yunlong Lin, Kairun Wen, Jingjing Zhao, Yixuan Yuan

Main category: cs.CV

TL;DR: HumanCrafter is a unified framework that jointly models appearance and human-part semantics from single images, integrating geometric priors for reconstruction and self-supervised semantic priors for segmentation.

Details

Motivation: Current generative models achieve high-fidelity 3D human reconstruction but lack utility for specific tasks like human 3D segmentation. There's also a scarcity of labeled 3D human datasets.

Method: Integrates human geometric priors in reconstruction and self-supervised semantic priors in segmentation. Uses pixel-aligned aggregation for cross-task synergy and multi-task objective for simultaneous texture modeling and semantic consistency optimization. Includes interactive annotation for generating data-label pairs.

Result: Extensive experiments show HumanCrafter surpasses state-of-the-art methods in both 3D human-part segmentation and 3D human reconstruction from single images.

Conclusion: HumanCrafter provides an effective unified framework that addresses the limitations of current methods by enabling joint modeling of appearance and semantics while overcoming dataset scarcity through interactive annotation.

Abstract: Recent advances in generative models have achieved high-fidelity in 3D human reconstruction, yet their utility for specific tasks (e.g., human 3D segmentation) remains constrained. We propose HumanCrafter, a unified framework that enables the joint modeling of appearance and human-part semantics from a single image in a feed-forward manner. Specifically, we integrate human geometric priors in the reconstruction stage and self-supervised semantic priors in the segmentation stage. To address labeled 3D human datasets scarcity, we further develop an interactive annotation procedure for generating high-quality data-label pairs. Our pixel-aligned aggregation enables cross-task synergy, while the multi-task objective simultaneously optimizes texture modeling fidelity and semantic consistency. Extensive experiments demonstrate that HumanCrafter surpasses existing state-of-the-art methods in both 3D human-part segmentation and 3D human reconstruction from a single image.

[236] Longitudinal Vestibular Schwannoma Dataset with Consensus-based Human-in-the-loop Annotations

Navodini Wijethilake, Marina Ivory, Oscar MacCormac, Siddhant Kumar, Aaron Kujawa, Lorena Garcia-Foncillas Macias, Rebecca Burger, Amanda Hitchings, Suki Thomson, Sinan Barazi, Eleni Maratos, Rupert Obholzer, Dan Jiang, Fiona McClenaghan, Kazumi Chia, Omar Al-Salihi, Nick Thomas, Steve Connor, Tom Vercauteren, Jonathan Shapey

Main category: cs.CV

TL;DR: A bootstrapped deep learning framework for automated vestibular schwannoma segmentation in MRI that combines iterative segmentation with expert quality refinement, achieving high accuracy and 37.4% efficiency improvement over manual annotation.

Details

Motivation: Manual segmentation of vestibular schwannoma on MRI is time-consuming and requires expert annotation. Current automated methods lack robustness across diverse datasets and complex clinical cases.

Method: Human-in-the-loop model training approach combining data from multiple centers with expert consensus for annotation trustworthiness. Uses iterative segmentation and quality refinement through a bootstrapped DL framework.

Result: Significant improvement in segmentation accuracy (DSC increased from 0.9125 to 0.9670) on internal validation dataset while maintaining stable performance on external datasets. Expert evaluation on 143 scans identified areas for refinement.

Conclusion: The approach provides a clinically adaptable and generalizable strategy for automated VS segmentation with high accuracy and efficiency gains, making it suitable for diverse clinical settings.

Abstract: Accurate segmentation of vestibular schwannoma (VS) on Magnetic Resonance Imaging (MRI) is essential for patient management but often requires time-intensive manual annotations by experts. While recent advances in deep learning (DL) have facilitated automated segmentation, challenges remain in achieving robust performance across diverse datasets and complex clinical cases. We present an annotated dataset stemming from a bootstrapped DL-based framework for iterative segmentation and quality refinement of VS in MRI. We combine data from multiple centres and rely on expert consensus for trustworthiness of the annotations. We show that our approach enables effective and resource-efficient generalisation of automated segmentation models to a target data distribution. The framework achieved a significant improvement in segmentation accuracy with a Dice Similarity Coefficient (DSC) increase from 0.9125 to 0.9670 on our target internal validation dataset, while maintaining stable performance on representative external datasets. Expert evaluation on 143 scans further highlighted areas for model refinement, revealing nuanced cases where segmentation required expert intervention. The proposed approach is estimated to enhance efficiency by approximately 37.4% compared to the conventional manual annotation process. Overall, our human-in-the-loop model training approach achieved high segmentation accuracy, highlighting its potential as a clinically adaptable and generalisable strategy for automated VS segmentation in diverse clinical settings. The dataset includes 190 patients, with tumour annotations available for 534 longitudinal contrast-enhanced T1-weighted (T1CE) scans from 184 patients, and non-annotated T2-weighted scans from 6 patients. This dataset is publicly accessible on The Cancer Imaging Archive (TCIA) (https://doi.org/10.7937/bq0z-xa62).

[237] FedMGP: Personalized Federated Learning with Multi-Group Text-Visual Prompts

Weihao Bo, Yanpeng Sun, Yu Wang, Xinyu Zhang, Zechao Li

Main category: cs.CV

TL;DR: FedMGP introduces personalized federated prompt learning for vision-language models using multiple prompt groups with diversity loss and dynamic similarity-based aggregation, achieving state-of-the-art performance with minimal communication parameters.

Details

Motivation: To address the challenge of capturing diverse, fine-grained semantic and instance-level cues in federated learning while maintaining parameter efficiency and enabling effective personalization across clients.

Method: Equips each client with multiple groups of paired textual and visual prompts, uses diversity loss to specialize groups in distinct semantic aspects, and employs dynamic prompt aggregation via similarity-guided probabilistic sampling for knowledge sharing.

Result: Achieves state-of-the-art performance with the lowest communication parameters among federated prompt learning methods, consistently outperforming prior approaches in both personalization and domain generalization across diverse benchmarks.

Conclusion: FedMGP provides an effective framework for personalized federated prompt learning that balances common knowledge preservation with client-specific features through multi-group prompts and dynamic aggregation, demonstrating superior performance and parameter efficiency.

Abstract: In this paper, we introduce FedMGP, a new paradigm for personalized federated prompt learning in vision-language models. FedMGP equips each client with multiple groups of paired textual and visual prompts, enabling the model to capture diverse, fine-grained semantic and instance-level cues. A diversity loss is introduced to drive each prompt group to specialize in distinct and complementary semantic aspects, ensuring that the groups collectively cover a broader range of local characteristics. During communication, FedMGP employs a dynamic prompt aggregation strategy based on similarity-guided probabilistic sampling: each client computes the cosine similarity between its prompt groups and the global prompts from the previous round, then samples s groups via a softmax-weighted distribution. This soft selection mechanism preferentially aggregates semantically aligned knowledge while still enabling exploration of underrepresented patterns effectively balancing the preservation of common knowledge with client-specific features. Notably, FedMGP maintains parameter efficiency by redistributing a fixed prompt capacity across multiple groups, achieving state-of-the-art performance with the lowest communication parameters among all federated prompt learning methods. Theoretical analysis shows that our dynamic aggregation strategy promotes robust global representation learning by reinforcing shared semantics while suppressing client-specific noise. Extensive experiments demonstrate that FedMGP consistently outperforms prior approaches in both personalization and domain generalization across diverse federated vision-language benchmarks. The code will be released on https://github.com/weihao-bo/FedMGP.git.

[238] Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

Panwang Pan, Chenguo Lin, Jingjing Zhao, Chenxin Li, Yuchen Lin, Haopeng Li, Honglei Yan, Kairun Wen, Yunlong Lin, Yixuan Yuan, Yadong Mu

Main category: cs.CV

TL;DR: Diff4Splat is a feed-forward method that synthesizes controllable 4D scenes from a single image using video diffusion models and 3D Gaussian primitives, achieving high-quality results in 30 seconds without test-time optimization.

Details

Motivation: To enable efficient synthesis of controllable 4D scenes from single images, overcoming the limitations of optimization-based methods that require extensive computation and time.

Method: Unifies video diffusion model priors with geometry/motion constraints from 4D datasets, using a video latent transformer to predict deformable 3D Gaussian fields in a single forward pass.

Result: Synthesizes high-quality 4D scenes in 30 seconds, matching or surpassing optimization-based methods in video generation, novel view synthesis, and geometry extraction while being significantly more efficient.

Conclusion: Diff4Splat provides an efficient feed-forward alternative to optimization-based dynamic scene synthesis, enabling rapid 4D scene generation from single images with competitive quality.

Abstract: We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30 seconds. We demonstrate the effectiveness of Diff4Splatacross video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.

[239] VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning

Hai-Dang Nguyen, Ha-Hieu Pham, Hao T. Nguyen, Huy-Hieu Pham

Main category: cs.CV

TL;DR: VinDr-CXR-VQA is a large-scale chest X-ray dataset for medical visual question answering with spatial grounding, containing 17,597 QA pairs across 4,394 images with radiologist-verified bounding boxes and clinical reasoning.

Details

Motivation: To advance reproducible and clinically grounded medical visual question answering research by providing a comprehensive dataset with spatial grounding and balanced distribution to mitigate hallucinations in normal cases.

Method: Created a dataset with 17,597 question-answer pairs across 4,394 chest X-ray images, annotated with radiologist-verified bounding boxes and clinical reasoning explanations. Used a six-type question taxonomy covering Where, What, Is there, How many, Which, and Yes/No questions.

Result: Benchmarking with MedGemma-4B-it showed improved performance (F1 = 0.624, +11.8% over baseline) while enabling lesion localization. The dataset has balanced distribution with 41.7% positive and 58.3% negative samples.

Conclusion: VinDr-CXR-VQA successfully advances reproducible and clinically grounded Med-VQA research by providing a comprehensive dataset with spatial grounding capabilities, improved performance metrics, and publicly available resources for the research community.

Abstract: We present VinDr-CXR-VQA, a large-scale chest X-ray dataset for explainable Medical Visual Question Answering (Med-VQA) with spatial grounding. The dataset contains 17,597 question-answer pairs across 4,394 images, each annotated with radiologist-verified bounding boxes and clinical reasoning explanations. Our question taxonomy spans six diagnostic types-Where, What, Is there, How many, Which, and Yes/No-capturing diverse clinical intents. To improve reliability, we construct a balanced distribution of 41.7% positive and 58.3% negative samples, mitigating hallucinations in normal cases. Benchmarking with MedGemma-4B-it demonstrates improved performance (F1 = 0.624, +11.8% over baseline) while enabling lesion localization. VinDr-CXR-VQA aims to advance reproducible and clinically grounded Med-VQA research. The dataset and evaluation tools are publicly available at huggingface.co/datasets/Dangindev/VinDR-CXR-VQA.

[240] ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation

Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Haopeng Li, Honglei Yan, Tingting Shen, Yadong Mu

Main category: cs.CV

TL;DR: ID-Composer is a novel framework for multi-subject video generation from text prompts and reference images, using hierarchical identity-preserving attention, VLM semantic understanding, and reinforcement learning to improve identity preservation and temporal consistency.

Details

Motivation: Existing video generative models are limited to text or single image conditioning, lacking controllability for multi-subject scenarios where preserving multiple subject identities and maintaining temporal consistency is challenging.

Method: Uses hierarchical identity-preserving attention mechanism to aggregate features across subjects and modalities, leverages pretrained VLM for semantic understanding, and employs online reinforcement learning (RLVR) to align critical concepts like subject ID.

Result: Extensive experiments show the model surpasses existing methods in identity preservation, temporal consistency, and video quality.

Conclusion: ID-Composer effectively addresses multi-subject video generation challenges through its hierarchical attention, VLM integration, and reinforcement learning approach, achieving superior performance in preserving identities and maintaining consistency.

Abstract: Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability. We introduce ID-Composer, a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency. To faithfully preserve the subject consistency and textual information in synthesized videos, ID-Composer designs a \textbf{hierarchical identity-preserving attention mechanism}, which effectively aggregates features within and across subjects and modalities. To effectively allow for the semantic following of user intention, we introduce \textbf{semantic understanding via pretrained vision-language model (VLM)}, leveraging VLM’s superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects. Considering that standard diffusion loss often fails in aligning the critical concepts like subject ID, we employ an \textbf{online reinforcement learning phase} to drive the overall training objective of ID-Composer into RLVR. Extensive experiments demonstrate that our model surpasses existing methods in identity preservation, temporal consistency, and video quality.

[241] SegDebias: Test-Time Bias Mitigation for ViT-Based CLIP via Segmentation

Fangyu Wu, Yujun Cai

Main category: cs.CV

TL;DR: A test-time debiasing method for CLIP models that uses segmentation to isolate target attributes and remove bias from non-target regions without requiring training data or bias annotations.

Details

Motivation: Existing debiasing methods require training data and group labels, limiting practicality. Test-time methods often need prior knowledge of dataset biases, reducing generalizability in open-set settings.

Method: Uses pretrained segmentation model to isolate target visual attribute, then adjusts non-target regions so their embeddings are uniformly similar to all class-specific text prompts, removing unintended bias signals.

Result: Outperforms existing test-time debiasing approaches on Waterbirds and CelebA datasets in both group robustness metrics and Attention IoU.

Conclusion: Segmentation-guided interventions are effective for scalable and annotation-free bias mitigation in vision language models.

Abstract: Vision language models such as CLIP have shown remarkable performance in zero shot classification, but remain susceptible to spurious correlations, where irrelevant visual features influence predictions. Existing debiasing methods often require access to training data and explicit group labels to perform fine-tuning or adjust embeddings, which limits their practicality in real-world settings. Test-time methods attempt to avoid this constraint, but many still depend on prior knowledge of dataset specific biases, limiting their generalizability in open set settings. In this work, we propose a test-time debiasing method for ViT based CLIP models that requires no additional training or assumptions of bias annotations. Our approach uses a pretrained segmentation model to isolate the target visual attribute, then adjusts the non target regions so that their embeddings are uniformly similar to all class specific text prompts. This procedure removes unintended bias signals from confounding visual regions while preserving the target attribute. Experiments on Waterbirds and CelebA show that our method outperforms existing test-time debiasing approaches in both group robustness metrics and Attention IoU. These results demonstrate the effectiveness of segmentation guided interventions for scalable and annotation free bias mitigation in vision language models.

[242] Text-guided Fine-Grained Video Anomaly Detection

Jihao Gu, Kun Li, He Wang, Kaan Akşit

Main category: cs.CV

TL;DR: T-VAD is a text-guided fine-grained video anomaly detection framework using Large Vision-Language Models that generates pixel-level anomaly heatmaps and provides detailed textual descriptions, achieving state-of-the-art performance.

Details

Motivation: Traditional VAD methods are semi-automated with limited binary outputs (normal/anomalous), lacking fine-grained detection and interactivity needed for surveillance and industrial monitoring applications.

Method: Built on LVLM with Anomaly Heatmap Decoder for pixel-wise visual-textual feature alignment to generate anomaly heatmaps, and Region-aware Anomaly Encoder to transform heatmaps into textual embeddings for precise anomaly identification and localization.

Result: Achieved 94.8% AUC on UBnormal dataset, 67.8%/76.7% accuracy in anomaly heatmaps, and high BLEU-4 scores (62.67-88.84) and Yes/No accuracy (89.73%-97.67%) for textual descriptions on multiple datasets.

Conclusion: T-VAD significantly enhances anomaly detection granularity and interactivity by combining fine-grained heatmaps with detailed textual guidance, demonstrating superior performance over existing methods.

Abstract: Video Anomaly Detection (VAD) aims to identify anomalous events within video segments. In scenarios such as surveillance or industrial process monitoring, anomaly detection is of critical importance. While existing approaches are semi-automated, requiring human assessment for anomaly detection, traditional VADs offer limited output as either normal or anomalous. We propose Text-guided Fine-Grained Video Anomaly Detection (T-VAD), a framework built upon Large Vision-Language Model (LVLM). T-VAD introduces an Anomaly Heatmap Decoder (AHD) that performs pixel-wise visual-textual feature alignment to generate fine-grained anomaly heatmaps. Furthermore, we design a Region-aware Anomaly Encoder (RAE) that transforms the heatmaps into learnable textual embeddings, guiding the LVLM to accurately identify and localize anomalous events in videos. This significantly enhances both the granularity and interactivity of anomaly detection. The proposed method achieving SOTA performance by demonstrating 94.8% Area Under the Curve (AUC, specifically micro-AUC) and 67.8%/76.7% accuracy in anomaly heatmaps (RBDC/TBDC) on the UBnormal dataset, and subjectively verified more preferable textual description on the ShanghaiTech-based dataset (BLEU-4: 62.67 for targets, 88.84 for trajectories; Yes/No accuracy: 97.67%), and on the UBnormal dataset (BLEU-4: 50.32 for targets, 78.10 for trajectories; Yes/No accuracy: 89.73%).

[243] Real-IAD Variety: Pushing Industrial Anomaly Detection Dataset to a Modern Era

Wenbing Zhu, Chengjie Wang, Bin-Bin Gao, Jiangning Zhang, Guannan Jiang, Jie Hu, Zhenye Gan, Lidong Wang, Ziqing Zhou, Linjie Cheng, Yurui Pan, Bo Peng, Mingmin Chi, Lizhuang Ma

Main category: cs.CV

TL;DR: Introduces Real-IAD Variety, the largest and most diverse industrial anomaly detection benchmark with 198,960 images across 160 categories, showing that vision-language models maintain robust performance when scaled to more categories while traditional methods degrade.

Details

Motivation: Existing IAD benchmarks have limited category diversity and scale, causing metric saturation and poor real-world transferability, creating a need for more comprehensive evaluation resources.

Method: Created Real-IAD Variety benchmark with 198,960 high-resolution images across 160 object categories, covering 28 industries, 24 material types, and 22 color variations, with rigorous evaluation protocols.

Result: State-of-the-art multi-class unsupervised methods show significant performance degradation when scaled from 30 to 160 categories, while vision-language models exhibit remarkable robustness with minimal performance variation across category counts.

Conclusion: Real-IAD Variety provides an essential resource for training and evaluating next-generation foundation models, enabling development of scalable, general-purpose anomaly detection systems beyond domain-specific constraints.

Abstract: Industrial Anomaly Detection (IAD) is critical for enhancing operational safety, ensuring product quality, and optimizing manufacturing efficiency across global industries. However, the IAD algorithms are severely constrained by the limitations of existing public benchmarks. Current datasets exhibit restricted category diversity and insufficient scale, frequently resulting in metric saturation and limited model transferability to real-world scenarios. To address this gap, we introduce Real-IAD Variety, the largest and most diverse IAD benchmark, comprising 198,960 high-resolution images across 160 distinct object categories. Its diversity is ensured through comprehensive coverage of 28 industries, 24 material types, and 22 color variations. Our comprehensive experimental analysis validates the benchmark’s substantial challenge: state-of-the-art multi-class unsupervised anomaly detection methods experience significant performance degradation when scaled from 30 to 160 categories. Crucially, we demonstrate that vision-language models exhibit remarkable robustness to category scale-up, with minimal performance variation across different category counts, significantly enhancing generalization capabilities in diverse industrial contexts. The unprecedented scale and complexity of Real-IAD Variety position it as an essential resource for training and evaluating next-generation foundation models for anomaly detection. By providing this comprehensive benchmark with rigorous evaluation protocols across multi-class unsupervised, multi-view, and zero-/few-shot settings, we aim to accelerate research beyond domain-specific constraints, enabling the development of scalable, general-purpose anomaly detection systems. Real-IAD Variety will be made publicly available to facilitate innovation in this critical field.

[244] MIFO: Learning and Synthesizing Multi-Instance from One Image

Kailun Su, Ziqi He, Xi Wang, Yang Zhou

Main category: cs.CV

TL;DR: Proposes a method for learning and synthesizing multi-instance semantics from a single image using penalty-based attention optimization and box control to handle similar semantics and achieve precise layout control.

Details

Motivation: The challenge lies in limited training data and the difficulty of disentangling similar semantics or appearance when learning from a single image.

Method: Uses penalty-based attention optimization to disentangle similar semantics during learning, and introduces box control in attention layers during synthesis to mitigate semantic leakage while controlling output layout.

Result: Achieves disentangled and high-quality semantic learning and synthesis, balancing editability and instance consistency. Remains robust with semantically/visually similar instances or rare objects.

Conclusion: The method effectively addresses multi-instance semantic learning from single images, providing robust performance even with challenging cases of similar semantics.

Abstract: This paper proposes a method for precise learning and synthesizing multi-instance semantics from a single image. The difficulty of this problem lies in the limited training data, and it becomes even more challenging when the instances to be learned have similar semantics or appearance. To address this, we propose a penalty-based attention optimization to disentangle similar semantics during the learning stage. Then, in the synthesis, we introduce and optimize box control in attention layers to further mitigate semantic leakage while precisely controlling the output layout. Experimental results demonstrate that our method achieves disentangled and high-quality semantic learning and synthesis, strikingly balancing editability and instance consistency. Our method remains robust when dealing with semantically or visually similar instances or rare-seen objects. The code is publicly available at https://github.com/Kareneveve/MIFO

[245] 4D Neural Voxel Splatting: Dynamic Scene Rendering with Voxelized Guassian Splatting

Chun-Tin Wu, Jun-Cheng Chen

Main category: cs.CV

TL;DR: 4D Neural Voxel Splatting (4D-NVS) combines voxel-based representations with neural Gaussian splatting to efficiently model dynamic scenes, reducing memory overhead and accelerating training while maintaining high image quality.

Details

Motivation: To address the substantial memory overhead from replicating Gaussians across frames in dynamic 3D Gaussian Splatting (3D-GS) scenes.

Method: Uses a compact set of neural voxels with learned deformation fields to model temporal dynamics instead of generating separate Gaussian sets per timestamp, and introduces a novel view refinement stage for selective optimization of challenging viewpoints.

Result: Outperforms state-of-the-art approaches with significant memory reduction and faster training, enabling real-time rendering with superior visual fidelity.

Conclusion: 4D-NVS provides an efficient solution for dynamic scene modeling that reduces memory consumption while preserving high rendering quality.

Abstract: Although 3D Gaussian Splatting (3D-GS) achieves efficient rendering for novel view synthesis, extending it to dynamic scenes still results in substantial memory overhead from replicating Gaussians across frames. To address this challenge, we propose 4D Neural Voxel Splatting (4D-NVS), which combines voxel-based representations with neural Gaussian splatting for efficient dynamic scene modeling. Instead of generating separate Gaussian sets per timestamp, our method employs a compact set of neural voxels with learned deformation fields to model temporal dynamics. The design greatly reduces memory consumption and accelerates training while preserving high image quality. We further introduce a novel view refinement stage that selectively improves challenging viewpoints through targeted optimization, maintaining global efficiency while enhancing rendering quality for difficult viewing angles. Experiments demonstrate that our method outperforms state-of-the-art approaches with significant memory reduction and faster training, enabling real-time rendering with superior visual fidelity.

[246] Generalized Category Discovery under Domain Shift: A Frequency Domain Perspective

Wei Feng, Zongyuan Ge

Main category: cs.CV

TL;DR: FREE is a frequency-guided framework for Domain-Shifted Generalized Category Discovery that uses frequency-domain analysis to handle distribution shifts between known and unknown domains, improving category discovery performance.

Details

Motivation: Existing GCD methods perform poorly under distribution shifts, and real-world unlabeled data often comes from unknown domains with different distributions than the labeled data.

Method: Uses frequency-based domain separation, cross-domain and intra-domain frequency perturbation strategies, extended self-supervised contrastive learning, semantic clustering loss, and clustering-difficulty-aware resampling.

Result: Extensive experiments show FREE effectively mitigates distribution shift impacts and achieves superior performance in discovering both known and unknown categories across benchmark datasets.

Conclusion: Frequency-domain information is effective for handling distribution shifts in GCD, and the proposed FREE framework significantly improves category discovery performance in domain-shifted scenarios.

Abstract: Generalized Category Discovery (GCD) aims to leverage labeled samples from known categories to cluster unlabeled data that may include both known and unknown categories. While existing methods have achieved impressive results under standard conditions, their performance often deteriorates in the presence of distribution shifts. In this paper, we explore a more realistic task: Domain-Shifted Generalized Category Discovery (DS_GCD), where the unlabeled data includes not only unknown categories but also samples from unknown domains. To tackle this challenge, we propose a \textbf{\underline{F}}requency-guided Gene\textbf{\underline{r}}alized Cat\textbf{\underline{e}}gory Discov\textbf{\underline{e}}ry framework (FREE) that enhances the model’s ability to discover categories under distributional shift by leveraging frequency-domain information. Specifically, we first propose a frequency-based domain separation strategy that partitions samples into known and unknown domains by measuring their amplitude differences. We then propose two types of frequency-domain perturbation strategies: a cross-domain strategy, which adapts to new distributions by exchanging amplitude components across domains, and an intra-domain strategy, which enhances robustness to intra-domain variations within the unknown domain. Furthermore, we extend the self-supervised contrastive objective and semantic clustering loss to better guide the training process. Finally, we introduce a clustering-difficulty-aware resampling technique to adaptively focus on harder-to-cluster categories, further enhancing model performance. Extensive experiments demonstrate that our method effectively mitigates the impact of distributional shifts across various benchmark datasets and achieves superior performance in discovering both known and unknown categories.

[247] TRACES: Temporal Recall with Contextual Embeddings for Real-Time Video Anomaly Detection

Yousuf Ahmed Siddiqui, Sufiyaan Usmani, Umer Tariq, Jawwad Ahmed Shamsi, Muhammad Burhan Khan

Main category: cs.CV

TL;DR: A memory-augmented pipeline for context-aware zero-shot anomaly detection that fuses temporal and appearance features with textual memory traces using cross-attention, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Video anomalies depend on contextual information and temporal evolution, but most detectors ignore context, limiting their real-world generalization. The paper addresses context-aware zero-shot anomaly detection to adaptively learn and detect new events in real-time.

Method: Memory-augmented pipeline using cross-attention to correlate temporal signals with visual embeddings, with real-time zero-shot anomaly classification through contextual similarity scoring.

Result: Achieved 90.4% AUC on UCF-Crime and 83.67% AP on XD-Violence, setting new state-of-the-art among zero-shot models. The model enables real-time inference with high precision and explainability.

Conclusion: Fusing cross-attention temporal fusion and contextual memory enables high-fidelity anomaly detection, advancing zero-shot models’ applicability in real-world surveillance and infrastructure monitoring.

Abstract: Video anomalies often depend on contextual information available and temporal evolution. Non-anomalous action in one context can be anomalous in some other context. Most anomaly detectors, however, do not notice this type of context, which seriously limits their capability to generalize to new, real-life situations. Our work addresses the context-aware zero-shot anomaly detection challenge, in which systems need to learn adaptively to detect new events by correlating temporal and appearance features with textual traces of memory in real time. Our approach defines a memory-augmented pipeline, correlating temporal signals with visual embeddings using cross-attention, and real-time zero-shot anomaly classification by contextual similarity scoring. We achieve 90.4% AUC on UCF-Crime and 83.67% AP on XD-Violence, a new state-of-the-art among zero-shot models. Our model achieves real-time inference with high precision and explainability for deployment. We show that, by fusing cross-attention temporal fusion and contextual memory, we achieve high fidelity anomaly detection, a step towards the applicability of zero-shot models in real-world surveillance and infrastructure monitoring.

[248] CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World

Yating Yu, Congqi Cao, Zhaoying Wang, Weihua Meng, Jie Li, Yuxin Li, Zihao Wei, Zhongpei Shen, Jiajun Zhang

Main category: cs.CV

TL;DR: CueBench is a comprehensive benchmark for context-aware video anomaly understanding, introducing hierarchical taxonomy and unified evaluation across multiple tasks, with Cue-R1 model achieving significant improvements over existing methods.

Details

Motivation: Current video anomaly understanding methods have superficial comprehension of real-world anomalies, lacking the ability to distinguish complex principles and subtle contextual differences that define anomalies in realistic scenarios.

Method: Proposed CueBench benchmark with event-centric hierarchical taxonomy covering 14 conditional and 18 absolute anomaly events across 174 scenes and 198 attributes. Developed Cue-R1 model using R1-style reinforcement fine-tuning with verifiable, task-aligned, and hierarchy-refined rewards in unified generative framework.

Result: Extensive evaluation shows existing vision-language models are far from satisfactory real-world anomaly understanding, while Cue-R1 surpasses state-of-the-art approaches by over 24% on average across recognition, temporal grounding, detection, and anticipation tasks.

Conclusion: CueBench provides a rigorous evaluation framework that reveals significant gaps in current models’ understanding of context-aware video anomalies, while the proposed Cue-R1 demonstrates substantial improvements through unified generative approach with refined rewards.

Abstract: How far are deep models from real-world video anomaly understanding (VAU)? Current works typically emphasize on detecting unexpected occurrences deviated from normal patterns or comprehending anomalous events with interpretable descriptions. However, they exhibit only a superficial comprehension of real-world anomalies, with limited breadth in complex principles and subtle context that distinguish the anomalies from normalities, e.g., climbing cliffs with safety gear vs. without it. To this end, we introduce CueBench, the first of its kind Benchmark, devoted to Context-aware video anomalies within a Unified Evaluation framework. We comprehensively establish an event-centric hierarchical taxonomy that anchors two core event types: 14 conditional and 18 absolute anomaly events, defined by their refined semantics from diverse contexts across 174 scenes and 198 attributes. Based on this, we propose to unify and benchmark context-aware VAU with various challenging tasks across recognition, temporal grounding, detection, and anticipation. This also serves as a rigorous and fair probing evaluation suite for generative-discriminative as well as generalized-specialized vision-language models (VLMs). To address the challenges underlying CueBench, we further develop Cue-R1 based on R1-style reinforcement fine-tuning with verifiable, task-aligned, and hierarchy-refined rewards in a unified generative manner. Extensive results on CueBench reveal that, existing VLMs are still far from satisfactory real-world anomaly understanding, while our Cue-R1 surpasses these state-of-the-art approaches by over 24% on average.

[249] Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach

Oluwatosin Alabi, Meng Wei, Charlie Budd, Tom Vercauteren, Miaojing Shi

Main category: cs.CV

TL;DR: This paper proposes triplet segmentation - a new task that spatially grounds surgical action triplets (<instrument, verb, target>) using instrument instance segmentation, addressing limitations of existing methods that lack spatial precision.

Details

Motivation: Existing surgical action recognition methods are limited to frame-level classification and fail to reliably link actions to specific instrument instances, while previous spatial grounding approaches using class activation maps lack the precision needed for detailed instrument-tissue interaction analysis.

Method: The authors propose TargetFusionNet, a novel architecture that extends Mask2Former with a target-aware fusion mechanism to fuse weak anatomy priors with instrument instance queries for accurate anatomical target prediction. They also introduce CholecTriplet-Seg, a large-scale dataset with over 30,000 annotated frames.

Result: TargetFusionNet consistently improves performance over existing baselines across recognition, detection, and triplet segmentation metrics, demonstrating that strong instance supervision combined with weak target priors significantly enhances surgical action understanding accuracy and robustness.

Conclusion: Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets, and the proposed benchmark and architecture pave the way for more interpretable surgical scene understanding.

Abstract: Understanding surgical instrument-tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame-level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument-tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded <instrument, verb, target> outputs.We start by presenting CholecTriplet-Seg, a large-scale dataset containing over 30,000 annotated frames, linking instrument instance masks with action verb and anatomical target annotations, and establishing the first benchmark for strongly supervised, instance-level triplet grounding and evaluation.To learn triplet segmentation, we propose TargetFusionNet, a novel architecture that extends Mask2Former with a target-aware fusion mechanism to address the challenge of accurate anatomical target prediction by fusing weak anatomy priors with instrument instance queries.Evaluated across recognition, detection, and triplet segmentation metrics, TargetFusionNet consistently improves performance over existing baselines, demonstrating that strong instance supervision combined with weak target priors significantly enhances the accuracy and robustness of surgical action understanding.Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets. The proposed benchmark and architecture pave the way for more interpretable, surgical scene understanding.

[250] Benchmarking individual tree segmentation using multispectral airborne laser scanning data: the FGI-EMIT dataset

Lassi Ruoppa, Tarmo Hietala, Verneri Seppänen, Josef Taher, Teemu Hakala, Xiaowei Yu, Antero Kukko, Harri Kaartinen, Juha Hyyppä

Main category: cs.CV

TL;DR: This paper introduces FGI-EMIT, the first large-scale multispectral LiDAR benchmark dataset for individual tree segmentation, and benchmarks both unsupervised and supervised deep learning methods, showing that DL approaches significantly outperform traditional methods, particularly for understory trees.

Details

Motivation: The lack of large-scale multispectral LiDAR benchmark datasets has hindered progress in individual tree segmentation, despite evidence that multispectral reflectance can improve accuracy. There's a need to evaluate both traditional and modern approaches on comprehensive data.

Method: Created FGI-EMIT dataset with 1,561 manually annotated trees captured at 532, 905, and 1,550 nm wavelengths. Benchmarking included four unsupervised algorithms (with Bayesian hyperparameter optimization) and four supervised DL approaches (trained from scratch). Conducted ablation studies on multispectral reflectance and performance analysis across point densities.

Result: Unsupervised Treeiso achieved best F1-score of 52.7%, while DL ForestFormer3D achieved 73.3% F1-score. DL methods significantly outperformed unsupervised approaches, especially for understory trees (25.9 percentage point difference). Current DL approaches generally fail to effectively leverage multispectral reflectance, though single channel reflectance provides marginal improvements for understory trees. DL methods remain superior even at low point densities (10 points/m²).

Conclusion: Deep learning approaches significantly outperform traditional unsupervised methods for individual tree segmentation, particularly for challenging understory trees. However, current DL models are not effectively utilizing multispectral reflectance information, indicating an area for future improvement in multispectral LiDAR data processing.

Abstract: Individual tree segmentation (ITS) from LiDAR point clouds is fundamental for applications such as forest inventory, carbon monitoring and biodiversity assessment. Traditionally, ITS has been achieved with unsupervised geometry-based algorithms, while more recent advances have shifted toward supervised deep learning (DL). In the past, progress in method development was hindered by the lack of large-scale benchmark datasets, and the availability of novel data formats, particularly multispectral (MS) LiDAR, remains limited to this day, despite evidence that MS reflectance can improve the accuracy of ITS. This study introduces FGI-EMIT, the first large-scale MS airborne laser scanning benchmark dataset for ITS. Captured at wavelengths 532, 905, and 1,550 nm, the dataset consists of 1,561 manually annotated trees, with a particular focus on small understory trees. Using FGI-EMIT, we comprehensively benchmarked four conventional unsupervised algorithms and four supervised DL approaches. Hyperparameters of unsupervised methods were optimized using a Bayesian approach, while DL models were trained from scratch. Among the unsupervised methods, Treeiso achieved the highest test set F1-score of 52.7%. The DL approaches performed significantly better overall, with the best model, ForestFormer3D, attaining an F1-score of 73.3%. The most significant difference was observed in understory trees, where ForestFormer3D exceeded Treeiso by 25.9 percentage points. An ablation study demonstrated that current DL-based approaches generally fail to leverage MS reflectance information when it is provided as additional input features, although single channel reflectance can improve accuracy marginally, especially for understory trees. A performance analysis across point densities further showed that DL methods consistently remain superior to unsupervised algorithms, even at densities as low as 10 points/m$^2$.

[251] Metadata-Aligned 3D MRI Representations for Contrast Understanding and Quality Control

Mehmet Yigit Avci, Pedro Borges, Virginia Fernandez, Paul Wright, Mehmet Yigitsoy, Sebastien Ourselin, Jorge Cardoso

Main category: cs.CV

TL;DR: MR-CLIP is a framework that learns unified MRI contrast representations by aligning volumetric images with DICOM acquisition parameters, enabling label-efficient analysis across diverse clinical datasets.

Details

Motivation: MRI suffers from data heterogeneity and lack of standardized contrast labels across scanners, protocols, and institutions, limiting large-scale automated analysis. A unified representation would enable automatic sequence recognition, harmonization, and quality control without manual annotations.

Method: MR-CLIP uses a metadata-guided framework that aligns volumetric images with their DICOM acquisition parameters to learn MRI contrast representations.

Result: The embeddings show distinct clusters of MRI sequences and outperform supervised 3D baselines in few-shot sequence classification under data scarcity. It also enables unsupervised data quality control by identifying corrupted metadata through image-metadata embedding distances.

Conclusion: By transforming routinely available acquisition metadata into a supervisory signal, MR-CLIP provides a scalable foundation for label-efficient MRI analysis across diverse clinical datasets.

Abstract: Magnetic Resonance Imaging suffers from substantial data heterogeneity and the absence of standardized contrast labels across scanners, protocols, and institutions, which severely limits large-scale automated analysis. A unified representation of MRI contrast would enable a wide range of downstream utilities, from automatic sequence recognition to harmonization and quality control, without relying on manual annotations. To this end, we introduce MR-CLIP, a metadata-guided framework that learns MRI contrast representations by aligning volumetric images with their DICOM acquisition parameters. The resulting embeddings shows distinct clusters of MRI sequences and outperform supervised 3D baselines under data scarcity in few-shot sequence classification. Moreover, MR-CLIP enables unsupervised data quality control by identifying corrupted or inconsistent metadata through image-metadata embedding distances. By transforming routinely available acquisition metadata into a supervisory signal, MR-CLIP provides a scalable foundation for label-efficient MRI analysis across diverse clinical datasets.

[252] Outlier-Aware Post-Training Quantization for Image Super-Resolution

Hailing Wang, jianglin Lu, Yitian Zhang, Yun Fu

Main category: cs.CV

TL;DR: A dual-region quantization strategy for image super-resolution networks that handles activation outliers by partitioning them into outlier and dense regions, with sensitivity-aware finetuning to address layer-specific quantization sensitivity.

Details

Motivation: Existing post-training quantization methods for SR networks fail due to overlooking activation outliers, which are correlated with image color information and cause significant performance degradation when removed.

Method: Proposes dual-region quantization that partitions activations into outlier and dense regions with independent uniform quantization, plus sensitivity-aware finetuning that focuses more on highly sensitive layers.

Result: Outperforms existing PTQ approaches across various SR networks and datasets, achieving performance comparable to QAT methods in most scenarios with at least 75x speedup.

Conclusion: The proposed dual-region quantization with sensitivity-aware finetuning effectively addresses activation outliers and layer sensitivity, making PTQ viable for SR networks with performance close to QAT but much faster.

Abstract: Quantization techniques, including quantization-aware training (QAT) and post-training quantization (PTQ), have become essential for inference acceleration of image super-resolution (SR) networks. Compared to QAT, PTQ has garnered significant attention as it eliminates the need for ground truth and model retraining. However, existing PTQ methods for SR often fail to achieve satisfactory performance as they overlook the impact of outliers in activation. Our empirical analysis reveals that these prevalent activation outliers are strongly correlated with image color information, and directly removing them leads to significant performance degradation. Motivated by this, we propose a dual-region quantization strategy that partitions activations into an outlier region and a dense region, applying uniform quantization to each region independently to better balance bit-width allocation. Furthermore, we observe that different network layers exhibit varying sensitivities to quantization, leading to different levels of performance degradation. To address this, we introduce sensitivity-aware finetuning that encourages the model to focus more on highly sensitive layers, further enhancing quantization performance. Extensive experiments demonstrate that our method outperforms existing PTQ approaches across various SR networks and datasets, while achieving performance comparable to QAT methods in most scenarios with at least a 75 speedup.

[253] Evolve to Inspire: Novelty Search for Diverse Image Generation

Alex Inch, Passawis Chaiyapattanaporn, Yuchen Zhu, Yuan Lu, Ting-Wen Ko, Davide Paglieri

Main category: cs.CV

TL;DR: WANDER is a novelty search-based approach that generates diverse image sets from single prompts using LLM semantic evolution and CLIP embeddings for novelty quantification, outperforming existing methods in diversity.

Details

Motivation: Text-to-image diffusion models suffer from limited output diversity, hindering their use in creative exploration tasks, while existing prompt optimization techniques are either focused on aesthetics or unsuitable for visual creativity.

Method: Uses novelty search with LLM for semantic evolution of prompts, CLIP embeddings to quantify novelty, and emitters to guide search into distinct regions of prompt space.

Result: WANDER significantly outperforms existing evolutionary prompt optimization baselines in diversity metrics, with ablation studies confirming the efficacy of emitters.

Conclusion: The approach successfully addresses the diversity limitation in text-to-image generation, making it more suitable for creative and exploratory applications.

Abstract: Text-to-image diffusion models, while proficient at generating high-fidelity im- ages, often suffer from limited output diversity, hindering their application in exploratory and ideation tasks. Existing prompt optimization techniques typically target aesthetic fitness or are ill-suited to the creative visual domain. To address this shortcoming, we introduce WANDER, a novelty search-based approach to generating diverse sets of images from a single input prompt. WANDER operates directly on natural language prompts, employing a Large Language Model (LLM) for semantic evolution of diverse sets of images, and using CLIP embeddings to quantify novelty. We additionally apply emitters to guide the search into distinct regions of the prompt space, and demonstrate that they boost the diversity of the generated images. Empirical evaluations using FLUX-DEV for generation and GPT-4o-mini for mutation demonstrate that WANDER significantly outperforms existing evolutionary prompt optimization baselines in diversity metrics. Ablation studies confirm the efficacy of emitters.

[254] Toward Better Optimization of Low-Dose CT Enhancement: A Critical Analysis of Loss Functions and Image Quality Assessment Metrics

Taifour Yousra, Beghdadi Azeddine, Marie Luong, Zuheng Ming

Main category: cs.CV

TL;DR: Analysis of loss functions for low-dose CT image enhancement reveals inconsistencies between loss functions and image quality metrics, highlighting the need to consider perceptual quality metrics when developing new loss functions.

Details

Motivation: Current deep learning models for low-dose CT enhancement achieve good PSNR/SSIM scores but these metrics don't adequately reflect perceptual quality, especially for medical images where diagnostic accuracy is critical.

Method: Conducted objective analysis of different loss functions used in DL-based LDCT enhancement models and their consistency with image quality metrics.

Result: Found inconsistencies between loss functions and quality metrics, showing that current loss functions don’t align well with perceptual image quality.

Conclusion: When developing new loss functions for image quality enhancement, it’s crucial to consider image quality metrics to ensure better perceptual quality and diagnostic accuracy.

Abstract: Low-dose CT (LDCT) imaging is widely used to reduce radiation exposure to mitigate high exposure side effects, but often suffers from noise and artifacts that affect diagnostic accuracy. To tackle this issue, deep learning models have been developed to enhance LDCT images. Various loss functions have been employed, including classical approaches such as Mean Square Error and adversarial losses, as well as customized loss functions(LFs) designed for specific architectures. Although these models achieve remarkable performance in terms of PSNR and SSIM, these metrics are limited in their ability to reflect perceptual quality, especially for medical images. In this paper, we focus on one of the most critical elements of DL-based architectures, namely the loss function. We conduct an objective analysis of the relevance of different loss functions for LDCT image quality enhancement and their consistency with image quality metrics. Our findings reveal inconsistencies between LFs and quality metrics, and highlight the need of consideration of image quality metrics when developing a new loss function for image quality enhancement.

[255] Validating Deep Models for Alzheimer’s 18F-FDG PET Diagnosis Across Populations: A Study with Latin American Data

Hugo Massaroli, Hernan Chaves, Pilar Anania, Mauricio Farez, Emmanuel Iarussi, Viviana Siless

Main category: cs.CV

TL;DR: Deep learning models trained on North American Alzheimer’s data (ADNI) show poor generalization to Latin American populations (FLENI), with performance dropping from AUC 0.96-0.97 to 0.80-0.82, revealing significant domain shift issues.

Details

Motivation: To evaluate the generalization of Alzheimer's diagnostic AI models from North American cohorts to underrepresented Latin American populations, addressing the gap in validation across diverse populations.

Method: Benchmarked convolutional and Transformer-based models on ADNI dataset and tested generalization on FLENI Latin American cohort. Conducted ablation studies on normalization and sampling, plus occlusion sensitivity analysis.

Result: All models showed high AUCs on ADNI (0.96-0.97) but substantial performance drops on FLENI (0.80-0.82). Transformers showed no clear advantage over CNNs. Per-image normalization and correct sampling were key for generalization.

Conclusion: Population-aware validation is crucial for diagnostic AI models. Domain adaptation and cohort diversification are needed to ensure generalizability across different populations.

Abstract: Deep learning models have shown strong performance in diagnosing Alzheimer’s disease (AD) using neuroimaging data, particularly 18F-FDG PET scans, with training datasets largely composed of North American cohorts such as those in the Alzheimer’s Disease Neuroimaging Initiative (ADNI). However, their generalization to underrepresented populations remains underexplored. In this study, we benchmark convolutional and Transformer-based models on the ADNI dataset and assess their generalization performance on a novel Latin American clinical cohort from the FLENI Institute in Buenos Aires, Argentina. We show that while all models achieve high AUCs on ADNI (up to .96, .97), their performance drops substantially on FLENI (down to .82, .80, respectively), revealing a significant domain shift. The tested architectures demonstrated similar performance, calling into question the supposed advantages of transformers for this specific task. Through ablation studies, we identify per-image normalization and a correct sampling selection as key factors for generalization. Occlusion sensitivity analysis further reveals that models trained on ADNI, generally attend to canonical hypometabolic regions for the AD class, but focus becomes unclear for the other classes and for FLENI scans. These findings highlight the need for population-aware validation of diagnostic AI models and motivate future work on domain adaptation and cohort diversification.

[256] Towards classification-based representation learning for place recognition on LiDAR scans

Dmitrii Khizbullin, Maksim Konoplia

Main category: cs.CV

TL;DR: The paper proposes framing place recognition as multi-class classification instead of contrastive learning, using discrete location labels for LiDAR scans and achieving competitive performance on NuScenes dataset.

Details

Motivation: Most existing place recognition methods rely on contrastive learning, so the authors explore an alternative classification-based approach to potentially improve training efficiency and stability.

Method: Assign discrete location labels to LiDAR scans and train an encoder-decoder model to directly classify each scan’s position as a multi-class classification problem.

Result: The method achieves competitive performance compared to contrastive learning-based methods on the NuScenes dataset.

Conclusion: The classification-based approach for place recognition is viable and offers advantages in training efficiency and stability while maintaining competitive performance.

Abstract: Place recognition is a crucial task in autonomous driving, allowing vehicles to determine their position using sensor data. While most existing methods rely on contrastive learning, we explore an alternative approach by framing place recognition as a multi-class classification problem. Our method assigns discrete location labels to LiDAR scans and trains an encoder-decoder model to classify each scan’s position directly. We evaluate this approach on the NuScenes dataset and show that it achieves competitive performance compared to contrastive learning-based methods while offering advantages in training efficiency and stability.

[257] Erasing ‘Ugly’ from the Internet: Propagation of the Beauty Myth in Text-Image Models

Tanvi Dinkar, Aiqi Jiang, Gavin Abercrombie, Ioannis Konstas

Main category: cs.CV

TL;DR: This study investigates how generative AI models encode Western beauty standards and erase ‘ugliness’, finding significant demographic biases including light skin tone preference (86.5%), younger age depiction (74%), and hypersexualization of non-binary individuals.

Details

Motivation: Social media exacerbates Western beauty norms causing negative self-image and body dysmorphia, with concerns that AI-generated content may further exaggerate these harmful standards.

Method: Created two image generation pipelines using text-to-image and text-to-language-to-image models, developed a structured beauty taxonomy, generated 5984 images using three language models and two text-to-image models, and conducted a Likert-scale study with women and non-binary social media users evaluating 1200 images.

Result: 86.5% of images depicted lighter skin tones, 22% contained explicit content despite SFW training, 74% showed younger age demographics, non-binary individuals were rated as younger and more hypersexualized, and ‘ugly’ traits consistently produced higher NSFW ratings regardless of gender.

Conclusion: Generative AI models contain pervasive demographic biases related to beauty standards that are actively perpetuated by developers through practices like negative prompting, leading to pollution of data streams and erasure of features outside stereotypical beauty norms.

Abstract: Social media has exacerbated the promotion of Western beauty norms, leading to negative self-image, particularly in women and girls, and causing harm such as body dysmorphia. Increasingly content on the internet has been artificially generated, leading to concerns that these norms are being exaggerated. The aim of this work is to study how generative AI models may encode ‘beauty’ and erase ‘ugliness’, and discuss the implications of this for society. To investigate these aims, we create two image generation pipelines: a text-to-image model and a text-to-language model-to image model. We develop a structured beauty taxonomy which we use to prompt three language models (LMs) and two text-to-image models to cumulatively generate 5984 images using our two pipelines. We then recruit women and non-binary social media users to evaluate 1200 of the images through a Likert-scale within-subjects study. Participants show high agreement in their ratings. Our results show that 86.5% of generated images depicted people with lighter skin tones, 22% contained explicit content despite Safe for Work (SFW) training, and 74% were rated as being in a younger age demographic. In particular, the images of non-binary individuals were rated as both younger and more hypersexualised, indicating troubling intersectional effects. Notably, prompts encoded with ’negative’ or ‘ugly’ beauty traits (such as “a wide nose”) consistently produced higher Not SFW (NSFW) ratings regardless of gender. This work sheds light on the pervasive demographic biases related to beauty standards present in generative AI models – biases that are actively perpetuated by model developers, such as via negative prompting. We conclude by discussing the implications of this on society, which include pollution of the data streams and active erasure of features that do not fall inside the stereotype of what is considered beautiful by developers.

[258] A Hybrid YOLOv5-SSD IoT-Based Animal Detection System for Durian Plantation Protection

Anis Suttan Shahrir, Zakiah Ayop, Syarulnaziah Anawar, Norulzahrah Mohd Zainudin

Main category: cs.CV

TL;DR: An IoT-based animal detection system for durian plantations that combines YOLOv5 and SSD algorithms for improved accuracy, provides real-time monitoring with Telegram notifications, and triggers automated deterrent sounds when animals are detected.

Details

Motivation: Durian plantations face significant crop damage and financial losses from animal intrusions, and traditional farming practices are ineffective due to lack of continuous monitoring without human intervention.

Method: The system integrates YOLOv5 and SSD object detection algorithms, provides real-time monitoring with Telegram notifications for rapid farmer response, and includes an automated sound deterrent mechanism (e.g., tiger roar) triggered upon animal detection.

Result: The YOLO+SSD model achieved accuracy rates of 90% for elephants, 85% for boars, and 70% for monkeys. Detection accuracy was highest during daytime and decreased at night, regardless of whether the input was still images or video.

Conclusion: This study presents a comprehensive framework combining detection, notification, and deterrence mechanisms, providing a practical solution for automated farming that paves the way for future innovations in agricultural technology.

Abstract: Durian plantation suffers from animal intrusions that cause crop damage and financial loss. The traditional farming practices prove ineffective due to the unavailability of monitoring without human intervention. The fast growth of machine learning and Internet of Things (IoT) technology has led to new ways to detect animals. However, current systems are limited by dependence on single object detection algorithms, less accessible notification platforms, and limited deterrent mechanisms. This research suggests an IoT-enabled animal detection system for durian crops. The system integrates YOLOv5 and SSD object detection algorithms to improve detection accuracy. The system provides real-time monitoring, with detected intrusions automatically reported to farmers via Telegram notifications for rapid response. An automated sound mechanism (e.g., tiger roar) is triggered once the animal is detected. The YOLO+SSD model achieved accuracy rates of elephant, boar, and monkey at 90%, 85% and 70%, respectively. The system shows the highest accuracy in daytime and decreases at night, regardless of whether the image is still or a video. Overall, this study contributes a comprehensive and practical framework that combines detection, notification, and deterrence, paving the way for future innovations in automated farming solutions.

[259] Class-agnostic 3D Segmentation by Granularity-Consistent Automatic 2D Mask Tracking

Juan Wang, Yasutomo Kawanishi, Tomo Miyazaki, Zhijie Wang, Shinichiro Omachi

Main category: cs.CV

TL;DR: A method for 3D instance segmentation that uses granularity-consistent 2D mask tracking and curriculum learning to generate accurate 3D pseudo labels from fragmented 2D annotations.

Details

Motivation: Existing methods for 3D instance segmentation rely on transferring 2D masks from foundation models, but process video frames independently, leading to inconsistent segmentation granularity and conflicting 3D pseudo labels that degrade accuracy.

Method: Introduces Granularity-Consistent automatic 2D Mask Tracking to maintain temporal correspondences across frames, combined with a three-stage curriculum learning framework that progressively trains from fragmented single-view data to unified multi-view annotations and globally coherent full-scene supervision.

Result: The method effectively generated consistent and accurate 3D segmentations, achieving state-of-the-art results on standard benchmarks and demonstrating open-vocabulary ability.

Conclusion: The proposed approach enables robust distillation of consistent 3D representations from initially fragmented and contradictory 2D priors through structured progressive learning.

Abstract: 3D instance segmentation is an important task for real-world applications. To avoid costly manual annotations, existing methods have explored generating pseudo labels by transferring 2D masks from foundation models to 3D. However, this approach is often suboptimal since the video frames are processed independently. This causes inconsistent segmentation granularity and conflicting 3D pseudo labels, which degrades the accuracy of final segmentation. To address this, we introduce a Granularity-Consistent automatic 2D Mask Tracking approach that maintains temporal correspondences across frames, eliminating conflicting pseudo labels. Combined with a three-stage curriculum learning framework, our approach progressively trains from fragmented single-view data to unified multi-view annotations, ultimately globally coherent full-scene supervision. This structured learning pipeline enables the model to progressively expose to pseudo-labels of increasing consistency. Thus, we can robustly distill a consistent 3D representation from initially fragmented and contradictory 2D priors. Experimental results demonstrated that our method effectively generated consistent and accurate 3D segmentations. Furthermore, the proposed method achieved state-of-the-art results on standard benchmarks and open-vocabulary ability.

[260] FedOnco-Bench: A Reproducible Benchmark for Privacy-Aware Federated Tumor Segmentation with Synthetic CT Data

Viswa Chaitanya Marella, Suhasnadh Reddy Veluru, Sai Teja Erukude

Main category: cs.CV

TL;DR: FedOnco-Bench is a reproducible benchmark for privacy-aware federated learning using synthetic oncologic CT scans, evaluating segmentation performance and privacy leakage across FL methods.

Details

Motivation: FL systems remain vulnerable to membership-inference attacks and data heterogeneity, especially in privacy-sensitive medical environments.

Method: Evaluates segmentation performance and privacy leakage across FL methods: FedAvg, FedProx, FedBN, and FedAvg with DP-SGD using synthetic oncologic CT scans with tumor annotations.

Result: FedAvg shows high performance (Dice ~0.85) with more privacy leakage (attack AUC ~0.72), while DP-SGD provides higher privacy (AUC ~0.25) at cost of accuracy (Dice ~0.79). FedProx and FedBN offer balanced performance under heterogeneous data.

Conclusion: FedOnco-Bench serves as a standardized, open-source platform for benchmarking and developing privacy-preserving FL methods for medical image segmentation.

Abstract: Federated Learning (FL) allows multiple institutions to cooperatively train machine learning models while retaining sensitive data at the source, which has great utility in privacy-sensitive environments. However, FL systems remain vulnerable to membership-inference attacks and data heterogeneity. This paper presents FedOnco-Bench, a reproducible benchmark for privacy-aware FL using synthetic oncologic CT scans with tumor annotations. It evaluates segmentation performance and privacy leakage across FL methods: FedAvg, FedProx, FedBN, and FedAvg with DP-SGD. Results show a distinct trade-off between privacy and utility: FedAvg is high performance (Dice around 0.85) with more privacy leakage (attack AUC about 0.72), while DP-SGD provides a higher level of privacy (AUC around 0.25) at the cost of accuracy (Dice about 0.79). FedProx and FedBN offer balanced performance under heterogeneous data, especially with non-identical distributed client data. FedOnco-Bench serves as a standardized, open-source platform for benchmarking and developing privacy-preserving FL methods for medical image segmentation.

[261] GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

Shijie Zhou, Viet Dac Lai, Hao Tan, Jihyung Kil, Wanrong Zhu, Changyou Chen, Ruiyi Zhang

Main category: cs.CV

TL;DR: GUI-AIMA is an attention-based, coordinate-free framework for GUI grounding that aligns MLLMs’ intrinsic multimodal attention with patch-wise grounding signals, achieving state-of-the-art performance with exceptional data efficiency.

Details

Motivation: Existing MLLM-based GUI grounding approaches directly generate precise coordinates from visual inputs, which is challenging and computationally intensive. The authors observed that general MLLMs have native grounding capability nested within their attentions.

Method: Proposes GUI-AIMA framework that aligns MLLMs’ multimodal attention with patch-wise grounding signals calculated adaptively via multi-head aggregation on simplified query-visual attention matrices. Uses coordinate-free approach that easily integrates plug-and-play zoom-in stage.

Result: GUI-AIMA-3B trained with only 85k screenshots achieves state-of-the-art performance among 3B models: 58.6% average accuracy on ScreenSpot-Pro and 62.2% on OSWorld-G, demonstrating exceptional data efficiency.

Conclusion: Light training can effectively trigger the native grounding capability of MLLMs, and the attention-based coordinate-free approach provides an efficient solution for GUI grounding tasks.

Abstract: Graphical user interface (GUI) grounding is a key function of computer-use agents, which maps natural-language instructions to actionable screen regions. Existing approaches based on Multimodal Large Language Models (MLLMs) typically formulate it as a text-based coordinate generation task, yet directly generating precise coordinates from visual inputs remains challenging and computationally intensive. An intuitive way to implement GUI grounding is to first select visual patches relevant to the instructions and then determine the precise click location within those patches. Based on the observations that general MLLMs have some native grounding capability, nested within their attentions, we propose GUI-AIMA, an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding. GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals are calculated adaptively for diverse user instructions by multi-head aggregation on simplified query-visual attention matrices. Besides, its coordinate-free manner can easily integrate a plug-and-play zoom-in stage. GUI-AIMA-3B was trained with only 85k screenshots, demonstrating exceptional data efficiency and verifying that light training can trigger the native grounding capability of MLLMs. It achieves state-of-the-art performance among 3B models, attaining an average accuracy of 58.6% on ScreenSpot-Pro and 62.2% on OSWorld-G. Project page: https://github.com/sjz5202/GUI-AIMA

[262] TA-LSDiff:Topology-Aware Diffusion Guided by a Level Set Energy for Pancreas Segmentation

Yue Gou, Fanghui Song, Yuming Xing, Shengzhu Shi, Zhichang Guo, Boying Wu

Main category: cs.CV

TL;DR: TA-LSDiff is a novel pancreas segmentation model that combines topology-aware diffusion probabilistic models with level set energy, achieving state-of-the-art accuracy without explicit geometric evolution.

Details

Motivation: Pancreas segmentation is challenging due to small size, low contrast, and topological variations. Traditional level set methods ignore topological effects, while deep learning networks sacrifice structural details.

Method: Combines topology-aware diffusion probabilistic model with level set energy that integrates input image and deep features through four complementary terms. Includes pixel-adaptive refinement module for boundary precision using affinity weighting.

Result: Achieves state-of-the-art accuracy on four public pancreas datasets, outperforming existing methods. Ablation studies confirm the contribution of each component.

Conclusion: TA-LSDiff establishes a practical and accurate solution for pancreas segmentation, bridging the gap between traditional level set methods and deep learning approaches.

Abstract: Pancreas segmentation in medical image processing is a persistent challenge due to its small size, low contrast against adjacent tissues, and significant topological variations. Traditional level set methods drive boundary evolution using gradient flows, often ignoring pointwise topological effects. Conversely, deep learning-based segmentation networks extract rich semantic features but frequently sacrifice structural details. To bridge this gap, we propose a novel model named TA-LSDiff, which combined topology-aware diffusion probabilistic model and level set energy, achieving segmentation without explicit geometric evolution. This energy function guides implicit curve evolution by integrating the input image and deep features through four complementary terms. To further enhance boundary precision, we introduce a pixel-adaptive refinement module that locally modulates the energy function using affinity weighting from neighboring evidence. Ablation studies systematically quantify the contribution of each proposed component. Evaluations on four public pancreas datasets demonstrate that TA-LSDiff achieves state-of-the-art accuracy, outperforming existing methods. These results establish TA-LSDiff as a practical and accurate solution for pancreas segmentation.

[263] OMEGA: Optimized Multimodal Position Encoding Index Derivation with Global Adaptive Scaling for Vision-Language Models

Ruoxiang Huang, Xindian Ma, Rundong Kong, Zhen Yuan, Peng Zhang

Main category: cs.CV

TL;DR: OMEGA is a novel position encoding framework for Vision-Language Models that uses modality-specific position encoding and adaptive step scaling to better handle the distinct structural properties of text and vision modalities.

Details

Motivation: Current VLMs use unified positional indexing that treats text and visual tokens uniformly, ignoring their distinct structural properties - sequential continuity for text and spatial coherence for vision.

Method: OMEGA employs Modality-Specific Position Encoding (MSPE) to assign positional indices while preserving inherent modality structures, and Global Adaptive Encoding Step Scaling (GAESS) to adaptively adjust visual token position encoding step size based on embedding entropy.

Result: OMEGA consistently enhances VLM performance across diverse architectures and VQA benchmarks, achieving up to 3.43% improvement on visual-intensive tasks with Qwen2.5-VL-3B, with consistent gains on larger models including Qwen2.5-VL-7B and LLaVA-v1.5-7B.

Conclusion: OMEGA’s modality-specific approach to position encoding effectively addresses the limitations of unified positional indexing in VLMs, leading to significant performance improvements across various model architectures and tasks.

Abstract: Vision-Language Models (VLMs) have demonstrated strong performance across various multimodal tasks, where position encoding plays a vital role in modeling both the sequential structure of textual information and the spatial structure of visual information. However, current VLMs commonly adopt modality-unified 1D or 2D positional indexing strategies, which treat textual and visual tokens uniformly without accounting for their distinct structural properties and sequential continuity for text and spatial coherence for vision. To address this limitation, we propose OMEGA, a novel position encoding framework that employs Modality-Specific Position Encoding (MSPE) to assign positional indices while preserving the inherent structures of each modality across separate coordinate dimensions. Additionally, to align the information density of multimodal data in the positional index space, OMEGA introduces Global Adaptive Encoding Step Scaling (GAESS), which adaptively adjusts the position encoding step size of visual tokens based on the embedding entropy of both modalities. Experimental results demonstrate that OMEGA consistently enhances VLM performance across diverse architectures and VQA benchmarks. On visual-intensive tasks, OMEGA achieves up to 3.43% improvement over baseline position encoding strategies on Qwen2.5-VL-3B, with consistent gains observed across larger models including Qwen2.5-VL-7B and LLaVA-v1.5-7B.

[264] Enhancing Adversarial Transferability in Visual-Language Pre-training Models via Local Shuffle and Sample-based Attack

Xin Liu, Aoyang Zhou, Aoyang Zhou

Main category: cs.CV

TL;DR: LSSA is a novel attack method that enhances adversarial transferability in VLP models by randomly shuffling local image blocks and sampling around adversarial images to generate more diverse adversarial texts.

Details

Motivation: Existing multimodal adversarial attacks suffer from overfitting due to lack of input diversity when crafting cross-modal attacks, limiting their transferability across different VLP models.

Method: LSSA randomly shuffles local image blocks to expand image-text pairs, generates adversarial images, samples around them, and uses both original and sampled images to create adversarial texts.

Result: Extensive experiments show LSSA significantly improves adversarial transferability across diverse VLP models and downstream tasks, outperforming other advanced attacks on Large Vision-Language Models.

Conclusion: LSSA effectively addresses overfitting in multimodal adversarial attacks by increasing input diversity through local shuffling and sampling, achieving superior transferability performance.

Abstract: Visual-Language Pre-training (VLP) models have achieved significant performance across various downstream tasks. However, they remain vulnerable to adversarial examples. While prior efforts focus on improving the adversarial transferability of multimodal adversarial examples through cross-modal interactions, these approaches suffer from overfitting issues, due to a lack of input diversity by relying excessively on information from adversarial examples in one modality when crafting attacks in another. To address this issue, we draw inspiration from strategies in some adversarial training methods and propose a novel attack called Local Shuffle and Sample-based Attack (LSSA). LSSA randomly shuffles one of the local image blocks, thus expanding the original image-text pairs, generating adversarial images, and sampling around them. Then, it utilizes both the original and sampled images to generate the adversarial texts. Extensive experiments on multiple models and datasets demonstrate that LSSA significantly enhances the transferability of multimodal adversarial examples across diverse VLP models and downstream tasks. Moreover, LSSA outperforms other advanced attacks on Large Vision-Language Models.

[265] Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, Xinyu Shao, Gao Huang, Xiu Li

Main category: cs.CV

TL;DR: VCA (Visual-Contrast Attention) replaces MHSA in Vision Transformers, reducing complexity from O(N²C) to O(NnC) while improving performance by injecting explicit discrimination through visual-contrast tokens with positive/negative streams.

Details

Motivation: MHSA in Vision Transformers performs quadratic computations on all token pairs, spending most computation on visually weak or redundant correlations, which is inefficient.

Method: VCA distills dense query fields into pooled visual-contrast tokens, splits them into positive/negative streams for differential interaction, and uses dual positional embeddings for contrastive reasoning.

Result: Improves DeiT-Tiny from 72.2% to 75.6% on ImageNet-1K (+3.4%), boosts hierarchical ViTs by up to 3.1%, and lowers FID in image generation by 2.1-5.2 points across diffusion and flow models.

Conclusion: VCA offers a simple path to faster and sharper Vision Transformers with minimal parameter overhead and no extra FLOPs, making it architecture-agnostic and effective.

Abstract: Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n « N. VCA first distils each head’s dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.

[266] Parameter Interpolation Adversarial Training for Robust Image Classification

Xin Liu, Yichen Yang, Kun He, John E. Hopcroft

Main category: cs.CV

TL;DR: PIAT is a novel adversarial training framework that reduces oscillations and overfitting by interpolating model parameters between epochs, achieving better robustness for CNNs and ViTs.

Details

Motivation: Existing adversarial training methods suffer from robustness oscillations and overfitting during training, which degrades defense effectiveness against adversarial attacks.

Method: Parameter Interpolation Adversarial Training (PIAT) tunes model parameters by interpolating between previous and current epoch parameters, and uses Normalized Mean Square Error (NMSE) to align relative logit magnitudes between clean and adversarial examples.

Result: Extensive experiments on benchmark datasets show PIAT prominently improves robustness for both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs).

Conclusion: PIAT effectively addresses overfitting and oscillation issues in adversarial training, enabling better model convergence and higher robustness against adversarial attacks.

Abstract: Though deep neural networks exhibit superior performance on various tasks, they are still plagued by adversarial examples. Adversarial training has been demonstrated to be the most effective method to defend against adversarial attacks. However, existing adversarial training methods show that the model robustness has apparent oscillations and overfitting issues in the training process, degrading the defense efficacy. To address these issues, we propose a novel framework called Parameter Interpolation Adversarial Training (PIAT). PIAT tunes the model parameters between each epoch by interpolating the parameters of the previous and current epochs. It makes the decision boundary of model change more moderate and alleviates the overfitting issue, helping the model converge better and achieving higher model robustness. In addition, we suggest using the Normalized Mean Square Error (NMSE) to further improve the robustness by aligning the relative magnitude of logits between clean and adversarial examples rather than the absolute magnitude. Extensive experiments conducted on several benchmark datasets demonstrate that our framework could prominently improve the robustness of both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs).

[267] OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks

Zhihao Peng, Cheng Wang, Shengyuan Liu, Zhiying Liang, Yixuan Yuan

Main category: cs.CV

TL;DR: OmniBrainBench is the first comprehensive multimodal VQA benchmark for brain imaging analysis, covering 15 imaging modalities and 15 clinical tasks, revealing that current MLLMs significantly lag behind physicians, especially in complex preoperative tasks.

Details

Motivation: Current brain-oriented VQA benchmarks are limited in imaging modalities and pathological descriptions, hindering comprehensive assessment of MLLMs throughout the full clinical continuum in brain imaging analysis.

Method: Created OmniBrainBench with 15 brain imaging modalities from 30 medical sources, containing 9,527 VQA pairs and 31,706 images, simulating clinical workflows and covering 15 multi-stage clinical tasks validated by professional radiologists.

Result: Evaluation of 24 MLLMs shows: proprietary models outperform open-source and medical models but still lag physicians; medical MLLMs have wide performance variation; open-source models trail overall but excel in specific tasks; all MLLMs significantly underperform in complex preoperative tasks.

Conclusion: OmniBrainBench sets a new standard for evaluating MLLMs in brain imaging analysis, highlighting substantial gaps between current MLLM capabilities and expert clinical reasoning, particularly in complex clinical scenarios.

Abstract: Brain imaging analysis is vital for diagnosing and treating brain disorders, and multimodal large language models (MLLMs) are increasingly assisting in that analysis. However, current brain-oriented visual question-answering (VQA) benchmarks either cover a few imaging modalities or are limited to coarse-grained pathological descriptions, hindering a comprehensive assessment of MLLMs throughout the full clinical continuum. To address these, we introduce OmniBrainBench, the first comprehensive multimodal VQA benchmark specifically designed to assess the multimodal comprehension capabilities of MLLMs in brain imaging analysis.OmniBrainBench consists of 15 distinct brain imaging modalities collected from 30 verified medical sources, yielding 9,527 validated VQA pairs and 31,706 images. It simulates clinical workflows and encompasses 15 multi-stage clinical tasks rigorously validated by a professional radiologist. Evaluation of 24 state-of-the-art models, including open-source, medical, and proprietary MLLMs, highlights the substantial challenges posed by OmniBrainBench. Our experiments reveal: (1) proprietary MLLMs (e.g., GPT-5) beat open-source and medical models but lag physicians; (2) medical MLLMs vary widely in performance; (3) open-source MLLMs trail overall but excel in specific tasks; (4) MLLMs underperform sharply in complex preoperative tasks, revealing a visual-to-clinical reasoning gap. OmniBrainBench sets a new standard for evaluating and advancing MLLMs in brain imaging analysis, highlighting gaps compared to expert clinical reasoning. We release it at benchmark & code.

[268] Occlusion-Aware Diffusion Model for Pedestrian Intention Prediction

Yu Liu, Zhijie Liu, Zedong Yang, You-Fu Li, He Kong

Main category: cs.CV

TL;DR: Proposes an Occlusion-Aware Diffusion Model (ODM) that reconstructs occluded pedestrian motion patterns to improve intention prediction in occlusion scenarios.

Details

Motivation: Existing deep learning models for pedestrian crossing intention prediction don't adequately handle incomplete observations under occlusion scenarios, which is crucial for mobile robots and intelligent vehicles.

Method: Uses an occlusion-aware diffusion transformer architecture during denoising to estimate noise features of occluded patterns, and introduces an occlusion mask-guided reverse process to utilize observation information and reduce prediction error accumulation.

Result: Comprehensive evaluation on PIE and JAAD benchmarks shows the method achieves more robust performance than existing methods under various occlusion scenarios.

Conclusion: The proposed ODM effectively handles occlusion scenarios and improves pedestrian crossing intention prediction accuracy through motion pattern reconstruction and contextual relationship capture.

Abstract: Predicting pedestrian crossing intentions is crucial for the navigation of mobile robots and intelligent vehicles. Although recent deep learning-based models have shown significant success in forecasting intentions, few consider incomplete observation under occlusion scenarios. To tackle this challenge, we propose an Occlusion-Aware Diffusion Model (ODM) that reconstructs occluded motion patterns and leverages them to guide future intention prediction. During the denoising stage, we introduce an occlusion-aware diffusion transformer architecture to estimate noise features associated with occluded patterns, thereby enhancing the model’s ability to capture contextual relationships in occluded semantic scenarios. Furthermore, an occlusion mask-guided reverse process is introduced to effectively utilize observation information, reducing the accumulation of prediction errors and enhancing the accuracy of reconstructed motion features. The performance of the proposed method under various occlusion scenarios is comprehensively evaluated and compared with existing methods on popular benchmarks, namely PIE and JAAD. Extensive experimental results demonstrate that the proposed method achieves more robust performance than existing methods in the literature.

[269] Layer-Wise Modality Decomposition for Interpretable Multimodal Sensor Fusion

Jaehyun Park, Konyul Park, Daehun Kim, Junseo Park, Jun Won Choi

Main category: cs.CV

TL;DR: LMD is a post-hoc interpretability method that disentangles modality-specific contributions in sensor fusion models for autonomous driving, enabling attribution of predictions to individual input modalities.

Details

Motivation: Transparency in perception models is critical for autonomous driving safety, but multi-sensor fusion makes it difficult to determine how each modality contributes to predictions due to information entanglement.

Method: Layer-Wise Modality Decomposition (LMD) - a post-hoc, model-agnostic method that disentangles modality-specific information across all layers of pretrained fusion models.

Result: LMD effectively attributes predictions to individual modalities in camera-radar, camera-LiDAR, and camera-radar-LiDAR fusion systems, validated through perturbation-based metrics and visual decompositions.

Conclusion: LMD provides practical interpretability for high-capacity multimodal architectures in autonomous driving, addressing the critical need for transparency in sensor fusion decision-making.

Abstract: In autonomous driving, transparency in the decision-making of perception models is critical, as even a single misperception can be catastrophic. Yet with multi-sensor inputs, it is difficult to determine how each modality contributes to a prediction because sensor information becomes entangled within the fusion network. We introduce Layer-Wise Modality Decomposition (LMD), a post-hoc, model-agnostic interpretability method that disentangles modality-specific information across all layers of a pretrained fusion model. To our knowledge, LMD is the first approach to attribute the predictions of a perception model to individual input modalities in a sensor-fusion system for autonomous driving. We evaluate LMD on pretrained fusion models under camera-radar, camera-LiDAR, and camera-radar-LiDAR settings for autonomous driving. Its effectiveness is validated using structured perturbation-based metrics and modality-wise visual decompositions, demonstrating practical applicability to interpreting high-capacity multimodal architectures. Code is available at https://github.com/detxter-jvb/Layer-Wise-Modality-Decomposition.

[270] GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks

Heng Zheng, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Hao Zhang, Wenjun Huang, Jin Huang

Main category: cs.CV

TL;DR: GraphGeo is a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization, modeling diverse debate relationships and enabling co-evolution between graph structure and agent representations.

Details

Motivation: Traditional retrieval methods are constrained by database coverage, and individual LVLMs struggle with diverse geographic regions and complex scenes. Existing multi-agent systems lack mechanisms to handle conflicting predictions effectively.

Method: Uses heterogeneous graph neural networks with typed edges for supportive collaboration, competitive argumentation, and knowledge transfer. Features dual-level debate mechanism with node-level refinement and edge-level argumentation modeling, plus cross-level topology refinement.

Result: Significantly outperforms state-of-the-art methods on multiple benchmarks.

Conclusion: The framework successfully transforms cognitive conflicts between agents into enhanced geo-localization accuracy through structured debate.

Abstract: Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata. Traditional retrieval methods are constrained by database coverage and quality. Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes. Existing multi-agent systems improve performance through model collaboration but treat all agent interactions uniformly. They lack mechanisms to handle conflicting predictions effectively. We propose \textbf{GraphGeo}, a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization. Our approach models diverse debate relationships through typed edges, distinguishing supportive collaboration, competitive argumentation, and knowledge transfer. We introduce a dual-level debate mechanism combining node-level refinement and edge-level argumentation modeling. A cross-level topology refinement strategy enables co-evolution between graph structure and agent representations. Experiments on multiple benchmarks demonstrate GraphGeo significantly outperforms state-of-the-art methods. Our framework transforms cognitive conflicts between agents into enhanced geo-localization accuracy through structured debate.

[271] Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

Yan Shu, Chi Liu, Robin Chen, Derek Li, Bryan Dai

Main category: cs.CV

TL;DR: Fleming-VL is a unified end-to-end framework for medical visual understanding across heterogeneous modalities (2D images, 3D volumetric scans, temporal videos) that addresses domain gaps through data-centric strategies including pretraining scaling, rare data fine-tuning, and extended evaluation frameworks.

Details

Motivation: Medical data presents unique challenges due to its heterogeneous nature with diverse modalities (2D images, 3D volumetric scans, temporal videos), substantial domain gaps, and data format inconsistencies that hinder unified medical MLLM development.

Method: Three key data-centric strategies: (1) scaling up pretraining with long-context natural and medical data, (2) fine-tuning with rare medical data including video analysis and underrepresented 2D modalities, (3) extending evaluation frameworks to include 3D and video benchmarks. Uses supervised fine-tuning (SFT) and group relative policy optimization (GRPO).

Result: Fleming-VL achieves state-of-the-art performance across multiple benchmarks including medical VQA, video QA, and 3D medical image understanding.

Conclusion: The framework successfully addresses medical data heterogeneity and enables comprehensive medical visual understanding across diverse modalities, with public release to promote transparent and reproducible medical AI progress.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature – encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective through three key strategies: (1) scaling up pretraining by integrating long-context data from both natural and medical-specific domains; (2) complementing fine-tuning with rare medical data, including holistic video analysis and underrepresented 2D modalities such as ultrasound and dermoscopy images; (3) extending existing evaluation frameworks to incorporate 3D volumetric and video understanding benchmarks. Through supervised fine-tuning (SFT) and group relative policy optimization (GRPO), we develop Fleming-VL in multiple model scales. Extensive experiments demonstrate that Fleming-VL achieves state-of-the-art performance across multiple benchmarks, including medical VQA, video QA, and 3D medical image understanding. We publicly release Fleming-VL to promote transparent, reproducible, and auditable progress in medical AI.

[272] Dynamic Multi-level Weighted Alignment Network for Zero-shot Sketch-based Image Retrieval

Hanwen Su, Ge Song, Jiyan Wang, Yuanbo Zhu

Main category: cs.CV

TL;DR: A Dynamic Multi-level Weighted Alignment Network for zero-shot sketch-based image retrieval that addresses modality imbalance and inconsistent information through multi-level weighting and weighted quadruplet loss.

Details

Motivation: Previous ZS-SBIR methods suffer from imbalanced modality samples and inconsistent low-quality information during training, leading to sub-optimal performance.

Method: Three-component approach: (1) Uni-modal Feature Extraction using CLIP text encoder and ViT, (2) Cross-modal Multi-level Weighting Module with local/global aggregation blocks, (3) Weighted Quadruplet Loss for domain balance.

Result: Superior performance over state-of-the-art methods on Sketchy, TU-Berlin, and QuickDraw benchmark datasets.

Conclusion: The proposed method effectively addresses modality imbalance and information inconsistency in ZS-SBIR, achieving improved retrieval performance.

Abstract: The problem of zero-shot sketch-based image retrieval (ZS-SBIR) has achieved increasing attention due to its wide applications, e.g. e-commerce. Despite progress made in this field, previous works suffer from using imbalanced samples of modalities and inconsistent low-quality information during training, resulting in sub-optimal performance. Therefore, in this paper, we introduce an approach called Dynamic Multi-level Weighted Alignment Network for ZS-SBIR. It consists of three components: (i) a Uni-modal Feature Extraction Module that includes a CLIP text encoder and a ViT for extracting textual and visual tokens, (ii) a Cross-modal Multi-level Weighting Module that produces an alignment weight list by the local and global aggregation blocks to measure the aligning quality of sketch and image samples, (iii) a Weighted Quadruplet Loss Module aiming to improve the balance of domains in the triplet loss. Experiments on three benchmark datasets, i.e., Sketchy, TU-Berlin, and QuickDraw, show our method delivers superior performances over the state-of-the-art ZS-SBIR methods.

[273] EVTAR: End-to-End Try on with Additional Unpaired Visual Reference

Liuzhuozheng Li, Yue Gong, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Dengyang Jiang, Zanyi Wang, Dawei Leng, Yuhui Yin

Main category: cs.CV

TL;DR: EVTAR is an end-to-end virtual try-on model that directly fits target garments onto person images using additional reference images, eliminating the need for complex inputs like masks, densepose, or segmentation maps.

Details

Motivation: Existing virtual try-on methods require complex inputs (agnostic person images, human pose, densepose, body keypoints) which are labor-intensive and impractical for real-world applications.

Method: Two-stage training strategy with simple inference using only source image and target garment inputs. Leverages additional reference images of different individuals wearing the same clothes to preserve garment texture and fine-grained details.

Result: Evaluated on two widely used benchmarks and diverse tasks, with results consistently validating the effectiveness of the approach.

Conclusion: EVTAR provides a more practical and realistic virtual try-on solution by simulating how humans consider reference models when choosing outfits, achieving high-quality dressing effects without complex preprocessing requirements.

Abstract: We propose EVTAR, an End-to-End Virtual Try-on model with Additional Reference, that directly fits the target garment onto the person image while incorporating reference images to enhance try-on accuracy. Most existing virtual try-on approaches rely on complex inputs such as agnostic person images, human pose, densepose, or body keypoints, making them labor-intensive and impractical for real-world applications. In contrast, EVTAR adopts a two-stage training strategy, enabling simple inference with only the source image and the target garment inputs. Our model generates try-on results without masks, densepose, or segmentation maps. Moreover, EVTAR leverages additional reference images of different individuals wearing the same clothes to preserve garment texture and fine-grained details better. This mechanism is analogous to how humans consider reference models when choosing outfits, thereby simulating a more realistic and high-quality dressing effect. We enrich the training data with supplementary references and unpaired person images to support these capabilities. We evaluate EVTAR on two widely used benchmarks and diverse tasks, and the results consistently validate the effectiveness of our approach.

[274] A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis

Dongheng Lin, Mengxue Qu, Kunyang Han, Jianbo Jiao, Xiaojie Jin, Yunchao Wei

Main category: cs.CV

TL;DR: A unified zero-shot framework for video anomaly analysis that connects temporal detection, spatial localization, and textual explanation through chained reasoning without additional training.

Details

Motivation: Current video anomaly methods lack explainability, providing only frame-wise scores without spatial or semantic context, and existing approaches are data-dependent and task-specific.

Method: Chained test-time reasoning process that leverages intra-task reasoning for temporal detection refinement and inter-task chaining for spatial and semantic understanding, using careful prompt design with foundation models.

Result: Achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks without additional data or gradients.

Conclusion: Task-wise chaining can unlock foundation models’ reasoning power for practical, interpretable video anomaly analysis in a fully zero-shot manner.

Abstract: Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner. Project Page: https://rathgrith.github.io/Unified_Frame_VAA/.

[275] VesSAM: Efficient Multi-Prompting for Segmenting Complex Vessel

Suzhong Fu, Rui Sun, Xuan Ding, Jingqi Dong, Yiming Yang, Yao Zhu, Min Chang Jordan Ren, Delin Deng, Angelica Aviles-Rivero, Shuguang Cui, Zhen Li

Main category: cs.CV

TL;DR: VesSAM is a specialized 2D vessel segmentation framework that enhances SAM’s performance on vascular structures through convolutional adapters, multi-prompt encoding, and lightweight mask decoding, achieving significant improvements over existing methods.

Details

Motivation: Vessel segmentation is crucial for clinical applications but challenging due to thin, branching structures and low texture contrast. Foundation models like SAM perform sub-optimally on vascular structures, requiring specialized approaches.

Method: VesSAM integrates: (1) convolutional adapter for local texture enhancement, (2) multi-prompt encoder fusing anatomical prompts (skeletons, bifurcation points, segment midpoints) via hierarchical cross-attention, (3) lightweight mask decoder to reduce artifacts. Includes automated pipeline for multi-prompt annotation generation.

Result: Outperforms state-of-the-art PEFT-based SAM variants by over 10% Dice and 13% IoU. Achieves competitive performance vs fully fine-tuned methods with significantly fewer parameters. Generalizes well to out-of-distribution settings, outperforming all baselines in average OoD Dice and IoU.

Conclusion: VesSAM provides an efficient and powerful framework for vessel segmentation that significantly improves upon foundation models while maintaining parameter efficiency and strong generalization capabilities across diverse imaging modalities.

Abstract: Accurate vessel segmentation is critical for clinical applications such as disease diagnosis and surgical planning, yet remains challenging due to thin, branching structures and low texture contrast. While foundation models like the Segment Anything Model (SAM) have shown promise in generic segmentation, they perform sub-optimally on vascular structures. In this work, we present VesSAM, a powerful and efficient framework tailored for 2D vessel segmentation. VesSAM integrates (1) a convolutional adapter to enhance local texture features, (2) a multi-prompt encoder that fuses anatomical prompts, including skeletons, bifurcation points, and segment midpoints, via hierarchical cross-attention, and (3) a lightweight mask decoder to reduce jagged artifacts. We also introduce an automated pipeline to generate structured multi-prompt annotations, and curate a diverse benchmark dataset spanning 8 datasets across 5 imaging modalities. Experimental results demonstrate that VesSAM consistently outperforms state-of-the-art PEFT-based SAM variants by over 10% Dice and 13% IoU, and achieves competitive performance compared to fully fine-tuned methods, with significantly fewer parameters. VesSAM also generalizes well to out-of-distribution (OoD) settings, outperforming all baselines in average OoD Dice and IoU.

[276] MID: A Self-supervised Multimodal Iterative Denoising Framework

Chang Nie, Tianchen Deng, Zhe Liu, Hesheng Wang

Main category: cs.CV

TL;DR: MID is a self-supervised multimodal iterative denoising framework that models noisy data as states in a continuous noise accumulation process, learning to estimate and remove noise iteratively without requiring clean-noisy data pairs.

Details

Motivation: Traditional rule-based denoising methods are inadequate for real-world data corrupted by complex, non-linear noise, necessitating a more robust and adaptable approach.

Method: Models noisy data as states in continuous non-linear noise accumulation, learns two neural networks to estimate noise step and predict/subtract noise increments, uses first-order Taylor expansion for local linearization of complex non-linear noise.

Result: Demonstrates robustness, adaptability, and state-of-the-art performance across four classic computer vision tasks, plus strong performance in biomedical and bioinformatics domains.

Conclusion: MID provides an effective self-supervised framework for complex noise removal that doesn’t require paired clean-noisy datasets and performs well across multiple domains.

Abstract: Data denoising is a persistent challenge across scientific and engineering domains. Real-world data is frequently corrupted by complex, non-linear noise, rendering traditional rule-based denoising methods inadequate. To overcome these obstacles, we propose a novel self-supervised multimodal iterative denoising (MID) framework. MID models the collected noisy data as a state within a continuous process of non-linear noise accumulation. By iteratively introducing further noise, MID learns two neural networks: one to estimate the current noise step and another to predict and subtract the corresponding noise increment. For complex non-linear contamination, MID employs a first-order Taylor expansion to locally linearize the noise process, enabling effective iterative removal. Crucially, MID does not require paired clean-noisy datasets, as it learns noise characteristics directly from the noisy inputs. Experiments across four classic computer vision tasks demonstrate MID’s robustness, adaptability, and consistent state-of-the-art performance. Moreover, MID exhibits strong performance and adaptability in tasks within the biomedical and bioinformatics domains.

[277] Integrating Visual and X-Ray Machine Learning Features in the Study of Paintings by Goya

Hassan Ugail, Ismail Lujain Jaleel

Main category: cs.CV

TL;DR: A multimodal machine learning framework for authenticating Goya paintings using identical feature extraction on both visual and X-ray images, achieving 97.8% accuracy with One-Class SVM.

Details

Motivation: Art authentication of Goya's works is challenging due to his stylistic evolution and forgery history, requiring advanced computational methods.

Method: Unified feature extraction (GLCM, LBP, entropy, energy, color analysis) applied to both visual and X-ray images, processed through optimized One-Class SVM with hyperparameter tuning.

Result: 97.8% classification accuracy with 0.022 false positive rate on 24 authenticated Goya paintings; case study on “Un Gigante” achieved 92.3% authentication confidence.

Conclusion: Multimodal approach significantly outperforms single-modal methods, demonstrating effectiveness of identical computational techniques across visual and radiographic imagery for art authentication.

Abstract: Art authentication of Francisco Goya’s works presents complex computational challenges due to his heterogeneous stylistic evolution and extensive historical patterns of forgery. We introduce a novel multimodal machine learning framework that applies identical feature extraction techniques to both visual and X-ray radiographic images of Goya paintings. The unified feature extraction pipeline incorporates Grey-Level Co-occurrence Matrix descriptors, Local Binary Patterns, entropy measures, energy calculations, and colour distribution analysis applied consistently across both imaging modalities. The extracted features from both visual and X-ray images are processed through an optimised One-Class Support Vector Machine with hyperparameter tuning. Using a dataset of 24 authenticated Goya paintings with corresponding X-ray images, split into an 80/20 train-test configuration with 10-fold cross-validation, the framework achieves 97.8% classification accuracy with a 0.022 false positive rate. Case study analysis of ``Un Gigante’’ demonstrates the practical efficacy of our pipeline, achieving 92.3% authentication confidence through unified multimodal feature analysis. Our results indicate substantial performance improvement over single-modal approaches, establishing the effectiveness of applying identical computational methods to both visual and radiographic imagery in art authentication applications.

[278] HyFormer-Net: A Synergistic CNN-Transformer with Interpretable Multi-Scale Fusion for Breast Lesion Segmentation and Classification in Ultrasound Images

Mohammad Amanour Rahman

Main category: cs.CV

TL;DR: HyFormer-Net is a hybrid CNN-Transformer model for breast ultrasound that simultaneously performs segmentation and classification with intrinsic interpretability, achieving state-of-the-art performance and demonstrating strong cross-dataset generalization through progressive fine-tuning.

Details

Motivation: B-mode ultrasound for breast cancer diagnosis faces challenges including speckle noise, operator dependency, and indistinct boundaries. Existing deep learning approaches suffer from single-task learning, architectural limitations (CNNs lack global context while Transformers lack local features), and black-box decision-making, which hinder clinical adoption.

Method: HyFormer-Net uses a dual-branch encoder integrating EfficientNet-B3 and Swin Transformer via multi-scale hierarchical fusion blocks, with an attention-gated decoder for precision and explainability. It features dual-pipeline interpretability: intrinsic attention validation with quantitative IoU verification and Grad-CAM for classification reasoning.

Result: On BUSI dataset: Dice Score 0.761 +/- 0.072, accuracy 93.2%, malignant recall 92.1 +/- 2.2%. Ensemble modeling achieves exceptional Dice 90.2%, accuracy 99.5%, and perfect 100% malignant recall. Ablation studies show multi-scale fusion contributes +16.8% Dice and attention gates add +5.9%. Cross-dataset generalization: progressive fine-tuning with only 10% target data recovers 92.5% performance; with 50% data achieves 77.3% Dice, exceeding source-domain performance.

Conclusion: HyFormer-Net demonstrates superior performance for breast ultrasound analysis with intrinsic interpretability and strong generalization capabilities. The model’s ability to achieve true generalization across datasets with minimal fine-tuning data represents a significant advancement for clinical adoption of deep learning in medical imaging.

Abstract: B-mode ultrasound for breast cancer diagnosis faces challenges: speckle, operator dependency, and indistinct boundaries. Existing deep learning suffers from single-task learning, architectural constraints (CNNs lack global context, Transformers local features), and black-box decision-making. These gaps hinder clinical adoption. We propose HyFormer-Net, a hybrid CNN-Transformer for simultaneous segmentation and classification with intrinsic interpretability. Its dual-branch encoder integrates EfficientNet-B3 and Swin Transformer via multi-scale hierarchical fusion blocks. An attention-gated decoder provides precision and explainability. We introduce dual-pipeline interpretability: (1) intrinsic attention validation with quantitative IoU verification (mean: 0.86), and (2) Grad-CAM for classification reasoning. On the BUSI dataset, HyFormer-Net achieves Dice Score 0.761 +/- 0.072 and accuracy 93.2%, outperforming U-Net, Attention U-Net, and TransUNet. Malignant Recall of 92.1 +/- 2.2% ensures minimal false negatives. Ensemble modeling yields exceptional Dice 90.2%, accuracy 99.5%, and perfect 100% Malignant Recall, eliminating false negatives. Ablation studies confirm multi-scale fusion contributes +16.8% Dice and attention gates add +5.9%. Crucially, we conduct the first cross-dataset generalization study for hybrid CNN-Transformers in breast ultrasound. Zero-shot transfer fails (Dice: 0.058), confirming domain shift. However, progressive fine-tuning with only 10% target-domain data (68 images) recovers 92.5% performance. With 50% data, our model achieves 77.3% Dice, exceeding source-domain performance (76.1%) and demonstrating true generalization.

[279] FastBoost: Progressive Attention with Dynamic Scaling for Efficient Deep Learning

JunXi Yuan

Main category: cs.CV

TL;DR: FastBoost is a parameter-efficient neural architecture that achieves state-of-the-art performance on CIFAR benchmarks using a novel Dynamically Scaled Progressive Attention (DSPA) mechanism, achieving significant parameter reduction while improving accuracy.

Details

Motivation: To develop a highly parameter-efficient neural architecture that can achieve state-of-the-art performance on image classification benchmarks while being suitable for deployment in resource-constrained edge devices, addressing the trade-off between model size and accuracy.

Method: Uses Dynamically Scaled Progressive Attention (DSPA) with three innovations: Adaptive Fusion (learnt channel-spatial attention blending), Phase Scaling (training-stage-aware intensity modulation), and Residual Adaptation (self-optimized skip connections). Integrated with enhanced MBConv blocks and features dual attention pathways with real-time weight adjustment and cascaded refinement layers.

Result: Achieved CIFAR-10: 95.57% accuracy (0.85M parameters) and 93.80% (0.37M parameters); CIFAR-100: 81.37% accuracy (0.92M parameters) and 74.85% (0.44M parameters). 2.1 times parameter reduction over MobileNetV3 with +3.2 percentage points accuracy improvement on CIFAR-10. Hardware-friendly design with 0.28G FLOPs.

Conclusion: FastBoost demonstrates unprecedented parameter-accuracy trade-offs through co-optimization of dynamic attention and efficient convolution operations, enabling deployment in resource-constrained edge devices without accuracy degradation.

Abstract: We present FastBoost, a parameter-efficient neural architecture that achieves state-of-the-art performance on CIFAR benchmarks through a novel Dynamically Scaled Progressive Attention (DSPA) mechanism. Our design establishes new efficiency frontiers with: CIFAR-10: 95.57% accuracy (0.85M parameters) and 93.80% (0.37M parameters) CIFAR-100: 81.37% accuracy (0.92M parameters) and 74.85% (0.44M parameters) The breakthrough stems from three fundamental innovations in DSPA: (1) Adaptive Fusion: Learnt channel-spatial attention blending with dynamic weights. (2) Phase Scaling: Training-stage-aware intensity modulation (from 0.5 to 1.0). (3) Residual Adaptation: Self-optimized skip connections (gamma from 0.5 to 0.72). By integrating DSPA with enhanced MBConv blocks, FastBoost achieves a 2.1 times parameter reduction over MobileNetV3 while improving accuracy by +3.2 percentage points on CIFAR-10. The architecture features dual attention pathways with real-time weight adjustment, cascaded refinement layers (increasing gradient flow by 12.7%), and a hardware-friendly design (0.28G FLOPs). This co-optimization of dynamic attention and efficient convolution operations demonstrates unprecedented parameter-accuracy trade-offs, enabling deployment in resource-constrained edge devices without accuracy degradation.

[280] T-MLA: A Targeted Multiscale Log–Exponential Attack Framework for Neural Image Compression

Nikolay I. Kalmykov, Razan Dibo, Kaiyu Shen, Xu Zhonghan, Anh-Huy Phan, Yipeng Liu, Ivan Oseledets

Main category: cs.CV

TL;DR: T-MLA is the first targeted multiscale log-exponential attack framework that crafts adversarial perturbations in the wavelet domain to compromise neural image compression systems while maintaining visual stealth.

Details

Motivation: Existing adversarial attacks on neural image compression (NIC) are naive adaptations of pixel-space methods, overlooking the unique structured nature of compression pipelines. There's a need to understand advanced vulnerabilities in NIC systems.

Method: Proposed T-MLA framework crafts adversarial perturbations in wavelet domain by directly targeting quality of attacked and reconstructed images. Uses strategic confinement to specific wavelet subbands for maximum distortion with perceptual stealth.

Result: Extensive evaluation across multiple state-of-the-art NIC architectures shows large drop in reconstruction quality while perturbations remain visually imperceptible. Reveals critical security flaws in generative and content delivery pipelines.

Conclusion: The work demonstrates that neural image compression systems have fundamental security vulnerabilities that can be exploited through sophisticated wavelet-domain attacks, posing serious risks to real-world content delivery systems.

Abstract: Neural image compression (NIC) has become the state-of-the-art for rate-distortion performance, yet its security vulnerabilities remain significantly less understood than those of classifiers. Existing adversarial attacks on NICs are often naive adaptations of pixel-space methods, overlooking the unique, structured nature of the compression pipeline. In this work, we propose a more advanced class of vulnerabilities by introducing T-MLA, the first targeted multiscale log–exponential attack framework. Our approach crafts adversarial perturbations in the wavelet domain by directly targeting the quality of the attacked and reconstructed images. This allows for a principled, offline attack where perturbations are strategically confined to specific wavelet subbands, maximizing distortion while ensuring perceptual stealth. Extensive evaluation across multiple state-of-the-art NIC architectures on standard image compression benchmarks reveals a large drop in reconstruction quality while the perturbations remain visually imperceptible. Our findings reveal a critical security flaw at the core of generative and content delivery pipelines.

[281] GeoToken: Hierarchical Geolocalization of Images via Next Token Prediction

Narges Ghasemi, Amir Ziashahabi, Salman Avestimehr, Cyrus Shahabi

Main category: cs.CV

TL;DR: Proposes hierarchical sequence prediction for image geolocalization using S2 cells, inspired by human location narrowing and autoregressive text generation, achieving state-of-the-art performance.

Details

Motivation: Address challenges in image geolocalization caused by visual similarities across locations and large search space by mimicking how humans narrow down locations hierarchically.

Method: Uses S2 cells as nested multiresolution grid, predicts geographic tokens hierarchically from broad regions to specific locations, incorporates beam search and multi-sample inference for uncertainty management.

Result: Achieves state-of-the-art performance on Im2GPS3k and YFCC4k datasets, with up to 13.9% accuracy gains in MLLM-free setting and outperforms all baselines when augmented with MLLM.

Conclusion: Hierarchical sequence prediction with S2 cells and autoregressive sampling strategies effectively addresses image geolocalization challenges, setting new state-of-the-art performance.

Abstract: Image geolocalization, the task of determining an image’s geographic origin, poses significant challenges, largely due to visual similarities across disparate locations and the large search space. To address these issues, we propose a hierarchical sequence prediction approach inspired by how humans narrow down locations from broad regions to specific addresses. Analogously, our model predicts geographic tokens hierarchically, first identifying a general region and then sequentially refining predictions to increasingly precise locations. Rather than relying on explicit semantic partitions, our method uses S2 cells, a nested, multiresolution global grid, and sequentially predicts finer-level cells conditioned on visual inputs and previous predictions. This procedure mirrors autoregressive text generation in large language models. Much like in language modeling, final performance depends not only on training but also on inference-time strategy. We investigate multiple top-down traversal methods for autoregressive sampling, incorporating techniques from test-time compute scaling used in language models. Specifically, we integrate beam search and multi-sample inference while exploring various selection strategies to determine the final output. This enables the model to manage uncertainty by exploring multiple plausible paths through the hierarchy. We evaluate our method on the Im2GPS3k and YFCC4k datasets against two distinct sets of baselines: those that operate without a Multimodal Large Language Model (MLLM) and those that leverage one. In the MLLM-free setting, our model surpasses other comparable baselines on nearly all metrics, achieving state-of-the-art performance with accuracy gains of up to 13.9%. When augmented with an MLLM, our model outperforms all baselines, setting a new state-of-the-art across all metrics. The source code is available at https://github.com/NNargesNN/GeoToken.

[282] $\left|,\circlearrowright,\boxed{\text{BUS}},\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

Trishanu Das, Abhilash Nandy, Khush Bajaj, Deepiha S

Main category: cs.CV

TL;DR: A new benchmark called |↻BUS| with 1,333 English Rebus Puzzles across 18 categories, plus a model-agnostic framework RebusDescProgICE that improves VLM performance by 2.1-4.1% (closed-source) and 20-30% (open-source) over Chain-of-Thought.

Details

Motivation: Rebus puzzles require complex skills like image recognition, cognitive reasoning, and wordplay, making them challenging for current Vision-Language Models.

Method: Proposed RebusDescProgICE framework combining unstructured descriptions with code-based structured reasoning and improved reasoning-based in-context example selection.

Result: The framework improved performance on the |↻BUS| benchmark by 2.1-4.1% for closed-source models and 20-30% for open-source models compared to Chain-of-Thought Reasoning.

Conclusion: The benchmark and framework advance VLM capabilities on complex reasoning tasks like rebus puzzles, showing significant improvements especially for open-source models.

Abstract: Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present $\left|,\circlearrowright,\boxed{\text{BUS}},\right|$, a large and diverse benchmark of $1,333$ English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose $RebusDescProgICE$, a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on $\left|,\circlearrowright,\boxed{\text{BUS}},\right|$ by $2.1-4.1%$ and $20-30%$ using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.

[283] SliceVision-F2I: A Synthetic Feature-to-Image Dataset for Visual Pattern Representation on Network Slices

Md. Abid Hasan Rafi, Mst. Fatematuj Johora, Pankaj Bhowmik

Main category: cs.CV

TL;DR: SliceVision-F2I is a synthetic dataset that transforms network KPI vectors into visual representations using four encoding methods for visual learning and machine learning applications in network slicing.

Details

Motivation: The emergence of 5G/6G networks requires refined identification methods and robust datasets for network slicing in service-oriented architectures.

Method: Created synthetic dataset by transforming multivariate KPI vectors into RGB images using four encoding methods: physically inspired mappings, Perlin noise, neural wallpapering, and fractal branching. Generated 30,000 samples per method with realistic network noise.

Result: Produced SliceVision-F2I dataset containing raw KPI vectors and corresponding low-resolution RGB images, simulating operational uncertainties and measurement imperfections.

Conclusion: The dataset is publicly available and suitable for visual learning, network state classification, anomaly detection, and benchmarking image-based ML techniques for network data analysis.

Abstract: The emergence of 5G and 6G networks has established network slicing as a significant part of future service-oriented architectures, demanding refined identification methods supported by robust datasets. The article presents SliceVision-F2I, a dataset of synthetic samples for studying feature visualization in network slicing for next-generation networking systems. The dataset transforms multivariate Key Performance Indicator (KPI) vectors into visual representations through four distinct encoding methods: physically inspired mappings, Perlin noise, neural wallpapering, and fractal branching. For each encoding method, 30,000 samples are generated, each comprising a raw KPI vector and a corresponding RGB image at low-resolution pixels. The dataset simulates realistic and noisy network conditions to reflect operational uncertainties and measurement imperfections. SliceVision-F2I is suitable for tasks involving visual learning, network state classification, anomaly detection, and benchmarking of image-based machine learning techniques applied to network data. The dataset is publicly available and can be reused in various research contexts, including multivariate time series analysis, synthetic data generation, and feature-to-image transformations.

[284] Epanechnikov nonparametric kernel density estimation based feature-learning in respiratory disease chest X-ray images

Veronica Marsico, Antonio Quintero-Rincon, Hadj Batatia

Main category: cs.CV

TL;DR: A novel method for diagnosing respiratory diseases using Epanechnikov’s kernel density estimation (EKDE) with bimodal logistic regression on chest X-rays, achieving moderate performance with 70.14% accuracy.

Details

Motivation: To develop a flexible diagnostic method for respiratory diseases that can handle medical image data without assuming specific distribution shapes, addressing pixel intensity variations in chest X-rays.

Method: Combines Epanechnikov’s non-parametric kernel density estimation (EKDE) with bimodal logistic regression classifier in a statistical-model-based learning scheme for feature extraction from medical images.

Result: Tested on 13,808 chest X-rays from COVID-19 Radiography Dataset, achieving 70.14% accuracy, 59.26% sensitivity, and 74.18% specificity, showing moderate performance with room for improvement in sensitivity.

Conclusion: EKDE-based approaches show potential to enhance diagnostic accuracy in medical imaging, though clinical expertise remains essential for further model refinement.

Abstract: This study presents a novel method for diagnosing respiratory diseases using image data. It combines Epanechnikov’s non-parametric kernel density estimation (EKDE) with a bimodal logistic regression classifier in a statistical-model-based learning scheme. EKDE’s flexibility in modeling data distributions without assuming specific shapes and its adaptability to pixel intensity variations make it valuable for extracting key features from medical images. The method was tested on 13808 randomly selected chest X-rays from the COVID-19 Radiography Dataset, achieved an accuracy of 70.14%, a sensitivity of 59.26%, and a specificity of 74.18%, demonstrating moderate performance in detecting respiratory disease while showing room for improvement in sensitivity. While clinical expertise remains essential for further refining the model, this study highlights the potential of EKDE-based approaches to enhance diagnostic accuracy and reliability in medical imaging.

[285] Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, Lei Bai, Wanli Ouyang, Yuanqi Li, Jie Guo, Yanwen Guo

Main category: cs.CV

TL;DR: The paper introduces Viewpoint Learning to evaluate and improve spatial reasoning in MLLMs using a 100K dataset and two-stage fine-tuning with SFT and GRPO, achieving significant performance improvements.

Details

Motivation: To address whether MLLMs can effectively capture detailed spatial information for robust 3D reasoning, particularly cross-view consistency needed for real-world applications.

Method: Two-stage fine-tuning: 1) Supervised Fine-Tuning on Viewpoint-100K dataset, 2) Reinforcement Learning with GRPO algorithm. Also uses hybrid cold-start initialization for viewpoint representation learning.

Result: Significantly activates spatial reasoning ability in MLLMs, improving performance on both in-domain and out-of-domain reasoning tasks.

Conclusion: Developing foundational spatial skills in MLLMs is valuable for future progress in robotics, autonomous systems, and 3D scene understanding.

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in significant improvements across multiple tasks; second, generalization is enhanced through Reinforcement Learning using the Group Relative Policy Optimization (GRPO) algorithm on a broader set of questions. Additionally, we introduce a hybrid cold-start initialization method designed to simultaneously learn viewpoint representations and maintain coherent reasoning thinking. Experimental results show that our approach significantly activates the spatial reasoning ability of MLLM, improving performance on both in-domain and out-of-domain reasoning tasks. Our findings highlight the value of developing foundational spatial skills in MLLMs, supporting future progress in robotics, autonomous systems, and 3D scene understanding.

[286] Anatomically Constrained Transformers for Echocardiogram Analysis

Alexander Thorley, Agis Chartsias, Jordan Strom, Jeremy Slivnick, Dipak Kotecha, Alberto Gomez, Jinming Duan

Main category: cs.CV

TL;DR: ViACT integrates anatomical priors into video transformers for echocardiogram analysis, focusing representation learning on anatomical regions through masked autoencoding of anatomical patches.

Details

Motivation: Video transformers for echo analysis tend to learn spurious correlations from non-diagnostic regions like image backgrounds, limiting their effectiveness for clinically relevant tasks.

Method: Proposes ViACT framework that represents deforming anatomical structures as point sets, encodes spatial geometry and image patches into tokens, and uses masked autoencoding that only reconstructs anatomical patches during pre-training.

Result: ViACT produces interpretable attention maps aligned with cardiac pathology regions, generalizes to myocardium point tracking without specialized components, and achieves strong performance on EF regression and CA detection tasks.

Conclusion: Anatomical constraints effectively focus transformer attention on clinically relevant regions, improving interpretability and performance for echocardiogram analysis tasks while maintaining generalization capabilities.

Abstract: Video transformers have recently demonstrated strong potential for echocardiogram (echo) analysis, leveraging self-supervised pre-training and flexible adaptation across diverse tasks. However, like other models operating on videos, they are prone to learning spurious correlations from non-diagnostic regions such as image backgrounds. To overcome this limitation, we propose the Video Anatomically Constrained Transformer (ViACT), a novel framework that integrates anatomical priors directly into the transformer architecture. ViACT represents a deforming anatomical structure as a point set and encodes both its spatial geometry and corresponding image patches into transformer tokens. During pre-training, ViACT follows a masked autoencoding strategy that masks and reconstructs only anatomical patches, enforcing that representation learning is focused on the anatomical region. The pre-trained model can then be fine-tuned for tasks localized to this region. In this work we focus on the myocardium, demonstrating the framework on echo analysis tasks such as left ventricular ejection fraction (EF) regression and cardiac amyloidosis (CA) detection. The anatomical constraint focuses transformer attention within the myocardium, yielding interpretable attention maps aligned with regions of known CA pathology. Moreover, ViACT generalizes to myocardium point tracking without requiring task-specific components such as correlation volumes used in specialized tracking networks.

[287] Boosting performance of computer vision applications through embedded GPUs on the edge

Fabio Diniz Rossi

Main category: cs.CV

TL;DR: Using GPU-equipped embedded devices in edge computing to improve performance for computer vision applications, achieving better user experience compared to CPU-only solutions.

Details

Motivation: Computer vision and AR applications are resource-intensive, and edge computing devices have limited capacity, which can impact user experience quality.

Method: Proposes using embedded devices with GPUs in edge computing to offload intensive computer vision tasks from mobile devices.

Result: Experiments showed GPUs achieve performance gains compared to using only CPUs, ensuring better user experience for computer vision applications.

Conclusion: GPU-equipped embedded devices in edge computing can overcome resource limitations and provide improved performance for computer vision applications.

Abstract: Computer vision applications, especially those using augmented reality technology, are becoming quite popular in mobile devices. However, this type of application is known as presenting significant demands regarding resources. In order to enable its utilization in devices with more modest resources, edge computing can be used to offload certain high intensive tasks. Still, edge computing is usually composed of devices with limited capacity, which may impact in users quality of experience when using computer vision applications. This work proposes the use of embedded devices with graphics processing units (GPUs) to overcome such limitation. Experiments performed shown that GPUs can attain a performance gain when compared to using only CPUs, which guarantee a better experience to users using such kind of application.

[288] Weakly Supervised Concept Learning with Class-Level Priors for Interpretable Medical Diagnosis

Md Nahiduzzaman, Steven Korevaar, Alireza Bab-Hadiashar, Ruwan Tennakoon

Main category: cs.CV

TL;DR: PCP is a weakly supervised framework that enables concept prediction in medical imaging without requiring concept annotations, using class-level concept priors and refinement mechanisms to achieve competitive performance.

Details

Motivation: Most interpretable-by-design frameworks need costly concept annotations for training data, which are impractical in clinical settings. Existing annotation-free methods struggle with domain-specific medical features.

Method: Uses Prior-guided Concept Predictor (PCP) with class-level concept priors as weak supervision, incorporating KL divergence and entropy regularization for refinement and clinical reasoning alignment.

Result: Improves concept-level F1-score by over 33% compared to zero-shot baselines on PH2 and WBCatt datasets, with competitive classification performance on four medical datasets relative to fully supervised models.

Conclusion: PCP provides an effective weakly supervised alternative for interpretable medical imaging predictions without requiring costly concept annotations.

Abstract: Human-interpretable predictions are essential for deploying AI in medical imaging, yet most interpretable-by-design (IBD) frameworks require concept annotations for training data, which are costly and impractical to obtain in clinical contexts. Recent attempts to bypass annotation, such as zero-shot vision-language models or concept-generation frameworks, struggle to capture domain-specific medical features, leading to poor reliability. In this paper, we propose a novel Prior-guided Concept Predictor (PCP), a weakly supervised framework that enables concept answer prediction without explicit supervision or reliance on language models. PCP leverages class-level concept priors as weak supervision and incorporates a refinement mechanism with KL divergence and entropy regularization to align predictions with clinical reasoning. Experiments on PH2 (dermoscopy) and WBCatt (hematology) show that PCP improves concept-level F1-score by over 33% compared to zero-shot baselines, while delivering competitive classification performance on four medical datasets (PH2, WBCatt, HAM10000, and CXR4) relative to fully supervised concept bottleneck models (CBMs) and V-IP.

[289] Learning with Category-Equivariant Architectures for Human Activity Recognition

Yoshihiro Maruyama

Main category: cs.CV

TL;DR: CatEquiv is a category-equivariant neural network for HAR that encodes temporal, amplitude, and structural symmetries through categorical symmetry products, achieving superior robustness to perturbations.

Details

Motivation: To systematically encode temporal, amplitude, and structural symmetries in HAR data to improve robustness against out-of-distribution perturbations.

Method: Introduces categorical symmetry product combining cyclic time shifts, positive gains, and sensor-hierarchy poset to capture categorical symmetry structure, achieving equivariance with respect to this product.

Result: On UCI-HAR under out-of-distribution perturbations, CatEquiv achieves markedly higher robustness compared to circularly padded CNNs and plain CNNs.

Conclusion: Enforcing categorical symmetries yields strong invariance and generalization without requiring additional model capacity.

Abstract: We propose CatEquiv, a category-equivariant neural network for Human Activity Recognition (HAR) from inertial sensors that systematically encodes temporal, amplitude, and structural symmetries. In particular, we introduce the categorical symmetry product where cyclic time shifts, positive gains and the sensor-hierarchy poset together capture the categorical symmetry structure of the data. CatEquiv achieves equivariance with respect to the categorical symmetry product. On UCI-HAR under out-of-distribution perturbations, CatEquiv attains markedly higher robustness compared with circularly padded CNNs and plain CNNs. These results demonstrate that enforcing categorical symmetries yields strong invariance and generalization without additional model capacity.

[290] MicroAUNet: Boundary-Enhanced Multi-scale Fusion with Knowledge Distillation for Colonoscopy Polyp Image Segmentation

Ziyi Wang, Yuanmei Zhang, Dorna Esrafilzadeh, Ali R. Jalili, Suncheng Xiang

Main category: cs.CV

TL;DR: MicroAUNet is a lightweight attention-based segmentation network for real-time colorectal polyp segmentation that combines depthwise-separable dilated convolutions with channel-spatial attention and uses progressive knowledge distillation from a high-capacity teacher model.

Details

Motivation: Current deep learning polyp segmentation models either provide ambiguous polyp margins compromising clinical decisions or use heavy architectures with insufficient inference speeds for real-time endoscopic applications.

Method: Proposes MicroAUNet with depthwise-separable dilated convolutions and parameter-shared channel-spatial attention block for multi-scale boundary features, plus progressive two-stage knowledge distillation from a high-capacity teacher model.

Result: Achieves state-of-the-art accuracy under extremely low model complexity, suitable for real-time clinical polyp segmentation.

Conclusion: MicroAUNet effectively addresses the trade-off between accuracy and computational efficiency for real-time colorectal polyp segmentation in clinical applications.

Abstract: Early and accurate segmentation of colorectal polyps is critical for reducing colorectal cancer mortality, which has been extensively explored by academia and industry. However, current deep learning-based polyp segmentation models either compromise clinical decision-making by providing ambiguous polyp margins in segmentation outputs or rely on heavy architectures with high computational complexity, resulting in insufficient inference speeds for real-time colorectal endoscopic applications. To address this problem, we propose MicroAUNet, a light-weighted attention-based segmentation network that combines depthwise-separable dilated convolutions with a single-path, parameter-shared channel-spatial attention block to strengthen multi-scale boundary features. On the basis of it, a progressive two-stage knowledge-distillation scheme is introduced to transfer semantic and boundary cues from a high-capacity teacher. Extensive experiments on benchmarks also demonstrate the state-of-the-art accuracy under extremely low model complexity, indicating that MicroAUNet is suitable for real-time clinical polyp segmentation. The code is publicly available at https://github.com/JeremyXSC/MicroAUNet.

Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, Furong Huang

Main category: cs.CV

TL;DR: ROVER is a benchmark for evaluating reciprocal cross-modal reasoning in unified multimodal models, testing how models use one modality to guide or verify outputs in another modality.

Details

Motivation: Existing evaluations treat multimodal abilities in isolation, focusing on unimodal reasoning rather than testing the reciprocal cross-modal reasoning that is central to true unified multimodal intelligence.

Method: Created a human-annotated benchmark with 1312 tasks grounded in 1876 images, spanning two settings: verbally-augmented reasoning for visual generation and visually-augmented reasoning for verbal generation.

Result: Experiments on 17 unified models showed that cross-modal reasoning determines visual generation quality, with interleaved models outperforming non-interleaved ones, and revealed dissociation between physical and symbolic reasoning capabilities.

Conclusion: Reciprocal cross-modal reasoning is a critical frontier for enabling true omnimodal generation, as current models struggle with constructing visual abstractions for symbolic tasks despite good perceptual interpretation.

Abstract: Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.

[292] Web-Scale Collection of Video Data for 4D Animal Reconstruction

Brian Nlong Zhao, Jiajun Wu, Shangzhe Wu

Main category: cs.CV

TL;DR: The paper introduces an automated pipeline for mining YouTube videos to create a large-scale animal video dataset (30K videos, 2M frames), presents Animal-in-Motion benchmark with 230 sequences for 4D reconstruction, and establishes baseline methods showing model-based approaches score better on 2D metrics but produce unrealistic 3D shapes, while model-free methods yield more natural reconstructions.

Details

Motivation: Current animal video datasets are limited in scale (only 2.4K clips) and lack processing for animal-centric 3D/4D tasks. There's a need for large-scale, non-invasive data collection methods for wildlife research that can support advanced computer vision tasks.

Method: Developed an automated pipeline to mine and process YouTube videos into object-centric clips with auxiliary annotations. Created Animal-in-Motion benchmark with 230 manually filtered sequences. Evaluated state-of-the-art model-based and model-free methods, and enhanced a model-free approach with sequence-level optimization.

Result: Collected 30K videos (2M frames) - an order of magnitude larger than prior works. Found that model-based methods score better on 2D metrics but produce unrealistic 3D shapes, while model-free methods yield more natural reconstructions but score lower, revealing an evaluation gap. Established the first 4D animal reconstruction baseline.

Conclusion: The pipeline, benchmark, and baseline advance large-scale, markerless 4D animal reconstruction from in-the-wild videos. The work addresses limitations in current datasets and evaluation methods for animal-centric computer vision tasks.

Abstract: Computer vision for animals holds great promise for wildlife research but often depends on large-scale data, while existing collection methods rely on controlled capture setups. Recent data-driven approaches show the potential of single-view, non-invasive analysis, yet current animal video datasets are limited–offering as few as 2.4K 15-frame clips and lacking key processing for animal-centric 3D/4D tasks. We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips, along with auxiliary annotations valuable for downstream tasks like pose estimation, tracking, and 3D/4D reconstruction. Using this pipeline, we amass 30K videos (2M frames)–an order of magnitude more than prior works. To demonstrate its utility, we focus on the 4D quadruped animal reconstruction task. To support this task, we present Animal-in-Motion (AiM), a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions. We evaluate state-of-the-art model-based and model-free methods on Animal-in-Motion, finding that 2D metrics favor the former despite unrealistic 3D shapes, while the latter yields more natural reconstructions but scores lower–revealing a gap in current evaluation. To address this, we enhance a recent model-free approach with sequence-level optimization, establishing the first 4D animal reconstruction baseline. Together, our pipeline, benchmark, and baseline aim to advance large-scale, markerless 4D animal reconstruction and related tasks from in-the-wild videos. Code and datasets are available at https://github.com/briannlongzhao/Animal-in-Motion.

[293] Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution

Peng Du, Hui Li, Han Xu, Paul Barom Jeon, Dongwook Lee, Daehyun Ji, Ran Yang, Feng Zhu

Main category: cs.CV

TL;DR: DTWSR is a Diffusion Transformer model that uses wavelet spectra for image super-resolution, capturing interrelations among multiscale frequency sub-bands to produce consistent and realistic results.

Details

Motivation: Existing DWT-based super-resolution methods neglect interrelations among multiscale frequency sub-bands, causing inconsistencies and artifacts in reconstructed images.

Method: Uses Multi-level Discrete Wavelet Transform to decompose images into wavelet spectra, pyramid tokenization for transformer processing, and a dual-decoder to handle low-frequency and high-frequency sub-bands while maintaining alignment.

Result: Extensive experiments show high performance on both perception quality and fidelity across multiple benchmark datasets.

Conclusion: DTWSR effectively addresses frequency sub-band interrelation issues in super-resolution, producing superior results through diffusion models and transformers.

Abstract: Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image superresolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multiscale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR).DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multiscale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform (MDWT) to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in lowfrequency (LF) and high-frequency (HF) sub-bands, without omitting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.

[294] A Topology-Aware Graph Convolutional Network for Human Pose Similarity and Action Quality Assessment

Minmin Zeng

Main category: cs.CV

TL;DR: Proposes GCN-PSN, a topology-aware Graph Convolutional Network that models human skeleton as a graph to learn discriminative pose embeddings for Action Quality Assessment, achieving competitive performance on benchmarks.

Details

Motivation: Action Quality Assessment requires fine-grained understanding of human motion and precise evaluation of pose similarity, which benefits from modeling skeletal topology.

Method: Uses a topology-aware Graph Convolutional Network (GCN-PSN) with Siamese architecture trained with contrastive regression objective to learn discriminative pose embeddings from human skeleton graphs.

Result: Outperforms coordinate-based baselines and achieves competitive performance on AQA-7 and FineDiving benchmarks. Ablation studies validate effectiveness of leveraging skeletal topology.

Conclusion: Modeling human skeleton topology through GCNs is effective for pose similarity and action quality assessment, demonstrating the importance of structural information in motion analysis.

Abstract: Action Quality Assessment (AQA) requires fine-grained understanding of human motion and precise evaluation of pose similarity. This paper proposes a topology-aware Graph Convolutional Network (GCN) framework, termed GCN-PSN, which models the human skeleton as a graph to learn discriminative, topology-sensitive pose embeddings. Using a Siamese architecture trained with a contrastive regression objective, our method outperforms coordinate-based baselines and achieves competitive performance on AQA-7 and FineDiving benchmarks. Experimental results and ablation studies validate the effectiveness of leveraging skeletal topology for pose similarity and action quality assessment.

[295] MoSa: Motion Generation with Scalable Autoregressive Modeling

Mengyuan Liu, Sheng Yan, Yong Wang, Yingjie Li, Gui-Bin Bian, Hong Liu

Main category: cs.CV

TL;DR: MoSa is a hierarchical motion generation framework that uses a coarse-to-fine scalable generation process with Multi-scale Token Preservation Strategy and a novel CAQ-VAE architecture, achieving state-of-the-art performance in text-driven 3D human motion generation.

Details

Motivation: To improve text-driven 3D human motion generation by addressing limitations of traditional Vector Quantization-guided Generative Transformers, particularly the inefficiency of requiring many inference steps and potential reconstruction degradation from interpolation.

Method: Proposes Multi-scale Token Preservation Strategy (MTPS) with hierarchical residual vector quantization variational autoencoder (RQ-VAE), Scalable Autoregressive (SAR) modeling that predicts multiple scale tokens per step, and CAQ-VAE - a lightweight convolution-attention hybrid VQ-VAE to address interpolation degradation.

Result: Achieves FID of 0.06 on Motion-X dataset (vs MoMask’s 0.20) with 27% reduction in inference time, requiring only 10 inference steps matching RQ-VAE quantization layers. Shows strong generalization to downstream tasks like motion editing without fine-tuning.

Conclusion: MoSa demonstrates superior generation quality and efficiency for text-driven 3D human motion generation, outperforming prior methods in both fidelity and speed while maintaining good generalization capabilities.

Abstract: We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine multi-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation, we propose CAQ-VAE, a lightweight yet expressive convolution-attention hybrid VQ-VAE. CAQ-VAE enhances residual block design and incorporates attention mechanisms to better capture global dependencies. Extensive experiments show that MoSa achieves state-of-the-art generation quality and efficiency, outperforming prior methods in both fidelity and speed. On the Motion-X dataset, MoSa achieves an FID of 0.06 (versus MoMask’s 0.20) while reducing inference time by 27 percent. Moreover, MoSa generalizes well to downstream tasks such as motion editing, requiring no additional fine-tuning. The code is available at https://mosa-web.github.io/MoSa-web

[296] OmniVLA: Unifiying Multi-Sensor Perception for Physically-Grounded Multimodal VLA

Heyu Guo, Shanmu Wang, Ruichun Ma, Shiqi Jiang, Yasaman Ghasempour, Omid Abari, Baining Guo, Lili Qi

Main category: cs.CV

TL;DR: OmniVLA is an omni-modality vision-language-action model that integrates multiple sensing modalities (infrared camera, mmWave radar, microphone array) through sensor-masked images to enhance perception and manipulation capabilities beyond RGB-only models.

Details

Motivation: Most existing VLA models rely solely on RGB cameras, which limits their perception and manipulation capabilities. There's a need to incorporate additional sensing modalities for better physically-grounded spatial intelligence.

Method: Uses sensor-masked images - a unified representation that overlays spatially grounded masks from different sensors onto RGB images. Built on RGB-pretrained VLA backbone with lightweight per-sensor projectors for data-efficient learning.

Result: Achieves 84% average task success rate, outperforming RGB-only models by 59% and raw-sensor-input baselines by 28%. Shows higher learning efficiency and stronger generalization capability.

Conclusion: Integrating multiple sensing modalities through sensor-masked images significantly enhances VLA model performance for manipulation tasks, providing better perception and spatial intelligence than RGB-only approaches.

Abstract: Vision-language-action (VLA) models have shown strong generalization for action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception is needed to guide the manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.

[297] Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering

Riddhi Jain, Manasi Patwardhan, Parijat Deshpande, Venkataramana Runkana

Main category: cs.CV

TL;DR: This paper proposes a multi-step reasoning approach for Indian Food VQA, using auto-validated reasoning chains and reinforcement learning to improve accuracy by 10 percentage points over baseline methods.

Details

Motivation: Existing VQA systems are biased towards Western foods and fail to handle the complex culinary context and relationships in diverse Indian cuisines. Current Indian food VQA approaches use a two-step process that doesn't adequately capture the required multi-step reasoning.

Method: Created reasoning chains for QA with minimal human intervention, fine-tuned smaller LLMs and VLMs with auto-validated reasoning chains, and used reinforcement learning with larger datasets for training.

Result: Achieved an average 10 percentage point improvement in accuracy on the baseline Indian Food VQA task through reasoning chain augmentation.

Conclusion: Multi-step reasoning chains are essential for accurate Indian Food VQA, and the proposed approach effectively captures the complex culinary context and relationships in Indian cuisine.

Abstract: The immense diversity in the culture and culinary of Indian cuisines calls attention to the major shortcoming of the existing Visual Question Answering(VQA) systems which are inclined towards the foods from Western region. Recent attempt towards building a VQA dataset for Indian food is a step towards addressing this challenge. However, their approach towards VQA follows a two-step process in which the answer is generated first, followed by the explanation of the expected answer. In this work, we claim that food VQA requires to follow a multi-step reasoning process to arrive at an accurate answer, especially in the context of India food, which involves understanding complex culinary context and identifying relationships between various food items. With this hypothesis we create reasoning chains upon the QA with minimal human intervention. We fine-tune smaller LLMs and VLMs with auto-validated reasoning chains and further train them using reinforcement learning with larger data. With augmentation of reasoning chains, we observed accuracy improvement of an average 10 percentage points on the baseline. We provide detailed analysis in terms the effect of addition of reasoning chains for the Indian Food VQA task. Index Terms - FoodVQA, Reasoning Chains, Reinforcement Learning, Knowledge Graph.

[298] Saliency-Guided Domain Adaptation for Left-Hand Driving in Autonomous Steering

Zahra Mehraban, Sebastien Glaser, Michael Milford, Ronald Schroeter

Main category: cs.CV

TL;DR: This paper explores domain adaptation methods for autonomous driving models, specifically adapting PilotNet from right-hand to left-hand driving conditions using Australian highway data. The study evaluates four training approaches and finds that pretraining on flipped data followed by fine-tuning yields the best results.

Details

Motivation: Domain adaptation is needed for automated driving models to generalize across diverse road conditions, particularly when adapting from right-hand to left-hand driving environments.

Method: Four training methods were evaluated: baseline model on US data, model on flipped US data, pretrained on US data then fine-tuned on Australian highways, and pretrained on flipped US data then fine-tuned on Australian highways. Saliency-based analysis was used to measure attention shifts.

Result: Pretraining on flipped data alone worsens prediction stability, but significantly improves adaptation when followed by fine-tuning, leading to lower prediction error and stronger focus on left-side cues. Similar trends were confirmed with ResNet architecture.

Conclusion: Preprocessing techniques like flipped-data pretraining followed by fine-tuning improve model adaptation with minimal retraining requirements, emphasizing the importance of proper domain adaptation strategies.

Abstract: Domain adaptation is required for automated driving models to generalize well across diverse road conditions. This paper explores a training method for domain adaptation to adapt PilotNet, an end-to-end deep learning-based model, for left-hand driving conditions using real-world Australian highway data. Four training methods were evaluated: (1) a baseline model trained on U.S. right-hand driving data, (2) a model trained on flipped U.S. data, (3) a model pretrained on U.S. data and then fine-tuned on Australian highways, and (4) a model pretrained on flipped U.S. data and then finetuned on Australian highways. This setup examines whether incorporating flipped data enhances the model adaptation by providing an initial left-hand driving alignment. The paper compares model performance regarding steering prediction accuracy and attention, using saliency-based analysis to measure attention shifts across significant road regions. Results show that pretraining on flipped data alone worsens prediction stability due to misaligned feature representations, but significantly improves adaptation when followed by fine-tuning, leading to lower prediction error and stronger focus on left-side cues. To validate this approach across different architectures, the same experiments were done on ResNet, which confirmed similar adaptation trends. These findings emphasize the importance of preprocessing techniques, such as flipped-data pretraining, followed by fine-tuning to improve model adaptation with minimal retraining requirements.

[299] Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark

Rajmund Nagy, Hendric Voss, Thanh Hoang-Minh, Mihail Tsakov, Teodor Nikolov, Zeyi Zhang, Tenglong Ao, Sicheng Yang, Shaoli Huang, Yongkang Cheng, M. Hamza Mughal, Rishabh Dabral, Kiran Chhatre, Christian Theobalt, Libin Liu, Stefan Kopp, Rachel McDonnell, Michael Neff, Taras Kucherenko, Youngwoo Yoon, Gustav Eje Henter

Main category: cs.CV

TL;DR: The paper identifies flaws in human evaluation practices for speech-driven 3D gesture generation and introduces a standardized evaluation protocol for the BEAT2 dataset, benchmarking six models to reveal that newer models don’t consistently outperform older ones and published claims may not hold under rigorous testing.

Details

Motivation: To address the lack of standardization and flawed experimental setups in human evaluation of automated speech-driven 3D gesture generation, which makes it impossible to compare methods or determine state-of-the-art.

Method: Introduced a detailed human evaluation protocol for BEAT2 dataset, conducted large-scale crowdsourced evaluation of six recent gesture-generation models across motion realism and speech-gesture alignment dimensions.

Result: Found that newer models don’t consistently outperform earlier approaches, published claims of high motion realism or alignment may not hold under rigorous evaluation, and the field needs disentangled assessments of motion quality and multimodal alignment.

Conclusion: The field must adopt standardized, disentangled evaluation protocols for accurate benchmarking. The authors will release synthetic motion data, video stimuli, rendering scripts, and human preference votes to drive standardization and enable new evaluation research.

Abstract: We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models – each trained by its original authors – across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies – enabling new evaluations without model reimplementation required – alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.

[300] Eyes on Target: Gaze-Aware Object Detection in Egocentric Video

Vishakha Lall, Yisi Liu

Main category: cs.CV

TL;DR: Eyes on Target is a gaze-guided object detection framework that integrates human gaze features into Vision Transformers to bias attention toward human-attended regions in egocentric videos, improving detection accuracy.

Details

Motivation: Human gaze provides valuable supervisory signals for understanding visual attention in complex environments, especially in egocentric videos where viewer attention is crucial for task assessment.

Method: Inject gaze-derived features into Vision Transformer’s attention mechanism to bias spatial feature selection toward human-attended regions, and introduce gaze-aware attention head importance metric.

Result: Consistent gains in detection accuracy over gaze-agnostic baselines on custom simulator dataset and public benchmarks (Ego4D Ego-Motion, Ego-CH-Gaze datasets).

Conclusion: The framework effectively leverages gaze cues to enhance object detection in egocentric videos and provides interpretability through gaze-aware attention analysis.

Abstract: Human gaze offers rich supervisory signals for understanding visual attention in complex visual environments. In this paper, we propose Eyes on Target, a novel depth-aware and gaze-guided object detection framework designed for egocentric videos. Our approach injects gaze-derived features into the attention mechanism of a Vision Transformer (ViT), effectively biasing spatial feature selection toward human-attended regions. Unlike traditional object detectors that treat all regions equally, our method emphasises viewer-prioritised areas to enhance object detection. We validate our method on an egocentric simulator dataset where human visual attention is critical for task assessment, illustrating its potential in evaluating human performance in simulation scenarios. We evaluate the effectiveness of our gaze-integrated model through extensive experiments and ablation studies, demonstrating consistent gains in detection accuracy over gaze-agnostic baselines on both the custom simulator dataset and public benchmarks, including Ego4D Ego-Motion and Ego-CH-Gaze datasets. To interpret model behaviour, we also introduce a gaze-aware attention head importance metric, revealing how gaze cues modulate transformer attention dynamics.

[301] Beyond Deceptive Flatness: Dual-Order Solution for Strengthening Adversarial Transferability

Zhixuan Zhang, Pingyu Wang, Xingjian Zheng, Linbo Qing, Qi Liu

Main category: cs.CV

TL;DR: The paper introduces Adversarial Flatness Attack (AFA) to address deceptive flatness in transferable attacks, using dual-order information and MonteCarlo Adversarial Sampling to improve adversarial transferability across models.

Details

Motivation: Current transferable attacks focus on flat losses but still fall into suboptimal regions (deceptive flatness), limiting their effectiveness against unknown victim models.

Method: Proposes Adversarial Flatness (AF) to solve deceptive flatness, develops AFA attack with efficient approximation, and introduces MonteCarlo Adversarial Sampling (MCAS) for better sampling efficiency.

Result: Comprehensive results on ImageNet-compatible dataset show superiority over six baselines, generating flatter adversarial examples and boosting transferability across model architectures. Outperforms baselines on input transformation attacks and Baidu Cloud API.

Conclusion: The proposed AFA method effectively addresses deceptive flatness and significantly improves adversarial transferability through dual-order information and efficient sampling techniques.

Abstract: Transferable attacks generate adversarial examples on surrogate models to fool unknown victim models, posing real-world threats and growing research interest. Despite focusing on flat losses for transferable adversarial examples, recent studies still fall into suboptimal regions, especially the flat-yet-sharp areas, termed as deceptive flatness. In this paper, we introduce a novel black-box gradient-based transferable attack from a perspective of dual-order information. Specifically, we feasibly propose Adversarial Flatness (AF) to the deceptive flatness problem and a theoretical assurance for adversarial transferability. Based on this, using an efficient approximation of our objective, we instantiate our attack as Adversarial Flatness Attack (AFA), addressing the altered gradient sign issue. Additionally, to further improve the attack ability, we devise MonteCarlo Adversarial Sampling (MCAS) by enhancing the inner-loop sampling efficiency. The comprehensive results on ImageNet-compatible dataset demonstrate superiority over six baselines, generating adversarial examples in flatter regions and boosting transferability across model architectures. When tested on input transformation attacks or the Baidu Cloud API, our method outperforms baselines.

[302] CenterMamba-SAM: Center-Prioritized Scanning and Temporal Prototypes for Brain Lesion Segmentation

Yu Tian, Zhongheng Yang, Chenshi Liu, Yiyun Su, Ziwei Hong, Zexi Gong, Jingyuan Xu

Main category: cs.CV

TL;DR: CenterMamba-SAM is an end-to-end framework for brain lesion segmentation that uses a frozen pretrained backbone with lightweight adapters, featuring a novel 3x3 corner-axis-center scanning strategy and memory-driven structural prompts for improved boundary sensitivity and inter-slice coherence.

Details

Motivation: Brain lesion segmentation faces challenges including small, low-contrast lesions, anisotropic sampling, and cross-slice discontinuities that make accurate segmentation difficult.

Method: The method employs CenterMamba encoder with 3x3 corner-axis-center short-sequence scanning for center-prioritized information aggregation, memory-driven structural prompt generator for automatic prompt synthesis, and memory-augmented multi-scale decoder with deep supervision.

Result: Extensive experiments on public benchmarks show that CenterMamba-SAM achieves state-of-the-art performance in brain lesion segmentation.

Conclusion: The proposed framework effectively addresses brain lesion segmentation challenges through innovative scanning strategies and memory-driven components, demonstrating superior performance compared to existing methods.

Abstract: Brain lesion segmentation remains challenging due to small, low-contrast lesions, anisotropic sampling, and cross-slice discontinuities. We propose CenterMamba-SAM, an end-to-end framework that freezes a pretrained backbone and trains only lightweight adapters for efficient fine-tuning. At its core is the CenterMamba encoder, which employs a novel 3x3 corner-axis-center short-sequence scanning strategy to enable center-prioritized, axis-reinforced, and diagonally compensated information aggregation. This design enhances sensitivity to weak boundaries and tiny foci while maintaining sparse yet effective feature representation. A memory-driven structural prompt generator maintains a prototype bank across neighboring slices, enabling automatic synthesis of reliable prompts without user interaction, thereby improving inter-slice coherence. The memory-augmented multi-scale decoder integrates memory attention modules at multiple levels, combining deep supervision with progressive refinement to restore fine details while preserving global consistency. Extensive experiments on public benchmarks demonstrate that CenterMamba-SAM achieves state-of-the-art performance in brain lesion segmentation.

[303] Source-Only Cross-Weather LiDAR via Geometry-Aware Point Drop

YoungJae Cheong, Jhonghyun An

Main category: cs.CV

TL;DR: A Light Geometry-aware adapter improves LiDAR semantic segmentation in adverse weather by preserving neighbor continuity and applying region-aware regularization to structurally fragile areas.

Details

Motivation: LiDAR semantic segmentation degrades in adverse weather due to corrupted geometry from refraction, scattering, and point dropouts. Prior methods overlook structural vulnerabilities near boundaries, corners, and sparse regions.

Method: The adapter aligns azimuth and applies horizontal circular padding to preserve neighbor continuity. It uses local-window K-Nearest Neighbors to compute local statistics, compressed into geometry-aware cues that drive region-aware regularization during training.

Result: In source-only cross-weather evaluation (training on SemanticKITTI, testing on SemanticSTF), the adapter improves mIoU by 7.9 percentage points over data-centric augmentation baseline and by 0.6 points over class-centric regularization baseline.

Conclusion: Geometry-driven regularization is a key direction for all-weather LiDAR segmentation, with the plug-and-play adapter providing significant improvements with negligible inference cost.

Abstract: LiDAR semantic segmentation degrades in adverse weather because refraction, scattering, and point dropouts corrupt geometry. Prior work in weather simulation, mixing-based augmentation, domain randomization, and uncertainty or boundary regularization improves robustness but still overlooks structural vulnerabilities near boundaries, corners, and sparse regions. We present a Light Geometry-aware adapter. The module aligns azimuth and applies horizontal circular padding to preserve neighbor continuity across the 0~360 degree wrap-around boundary. A local-window K-Nearest Neighbors gathers nearby points and computes simple local statistics, which are compressed into compact geometry-aware cues. During training, these cues drive region-aware regularization that stabilizes predictions in structurally fragile areas. The adapter is plug and play, complements augmentation, and can be enabled only during training with negligible inference cost. We adopt a source-only cross-weather setup where models train on SemanticKITTI and are evaluated on SemanticSTF without target labels or fine-tuning. The adapter improves mIoU by 7.9 percentage points over the data-centric augmentation baseline and by 0.6 points over the class-centric regularization baseline. These results indicate that geometry-driven regularization is a key direction for all-weather LiDAR segmentation.

[304] MotionStream: Real-Time Video Generation with Interactive Motion Controls

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Schechtman, Xun Huang

Main category: cs.CV

TL;DR: MotionStream enables real-time motion-conditioned video generation with sub-second latency and up to 29 FPS streaming, addressing the prohibitive latency and non-causal processing limitations of current methods.

Details

Motivation: Current motion-conditioned video generation methods suffer from high latency (minutes per video) and non-causal processing that prevents real-time interaction, creating a need for streaming video generation.

Method: Distill a bidirectional text-to-video teacher model into a causal student using Self Forcing with Distribution Matching Distillation. Key innovations include sliding-window causal attention with attention sinks, and self-rollout with KV cache rolling during training to simulate inference-time extrapolations.

Result: Achieves state-of-the-art results in motion following and video quality while being two orders of magnitude faster than existing methods. Enables constant-speed generation of arbitrarily long videos with up to 29 FPS streaming on a single GPU.

Conclusion: MotionStream uniquely enables infinite-length streaming video generation, allowing users to paint trajectories, control cameras, or transfer motion and see results unfold in real-time for a truly interactive experience.

Abstract: Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons: (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by preventing error accumulation, and (3) maintaining fast inference, without incurring growth in computational cost due to increasing context windows. A key to our approach is introducing carefully designed sliding-window causal attention, combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.

[305] PRevivor: Reviving Ancient Chinese Paintings using Prior-Guided Color Transformers

Tan Tang, Yanhong Wu, Junming Gao, Yingcai Wu

Main category: cs.CV

TL;DR: PRevivor is a prior-guided color transformer that restores ancient Chinese paintings by learning from recent paintings, using a two-stage approach of luminance enhancement and hue correction with localized priors.

Details

Motivation: Ancient Chinese paintings suffer from irreversible color degradation due to complex chemistry, and there's a lack of comprehensive datasets and end-to-end digital restoration tools for color revival.

Method: Two-stage approach: 1) Luminance enhancement using variational U-Nets and multi-scale mapping, 2) Hue correction using dual-branch color query module guided by localized hue priors from faded paintings.

Result: Extensive experiments show PRevivor achieves superior performance both quantitatively and qualitatively compared to state-of-the-art colorization methods.

Conclusion: PRevivor effectively revives colors in ancient Chinese paintings through its prior-guided transformer approach with sequential luminance enhancement and hue correction.

Abstract: Ancient Chinese paintings are a valuable cultural heritage that is damaged by irreversible color degradation. Reviving color-degraded paintings is extraordinarily difficult due to the complex chemistry mechanism. Progress is further slowed by the lack of comprehensive, high-quality datasets, which hampers the creation of end-to-end digital restoration tools. To revive colors, we propose PRevivor, a prior-guided color transformer that learns from recent paintings (e.g., Ming and Qing Dynasty) to restore ancient ones (e.g., Tang and Song Dynasty). To develop PRevivor, we decompose color restoration into two sequential sub-tasks: luminance enhancement and hue correction. For luminance enhancement, we employ two variational U-Nets and a multi-scale mapping module to translate faded luminance into restored counterparts. For hue correction, we design a dual-branch color query module guided by localized hue priors extracted from faded paintings. Specifically, one branch focuses attention on regions guided by masked priors, enforcing localized hue correction, whereas the other branch remains unconstrained to maintain a global reasoning capability. To evaluate PRevivor, we conduct extensive experiments against state-of-the-art colorization methods. The results demonstrate superior performance both quantitatively and qualitatively.

[306] Adaptation of Foundation Models for Medical Image Analysis: Strategies, Challenges, and Future Directions

Karma Phuntsho, Abdullah, Kyungmi Lee, Ickjai Lee, Euijoon Ahn

Main category: cs.CV

TL;DR: This review paper examines adaptation strategies for foundation models in medical imaging, addressing challenges like domain shifts, data scarcity, computational demands, and privacy requirements.

Details

Motivation: Foundation models offer transformative potential for medical image analysis but face challenges in adapting to real-world clinical practice due to domain shifts, limited annotated data, computational demands, and privacy constraints.

Method: The review comprehensively assesses adaptation strategies including supervised fine-tuning, domain-specific pretraining, parameter-efficient fine-tuning, self-supervised learning, hybrid methods, and multimodal frameworks. It also explores emerging directions like continual learning, federated approaches, hybrid self-supervised learning, data-centric pipelines, and systematic benchmarking.

Result: The review evaluates performance gains, clinical applicability, and limitations of various adaptation approaches, identifying trade-offs and unresolved challenges that prior reviews have often overlooked.

Conclusion: The paper provides a roadmap for developing adaptive, trustworthy, and clinically integrated foundation models capable of meeting real-world medical imaging demands by outlining strategies and associated research gaps.

Abstract: Foundation models (FMs) have emerged as a transformative paradigm in medical image analysis, offering the potential to provide generalizable, task-agnostic solutions across a wide range of clinical tasks and imaging modalities. Their capacity to learn transferable representations from large-scale data has the potential to address the limitations of conventional task-specific models. However, adaptation of FMs to real-world clinical practice remains constrained by key challenges, including domain shifts, limited availability of high-quality annotated data, substantial computational demands, and strict privacy requirements. This review presents a comprehensive assessment of strategies for adapting FMs to the specific demands of medical imaging. We examine approaches such as supervised fine-tuning, domain-specific pretraining, parameter-efficient fine-tuning, self-supervised learning, hybrid methods, and multimodal or cross-modal frameworks. For each, we evaluate reported performance gains, clinical applicability, and limitations, while identifying trade-offs and unresolved challenges that prior reviews have often overlooked. Beyond these established techniques, we also highlight emerging directions aimed at addressing current gaps. These include continual learning to enable dynamic deployment, federated and privacy-preserving approaches to safeguard sensitive data, hybrid self-supervised learning to enhance data efficiency, data-centric pipelines that combine synthetic generation with human-in-the-loop validation, and systematic benchmarking to assess robust generalization under real-world clinical variability. By outlining these strategies and associated research gaps, this review provides a roadmap for developing adaptive, trustworthy, and clinically integrated FMs capable of meeting the demands of real-world medical imaging.

[307] Detecting Generated Images by Fitting Natural Image Distributions

Yonggang Zhang, Jun Nie, Xinmei Tian, Mingming Gong, Kun Zhang, Bo Han

Main category: cs.CV

TL;DR: A novel framework for detecting generated images by exploiting geometric differences between natural and generated image manifolds, using orthogonal gradient subspaces and normalizing flows to amplify detectable differences.

Details

Motivation: Increasing realism of generated images raises concerns about misuse, requiring robust detection methods that don't depend heavily on training data quantity and quality like current binary classifiers.

Method: Uses functions that yield consistent outputs for natural images but divergent outputs for generated images, leveraging orthogonal gradient subspaces. Detects generated images when transformations along data manifold cause significant loss changes in self-supervised models pre-trained on natural images. Employs normalizing flows to amplify differences by extruding generated images away from natural image manifold.

Result: Extensive experiments demonstrate the method’s efficacy in detecting generated images, particularly addressing diminishing manifold disparities in advanced generative models.

Conclusion: The proposed framework provides an effective detection method for generated images by exploiting geometric manifold differences and amplifying detectable disparities, offering a robust alternative to binary classification approaches.

Abstract: The increasing realism of generated images has raised significant concerns about their potential misuse, necessitating robust detection methods. Current approaches mainly rely on training binary classifiers, which depend heavily on the quantity and quality of available generated images. In this work, we propose a novel framework that exploits geometric differences between the data manifolds of natural and generated images. To exploit this difference, we employ a pair of functions engineered to yield consistent outputs for natural images but divergent outputs for generated ones, leveraging the property that their gradients reside in mutually orthogonal subspaces. This design enables a simple yet effective detection method: an image is identified as generated if a transformation along its data manifold induces a significant change in the loss value of a self-supervised model pre-trained on natural images. Further more, to address diminishing manifold disparities in advanced generative models, we leverage normalizing flows to amplify detectable differences by extruding generated images away from the natural image manifold. Extensive experiments demonstrate the efficacy of this method. Code is available at https://github.com/tmlr-group/ConV.

[308] UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, Jiaqi Wang

Main category: cs.CV

TL;DR: UniREditBench is a comprehensive benchmark for evaluating reasoning-based image editing models, covering 2,700 samples across real- and game-world scenarios with multimodal dual-reference evaluation to improve reliability.

Details

Motivation: Current generative models struggle with complex image editing tasks requiring implicit reasoning, and existing benchmarks focus mainly on single-object transformations while overlooking multi-object interactions and game-world scenarios with human-defined rules.

Method: Proposed UniREditBench with 2,700 curated samples across 8 primary dimensions and 18 sub-dimensions, introduced multimodal dual-reference evaluation (textual + ground-truth image references), and created UniREdit-Data-100K synthetic dataset with CoT reasoning annotations.

Result: Fine-tuned Bagel on the synthetic dataset to create UniREdit-Bagel, which showed substantial improvements in both in-domain and out-of-distribution settings. Benchmarking revealed strengths and weaknesses of various image editing models.

Conclusion: UniREditBench provides a systematic framework for evaluating reasoning-based image editing, addressing limitations of existing benchmarks and enabling more reliable assessment of model capabilities across diverse scenarios.

Abstract: Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.

[309] REASON: Probability map-guided dual-branch fusion framework for gastric content assessment

Nu-Fnag Xiao, De-Xing Huang, Le-Tian Wang, Mei-Jiang Gui, Qi Fu, Xiao-Liang Xie, Shi-Qi Liu, Shuangyi Wang, Zeng-Guang Hou, Ying-Wei Wang, Xiao-Hu Zhou

Main category: cs.CV

TL;DR: REASON is a two-stage framework that uses probability maps and dual-branch fusion of ultrasound views to automate gastric content assessment for aspiration risk stratification.

Details

Motivation: Traditional manual tracing methods for gastric content assessment are inefficient and inaccurate, creating a need for automated solutions to improve aspiration risk assessment during anesthesia induction.

Method: Two-stage framework: Stage 1 uses segmentation to generate probability maps that suppress artifacts and highlight gastric anatomy; Stage 2 employs dual-branch classifier that fuses information from right lateral decubitus and supine ultrasound views.

Result: Outperforms current state-of-the-art approaches by significant margin on self-collected dataset.

Conclusion: The framework shows great promise for automated preoperative aspiration risk assessment, offering robust, efficient, and accurate clinical solution.

Abstract: Accurate assessment of gastric content from ultrasound is critical for stratifying aspiration risk at induction of general anesthesia. However, traditional methods rely on manual tracing of gastric antra and empirical formulas, which face significant limitations in both efficiency and accuracy. To address these challenges, a novel two-stage probability map-guided dual-branch fusion framework (REASON) for gastric content assessment is proposed. In stage 1, a segmentation model generates probability maps that suppress artifacts and highlight gastric anatomy. In stage 2, a dual-branch classifier fuses information from two standard views, right lateral decubitus (RLD) and supine (SUP), to improve the discrimination of learned features. Experimental results on a self-collected dataset demonstrate that the proposed framework outperforms current state-of-the-art approaches by a significant margin. This framework shows great promise for automated preoperative aspiration risk assessment, offering a more robust, efficient, and accurate solution for clinical practice.

[310] Perturb a Model, Not an Image: Towards Robust Privacy Protection via Anti-Personalized Diffusion Models

Tae-Young Lee, Juwon Seo, Jong Hwan Ko, Gyeong-Moon Park

Main category: cs.CV

TL;DR: APDM is a novel framework that protects against unauthorized personalization in diffusion models by shifting protection from images to the model itself, using a new loss function and dual-path optimization strategy.

Details

Motivation: To address privacy risks from malicious misuse of personalization techniques in diffusion models, as existing adversarial perturbation methods are ineffective against simple image transformations or clean images.

Method: Proposes Direct Protective Optimization (DPO) loss function and Learning to Protect (L2P) dual-path optimization strategy that alternates between personalization and protection paths to disrupt subject personalization.

Result: APDM outperforms existing methods and achieves state-of-the-art performance in preventing unauthorized personalization without compromising generative quality.

Conclusion: The framework effectively protects diffusion models from unauthorized personalization through theoretical analysis and practical optimization strategies, providing robust privacy protection.

Abstract: Recent advances in diffusion models have enabled high-quality synthesis of specific subjects, such as identities or objects. This capability, while unlocking new possibilities in content creation, also introduces significant privacy risks, as personalization techniques can be misused by malicious users to generate unauthorized content. Although several studies have attempted to counter this by generating adversarially perturbed samples designed to disrupt personalization, they rely on unrealistic assumptions and become ineffective in the presence of even a few clean images or under simple image transformations. To address these challenges, we shift the protection target from the images to the diffusion model itself to hinder the personalization of specific subjects, through our novel framework called Anti-Personalized Diffusion Models (APDM). We first provide a theoretical analysis demonstrating that a naive approach of existing loss functions to diffusion models is inherently incapable of ensuring convergence for robust anti-personalization. Motivated by this finding, we introduce Direct Protective Optimization (DPO), a novel loss function that effectively disrupts subject personalization in the target model without compromising generative quality. Moreover, we propose a new dual-path optimization strategy, coined Learning to Protect (L2P). By alternating between personalization and protection paths, L2P simulates future personalization trajectories and adaptively reinforces protection at each step. Experimental results demonstrate that our framework outperforms existing methods, achieving state-of-the-art performance in preventing unauthorized personalization. The code is available at https://github.com/KU-VGI/APDM.

[311] Positive Semi-definite Latent Factor Grouping-Boosted Cluster-reasoning Instance Disentangled Learning for WSI Representation

Chentao Li, Behzad Bozorgtabar, Yifang Ping, Pan Huang, Jing Qin

Main category: cs.CV

TL;DR: A novel MIL framework using latent factor grouping and cluster-reasoning to disentangle spatial, semantic, and decision entanglements in whole-slide pathology images, achieving superior performance and interpretability.

Details

Motivation: To address limitations of multiple instance learning (MIL) in whole-slide pathology images caused by spatial, semantic, and decision entanglements among instances that limit representation and interpretability.

Method: Three-phase framework: 1) Positive semi-definite latent factor grouping to mitigate spatial entanglement, 2) Instance probability counterfactual inference via cluster-reasoning for semantic disentanglement, 3) Generalized linear weighted decision with instance effect re-weighting for decision entanglement.

Result: Outperforms all state-of-the-art models on multicentre datasets and achieves pathologist-aligned interpretability through disentangled representations and transparent decision-making.

Conclusion: The proposed framework effectively addresses entanglement issues in MIL for whole-slide images, providing both superior performance and enhanced interpretability aligned with pathologist reasoning.

Abstract: Multiple instance learning (MIL) has been widely used for representing whole-slide pathology images. However, spatial, semantic, and decision entanglements among instances limit its representation and interpretability. To address these challenges, we propose a latent factor grouping-boosted cluster-reasoning instance disentangled learning framework for whole-slide image (WSI) interpretable representation in three phases. First, we introduce a novel positive semi-definite latent factor grouping that maps instances into a latent subspace, effectively mitigating spatial entanglement in MIL. To alleviate semantic entanglement, we employs instance probability counterfactual inference and optimization via cluster-reasoning instance disentangling. Finally, we employ a generalized linear weighted decision via instance effect re-weighting to address decision entanglement. Extensive experiments on multicentre datasets demonstrate that our model outperforms all state-of-the-art models. Moreover, it attains pathologist-aligned interpretability through disentangled representations and a transparent decision-making process.

[312] MVSMamba: Multi-View Stereo with State Space Model

Jianfei Jiang, Qiankun Liu, Hongyuan Liu, Haochen Yu, Liyong Wang, Jiansheng Chen, Huimin Ma

Main category: cs.CV

TL;DR: MVSMamba is the first Mamba-based Multi-View Stereo network that achieves state-of-the-art performance with linear computational complexity, overcoming the quadratic complexity limitations of Transformer-based MVS methods.

Details

Motivation: Current Transformer-based MVS methods suffer from quadratic computational complexity, making it challenging to balance performance and efficiency. The Mamba architecture offers global modeling capability with linear complexity, making it suitable for efficient MVS.

Method: Proposes MVSMamba with a Dynamic Mamba module using reference-centered dynamic scanning strategy for efficient intra- and inter-view feature interaction, omnidirectional multi-view feature representations, and multi-scale global feature aggregation.

Result: MVSMamba outperforms state-of-the-art MVS methods on DTU dataset and Tanks-and-Temples benchmark with superior performance and efficiency.

Conclusion: MVSMamba demonstrates that Mamba architecture is highly effective for MVS tasks, achieving better performance than Transformer-based methods while maintaining linear computational complexity.

Abstract: Robust feature representations are essential for learning-based Multi-View Stereo (MVS), which relies on accurate feature matching. Recent MVS methods leverage Transformers to capture long-range dependencies based on local features extracted by conventional feature pyramid networks. However, the quadratic complexity of Transformer-based MVS methods poses challenges to balance performance and efficiency. Motivated by the global modeling capability and linear complexity of the Mamba architecture, we propose MVSMamba, the first Mamba-based MVS network. MVSMamba enables efficient global feature aggregation with minimal computational overhead. To fully exploit Mamba’s potential in MVS, we propose a Dynamic Mamba module (DM-module) based on a novel reference-centered dynamic scanning strategy, which enables: (1) Efficient intra- and inter-view feature interaction from the reference to source views, (2) Omnidirectional multi-view feature representations, and (3) Multi-scale global feature aggregation. Extensive experimental results demonstrate MVSMamba outperforms state-of-the-art MVS methods on the DTU dataset and the Tanks-and-Temples benchmark with both superior performance and efficiency. The source code is available at https://github.com/JianfeiJ/MVSMamba.

[313] A Generative Adversarial Approach to Adversarial Attacks Guided by Contrastive Language-Image Pre-trained Model

Sampriti Soor, Alik Pramanick, Jothiprakash K, Arijit Sur

Main category: cs.CV

TL;DR: A generative adversarial attack method using CLIP model to create effective and imperceptible adversarial perturbations that deceive multilabel classifiers while maintaining high visual similarity to original images.

Details

Motivation: Address the vulnerability of deep learning models to adversarial attacks, particularly the need for attacks that are both effective at deceiving models and visually imperceptible to humans.

Method: Integrates CLIP model’s text-image alignment with guided loss to incorporate natural language semantics, combining concentrated perturbation strategy from SSAE with dissimilar text embeddings from GAMA for multi-object scene manipulation.

Result: Achieves competitive performance comparable or superior to existing techniques across various black-box victim models while preserving greater visual fidelity.

Conclusion: The proposed method successfully creates effective adversarial examples that deceive classification models while maintaining high structural similarity to original inputs, demonstrating the power of integrating CLIP’s semantic understanding with perturbation strategies.

Abstract: The rapid growth of deep learning has brought about powerful models that can handle various tasks, like identifying images and understanding language. However, adversarial attacks, an unnoticed alteration, can deceive models, leading to inaccurate predictions. In this paper, a generative adversarial attack method is proposed that uses the CLIP model to create highly effective and visually imperceptible adversarial perturbations. The CLIP model’s ability to align text and image representation helps incorporate natural language semantics with a guided loss to generate effective adversarial examples that look identical to the original inputs. This integration allows extensive scene manipulation, creating perturbations in multi-object environments specifically designed to deceive multilabel classifiers. Our approach integrates the concentrated perturbation strategy from Saliency-based Auto-Encoder (SSAE) with the dissimilar text embeddings similar to Generative Adversarial Multi-Object Scene Attacks (GAMA), resulting in perturbations that both deceive classification models and maintain high structural similarity to the original images. The model was tested on various tasks across diverse black-box victim models. The experimental results show that our method performs competitively, achieving comparable or superior results to existing techniques, while preserving greater visual fidelity.

[314] RDTE-UNet: A Boundary and Detail Aware UNet for Precise Medical Image Segmentation

Jierui Qu, Jianchun Zhao

Main category: cs.CV

TL;DR: RDTE-UNet is a medical image segmentation network that combines local modeling with global context to improve boundary delineation and detail preservation for fine structures.

Details

Motivation: Medical image segmentation faces challenges due to anatomical variability and boundary ambiguity, which hinder reliable delineation of fine structures.

Method: Uses a hybrid ResBlock detail-aware Transformer backbone with three modules: ASBE for adaptive boundary enhancement, HVDA for fine-grained feature modeling, and EulerFF for fusion weighting guided by Euler’s formula.

Result: Achieved comparable segmentation accuracy and boundary quality on Synapse and BUSI datasets.

Conclusion: RDTE-UNet improves structural consistency and boundary accuracy across morphology, orientation, and scale in medical image segmentation.

Abstract: Medical image segmentation is essential for computer-assisted diagnosis and treatment planning, yet substantial anatomical variability and boundary ambiguity hinder reliable delineation of fine structures. We propose RDTE-UNet, a segmentation network that unifies local modeling with global context to strengthen boundary delineation and detail preservation. RDTE-UNet employs a hybrid ResBlock detail-aware Transformer backbone and three modules: ASBE for adaptive boundary enhancement, HVDA for fine-grained feature modeling, and EulerFF for fusion weighting guided by Euler’s formula. Together, these components improve structural consistency and boundary accuracy across morphology, orientation, and scale. On Synapse and BUSI dataset, RDTE-UNet has achieved a comparable level in terms of segmentation accuracy and boundary quality.

[315] MIQ-SAM3D: From Single-Point Prompt to Multi-Instance Segmentation via Competitive Query Refinement

Jierui Qu, Jianchun Zhao

Main category: cs.CV

TL;DR: MIQ-SAM3D is a multi-instance 3D segmentation framework that enables single-point-to-multi-instance segmentation using competitive query optimization and hybrid CNN-Transformer architecture.

Details

Motivation: Current SAM-based interactive segmentation methods follow single-point-to-single-object paradigm, limiting multi-lesion segmentation. ViT backbones capture global context but miss local details.

Method: Uses prompt-conditioned instance-query generator to transform single point into multiple queries, hybrid CNN-Transformer encoder with spatial gating, and competitively optimized query decoder for parallel multi-instance prediction.

Result: Achieved comparable performance on LiTS17 and KiTS21 datasets with strong robustness to prompts.

Conclusion: Provides practical solution for efficient annotation of clinically relevant multi-lesion cases.

Abstract: Accurate segmentation of medical images is fundamental to tumor diagnosis and treatment planning. SAM-based interactive segmentation has gained attention for its strong generalization, but most methods follow a single-point-to-single-object paradigm, which limits multi-lesion segmentation. Moreover, ViT backbones capture global context but often miss high-fidelity local details. We propose MIQ-SAM3D, a multi-instance 3D segmentation framework with a competitive query optimization strategy that shifts from single-point-to-single-mask to single-point-to-multi-instance. A prompt-conditioned instance-query generator transforms a single point prompt into multiple specialized queries, enabling retrieval of all semantically similar lesions across the 3D volume from a single exemplar. A hybrid CNN-Transformer encoder injects CNN-derived boundary saliency into ViT self-attention via spatial gating. A competitively optimized query decoder then enables end-to-end, parallel, multi-instance prediction through inter-query competition. On LiTS17 and KiTS21 dataset, MIQ-SAM3D achieved comparable levels and exhibits strong robustness to prompts, providing a practical solution for efficient annotation of clinically relevant multi-lesion cases.

[316] Expanding the Content-Style Frontier: a Balanced Subspace Blending Approach for Content-Style LoRA Fusion

Linhao Huang

Main category: cs.CV

TL;DR: A novel method using Content-Style Subspace Blending and Content-Style Balance loss to expand the content-style frontier in text-to-image diffusion models, improving content preservation across varying style intensities.

Details

Motivation: Previous studies only assessed content similarity under single style intensity, but increasing style intensity causes significant content feature loss, leading to suboptimal content-style frontier.

Method: Proposed Content-Style Subspace Blending and Content-Style Balance loss to better balance content preservation and style application across different intensity levels.

Result: Outperforms existing techniques in both qualitative and quantitative evaluations, achieving superior content-style trade-off with significantly lower Inverted Generational Distance (IGD) and Generational Distance (GD) scores.

Conclusion: The proposed approach successfully expands the content-style frontier by improving content similarity across varying style intensities, providing better balance between content preservation and style application.

Abstract: Recent advancements in text-to-image diffusion models have significantly improved the personalization and stylization of generated images. However, previous studies have only assessed content similarity under a single style intensity. In our experiments, we observe that increasing style intensity leads to a significant loss of content features, resulting in a suboptimal content-style frontier. To address this, we propose a novel approach to expand the content-style frontier by leveraging Content-Style Subspace Blending and a Content-Style Balance loss. Our method improves content similarity across varying style intensities, significantly broadening the content-style frontier. Extensive experiments demonstrate that our approach outperforms existing techniques in both qualitative and quantitative evaluations, achieving superior content-style trade-off with significantly lower Inverted Generational Distance (IGD) and Generational Distance (GD) scores compared to current methods.

[317] CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

Qiangguo Jin, Xianyao Zheng, Hui Cui, Changming Sun, Yuqi Fang, Cong Cong, Ran Su, Leyi Wei, Ping Xuan, Junbo Wang

Main category: cs.CV

TL;DR: CMI-MTL is a novel framework for medical visual question answering that addresses cross-modal alignment challenges and free-form answer diversity through multi-task learning with three key modules: fine-grained visual-text alignment, cross-modal feature representation, and free-form answer enhancement.

Details

Motivation: Existing self-attention methods struggle with cross-modal semantic alignment between vision and language, and classification-based approaches are limited by predefined answer sets, making them unable to handle the diversity of free-form answers in medical VQA.

Method: The CMI-MTL framework uses three modules: FVTA for fine-grained visual-text feature alignment, CIFR for cross-modal interleaved feature representation, and FFAE for free-form answer-enhanced multi-task learning that leverages auxiliary knowledge from open-ended questions.

Result: CMI-MTL outperforms state-of-the-art methods on three Med-VQA datasets (VQA-RAD, SLAKE, and OVQA) and demonstrates effectiveness through interpretability experiments.

Conclusion: The proposed CMI-MTL framework effectively addresses cross-modal alignment challenges and adapts to free-form answer diversity in medical VQA, achieving superior performance across multiple datasets.

Abstract: Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model’s capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.

[318] EREBUS: End-to-end Robust Event Based Underwater Simulation

Hitesh Kyatham, Arjun Suresh, Aadi Palnitkar, Yiannis Aloimonos

Main category: cs.CV

TL;DR: A pipeline for generating realistic synthetic data of event-based cameras on AUVs in underwater environments to train vision models, demonstrated through rock detection in poor visibility conditions.

Details

Motivation: Underwater environments pose challenges like poor lighting and high dynamic range that traditional vision techniques struggle with, while event-based cameras offer advantages by tracking frame-by-frame changes.

Method: Developed a pipeline to generate synthetic data simulating event-based cameras mounted on Autonomous Underwater Vehicles in underwater conditions.

Result: The pipeline effectively generates realistic synthetic data for training vision models, demonstrated specifically for rock detection tasks in poor visibility with suspended particulate matter.

Conclusion: The approach successfully addresses underwater vision challenges using synthetic event-based camera data and can be generalized to other underwater tasks beyond rock detection.

Abstract: The underwater domain presents a vast array of challenges for roboticists and computer vision researchers alike, such as poor lighting conditions and high dynamic range scenes. In these adverse conditions, traditional vision techniques struggle to adapt and lead to suboptimal performance. Event-based cameras present an attractive solution to this problem, mitigating the issues of traditional cameras by tracking changes in the footage on a frame-by-frame basis. In this paper, we introduce a pipeline which can be used to generate realistic synthetic data of an event-based camera mounted to an AUV (Autonomous Underwater Vehicle) in an underwater environment for training vision models. We demonstrate the effectiveness of our pipeline using the task of rock detection with poor visibility and suspended particulate matter, but the approach can be generalized to other underwater tasks.

[319] UniSOT: A Unified Framework for Multi-Modality Single Object Tracking

Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Xu Zhou, Feng Wu

Main category: cs.CV

TL;DR: UniSOT is a unified tracker that handles three reference modalities (bounding box, natural language, or both) across four video modalities (RGB, RGB+Depth, RGB+Thermal, RGB+Event) with uniform parameters, outperforming modality-specific trackers.

Details

Motivation: Existing trackers are designed for single or few video and reference modalities, leading to separate model designs that limit practical applications. A unified tracker is needed to handle various requirements across different modalities.

Method: Proposed UniSOT, a unified tracker that can perform tracking with three reference modalities (bounding box, natural language, or both) across four video modalities (RGB, RGB+Depth, RGB+Thermal, RGB+Event) using uniform parameters.

Result: Extensive experiments on 18 benchmarks show UniSOT outperforms modality-specific counterparts. It achieves over 3.0% AUC improvement on TNL2K across all three reference modalities and over 2.0% main metric improvement on Un-Track across all three RGB+X video modalities.

Conclusion: UniSOT demonstrates superior performance as a unified tracker capable of handling multiple reference and video modalities simultaneously, addressing the limitations of modality-specific trackers in practical applications.

Abstract: Single object tracking aims to localize target object with specific reference modalities (bounding box, natural language or both) in a sequence of specific video modalities (RGB, RGB+Depth, RGB+Thermal or RGB+Event.). Different reference modalities enable various human-machine interactions, and different video modalities are demanded in complex scenarios to enhance tracking robustness. Existing trackers are designed for single or several video modalities with single or several reference modalities, which leads to separate model designs and limits practical applications. Practically, a unified tracker is needed to handle various requirements. To the best of our knowledge, there is still no tracker that can perform tracking with these above reference modalities across these video modalities simultaneously. Thus, in this paper, we present a unified tracker, UniSOT, for different combinations of three reference modalities and four video modalities with uniform parameters. Extensive experimental results on 18 visual tracking, vision-language tracking and RGB+X tracking benchmarks demonstrate that UniSOT shows superior performance against modality-specific counterparts. Notably, UniSOT outperforms previous counterparts by over 3.0% AUC on TNL2K across all three reference modalities and outperforms Un-Track by over 2.0% main metric across all three RGB+X video modalities.

[320] Semantic BIM enrichment for firefighting assets: Fire-ART dataset and panoramic image-based 3D reconstruction

Ya Wen, Yutong Qiao, Chi Chiu Lam, Ioannis Brilakis, Sanghoon Lee, Mun On Wong

Main category: cs.CV

TL;DR: This paper introduces Fire-ART dataset and a panoramic image-based reconstruction approach for automated firefighting asset recognition and BIM integration, achieving good performance in real-world validations.

Details

Motivation: Conventional firefighting asset management methods are inefficient due to limited automated recognition and reconstruction capabilities, which hinders emergency preparedness and risk assessment.

Method: Developed Fire-ART dataset with 15 asset types (2,626 images, 6,627 instances) and a reconstruction approach using modified cube-map conversion and radius-based spherical camera projection for semantic enrichment into BIM models.

Result: Achieved F1-scores of 73% and 88% with localization errors of 0.620 and 0.428 meters respectively in two real-world case studies.

Conclusion: The Fire-ART dataset and reconstruction approach provide valuable resources and robust technical solutions for enhancing accurate digital management of fire safety equipment.

Abstract: Inventory management of firefighting assets is crucial for emergency preparedness, risk assessment, and on-site fire response. However, conventional methods are inefficient due to limited capabilities in automated asset recognition and reconstruction. To address the challenge, this research introduces the Fire-ART dataset and develops a panoramic image-based reconstruction approach for semantic enrichment of firefighting assets into BIM models. The Fire-ART dataset covers 15 fundamental assets, comprising 2,626 images and 6,627 instances, making it an extensive and publicly accessible dataset for asset recognition. In addition, the reconstruction approach integrates modified cube-map conversion and radius-based spherical camera projection to enhance recognition and localization accuracy. Through validations with two real-world case studies, the proposed approach achieves F1-scores of 73% and 88% and localization errors of 0.620 and 0.428 meters, respectively. The Fire-ART dataset and the reconstruction approach offer valuable resources and robust technical solutions to enhance the accurate digital management of fire safety equipment.

[321] Privacy Preserving Ordinal-Meta Learning with VLMs for Fine-Grained Fruit Quality Prediction

Riddhi Jain, Manasi Patwardhan, Aayush Mishra, Parijat Deshpande, Beena Rai

Main category: cs.CV

TL;DR: A Model-Agnostic Ordinal Meta-Learning (MAOML) algorithm is proposed to train smaller VLMs for fruit freshness classification, achieving 92.71% accuracy by addressing data scarcity through meta-learning and leveraging label ordinality.

Details

Motivation: Managing perishable fruit wastage requires accurate freshness prediction using non-invasive visual methods, but expert labeling is costly and data is scarce. Proprietary VLMs perform well but raise privacy concerns, while open-source VLMs underperform with limited data.

Method: Developed MAOML algorithm that combines meta-learning to handle data sparsity and leverages ordinal relationships in freshness labels to train smaller vision language models effectively.

Result: Achieved state-of-the-art performance with 92.71% accuracy across all fruits in both zero-shot and few-shot settings, outperforming existing open-source VLMs and matching proprietary model performance.

Conclusion: MAOML provides an effective solution for fruit freshness classification that addresses data scarcity and privacy concerns while achieving high accuracy comparable to proprietary models.

Abstract: To effectively manage the wastage of perishable fruits, it is crucial to accurately predict their freshness or shelf life using non-invasive methods that rely on visual data. In this regard, deep learning techniques can offer a viable solution. However, obtaining fine-grained fruit freshness labels from experts is costly, leading to a scarcity of data. Closed proprietary Vision Language Models (VLMs), such as Gemini, have demonstrated strong performance in fruit freshness detection task in both zero-shot and few-shot settings. Nonetheless, food retail organizations are unable to utilize these proprietary models due to concerns related to data privacy, while existing open-source VLMs yield sub-optimal performance for the task. Fine-tuning these open-source models with limited data fails to achieve the performance levels of proprietary models. In this work, we introduce a Model-Agnostic Ordinal Meta-Learning (MAOML) algorithm, designed to train smaller VLMs. This approach utilizes meta-learning to address data sparsity and leverages label ordinality, thereby achieving state-of-the-art performance in the fruit freshness classification task under both zero-shot and few-shot settings. Our method achieves an industry-standard accuracy of 92.71%, averaged across all fruits. Keywords: Fruit Quality Prediction, Vision Language Models, Meta Learning, Ordinal Regression

[322] Extremal Contours: Gradient-driven contours for compact visual attribution

Reza Karimzadeh, Albert Alonso, Frans Zdyb, Julius B. Kirkegaard, Bulat Ibragimov

Main category: cs.CV

TL;DR: A training-free explanation method that uses smooth tunable contours instead of dense masks for vision model explanations, achieving compact and interpretable regions with high fidelity.

Details

Motivation: Dense perturbation masks for vision model explanations are often fragmented, overfitted, and require post-processing, motivating the need for more compact and stable explanation methods.

Method: Parameterizes star-convex regions using truncated Fourier series and optimizes under extremal preserve/delete objectives using classifier gradients, guaranteeing single connected masks with fewer parameters.

Result: Matches fidelity of dense masks while producing compact, interpretable regions with improved consistency; achieves higher relevance mass and lower complexity than baselines, especially on DINO models with 15%+ improvement.

Conclusion: The contour-based approach provides faithful yet compact explanations with stable boundaries, explicit area control, and extensibility to multi-object localization, outperforming gradient and perturbation methods.

Abstract: Faithful yet compact explanations for vision models remain a challenge, as commonly used dense perturbation masks are often fragmented and overfitted, needing careful post-processing. Here, we present a training-free explanation method that replaces dense masks with smooth tunable contours. A star-convex region is parameterized by a truncated Fourier series and optimized under an extremal preserve/delete objective using the classifier gradients. The approach guarantees a single, simply connected mask, cuts the number of free parameters by orders of magnitude, and yields stable boundary updates without cleanup. Restricting solutions to low-dimensional, smooth contours makes the method robust to adversarial masking artifacts. On ImageNet classifiers, it matches the extremal fidelity of dense masks while producing compact, interpretable regions with improved run-to-run consistency. Explicit area control also enables importance contour maps, yielding a transparent fidelity-area profiles. Finally, we extend the approach to multi-contour and show how it can localize multiple objects within the same framework. Across benchmarks, the method achieves higher relevance mass and lower complexity than gradient and perturbation based baselines, with especially strong gains on self-supervised DINO models where it improves relevance mass by over 15% and maintains positive faithfulness correlations.

[323] Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation

Jie Du, Xinyu Gong, Qingshan Tan, Wen Li, Yangming Cheng, Weitao Wang, Chenlu Zhan, Suhui Wu, Hao Zhang, Jun Zhang

Main category: cs.CV

TL;DR: The paper introduces Reg-DPO, a method that enhances Direct Preference Optimization for video generation by using automatically constructed GT-Pairs and incorporating SFT loss as regularization, achieving superior video quality with improved training stability and capacity.

Details

Motivation: Existing DPO methods for video generation follow image-domain paradigms and are limited to small-scale models, failing to address video-specific challenges like costly data construction, unstable training, and heavy memory consumption.

Method: Proposes GT-Pairs that automatically build preference pairs using real videos as positives and model-generated videos as negatives, and Reg-DPO that incorporates SFT loss as regularization into DPO objective. Also uses FSDP framework with memory optimization techniques to increase training capacity.

Result: Achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on I2V and T2V tasks across multiple datasets show consistent outperformance of existing approaches with superior video generation quality.

Conclusion: The proposed Reg-DPO method effectively addresses video generation challenges by eliminating external annotation needs, enhancing training stability, and significantly improving training capacity, leading to consistently better video generation performance.

Abstract: Recent studies have identified Direct Preference Optimization (DPO) as an efficient and reward-free approach to improving video generation quality. However, existing methods largely follow image-domain paradigms and are mainly developed on small-scale models (approximately 2B parameters), limiting their ability to address the unique challenges of video tasks, such as costly data construction, unstable training, and heavy memory consumption. To overcome these limitations, we introduce a GT-Pair that automatically builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives, eliminating the need for any external annotation. We further present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO objective to enhance training stability and generation fidelity. Additionally, by combining the FSDP framework with multiple memory optimization techniques, our approach achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on both I2V and T2V tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches, delivering superior video generation quality.

[324] Towards One-step Causal Video Generation via Adversarial Self-Distillation

Yongqi Yang, Huayang Huang, Xu Peng, Xiaobin Hu, Donghao Luo, Jiangning Zhang, Chengjie Wang, Yu Wu

Main category: cs.CV

TL;DR: A distillation framework for efficient causal video generation that enables high-quality synthesis with very few denoising steps (1-2 steps) through adversarial self-distillation and first-frame enhancement strategies.

Details

Motivation: Current hybrid video generation models suffer from error accumulation and long inference times due to their sequential, iterative nature, making them inefficient for practical applications.

Method: Proposes Adversarial Self-Distillation (ASD) that aligns student model’s n-step denoising with (n+1)-step outputs at distribution level, and First-Frame Enhancement (FFE) that allocates more denoising steps to initial frames while applying larger skipping steps to later frames.

Result: Extensive experiments on VBench show the method surpasses state-of-the-art approaches in both one-step and two-step video generation, producing a single distilled model that flexibly supports multiple inference-step settings.

Conclusion: The framework enables efficient, high-quality video synthesis with extremely limited denoising steps, eliminating the need for repeated re-distillation while maintaining generation quality.

Abstract: Recent hybrid video generation models combine autoregressive temporal dynamics with diffusion-based spatial denoising, but their sequential, iterative nature leads to error accumulation and long inference times. In this work, we propose a distillation-based framework for efficient causal video generation that enables high-quality synthesis with extremely limited denoising steps. Our approach builds upon the Distribution Matching Distillation (DMD) framework and proposes a novel Adversarial Self-Distillation (ASD) strategy, which aligns the outputs of the student model’s n-step denoising process with its (n+1)-step version at the distribution level. This design provides smoother supervision by bridging small intra-student gaps and more informative guidance by combining teacher knowledge with locally consistent student behavior, substantially improving training stability and generation quality in extremely few-step scenarios (e.g., 1-2 steps). In addition, we present a First-Frame Enhancement (FFE) strategy, which allocates more denoising steps to the initial frames to mitigate error propagation while applying larger skipping steps to later frames. Extensive experiments on VBench demonstrate that our method surpasses state-of-the-art approaches in both one-step and two-step video generation. Notably, our framework produces a single distilled model that flexibly supports multiple inference-step settings, eliminating the need for repeated re-distillation and enabling efficient, high-quality video synthesis.

[325] When to Trust the Answer: Question-Aligned Semantic Nearest Neighbor Entropy for Safer Surgical VQA

Dennis Pierantozzi, Luca Carlini, Mauro Orazio Drago, Chiara Lena, Cesare Hassan, Elena De Momi, Danail Stoyanov, Sophia Bano, Mobarak I. Hoque

Main category: cs.CV

TL;DR: QA-SNNE improves surgical VQA safety by estimating uncertainty through semantic nearest neighbor entropy, enhancing failure detection and clinician trust.

Details

Motivation: Safety and reliability are critical in surgical VQA where incorrect responses can harm patients, but most research overlooks safety behaviors like ambiguity awareness and referral to experts.

Method: Question Aligned Semantic Nearest Neighbor Entropy (QA-SNNE) - a black box uncertainty estimator that incorporates question semantics into prediction confidence by measuring semantic entropy in medical text embedding space.

Result: QA-SNNE improves AUROC by 15-38% for zero-shot models, enhances hallucination detection, and maintains gains under out-of-template stress. PEFT models degrade under paraphrasing while LVLMs are more resilient.

Conclusion: QA-SNNE provides a practical step toward automatic failure detection in surgical VQA by linking semantic uncertainty to question context, improving safety and clinician trust when combined with LVLM backbones.

Abstract: Safety and reliability are essential for deploying Visual Question Answering (VQA) in surgery, where incorrect or ambiguous responses can harm the patient. Most surgical VQA research focuses on accuracy or linguistic quality while overlooking safety behaviors such as ambiguity awareness, referral to human experts, or triggering a second opinion. Inspired by Automatic Failure Detection (AFD), we study uncertainty estimation as a key enabler of safer decision making. We introduce Question Aligned Semantic Nearest Neighbor Entropy (QA-SNNE), a black box uncertainty estimator that incorporates question semantics into prediction confidence. It measures semantic entropy by comparing generated answers with nearest neighbors in a medical text embedding space, conditioned on the question. We evaluate five models, including domain specific Parameter-Efficient Fine-Tuned (PEFT) models and zero-shot Large Vision-Language Models (LVLMs), on EndoVis18-VQA and PitVQA. PEFT models degrade under mild paraphrasing, while LVLMs are more resilient. Across three LVLMs and two PEFT baselines, QA-SNNE improves AUROC in most in-template settings and enhances hallucination detection. The Area Under the ROC Curve (AUROC) increases by 15-38% for zero-shot models, with gains maintained under out-of-template stress. QA-SNNE offers a practical and interpretable step toward AFD in surgical VQA by linking semantic uncertainty to question context. Combining LVLM backbones with question aligned uncertainty estimation can improve safety and clinician trust. The code and model are available at https://github.com/DennisPierantozzi/QASNNE

Seongkyu Choi, Jhonghyun An

Main category: cs.CV

TL;DR: A resolution-aware token decoder for off-road semantic segmentation that balances global semantics, local consistency, and boundary fidelity while being robust to imperfect supervision and noise.

Details

Motivation: Off-road semantic segmentation suffers from thick boundaries, sparse supervision for rare classes, and pervasive label noise. Existing methods either blur edges at low resolution or are costly and fragile to noise at high resolution.

Method: Uses a resolution-aware token decoder with global self-attention, gated cross-attention for fine-scale detail injection, and class-aware point refinement. Includes a boundary-band consistency regularizer during training for coherent predictions around edges.

Result: The approach achieves competitive performance and improved stability across transitions in off-road semantic segmentation.

Conclusion: The proposed method effectively balances computational efficiency with boundary fidelity and noise robustness in challenging off-road segmentation scenarios.

Abstract: Off-road semantic segmentation suffers from thick, inconsistent boundaries, sparse supervision for rare classes, and pervasive label noise. Designs that fuse only at low resolution blur edges and propagate local errors, whereas maintaining high-resolution pathways or repeating high-resolution fusions is costly and fragile to noise. We introduce a resolutionaware token decoder that balances global semantics, local consistency, and boundary fidelity under imperfect supervision. Most computation occurs at a low-resolution bottleneck; a gated cross-attention injects fine-scale detail, and only a sparse, uncertainty-selected set of pixels is refined. The components are co-designed and tightly integrated: global self-attention with lightweight dilated depthwise refinement restores local coherence; a gated cross-attention integrates fine-scale features from a standard high-resolution encoder stream without amplifying noise; and a class-aware point refinement corrects residual ambiguities with negligible overhead. During training, we add a boundary-band consistency regularizer that encourages coherent predictions in a thin neighborhood around annotated edges, with no inference-time cost. Overall, the results indicate competitive performance and improved stability across transitions.

[327] Efficiently Training A Flat Neural Network Before It has been Quantizated

Peng Xia, Junbiao Pang, Tianyang Cai

Main category: cs.CV

TL;DR: Proposes a framework for post-training quantization of vision transformers by modeling quantization errors as Gaussian noise and using noise injection to achieve flat minima for better low-bit quantization.

Details

Motivation: Existing PTQ methods overlook the relationship between well-trained models and quantized versions, leading to high quantization errors. There's a need for model-agnostic approaches tailored for low-bit quantization.

Method: Models Activation Quantization Error (AQE) and Weight Quantization Error (WQE) as independent Gaussian noises. Uses noise injection optimization methods to obtain flat minima in the neural network.

Result: Experimental results demonstrate the effectiveness of the approach in achieving better low-bit PTQ models.

Conclusion: The method opens novel pathways for obtaining low-bit PTQ models by proactively pre-conditioning models through error measurement and disentanglement.

Abstract: Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models. However, existing methods typically overlook the relationship between a well-trained NN and the quantized model, leading to considerable quantization error for PTQ. However, it is unclear how to efficiently train a model-agnostic neural network which is tailored for a predefined precision low-bit model. In this paper, we firstly discover that a flat full precision neural network is crucial for low-bit quantization. To achieve this, we propose a framework that proactively pre-conditions the model by measuring and disentangling the error sources. Specifically, both the Activation Quantization Error (AQE) and the Weight Quantization Error (WQE) are statistically modeled as independent Gaussian noises. We study several noise injection optimization methods to obtain a flat minimum. Experimental results attest to the effectiveness of our approach. These results open novel pathways for obtaining low-bit PTQ models.

SiWoo Kim, JhongHyun An

Main category: cs.CV

TL;DR: Training-only method to improve thermal-infrared object detection by sharpening decision boundaries and injecting RGB semantic priors, achieving SOTA performance without multi-modal inference.

Details

Motivation: Thermal-infrared detection suffers from low contrast and weak cues causing duplicate boxes, missed objects, and class confusion. Existing methods either translate TIR to RGB (fragile to artifacts) or fuse RGB-TIR at test time (requires extra sensors and calibration).

Method: Training-only objectives that: 1) sharpen instance-level decision boundaries via feature pulling/pushing to suppress duplicates and confusion, 2) inject cross-modal semantic priors by aligning student’s pyramid features with RGB-trained teacher to strengthen texture-poor thermal features.

Result: Outperformed prior approaches and achieved state-of-the-art performance in experiments.

Conclusion: The method effectively addresses root causes of thermal detection issues during training, enabling robust mono-modality inference without requiring visible input at test time.

Abstract: Robust perception at night remains challenging for thermal-infrared detection: low contrast and weak high-frequency cues lead to duplicate, overlapping boxes, missed small objects, and class confusion. Prior remedies either translate TIR to RGB and hope pixel fidelity transfers to detection – making performance fragile to color or structure artifacts – or fuse RGB and TIR at test time, which requires extra sensors, precise calibration, and higher runtime cost. Both lines can help in favorable conditions, but do not directly shape the thermal representation used by the detector. We keep mono-modality inference and tackle the root causes during training. Specifically, we introduce training-only objectives that sharpen instance-level decision boundaries by pulling together features of the same class and pushing apart those of different classes – suppressing duplicate and confusing detections – and that inject cross-modal semantic priors by aligning the student’s multi-level pyramid features with an RGB-trained teacher, thereby strengthening texture-poor thermal features without visible input at test time. In experiments, our method outperformed prior approaches and achieved state-of-the-art performance.

[329] HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA

Lei Hu, Yongjing Ye, Shihong Xia

Main category: cs.CV

TL;DR: HMVLM is a unified framework that integrates 3D human motion with foundation models using MoE LoRA strategy, addressing catastrophic forgetting and pose representation challenges in multimodal learning.

Details

Motivation: To bridge the modality gap between human motion and text, prevent catastrophic forgetting during integration, and develop autoregressive-compatible pose representations that maintain generalizability across tasks.

Method: Uses Mixture of Expert Low-Rank Adaptation (MoE LoRA) with dynamic weight allocation based on input prompts, introduces zero expert to preserve pre-trained parameters, and implements body-part-specific tokenization for enhanced spatial resolution.

Result: Effectively alleviates knowledge forgetting during instruction-tuning and achieves remarkable performance across diverse human motion downstream tasks.

Conclusion: The proposed HMVLM framework successfully addresses key challenges in multimodal integration of human motion with foundation models, demonstrating improved performance while mitigating catastrophic forgetting.

Abstract: The expansion of instruction-tuning data has enabled foundation language models to exhibit improved instruction adherence and superior performance across diverse downstream tasks. Semantically-rich 3D human motion is being progressively integrated with these foundation models to enhance multimodal understanding and cross-modal generation capabilities. However, the modality gap between human motion and text raises unresolved concerns about catastrophic forgetting during this integration. In addition, developing autoregressive-compatible pose representations that preserve generalizability across heterogeneous downstream tasks remains a critical technical barrier. To address these issues, we propose the Human Motion-Vision-Language Model (HMVLM), a unified framework based on the Mixture of Expert Low-Rank Adaption(MoE LoRA) strategy. The framework leverages the gating network to dynamically allocate LoRA expert weights based on the input prompt, enabling synchronized fine-tuning of multiple tasks. To mitigate catastrophic forgetting during instruction-tuning, we introduce a novel zero expert that preserves the pre-trained parameters for general linguistic tasks. For pose representation, we implement body-part-specific tokenization by partitioning the human body into different joint groups, enhancing the spatial resolution of the representation. Experiments show that our method effectively alleviates knowledge forgetting during instruction-tuning and achieves remarkable performance across diverse human motion downstream tasks.

[330] SecDiff: Diffusion-Aided Secure Deep Joint Source-Channel Coding Against Adversarial Attacks

Changyuan Zhao, Jiacheng Wang, Ruichen Zhang, Dusit Niyato, Hongyang Du, Zehui Xiong, Dong In Kim, Ping Zhang

Main category: cs.CV

TL;DR: SecDiff is a diffusion-aided decoding framework that enhances security and robustness of deep joint source-channel coding (JSCC) against adversarial wireless attacks like pilot spoofing and subcarrier jamming, achieving better reconstruction quality with lower computational cost.

Details

Motivation: Existing JSCC frameworks are vulnerable to physical-layer adversarial threats that compromise semantic fidelity, creating a need for more secure and robust semantic communication systems.

Method: Uses pseudoinverse-guided sampling and adaptive guidance weighting for flexible step-size control and efficient semantic reconstruction. For jamming attacks: power-based subcarrier masking recasts recovery as masked inpainting. For pilot spoofing: formulates channel estimation as blind inverse problem with EM-driven reconstruction algorithm that alternates between pilot recovery and channel estimation.

Result: Extensive experiments over OFDM channels show SecDiff outperforms existing secure and generative JSCC baselines, achieving favorable trade-off between reconstruction quality and computational cost under adversarial conditions.

Conclusion: SecDiff represents a promising step toward practical, low-latency, and attack-resilient semantic communications with enhanced security and robustness.

Abstract: Deep joint source-channel coding (JSCC) has emerged as a promising paradigm for semantic communication, delivering significant performance gains over conventional separate coding schemes. However, existing JSCC frameworks remain vulnerable to physical-layer adversarial threats, such as pilot spoofing and subcarrier jamming, compromising semantic fidelity. In this paper, we propose SecDiff, a plug-and-play, diffusion-aided decoding framework that significantly enhances the security and robustness of deep JSCC under adversarial wireless environments. Different from prior diffusion-guided JSCC methods that suffer from high inference latency, SecDiff employs pseudoinverse-guided sampling and adaptive guidance weighting, enabling flexible step-size control and efficient semantic reconstruction. To counter jamming attacks, we introduce a power-based subcarrier masking strategy and recast recovery as a masked inpainting problem, solved via diffusion guidance. For pilot spoofing, we formulate channel estimation as a blind inverse problem and develop an expectation-minimization (EM)-driven reconstruction algorithm, guided jointly by reconstruction loss and a channel operator. Notably, our method alternates between pilot recovery and channel estimation, enabling joint refinement of both variables throughout the diffusion process. Extensive experiments over orthogonal frequency-division multiplexing (OFDM) channels under adversarial conditions show that SecDiff outperforms existing secure and generative JSCC baselines by achieving a favorable trade-off between reconstruction quality and computational cost. This balance makes SecDiff a promising step toward practical, low-latency, and attack-resilient semantic communications.

[331] EPAN: Robust Pedestrian Re-Identification via Enhanced Alignment Network for IoT Surveillance

Zhiyang Jia, Hongyan Cui, Ge Gao, Bo Li, Minjie Zhang, Zishuo Gao, Huiwen Huang, Caisheng Zhuo

Main category: cs.CV

TL;DR: EPAN is a dual-branch network for person re-identification in IoT surveillance, achieving 90.09% Rank-1 accuracy and 78.82% mAP on the Inspection-Personnel dataset.

Details

Motivation: To address challenges in person re-identification across diverse IoT surveillance conditions with varying perspectives and environmental changes.

Method: Uses a dual-branch architecture to extract alignment information under varying scales and viewpoints, mitigating perspective and environmental impacts.

Result: Achieved outstanding performance with 90.09% Rank-1 accuracy and 78.82% mAP on the Inspection-Personnel dataset.

Conclusion: EPAN demonstrates strong potential for real-world IoT applications, enabling effective and reliable person re-identification across diverse surveillance cameras.

Abstract: Person re-identification (ReID) plays a pivotal role in computer vision, particularly in surveillance and security applications within IoT-enabled smart environments. This study introduces the Enhanced Pedestrian Alignment Network (EPAN), tailored for robust ReID across diverse IoT surveillance conditions. EPAN employs a dual-branch architecture to mitigate the impact of perspective and environmental changes, extracting alignment information under varying scales and viewpoints. Here, we demonstrate EPAN’s strong feature extraction capabilities, achieving outstanding performance on the Inspection-Personnel dataset with a Rank-1 accuracy of 90.09% and a mean Average Precision (mAP) of 78.82%. This highlights EPAN’s potential for real-world IoT applications, enabling effective and reliable person ReID across diverse cameras in surveillance and security systems. The code and data are available at: https://github.com/ggboy2580/EPAN

[332] SE(3)-PoseFlow: Estimating 6D Pose Distributions for Uncertainty-Aware Robotic Manipulation

Yufeng Jin, Niklas Funk, Vignesh Prasad, Zechu Li, Mathias Franzius, Jan Peters, Georgia Chalvatzaki

Main category: cs.CV

TL;DR: A probabilistic framework using flow matching on SE(3) manifold for estimating 6D object pose distributions, addressing pose ambiguity from occlusions and symmetries with sample-based uncertainty modeling.

Details

Motivation: Object pose estimation faces challenges from partial observability, occlusions, and object symmetries that create pose ambiguity. Deterministic deep networks are overconfident and fail to capture multi-modal pose distributions in ambiguous cases.

Method: Proposes flow matching on the SE(3) manifold to model full pose distributions with sample-based estimates, enabling uncertainty reasoning for symmetric objects and severe occlusions.

Result: Achieves state-of-the-art performance on Real275, YCB-V, and LM-O datasets. Demonstrates practical applications in robotic manipulation tasks like active perception and uncertainty-aware grasp synthesis.

Conclusion: The probabilistic framework successfully addresses pose ambiguity through sample-based distribution modeling, providing uncertainty-aware pose estimates that benefit downstream robotic applications.

Abstract: Object pose estimation is a fundamental problem in robotics and computer vision, yet it remains challenging due to partial observability, occlusions, and object symmetries, which inevitably lead to pose ambiguity and multiple hypotheses consistent with the same observation. While deterministic deep networks achieve impressive performance under well-constrained conditions, they are often overconfident and fail to capture the multi-modality of the underlying pose distribution. To address these challenges, we propose a novel probabilistic framework that leverages flow matching on the SE(3) manifold for estimating 6D object pose distributions. Unlike existing methods that regress a single deterministic output, our approach models the full pose distribution with a sample-based estimate and enables reasoning about uncertainty in ambiguous cases such as symmetric objects or severe occlusions. We achieve state-of-the-art results on Real275, YCB-V, and LM-O, and demonstrate how our sample-based pose estimates can be leveraged in downstream robotic manipulation tasks such as active perception for disambiguating uncertain viewpoints or guiding grasp synthesis in an uncertainty-aware manner.

[333] Driving scenario generation and evaluation using a structured layer representation and foundational models

Arthur Hubert, Gamal Elghazaly, Raphaël Frank

Main category: cs.CV

TL;DR: Proposes a structured five-layer model for generating and evaluating rare driving scenarios using generative models and data augmentation, with metrics for diversity and originality.

Details

Motivation: Rare driving scenarios are critical for autonomous vehicle development but difficult to encounter, requiring simulation and generation methods.

Method: Uses a structured five-layer model with subclasses and characteristics for agents, combined with large foundational models and data augmentation to generate scenarios. Introduces embedding-based comparison and two evaluation metrics.

Result: Developed diversity and originality metrics to evaluate synthetic datasets, demonstrated in various generation setups with qualitative evaluation of synthetic videos from structured descriptions.

Conclusion: The structured five-layer model effectively improves rare scenario generation and evaluation for autonomous driving applications.

Abstract: Rare and challenging driving scenarios are critical for autonomous vehicle development. Since they are difficult to encounter, simulating or generating them using generative models is a popular approach. Following previous efforts to structure driving scenario representations in a layer model, we propose a structured five-layer model to improve the evaluation and generation of rare scenarios. We use this model alongside large foundational models to generate new driving scenarios using a data augmentation strategy. Unlike previous representations, our structure introduces subclasses and characteristics for every agent of the scenario, allowing us to compare them using an embedding specific to our layer-model. We study and adapt two metrics to evaluate the relevance of a synthetic dataset in the context of a structured representation: the diversity score estimates how different the scenarios of a dataset are from one another, while the originality score calculates how similar a synthetic dataset is from a real reference set. This paper showcases both metrics in different generation setup, as well as a qualitative evaluation of synthetic videos generated from structured scenario descriptions. The code and extended results can be found at https://github.com/Valgiz/5LMSG.

[334] Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning

Mengtan Zhang, Zizhan Guo, Hongbo Zhao, Yi Feng, Zuyi Xiong, Yue Wang, Shaoyi Du, Hanli Wang, Rui Fan

Main category: cs.CV

TL;DR: DiMoDE introduces a discriminative treatment of motion components in unsupervised depth and ego-motion learning, leveraging geometric regularities of rigid flows to improve both tasks through targeted constraints and reformulated joint learning.

Details

Motivation: Most existing methods treat ego-motion as auxiliary, either mixing all motion types or excluding depth-independent motions, limiting geometric constraints and reducing reliability under diverse conditions.

Method: Network outputs align optical axes and imaging planes between frames, transform optical flows through these alignments, and quantify deviations to impose geometric constraints on each ego-motion component individually. This reformulates joint learning into coaxial and coplanar forms with closed-form geometric relationships.

Result: DiMoDE achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions.

Conclusion: The discriminative treatment of motion components and geometric reformulation significantly improve depth and ego-motion estimation robustness and reliability.

Abstract: Unsupervised learning of depth and ego-motion, two fundamental 3D perception tasks, has made significant strides in recent years. However, most methods treat ego-motion as an auxiliary task, either mixing all motion types or excluding depth-independent rotational motions in supervision. Such designs limit the incorporation of strong geometric constraints, reducing reliability and robustness under diverse conditions. This study introduces a discriminative treatment of motion components, leveraging the geometric regularities of their respective rigid flows to benefit both depth and ego-motion estimation. Given consecutive video frames, network outputs first align the optical axes and imaging planes of the source and target cameras. Optical flows between frames are transformed through these alignments, and deviations are quantified to impose geometric constraints individually on each ego-motion component, enabling more targeted refinement. These alignments further reformulate the joint learning process into coaxial and coplanar forms, where depth and each translation component can be mutually derived through closed-form geometric relationships, introducing complementary constraints that improve depth robustness. DiMoDE, a general depth and ego-motion joint learning framework incorporating these designs, achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions. Our source code will be publicly available at mias.group/DiMoDE upon publication.

[335] Luminance-Aware Statistical Quantization: Unsupervised Hierarchical Learning for Illumination Enhancement

Derong Kong, Zhixiong Yang, Shengxi Li, Shuaifeng Zhi, Li Liu, Zhen Liu, Jingyuan Xia

Main category: cs.CV

TL;DR: LASQ reformulates low-light image enhancement as a statistical sampling process using power-law distributed luminance transitions, enabling unsupervised enhancement without normal-light references.

Details

Motivation: Existing LLIE methods struggle with balancing reconstruction fidelity and cross-scenario generalization, especially when normal-light references are unavailable, due to their reliance on deterministic pixel-level mappings.

Method: Introduces Luminance-Aware Statistical Quantification (LASQ) that models luminance transitions as power-law distributions, uses stratified power functions, and employs a diffusion forward process to discover optimal transition paths between luminance layers.

Result: Achieves superior performance on domain-specific datasets and better generalization across non-reference datasets, enabling more adaptable and versatile light restoration in practical situations.

Conclusion: LASQ provides a probabilistic framework that replaces deterministic mappings, significantly improving LLIE performance in both reference and non-reference scenarios while maintaining better generalization capabilities.

Abstract: Low-light image enhancement (LLIE) faces persistent challenges in balancing reconstruction fidelity with cross-scenario generalization. While existing methods predominantly focus on deterministic pixel-level mappings between paired low/normal-light images, they often neglect the continuous physical process of luminance transitions in real-world environments, leading to performance drop when normal-light references are unavailable. Inspired by empirical analysis of natural luminance dynamics revealing power-law distributed intensity transitions, this paper introduces Luminance-Aware Statistical Quantification (LASQ), a novel framework that reformulates LLIE as a statistical sampling process over hierarchical luminance distributions. Our LASQ re-conceptualizes luminance transition as a power-law distribution in intensity coordinate space that can be approximated by stratified power functions, therefore, replacing deterministic mappings with probabilistic sampling over continuous luminance layers. A diffusion forward process is designed to autonomously discover optimal transition paths between luminance layers, achieving unsupervised distribution emulation without normal-light references. In this way, it considerably improves the performance in practical situations, enabling more adaptable and versatile light restoration. This framework is also readily applicable to cases with normal-light references, where it achieves superior performance on domain-specific datasets alongside better generalization-ability across non-reference datasets.

[336] Example-Based Feature Painting on Textures

Andrei-Timotei Ardelean, Tim Weyrich

Main category: cs.CV

TL;DR: A system for controlled authoring and editing of textures with local characteristics like stains, tears, and holes using unsupervised learning and diffusion-based generation.

Details

Motivation: Realistic textures require including natural alterations like stains and abrasions that are ubiquitous in nature, but current methods lack efficient ways to incorporate these features without manual annotation.

Method: Uses unsupervised anomaly detection to identify appearance-altering features from unlabeled examples, automatically clusters them into semantic groups, and employs diffusion-based editing for conditional generation of arbitrary-sized textures.

Result: Developed a complete pipeline from small image collections to versatile generative models that enable interactive feature painting on textures of any size.

Conclusion: The approach successfully creates realistic textures with natural blemishes without manual annotation, and the introduced algorithms for diffusion-based editing and infinite texture generation are generic enough for broader applications.

Abstract: In this work, we propose a system that covers the complete workflow for achieving controlled authoring and editing of textures that present distinctive local characteristics. These include various effects that change the surface appearance of materials, such as stains, tears, holes, abrasions, discoloration, and more. Such alterations are ubiquitous in nature, and including them in the synthesis process is crucial for generating realistic textures. We introduce a novel approach for creating textures with such blemishes, adopting a learning-based approach that leverages unlabeled examples. Our approach does not require manual annotations by the user; instead, it detects the appearance-altering features through unsupervised anomaly detection. The various textural features are then automatically clustered into semantically coherent groups, which are used to guide the conditional generation of images. Our pipeline as a whole goes from a small image collection to a versatile generative model that enables the user to interactively create and paint features on textures of arbitrary size. Notably, the algorithms we introduce for diffusion-based editing and infinite stationary texture generation are generic and should prove useful in other contexts as well. Project page: https://reality.tf.fau.de/pub/ardelean2025examplebased.html

[337] NSYNC: Negative Synthetic Image Generation for Contrastive Training to Improve Stylized Text-To-Image Translation

Serkan Ozturk, Samet Hicsonmez, Pinar Duygulu

Main category: cs.CV

TL;DR: NSYNC introduces a contrastive learning framework using synthetic negative examples to improve text-to-image diffusion models’ ability to capture specific artistic styles by orthogonalizing gradients.

Details

Motivation: Current text-to-image models generate realistic images but fail to capture specific artistic styles. Fine-tuning on style datasets alone is insufficient for grasping style features effectively.

Method: Proposes a contrastive training scheme using synthetic negative image sets alongside real positive images. Refines gradients by subtracting positive gradient’s projection onto negative gradient, updating parameters based on the orthogonal component to eliminate trivial shared attributes.

Result: Experiments on various painter and illustrator styles show improved performance over baseline methods both quantitatively and qualitatively.

Conclusion: The NSYNC framework successfully enhances stylization capability of diffusion models by leveraging synthetic negative examples in contrastive learning to capture more unique style characteristics.

Abstract: Current text conditioned image generation methods output realistic looking images, but they fail to capture specific styles. Simply finetuning them on the target style datasets still struggles to grasp the style features. In this work, we present a novel contrastive learning framework to improve the stylization capability of large text-to-image diffusion models. Motivated by the astonishing advance in image generation models that makes synthetic data an intrinsic part of model training in various computer vision tasks, we exploit synthetic image generation in our approach. Usually, the generated synthetic data is dependent on the task, and most of the time it is used to enlarge the available real training dataset. With NSYNC, alternatively, we focus on generating negative synthetic sets to be used in a novel contrastive training scheme along with real positive images. In our proposed training setup, we forward negative data along with positive data and obtain negative and positive gradients, respectively. We then refine the positive gradient by subtracting its projection onto the negative gradient to get the orthogonal component, based on which the parameters are updated. This orthogonal component eliminates the trivial attributes that are present in both positive and negative data and directs the model towards capturing a more unique style. Experiments on various styles of painters and illustrators show that our approach improves the performance over the baseline methods both quantitatively and qualitatively. Our code is available at https://github.com/giddyyupp/NSYNC.

[338] DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning

Mahmut Selman Gokmen, Cody Bumgardner

Main category: cs.CV

TL;DR: DINO-MX is a modular and extensible training framework that unifies DINO family methods, supporting various transformer architectures and training strategies while reducing computational costs and improving interpretability.

Details

Motivation: Existing vision foundation model training pipelines are inflexible, domain-specific, and computationally expensive, limiting their usability across different domains and resource settings.

Method: Combines core principles of DINO, DINOv2 and DINOv3 in a unified configuration-driven system, supports multiple transformer architectures, and includes training strategies like LoRA, layer freezing, knowledge distillation, DDP, and FSDP.

Result: Achieves competitive performance on diverse datasets while significantly reducing computational costs, and provides interpretability tools with label-guided data augmentation for improved attention-based localization.

Conclusion: DINO-MX provides a reproducible and scalable foundation for developing, adapting, and benchmarking self-supervised vision models across research and real-world applications.

Abstract: Vision Foundation Models (VFMs) have advanced representation learning through self-supervised methods. However, existing training pipelines are often inflexible, domain-specific, or computationally expensive, which limits their usability across different domains and resource settings. DINO-MX is a modular and extensible training framework that combines the core principles of DINO, DINOv2 and DINOv3 within a unified configuration-driven system. It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem. The framework includes multiple training strategies such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation, along with support for distributed training through both Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to work with both natural and specialized data types, including single- and multi-channel images. Experimental results on diverse datasets show that DINO-MX achieves competitive performance while significantly reducing computational costs. Additionally, it offers interpretability tools and a label-guided data augmentation method that improves attention-based localization without the need for extra detection or segmentation heads. DINO-MX provides a reproducible and scalable foundation for developing, adapting, and benchmarking self-supervised vision models across a range of research and real-world applications.

[339] PCD-ReID: Occluded Person Re-Identification for Base Station Inspection

Ge Gao, Zishuo Gao, Hongyan Cui, Zhiyang Jia, Zhuang Luo, ChaoPeng Liu

Main category: cs.CV

TL;DR: PCD-ReID algorithm uses Transformer architecture to extract shared component features for occluded pedestrian re-identification, achieving 79.0% mAP and 82.7% Rank-1 accuracy with 15.9% improvement over ResNet50 methods.

Details

Motivation: Traditional ResNet-based ReID algorithms fail to effectively handle occlusions that obscure key body features in surveillance scenarios, requiring new methods for occluded pedestrian re-identification.

Method: Transformer-based PCD network that extracts shared component features (helmets, uniforms) and uses new real-world patrol surveillance dataset with 10,000 individuals and 50,000+ images collected over six months.

Result: Achieved 79.0% mAP and 82.7% Rank-1 accuracy, representing 15.9% Rank-1 improvement over ResNet50-based methods in occlusion-aware ReID performance.

Conclusion: PCD-ReID effectively addresses occlusion challenges in pedestrian re-identification for tower inspection scenarios and shows strong potential for practical deployment in surveillance and security applications.

Abstract: Occluded pedestrian re-identification (ReID) in base station environments is a critical task in computer vision, particularly for surveillance and security applications. This task faces numerous challenges, as occlusions often obscure key body features, increasing the complexity of identification. Traditional ResNet-based ReID algorithms often fail to address occlusions effectively, necessitating new ReID methods. We propose the PCD-ReID (Pedestrian Component Discrepancy) algorithm to address these issues. The contributions of this work are as follows: To tackle the occlusion problem, we design a Transformer-based PCD network capable of extracting shared component features, such as helmets and uniforms. To mitigate overfitting on public datasets, we collected new real-world patrol surveillance images for model training, covering six months, 10,000 individuals, and over 50,000 images. Comparative experiments with existing ReID algorithms demonstrate that our model achieves a mean Average Precision (mAP) of 79.0% and a Rank-1 accuracy of 82.7%, marking a 15.9% Rank-1 improvement over ResNet50-based methods. Experimental evaluations indicate that PCD-ReID effectively achieves occlusion-aware ReID performance for personnel in tower inspection scenarios, highlighting its potential for practical deployment in surveillance and security applications.

[340] NOA: a versatile, extensible tool for AI-based organoid analysis

Mikhail Konov, Lion J. Gleiter, Khoa Co, Monica Yabal, Tingying Peng

Main category: cs.CV

TL;DR: NOA is a graphical user interface plugin for napari that simplifies AI-based organoid image analysis by integrating detection, segmentation, tracking, feature extraction, and ML prediction modules.

Details

Motivation: AI tools for organoid microscopy analysis are inaccessible to biologists without programming experience, leading to labor-intensive manual workflows, and existing tools are narrowly focused on specific tasks.

Method: Developed NOA as an open-source napari plugin that integrates multiple state-of-the-art algorithms for organoid detection, segmentation, tracking, feature extraction, custom annotation, and ML-based feature prediction.

Result: Demonstrated NOA’s versatility through three case studies: quantifying morphological changes during organoid differentiation, assessing phototoxicity effects, and predicting organoid viability and differentiation state.

Conclusion: NOA enables comprehensive, AI-driven organoid image analysis within an accessible and extensible framework, making advanced analysis tools available to non-programming biologists.

Abstract: AI tools can greatly enhance the analysis of organoid microscopy images, from detection and segmentation to feature extraction and classification. However, their limited accessibility to biologists without programming experience remains a major barrier, resulting in labor-intensive and largely manual workflows. Although a few AI models for organoid analysis have been developed, most existing tools remain narrowly focused on specific tasks. In this work, we introduce the Napari Organoid Analyzer (NOA), a general purpose graphical user interface to simplify AI-based organoid analysis. NOA integrates modules for detection, segmentation, tracking, feature extraction, custom feature annotation and ML-based feature prediction. It interfaces multiple state-of-the-art algorithms and is implemented as an open-source napari plugin for maximal flexibility and extensibility. We demonstrate the versatility of NOA through three case studies, involving the quantification of morphological changes during organoid differentiation, assessment of phototoxicity effects, and prediction of organoid viability and differentiation state. Together, these examples illustrate how NOA enables comprehensive, AI-driven organoid image analysis within an accessible and extensible framework.

[341] PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

Wenqi Liang, Gan Sun, Yao He, Jiahua Dong, Suyan Dai, Ivan Laptev, Salman Khan, Yang Cong

Main category: cs.CV

TL;DR: PixelVLA is a new Vision-Language-Action model that addresses limitations of current VLAs by enabling pixel-level reasoning and multimodal prompting with both text and visual inputs, achieving significant performance improvements with much lower training costs.

Details

Motivation: Current VLAs struggle with pixel-level scene understanding and rely heavily on textual prompts, limiting their flexibility in real-world robot control applications.

Method: Built on a visuomotor instruction tuning framework with multiscale pixel-aware encoder and visual prompting encoder, trained using a two-stage automated annotation pipeline that generates Pixel-160K dataset with pixel-level annotations.

Result: Improves manipulation success rates by 10.1%-17.8% over OpenVLA while requiring only 1.5% of its pretraining cost on three standard VLA benchmarks.

Conclusion: PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments, with the dataset and code being released as open source.

Abstract: Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-17.8% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments. The dataset and code will be released as open source.

[342] Generative Adversarial Synthesis and Deep Feature Discrimination of Brain Tumor MRI Images

Md Sumon Ali, Muzammil Behzad

Main category: cs.CV

TL;DR: Proposed DL methodology using DC-GAN to generate synthetic MRI data and CNN classifier to evaluate synthetic image quality through brain tumor classification.

Details

Motivation: Limited availability of original MRI data and the difficulty of generating realistic medical images create a need for synthetic data generation in medical imaging.

Method: Used Deep Convolutional Generative Adversarial Network (DC-GAN) to create synthetic MRI data and Convolutional Neural Network (CNN) classifier to evaluate synthetic images through brain tumor classification.

Result: Classification performance on synthetic images was comparable to real images, validating the effectiveness of GAN-generated images for downstream tasks.

Conclusion: GAN-generated synthetic MRI data is effective and can address the problem of limited medical imaging data while maintaining utility for classification tasks.

Abstract: Compared to traditional methods, Deep Learning (DL) becomes a key technology for computer vision tasks. Synthetic data generation is an interesting use case for DL, especially in the field of medical imaging such as Magnetic Resonance Imaging (MRI). The need for this task since the original MRI data is limited. The generation of realistic medical images is completely difficult and challenging. Generative Adversarial Networks (GANs) are useful for creating synthetic medical images. In this paper, we propose a DL based methodology for creating synthetic MRI data using the Deep Convolutional Generative Adversarial Network (DC-GAN) to address the problem of limited data. We also employ a Convolutional Neural Network (CNN) classifier to classify the brain tumor using synthetic data and real MRI data. CNN is used to evaluate the quality and utility of the synthetic images. The classification result demonstrates comparable performance on real and synthetic images, which validates the effectiveness of GAN-generated images for downstream tasks.

[343] Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation

Yizhu Chen, Chen Ju, Zhicheng Wang, Shuai Xiao, Xu Chen, Jinsong Lan, Xiaoyong Zhu, Ying Chen

Main category: cs.CV

TL;DR: CDD-VT proposes a dualistic visual tokenizer that adaptively combines continuous and discrete tokenization approaches based on image complexity, achieving superior performance with a more scalable architecture.

Details

Motivation: To overcome the limitations of continuous tokenizers (complex pipelines) and discrete tokenizers (information loss) in multi-modal large models by creating a unified approach that adapts to image complexity.

Method: Uses a dualistic approach with Diverse Quantitative Primitives for orthogonal primitive representation and Dynamic Primitive Allocator to determine optimal primitive count based on image complexity.

Result: Achieves superior performance in reconstruction, retrieval and classification tasks compared to specialized continuous and discrete tokenizers.

Conclusion: CDD-VT effectively bridges the continuous-discrete dichotomy in visual tokenization, enabling strong performance within a concise and scalable MLLM framework.

Abstract: The unification of understanding and generation within a single multi-modal large model (MLLM) remains one significant challenge, largely due to the dichotomy between continuous and discrete visual tokenizations. Continuous tokenizer (CT) achieves strong performance by bridging multiple independently-trained understanding modules and generation modules, but suffers from complex multi-stage pipelines and substantial engineering overhead. Conversely, discrete tokenizers (DT) offer a conceptually elegant idea by quantizing each image into a primitive, but inevitably leading to information loss and performance degradation. To resolve this tension, we question the binary choice between CT and DT, inspired by the wave-particle duality of light, and propose the Continuous-Discrete Dualistic Visual Tokenizer (CDD-VT). We treat visual data as a flexible composition of image primitives derived from quantized codebooks, with the crucial insight that the primitive number assigned to each visual sample is adaptively determined according to its complexity: simple instances use a few primitives, emulating discrete tokenization, while complex instances use many, approximating continuous tokenization. Two core components are designed: Diverse Quantitative Primitives, which encourage primitives orthogonality to better populate information space, and Dynamic Primitive Allocator, which assesses sample complexity to determine the optimal set of primitives. Extensive experiments on reconstruction, retrieval and classification show that CDD-VT achieves superior performance over to specialized CT and DT, effectively getting strong result within a concise and scalable MLLM.

[344] Lite ENSAM: a lightweight cancer segmentation model for 3D Computed Tomography

Agnar Martin Bjørnstad, Elias Stenhede, Arian Ranjbar

Main category: cs.CV

TL;DR: Lite ENSAM is a lightweight adaptation of ENSAM architecture for efficient volumetric tumor segmentation from CT scans with RECIST annotations, achieving competitive performance in MICCAI FLARE 2025 Task 1.

Details

Motivation: Current clinical practice relies on RECIST v1.1 for tumor size measurement using longest diameter, but volumetric measurements provide more reliable treatment assessment. Manual volumetric annotation is labor-intensive, limiting clinical adoption of volumetric assessment.

Method: Lite ENSAM, a lightweight adaptation of the ENSAM architecture, designed for efficient volumetric tumor segmentation from CT scans annotated with RECIST annotations.

Result: Achieved Dice Similarity Coefficient (DSC) of 60.7% and Normalized Surface Dice (NSD) of 63.6% on hidden test set in MICCAI FLARE 2025 Task 1. Average total RAM time of 50.6 GBs and average inference time of 14.4 s on CPU on public validation dataset.

Conclusion: Lite ENSAM provides an efficient solution for volumetric tumor segmentation from CT scans with RECIST annotations, addressing the labor-intensive nature of manual volumetric annotation and enabling more reliable treatment assessment.

Abstract: Accurate tumor size measurement is a cornerstone of evaluating cancer treatment response. The most widely adopted standard for this purpose is the Response Evaluation Criteria in Solid Tumors (RECIST) v1.1, which relies on measuring the longest tumor diameter in a single plane. However, volumetric measurements have been shown to provide a more reliable assessment of treatment effect. Their clinical adoption has been limited, though, due to the labor-intensive nature of manual volumetric annotation. In this paper, we present Lite ENSAM, a lightweight adaptation of the ENSAM architecture designed for efficient volumetric tumor segmentation from CT scans annotated with RECIST annotations. Lite ENSAM was submitted to the MICCAI FLARE 2025 Task 1: Pan-cancer Segmentation in CT Scans, Subtask 2, where it achieved a Dice Similarity Coefficient (DSC) of 60.7% and a Normalized Surface Dice (NSD) of 63.6% on the hidden test set, and an average total RAM time of 50.6 GBs and an average inference time of 14.4 s on CPU on the public validation dataset.

[345] Benchmark-Ready 3D Anatomical Shape Classification

Tomáš Krsička, Tibor Kubík

Main category: cs.CV

TL;DR: The paper introduces PSPooling, a non-learnable mesh pooling operator for 3D anatomical shape classification, and MedShapeNet19 benchmark dataset, showing improved reconstruction and classification in low-label settings.

Details

Motivation: Progress in anatomical 3D shape classification is limited by mesh complexity and lack of standardized benchmarks, highlighting the need for robust learning methods and reproducible evaluation.

Method: Proposes Precomputed Structural Pooling (PSPooling) for efficient graph coarsening, integrated into a self-supervised graph autoencoder that learns anatomy-aware representations from unlabeled surface meshes.

Result: PSPooling significantly improves reconstruction fidelity and classification accuracy in low-label regimes on the MedShapeNet19 benchmark dataset.

Conclusion: PSPooling establishes a strong baseline for medical 3D shape learning, and MedShapeNet19 serves as a widely adopted benchmark for anatomical shape classification research.

Abstract: Progress in anatomical 3D shape classification is limited by the complexity of mesh data and the lack of standardized benchmarks, highlighting the need for robust learning methods and reproducible evaluation. We introduce two key steps toward clinically and benchmark-ready anatomical shape classification via self-supervised graph autoencoding. We propose Precomputed Structural Pooling (PSPooling), a non-learnable mesh pooling operator designed for efficient and structure-preserving graph coarsening in 3D anatomical shape analysis. PSPooling precomputes node correspondence sets based on geometric proximity, enabling parallelizable and reversible pooling and unpooling operations with guaranteed support structure. This design avoids the sparsity and reconstruction issues of selection-based methods and the sequential overhead of edge contraction approaches, making it particularly suitable for high-resolution medical meshes. To demonstrate its effectiveness, we integrate PSPooling into a self-supervised graph autoencoder that learns anatomy-aware representations from unlabeled surface meshes. We evaluate the downstream benefits on MedShapeNet19, a new curated benchmark dataset we derive from MedShapeNet, consisting of 19 anatomical classes with standardized training, validation, and test splits. Experiments show that PSPooling significantly improves reconstruction fidelity and classification accuracy in low-label regimes, establishing a strong baseline for medical 3D shape learning. We hope that MedShapeNet19 will serve as a widely adopted benchmark for anatomical shape classification and further research in medical 3D shape analysis. Access the complete codebase, model weights, and dataset information here: https://github.com/TomasKrsicka/MedShapeNet19-PSPooling.

[346] Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan

Main category: cs.CV

TL;DR: Vote-in-Context (ViC) is a training-free framework that uses Vision-Language Models for zero-shot reranking and fusion in video retrieval, achieving state-of-the-art performance by serializing content evidence and retriever metadata in prompts.

Details

Motivation: Existing fusion techniques for heterogeneous retrievers rely only on rank or score signals, ignoring candidates' representations, which is particularly problematic for complex multi-modal data like videos.

Method: ViC serializes both content evidence and retriever metadata directly in VLM prompts, using S-Grid (compact image grid representation of videos) to enable list-wise reasoning over video candidates as a zero-shot task.

Result: ViC achieves state-of-the-art zero-shot performance: 87.1% Recall@1 (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, with gains up to +40 Recall@1 over previous baselines.

Conclusion: ViC provides a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers for complex multi-modal retrieval tasks.

Abstract: In the retrieval domain, candidates’ fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates’ representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM’s prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC

[347] Enhancing Diffusion-based Restoration Models via Difficulty-Adaptive Reinforcement Learning with IQA Reward

Xiaogang Xu, Ruihang Chu, Jian Wang, Kun Zhou, Wenjie Shu, Harry Yang, Ser-Nam Lim, Hao Chen, Liang Lin

Main category: cs.CV

TL;DR: This paper proposes an RL framework for diffusion-based image restoration models that uses MLLM-based IQA models for reward functions and adaptively combines RL with supervised fine-tuning based on sample difficulty.

Details

Motivation: Directly applying existing RL methods to diffusion-based image restoration is suboptimal because restoration emphasizes fidelity over pure generation, requiring different optimization approaches.

Method: Uses MLLM-based IQA models for RL rewards, focuses on challenging samples far from ground truth, and adaptively combines RL with SFT through automatic weighting based on sample difficulty.

Result: The proposed RL framework effectively boosts performance across various restoration tasks and can be seamlessly applied to diffusion-based restoration models.

Conclusion: The adaptive RL strategy with IQA-based rewards provides an effective plug-and-play solution for improving diffusion-based image restoration models.

Abstract: Reinforcement Learning (RL) has recently been incorporated into diffusion models, e.g., tasks such as text-to-image. However, directly applying existing RL methods to diffusion-based image restoration models is suboptimal, as the objective of restoration fundamentally differs from that of pure generation: it places greater emphasis on fidelity. In this paper, we investigate how to effectively integrate RL into diffusion-based restoration models. First, through extensive experiments with various reward functions, we find that an effective reward can be derived from an Image Quality Assessment (IQA) model, instead of intuitive ground-truth-based supervision, which has already been optimized during the Supervised Fine-Tuning (SFT) stage prior to RL. Moreover, our strategy focuses on using RL for challenging samples that are significantly distant from the ground truth, and our RL approach is innovatively implemented using MLLM-based IQA models to align distributions with high-quality images initially. As the samples approach the ground truth’s distribution, RL is adaptively combined with SFT for more fine-grained alignment. This dynamic process is facilitated through an automatic weighting strategy that adjusts based on the relative difficulty of the training samples. Our strategy is plug-and-play that can be seamlessly applied to diffusion-based restoration models, boosting its performance across various restoration tasks. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our proposed RL framework.

[348] UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback

Ropeway Liu, Hangjie Yuan, Bo Dong, Jiazheng Xing, Jinwang Wang, Rui Zhao, Yan Xing, Weihua Chen, Fan Wang

Main category: cs.CV

TL;DR: UniLumos is a unified relighting framework that uses RGB-space geometry feedback and path consistency learning to achieve physically plausible relighting for images and videos with 20x speedup.

Details

Motivation: Existing diffusion-based relighting methods produce unrealistic results like overexposed highlights and misaligned shadows due to optimization in semantic latent space without physical correctness guarantees.

Method: Uses flow matching backbone with RGB-space geometry feedback from depth/normal maps, path consistency learning for few-step training, and structured 6D lighting annotation protocol.

Result: Achieves state-of-the-art relighting quality with improved physical consistency and 20x speedup for both image and video relighting.

Conclusion: UniLumos effectively bridges the gap between semantic control and physical plausibility in relighting, while introducing LumosBench for automatic evaluation of lighting controllability.

Abstract: Relighting is a crucial task with both practical demand and artistic value, and recent diffusion models have shown strong potential by enabling rich and controllable lighting effects. However, as they are typically optimized in semantic latent space, where proximity does not guarantee physical correctness in visual space, they often produce unrealistic results, such as overexposed highlights, misaligned shadows, and incorrect occlusions. We address this with UniLumos, a unified relighting framework for both images and videos that brings RGB-space geometry feedback into a flow matching backbone. By supervising the model with depth and normal maps extracted from its outputs, we explicitly align lighting effects with the scene structure, enhancing physical plausibility. Nevertheless, this feedback requires high-quality outputs for supervision in visual space, making standard multi-step denoising computationally expensive. To mitigate this, we employ path consistency learning, allowing supervision to remain effective even under few-step training regimes. To enable fine-grained relighting control and supervision, we design a structured six-dimensional annotation protocol capturing core illumination attributes. Building upon this, we propose LumosBench, a disentangled attribute-level benchmark that evaluates lighting controllability via large vision-language models, enabling automatic and interpretable assessment of relighting precision across individual dimensions. Extensive experiments demonstrate that UniLumos achieves state-of-the-art relighting quality with significantly improved physical consistency, while delivering a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.

[349] Progressive Translation of H&E to IHC with Enhanced Structural Fidelity

Yuhang Kang, Ziyu Su, Tianyang Wang, Zaibo Li, Wei Chen, Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: A progressive network architecture for synthesizing IHC images from H&E slides that decouples structure, color, and cell boundary generation to overcome limitations of traditional linear loss weighting approaches.

Details

Motivation: IHC staining is costly and labor-intensive with limited scalability, while existing computational stain translation methods using linear loss functions fail to preserve both structural authenticity and color fidelity simultaneously.

Method: Progressive network architecture with stage-wise optimization of visual aspects, building on Adaptive Supervised PatchNCE framework with additional DAB chromogen concentration and image gradient loss functions.

Result: Experiments on HER2 and ER datasets showed significant improvement in visual quality and finer structural details compared to baseline methods.

Conclusion: The proposed progressive mechanism effectively addresses the interdependence problem in stain translation, enabling better preservation of both structural features and color fidelity in synthesized IHC images.

Abstract: Compared to hematoxylin-eosin (H&E) staining, immunohistochemistry (IHC) not only maintains the structural features of tissue samples, but also provides high-resolution protein localization, which is essential for aiding in pathology diagnosis. Despite its diagnostic value, IHC remains a costly and labor-intensive technique. Its limited scalability and constraints in multiplexing further hinder widespread adoption, especially in resource-limited settings. Consequently, researchers are increasingly exploring computational stain translation techniques to synthesize IHC-equivalent images from H&E-stained slides, aiming to extract protein-level information more efficiently and cost-effectively. However, most existing stain translation techniques rely on a linearly weighted summation of multiple loss terms within a single objective function, strategy that often overlooks the interdepedence among these components-resulting in suboptimal image quality and an inability to simultaneously preserve structural authenticity and color fidelity. To address this limitation, we propose a novel network architecture that follows a progressive structure, incorporating color and cell border generation logic, which enables each visual aspect to be optimized in a stage-wise and decoupled manner. To validate the effectiveness of our proposed network architecture, we build upon the Adaptive Supervised PatchNCE (ASP) framework as our baseline. We introduce additional loss functions based on 3,3’-diaminobenzidine (DAB) chromogen concentration and image gradient, enhancing color fidelity and cell boundary clarity in the generated IHC images. By reconstructing the generation pipeline using our structure-color-cell boundary progressive mechanism, experiments on HER2 and ER datasets demonstrated that the model significantly improved visual quality and achieved finer structural details.

[350] Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond

Xin Qiao, Matteo Poggi, Xing Wei, Pengchao Deng, Yanhui Zhou, Stefano Mattoccia

Main category: cs.CV

TL;DR: LFRD2 is a hybrid framework combining neural networks with physical modeling to address severe degradations in under-display ToF imaging caused by TOLED layers, using learnable fractional reaction-diffusion dynamics for iterative depth refinement.

Details

Motivation: Under-display ToF imaging suffers from severe degradations including signal attenuation, multi-path interference, and temporal noise when cameras are placed beneath transparent OLED screens, which significantly compromise depth quality.

Method: Proposes Learnable Fractional Reaction-Diffusion Dynamics (LFRD2) with time-fractional reaction-diffusion module for iterative depth refinement with dynamic differential orders, and efficient continuous convolution via coefficient prediction and repeated differentiation.

Result: Experiments on four benchmark datasets demonstrate the effectiveness of the approach in improving depth quality for under-display ToF imaging.

Conclusion: LFRD2 successfully combines neural networks with physical modeling to address the challenges of under-display ToF imaging, providing an effective solution for depth restoration through interpretable physical dynamics.

Abstract: Under-display ToF imaging aims to achieve accurate depth sensing through a ToF camera placed beneath a screen panel. However, transparent OLED (TOLED) layers introduce severe degradations-such as signal attenuation, multi-path interference (MPI), and temporal noise-that significantly compromise depth quality. To alleviate this drawback, we propose Learnable Fractional Reaction-Diffusion Dynamics (LFRD2), a hybrid framework that combines the expressive power of neural networks with the interpretability of physical modeling. Specifically, we implement a time-fractional reaction-diffusion module that enables iterative depth refinement with dynamically generated differential orders, capturing long-term dependencies. In addition, we introduce an efficient continuous convolution operator via coefficient prediction and repeated differentiation to further improve restoration quality. Experiments on four benchmark datasets demonstrate the effectiveness of our approach. The code is publicly available at https://github.com/wudiqx106/LFRD2.

[351] Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image

Yuxiao Yang, Xiao-Xiao Long, Zhiyang Dou, Cheng Lin, Yuan Liu, Qingsong Yan, Yuexin Ma, Haoqian Wang, Zhiqiang Wu, Wei Yin

Main category: cs.CV

TL;DR: Wonder3D++ is a novel method for efficient high-fidelity textured mesh generation from single-view images using cross-domain diffusion and multi-view attention.

Details

Motivation: Existing methods either suffer from slow per-shape optimization with inconsistent geometry (SDS-based) or produce low-quality results lacking geometric details (fast inference methods).

Method: Proposes cross-domain diffusion model generating multi-view normal maps and color images, multi-view cross-domain attention for consistency, and cascaded 3D mesh extraction in coarse-to-fine manner.

Result: Achieves high-quality reconstruction results with robust generalization and good efficiency (~3 minutes per mesh), outperforming prior works.

Conclusion: Wonder3D++ holistically improves quality, consistency, and efficiency of single-view 3D reconstruction through its novel cross-domain approach.

Abstract: In this work, we introduce \textbf{Wonder3D++}, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works. Code available at https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus.

[352] Probabilistic Robustness for Free? Revisiting Training via a Benchmark

Yi Zhang, Zheng Wang, Chen Zhen, Wenjie Ruan, Qing Guo, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

Main category: cs.CV

TL;DR: PRBench is the first benchmark for evaluating probabilistic robustness (PR) training methods, comparing adversarial training (AT) and PR-targeted methods across multiple metrics including clean accuracy, PR/AR performance, training efficiency, and generalization error.

Details

Motivation: Current research focuses on adversarial robustness (AR) while probabilistic robustness (PR) remains underexplored. Existing PR training methods have limitations including non-comparable evaluations, limited comparisons to strong AT baselines, and no unified framework for generalization analysis.

Method: Introduces PRBench benchmark that empirically compares common AT and PR-targeted training methods using comprehensive metrics. Also provides theoretical analysis on generalization error of PR performance across different training methods.

Result: AT methods are more versatile for improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted methods consistently yield lower generalization error and higher clean accuracy. A leaderboard with 222 trained models across 7 datasets and 10 architectures is provided.

Conclusion: PRBench fills the gap in evaluating PR training methods, revealing trade-offs between AT and PR-targeted approaches. AT offers broader robustness improvements while PR methods excel in clean accuracy and generalization.

Abstract: Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 222 trained models across 7 datasets and 10 model architectures is publicly available at https://tmpspace.github.io/PRBenchLeaderboard/.

[353] Toward Strategy Identification and Subtask Decomposition In Task Exploration

Tom Odem

Main category: cs.CV

TL;DR: Developed a task explorer pipeline using clustering, factor analysis, and string edit distance to automatically identify global/local strategies and hierarchical subtasks in human-machine interaction data.

Details

Motivation: To advance machines' understanding of user knowledge, skill, and behavior for implicit coordination in anticipatory human-machine interaction.

Method: Created a task explorer pipeline with clustering techniques, factor analysis, and string edit distance to identify global strategies (generalized action sets) and local strategies (similar action compositions), plus hierarchical subtask identification.

Result: The pipeline successfully identified key strategies for task completion and encoded user runs with hierarchical subtask structures. A Task Explorer application was also developed for reviewing results.

Conclusion: The pipeline is adaptable to any action-based time-series data and helps inform both humans and machines about user knowledge, skill, and behavior patterns.

Abstract: This research builds on work in anticipatory human-machine interaction, a subfield of human-machine interaction where machines can facilitate advantageous interactions by anticipating a user’s future state. The aim of this research is to further a machine’s understanding of user knowledge, skill, and behavior in pursuit of implicit coordination. A task explorer pipeline was developed that uses clustering techniques, paired with factor analysis and string edit distance, to automatically identify key global and local strategies that are used to complete tasks. Global strategies identify generalized sets of actions used to complete tasks, while local strategies identify sequences that used those sets of actions in a similar composition. Additionally, meaningful subtasks of various lengths are identified within the tasks. The task explorer pipeline was able to automatically identify key strategies used to complete tasks and encode user runs with hierarchical subtask structures. In addition, a Task Explorer application was developed to easily review pipeline results. The task explorer pipeline can be easily modified to any action-based time-series data and the identified strategies and subtasks help to inform humans and machines on user knowledge, skill, and behavior.

[354] CGF-DETR: Cross-Gated Fusion DETR for Enhanced Pneumonia Detection in Chest X-rays

Yefeng Wu, Yucheng Song, Ling Wu, Shan Wan, Yecheng Zhao

Main category: cs.CV

TL;DR: CGF-DETR is an enhanced real-time detection transformer for pneumonia detection in chest X-rays, achieving 82.2% mAP@0.5 and outperforming baseline RT-DETR-l by 3.7% while maintaining 48.1 FPS inference speed.

Details

Motivation: Pneumonia remains a leading cause of morbidity and mortality worldwide, necessitating accurate automated detection systems. While transformer-based detectors like RT-DETR show promise, their application to medical imaging for pneumonia detection remains underexplored.

Method: Proposed CGF-DETR with three key modules: XFABlock in backbone for multi-scale feature extraction using convolutional attention with CSP architecture; SPGA module replacing standard multi-head attention with dynamic gating and single-head self-attention; GCFC3 in neck for feature representation via multi-path convolution fusion with structural re-parameterization.

Result: Achieved 82.2% mAP@0.5 on RSNA Pneumonia Detection dataset, outperforming baseline RT-DETR-l by 3.7%. Maintained real-time performance at 48.1 FPS. Complete model achieved 50.4% mAP@[0.5:0.95]. Ablation studies confirmed each module contributes meaningfully to performance improvement.

Conclusion: CGF-DETR demonstrates superior performance for pneumonia detection in chest X-rays while maintaining real-time inference capabilities, making it suitable for clinical applications requiring both accuracy and efficiency.

Abstract: Pneumonia remains a leading cause of morbidity and mortality worldwide, necessitating accurate and efficient automated detection systems. While recent transformer-based detectors like RT-DETR have shown promise in object detection tasks, their application to medical imaging, particularly pneumonia detection in chest X-rays, remains underexplored. This paper presents CGF-DETR, an enhanced real-time detection transformer specifically designed for pneumonia detection. We introduce XFABlock in the backbone to improve multi-scale feature extraction through convolutional attention mechanisms integrated with CSP architecture. To achieve efficient feature aggregation, we propose SPGA module that replaces standard multi-head attention with dynamic gating mechanisms and single-head self-attention. Additionally, GCFC3 is designed for the neck to enhance feature representation through multi-path convolution fusion while maintaining real-time performance via structural re-parameterization. Extensive experiments on the RSNA Pneumonia Detection dataset demonstrate that CGF-DETR achieves 82.2% mAP@0.5, outperforming the baseline RT-DETR-l by 3.7% while maintaining comparable inference speed at 48.1 FPS. Our ablation studies confirm that each proposed module contributes meaningfully to the overall performance improvement, with the complete model achieving 50.4% mAP@[0.5:0.95]

[355] 3EED: Ground Everything Everywhere in 3D

Rong Li, Yuhao Dong, Tianshuai Hu, Ao Liang, Youquan Liu, Dongyue Lu, Liang Pan, Lingdong Kong, Junwei Liang, Ziwei Liu

Main category: cs.CV

TL;DR: 3EED is a large-scale multi-platform 3D visual grounding benchmark with RGB and LiDAR data from vehicles, drones, and quadrupeds, featuring 128K+ objects and 22K+ referring expressions across diverse outdoor scenes.

Details

Motivation: Existing 3D grounding benchmarks are limited to indoor settings, single platforms, and small scale, creating a need for more comprehensive outdoor multi-platform datasets.

Method: Developed scalable annotation pipeline using vision-language model prompting with human verification, plus platform-aware normalization and cross-modal alignment techniques for cross-platform learning.

Result: Created dataset 10x larger than existing ones with high-quality spatial grounding annotations, established benchmark protocols for in-domain and cross-platform evaluations.

Conclusion: 3EED reveals significant performance gaps in generalizable 3D grounding and provides resources to advance language-driven 3D embodied perception research.

Abstract: Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes – 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.

[356] HGFreNet: Hop-hybrid GraphFomer for 3D Human Pose Estimation with Trajectory Consistency in Frequency Domain

Kai Zhai, Ziyan Huang, Qiang Nie, Xiang Li, Bo Ouyang

Main category: cs.CV

TL;DR: HGFreNet is a novel GraphFormer architecture for 2D-to-3D human pose lifting that addresses depth ambiguity and temporal incoherence through hop-hybrid graph attention and frequency domain trajectory consistency.

Details

Motivation: Previous methods for 2D-to-3D human pose lifting suffer from depth ambiguity and temporal jitters, mainly focusing on local temporal constraints while neglecting global spatial-temporal correlations of skeletal joint motion.

Method: Proposes HGFreNet with hop-hybrid graph attention (HGA) module to group k-hop neighbors for larger receptive field, Transformer encoder for global correlations, and frequency domain constraints for temporal consistency. Uses preliminary network for 3D pose estimation.

Result: Extensive experiments on Human3.6M and MPI-INF-3DHP benchmarks show HGFreNet outperforms state-of-the-art methods in both positional accuracy and temporal consistency.

Conclusion: The proposed approach effectively models global spatial-temporal correlations and maintains temporal coherence, achieving superior performance in 3D human pose estimation from monocular video.

Abstract: 2D-to-3D human pose lifting is a fundamental challenge for 3D human pose estimation in monocular video, where graph convolutional networks (GCNs) and attention mechanisms have proven to be inherently suitable for encoding the spatial-temporal correlations of skeletal joints. However, depth ambiguity and errors in 2D pose estimation lead to incoherence in the 3D trajectory. Previous studies have attempted to restrict jitters in the time domain, for instance, by constraining the differences between adjacent frames while neglecting the global spatial-temporal correlations of skeletal joint motion. To tackle this problem, we design HGFreNet, a novel GraphFormer architecture with hop-hybrid feature aggregation and 3D trajectory consistency in the frequency domain. Specifically, we propose a hop-hybrid graph attention (HGA) module and a Transformer encoder to model global joint spatial-temporal correlations. The HGA module groups all $k$-hop neighbors of a skeletal joint into a hybrid group to enlarge the receptive field and applies the attention mechanism to discover the latent correlations of these groups globally. We then exploit global temporal correlations by constraining trajectory consistency in the frequency domain. To provide 3D information for depth inference across frames and maintain coherence over time, a preliminary network is applied to estimate the 3D pose. Extensive experiments were conducted on two standard benchmark datasets: Human3.6M and MPI-INF-3DHP. The results demonstrate that the proposed HGFreNet outperforms state-of-the-art (SOTA) methods in terms of positional accuracy and temporal consistency.

[357] UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs

Zhe Liu, Jinghua Hou, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, Xiang Bai

Main category: cs.CV

TL;DR: UniLION is a unified autonomous driving model that efficiently processes LiDAR point clouds, multi-view images, and temporal sequences using linear group RNN operators, eliminating quadratic attention overhead while supporting multiple configurations without explicit fusion modules.

Details

Motivation: Transformers face computational challenges with long-sequence data due to quadratic attention mechanisms, which is problematic for autonomous driving applications that require processing large-scale sensor data and temporal sequences.

Method: The model uses linear group RNN operators to perform linear RNN for grouped features, enabling efficient handling of LiDAR point clouds, high-resolution multi-view images, and temporal sequences in a single versatile architecture.

Result: UniLION achieves competitive and state-of-the-art performance across various core autonomous driving tasks including 3D perception, prediction, and planning, while supporting multiple specialized variants without requiring explicit temporal or multi-modal fusion modules.

Conclusion: The unified paradigm simplifies multi-modal and multi-task autonomous driving system design while maintaining superior performance, offering a fresh perspective on developing 3D foundation models for autonomous driving.

Abstract: Although transformers have demonstrated remarkable capabilities across various domains, their quadratic attention mechanisms introduce significant computational overhead when processing long-sequence data. In this paper, we present a unified autonomous driving model, UniLION, which efficiently handles large-scale LiDAR point clouds, high-resolution multi-view images, and even temporal sequences based on the linear group RNN operator (i.e., performs linear RNN for grouped features). Remarkably, UniLION serves as a single versatile architecture that can seamlessly support multiple specialized variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal temporal fusion configurations) without requiring explicit temporal or multi-modal fusion modules. Moreover, UniLION consistently delivers competitive and even state-of-the-art performance across a wide range of core tasks, including 3D perception (e.g., 3D object detection, 3D object tracking, 3D occupancy prediction, BEV map segmentation), prediction (e.g., motion prediction), and planning (e.g., end-to-end planning). This unified paradigm naturally simplifies the design of multi-modal and multi-task autonomous driving systems while maintaining superior performance. Ultimately, we hope UniLION offers a fresh perspective on the development of 3D foundation models in autonomous driving. Code is available at https://github.com/happinesslz/UniLION

[358] PROPEX-RAG: Enhanced GraphRAG using Prompt-Driven Prompt Execution

Tejas Sarnaik, Manan Shah, Ravi Hegde

Main category: cs.CV

TL;DR: A prompt-driven GraphRAG framework that improves multi-hop question answering through optimized prompt design for entity extraction, fact selection, and passage reranking using knowledge graphs.

Details

Motivation: Current graph-based RAG approaches under-examine the influence of prompt design on enhancing retrieval and reasoning processes, despite its potential importance.

Method: Creates symbolic knowledge graphs from text data using entity and fact triples, uses LLMs for semantic filtering and answer generation, and employs entity-guided graph traversal via Personalized PageRank for efficient retrieval.

Result: Achieves state-of-the-art performance on HotpotQA (80.7% F1, 97.1% Recall@5) and 2WikiMultiHopQA (78.9% F1, 98.1% Recall@5).

Conclusion: Prompt design is crucial for improving retrieval accuracy and response quality in multi-hop question answering, establishing foundations for more efficient and comprehensible systems.

Abstract: Retrieval-Augmented Generation (RAG) has become a robust framework for enhancing Large Language Models (LLMs) with external knowledge. Recent advances in RAG have investigated graph based retrieval for intricate reasoning; however, the influence of prompt design on enhancing the retrieval and reasoning process is still considerably under-examined. In this paper, we present a prompt-driven GraphRAG framework that underscores the significance of prompt formulation in facilitating entity extraction, fact selection, and passage reranking for multi-hop question answering. Our approach creates a symbolic knowledge graph from text data by encoding entities and factual relationships as structured facts triples. We use LLMs selectively during online retrieval to perform semantic filtering and answer generation. We also use entity-guided graph traversal through Personalized PageRank (PPR) to support efficient, scalable retrieval based on the knowledge graph we built. Our system gets state-of-the-art performance on HotpotQA and 2WikiMultiHopQA, with F1 scores of 80.7% and 78.9%, and Recall@5 scores of 97.1% and 98.1%, respectively. These results show that prompt design is an important part of improving retrieval accuracy and response quality. This research lays the groundwork for more efficient and comprehensible multi-hop question-answering systems, highlighting the importance of prompt-aware graph reasoning.

[359] SciTextures: Collecting and Connecting Visual Patterns, Models, and Code Across Science and Art

Sagi Eppel, Alona Strugatski

Main category: cs.CV

TL;DR: The paper introduces Scitextures, a large-scale dataset of 100,000 images from 1,200+ models across science, tech, and art, created to study how AI can connect visual patterns with their underlying generative mechanisms.

Details

Motivation: To enable deeper visual understanding by connecting visual patterns (like clouds, waves, city growth) with the processes that form them, bridging the gap between appearance and underlying mechanisms.

Method: Created an agentic AI pipeline that autonomously collects and implements models in standardized form, then used this dataset to evaluate AI models’ ability to link visual patterns to their generative code and mechanisms.

Result: Vision-language models demonstrated capability to understand and simulate physical systems beyond visual patterns, identifying different patterns from the same process and inferring mechanisms from real-world images.

Conclusion: The Scitextures dataset provides a foundation for exploring connections between visual patterns and their generative mechanisms, showing AI’s potential to understand the processes that shape our visual world.

Abstract: The ability to connect visual patterns with the processes that form them represents one of the deepest forms of visual understanding. Textures of clouds and waves, the growth of cities and forests, or the formation of materials and landscapes are all examples of patterns emerging from underlying mechanisms. We present the Scitextures dataset, a large-scale collection of textures and visual patterns from all domains of science, tech, and art, along with the models and code that generate these images. Covering over 1,200 different models and 100,000 images of patterns and textures from physics, chemistry, biology, sociology, technology, mathematics, and art, this dataset offers a way to explore the connection between the visual patterns that shape our world and the mechanisms that produce them. Created by an agentic AI pipeline that autonomously collects and implements models in standardized form, we use SciTextures to evaluate the ability of leading AI models to link visual patterns to the models and code that generate them, and to identify different patterns that emerged from the same process. We also test AIs ability to infer and recreate the mechanisms behind visual patterns by providing a natural image of a real-world pattern and asking the AI to identify, model, and code the mechanism that formed the pattern, then run this code to generate a simulated image that is compared to the real image. These benchmarks show that vision-language models (VLMs) can understand and simulate the physical system beyond a visual pattern. The dataset and code are available at: https://zenodo.org/records/17485502

[360] TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Wei Chen, Konstantinos Psounis, Kaipeng Zhang

Main category: cs.CV

TL;DR: TIR-Bench is a new benchmark for evaluating agentic thinking-with-images capabilities in multimodal models, testing complex tool use and image manipulation across 13 diverse tasks.

Details

Motivation: Existing benchmarks like Visual Search only test basic operations and fail to capture advanced thinking-with-images capabilities needed for complex, dynamic, and tool-dependent reasoning.

Method: Created TIR-Bench with 13 diverse tasks requiring novel tool use for image processing and manipulation in chain-of-thought. Evaluated 22 MLLMs including open-source, proprietary, and tool-augmented models.

Result: TIR-Bench is universally challenging for all tested models, showing that strong performance requires genuine thinking-with-images capabilities. Also conducted pilot study comparing direct vs agentic fine-tuning.

Conclusion: The benchmark successfully identifies gaps in current models’ thinking-with-images abilities and provides a comprehensive evaluation framework for advanced visual reasoning capabilities.

Abstract: The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-\textit{with}-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-\textit{with}-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce \textbf{TIR-Bench}, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.

[361] Coupled quasi-harmonic bases

A. Kovnatsky, M. M. Bronstein, A. M. Bronstein, K. Glashoff, R. Kimmel

Main category: cs.CV

TL;DR: The paper proposes constructing common approximate eigenbases for multiple shapes using joint diagonalization algorithms to overcome incompatibility issues between independently computed Laplacian eigenbases.

Details

Motivation: Many multi-shape applications are hindered by the incompatibility of Laplacian eigenbases computed independently on different shapes, which prevents effective use of harmonic analysis tools across multiple shapes.

Method: Use approximate joint diagonalization algorithms to construct common approximate eigenbases for multiple shapes, enabling compatibility across different shapes.

Result: The approach demonstrates benefits in various applications including shape editing, pose transfer, correspondence, and similarity tasks.

Conclusion: Common approximate eigenbases provide a solution to the incompatibility problem of Laplacian eigenbases across multiple shapes, enabling more effective multi-shape analysis and processing.

Abstract: The use of Laplacian eigenbases has been shown to be fruitful in many computer graphics applications. Today, state-of-the-art approaches to shape analysis, synthesis, and correspondence rely on these natural harmonic bases that allow using classical tools from harmonic analysis on manifolds. However, many applications involving multiple shapes are obstacled by the fact that Laplacian eigenbases computed independently on different shapes are often incompatible with each other. In this paper, we propose the construction of common approximate eigenbases for multiple shapes using approximate joint diagonalization algorithms. We illustrate the benefits of the proposed approach on tasks from shape editing, pose transfer, correspondence, and similarity.

[362] Exploring Effective Factors for Improving Visual In-Context Learning

Yanpeng Sun, Qiang Chen, Xiaofan Li, Jian Wang, Jingdong Wang, Zechao Li

Main category: cs.CV

TL;DR: This paper identifies prompt selection and prompt fusion as key factors in visual in-context learning performance, and proposes prompt-SelF framework that outperforms meta-learning approaches in 1-shot segmentation.

Details

Motivation: To understand the factors influencing visual in-context learning performance, as it's a relatively new research area in computer vision compared to NLP.

Method: Proposed prompt-SelF framework: pixel-level retrieval for prompt selection, different prompt fusion methods to activate model knowledge, and ensemble predictions from different fusion methods.

Result: Outperformed OSLSM-based meta-learning in 1-shot segmentation for the first time, demonstrating the potential of visual in-context learning.

Conclusion: Visual in-context learning shows great potential, with prompt selection and fusion being critical factors for performance.

Abstract: The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that prompt selection and prompt fusion are two major factors that have a direct impact on the inference performance of visual context learning. Prompt selection is the process of identifying the most appropriate prompt or example to help the model understand new tasks. This is important because providing the model with relevant prompts can help it learn more effectively and efficiently. Prompt fusion involves combining knowledge from different positions within the large-scale visual model. By doing this, the model can leverage the diverse knowledge stored in different parts of the model to improve its performance on new tasks. Based these findings, we propose a simple framework prompt-SelF for visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate all the knowledge stored in the large-scale model, and finally ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. And we conduct extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, the prompt-SelF has outperformed OSLSM based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at https://github.com/syp2ysy/prompt-SelF.

[363] DeGMix: Efficient Multi-Task Dense Prediction with Deformable and Gating Mixer

Yangyang Xu, Yibo Yang, Bernard Ghanem, Lefei Zhang, Bo Du, Jun Zhu

Main category: cs.CV

TL;DR: DeGMix combines deformable CNN and query-based Transformer with shared gating for efficient multi-task dense prediction, achieving better performance with lower computational cost than existing CNN-based and Transformer-based models.

Details

Motivation: To integrate the strengths of CNNs (capturing local spatial patterns) and Transformers (capturing long-range dependencies) for more robust multi-task learning models, addressing limitations of using either architecture independently.

Method: Uses deformable mixer encoder with channel-aware mixing and spatial-aware deformable operators, combined with task-aware gating transformer decoder that employs task interaction blocks with self-attention and task query blocks with gating attention for dynamic feature selection.

Result: Significantly outperforms current Transformer-based and CNN-based competitive models on various metrics across three dense prediction datasets while using fewer GFLOPs.

Conclusion: DeGMix provides a simple and efficient solution for multi-task dense prediction that combines the advantages of both CNN and Transformer architectures with lower cost, less complexity, and smaller parameters than traditional MTL methods.

Abstract: Convolution neural networks and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Existing studies typically employ either CNNs (effectively capture local spatial patterns) or Transformers (capturing long-range dependencies) independently, but integrating their strengths may yield more robust models. In this work, we present an efficient MTL model that combines the adaptive capabilities of deformable CNN and query-based Transformer with shared gating for MTL of dense prediction. This combination may offer a simple and efficient solution owing to its powerful and flexible task-specific learning and the advantages of lower cost, less complexity, and smaller parameters than traditional MTL methods. We introduce an efficient multi-task dense prediction with deformable and gating mixer (DeGMix). First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels, and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations. Second, the task-aware gating transformer decoder is used to perform task-specific predictions, in which task interaction block integrated with self-attention is applied to capture task interaction features, and the task query block integrated with gating attention is leveraged to dynamically select the corresponding task-specific features. Furthermore, the results of the experiment demonstrate that the proposed DeGMix uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models on a variety of metrics on three dense prediction datasets. Our code and models are available at https://github.com/yangyangxu0/DeMTG.

[364] HAT: Hybrid Attention Transformer for Image Restoration

Xiangyu Chen, Xintao Wang, Wenlong Zhang, Xiangtao Kong, Yu Qiao, Jiantao Zhou, Chao Dong

Main category: cs.CV

TL;DR: Proposes Hybrid Attention Transformer (HAT) for image restoration, combining channel attention and window-based self-attention with overlapping cross-attention to activate more input pixels and achieve state-of-the-art performance.

Details

Motivation: Existing Transformer-based methods for image restoration only utilize limited spatial range of input information, not fully exploiting Transformer's potential.

Method: Hybrid Attention Transformer (HAT) combining channel attention and window-based self-attention, with overlapping cross-attention module for cross-window interaction, plus same-task pre-training strategy.

Result: Achieves state-of-the-art performance on image super-resolution, Gaussian denoising, and compression artifacts reduction across benchmark and real-world datasets.

Conclusion: HAT effectively activates more input pixels and fully exploits Transformer potential for superior image restoration performance.

Abstract: Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better restoration, we propose a new Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement. Extensive experiments have demonstrated the effectiveness of the proposed modules. We further scale up the model to show that the performance of the SR task can be greatly improved. Besides, we extend HAT to more image restoration applications, including real-world image super-resolution, Gaussian image denoising and image compression artifacts reduction. Experiments on benchmark and real-world datasets demonstrate that our HAT achieves state-of-the-art performance both quantitatively and qualitatively. Codes and models are publicly available at https://github.com/XPixelGroup/HAT.

[365] Targeted Attack Improves Protection against Unauthorized Diffusion Customization

Boyang Zheng, Chumeng Liang, Xiaoyu Wu

Main category: cs.CV

TL;DR: The paper proposes using targeted adversarial attacks instead of untargeted ones to better protect images from unauthorized diffusion model customization, showing superior performance in poisoning models and degrading generated image quality.

Details

Motivation: Current protection methods using untargeted adversarial attacks are not effective enough against unauthorized fine-tuning of diffusion models on protected images.

Method: Introduce targeted adversarial attacks with carefully selected targets to poison diffusion models and degrade customization quality, validated on two mainstream customization methods.

Result: Targeted attacks significantly outperform untargeted attacks in poisoning diffusion models and reducing the quality of customized images, with extensive experimental validation.

Conclusion: Targeted attacks are more effective than untargeted attacks for protecting images against unauthorized diffusion model customization, revealing a new vulnerability in diffusion models.

Abstract: Diffusion models build a new milestone for image generation yet raising public concerns, for they can be fine-tuned on unauthorized images for customization. Protection based on adversarial attacks rises to encounter this unauthorized diffusion customization, by adding protective watermarks to images and poisoning diffusion models. However, current protection, leveraging untargeted attacks, does not appear to be effective enough. In this paper, we propose a simple yet effective improvement for the protection against unauthorized diffusion customization by introducing targeted attacks. We show that by carefully selecting the target, targeted attacks significantly outperform untargeted attacks in poisoning diffusion models and degrading the customization image quality. Extensive experiments validate the superiority of our method on two mainstream customization methods of diffusion models, compared to existing protections. To explain the surprising success of targeted attacks, we delve into the mechanism of attack-based protections and propose a hypothesis based on our observation, which enhances the comprehension of attack-based protections. To the best of our knowledge, we are the first to both reveal the vulnerability of diffusion models to targeted attacks and leverage targeted attacks to enhance protection against unauthorized diffusion customization. Our code is available on GitHub: https://github.com/psyker-team/mist-v2.

[366] Balancing Efficiency and Quality: MoEISR for Arbitrary-Scale Image Super-Resolution

Young Jae Oh, Jihun Kim, Jihoon Nam, Tae Hyun Kim

Main category: cs.CV

TL;DR: MoEISR is an efficient arbitrary-scale image super-resolution framework that uses a mixture-of-experts approach to reduce computational costs while maintaining reconstruction quality.

Details

Motivation: Existing implicit neural function methods for arbitrary-scale super-resolution are computationally demanding as they query every target pixel to a single expensive decoder.

Method: MoEISR uses a lightweight mapper module to dynamically allocate the most suitable decoding expert to each pixel, allowing experts with varying capacities to handle regions with different complexities.

Result: MoEISR significantly reduces FLOPs while delivering comparable or superior PSNR compared to existing methods.

Conclusion: The proposed MoEISR framework enables efficient arbitrary-scale super-resolution without sacrificing reconstruction quality through dynamic expert allocation.

Abstract: Arbitrary-scale image super-resolution employing implicit neural functions has gained significant attention lately due to its capability to upscale images across diverse scales utilizing only a single model. Nevertheless, these methodologies have imposed substantial computational demands as they involve querying every target pixel to a single resource-intensive decoder. In this paper, we introduce a novel and efficient framework, the Mixture-of-Experts Implicit Super-Resolution (MoEISR), which enables super-resolution at arbitrary scales with significantly increased computational efficiency without sacrificing reconstruction quality. MoEISR dynamically allocates the most suitable decoding expert to each pixel using a lightweight mapper module, allowing experts with varying capacities to reconstruct pixels across regions with diverse complexities. Our experiments demonstrate that MoEISR successfully reduces significant amount of floating point operations (FLOPs) while delivering comparable or superior peak signal-to-noise ratio (PSNR).

[367] VRP-SAM: SAM with Visual Reference Prompt

Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Xiaofan Li, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li

Main category: cs.CV

TL;DR: VRP-SAM enhances SAM by using annotated reference images as prompts for segmentation, supporting multiple annotation formats and achieving SOTA performance with minimal parameters.

Details

Motivation: To extend SAM's versatility by enabling it to use annotated reference images as prompts for specific object segmentation, making it more user-friendly while preserving SAM's strengths.

Method: Proposes a Visual Reference Prompt (VRP) encoder that supports various annotation formats (point, box, scribble, mask) and uses meta-learning for generalization. VRP-SAM leverages reference images to understand and segment specific objects in target images.

Result: Achieved state-of-the-art performance on Pascal and COCO datasets in visual reference segmentation with minimal learnable parameters. Demonstrates strong generalization for unseen objects and cross-domain segmentation.

Conclusion: VRP-SAM successfully extends SAM’s capabilities for reference-based segmentation while maintaining efficiency, showing excellent generalization across domains and object types.

Abstract: In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM’s inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at https://github.com/syp2ysy/VRP-SAM

[368] OpenMaterial: A Large-scale Dataset of Complex Materials for 3D Reconstruction

Zheng Dang, Jialu Huang, Fei Wang, Mathieu Salzmann

Main category: cs.CV

TL;DR: OpenMaterial is a large-scale semi-synthetic dataset for benchmarking material-aware 3D reconstruction, containing 1,001 objects with 295 materials captured under 714 lighting conditions, enabling evaluation of 3D reconstruction methods on challenging materials.

Details

Motivation: Current 3D reconstruction methods struggle with objects having complex optical properties (metals, glass, plastics) due to breakdown of multi-view color consistency from specular reflections, refractions, and transparency, and lack of benchmark datasets modeling material-dependent light transport.

Method: Created OpenMaterial dataset by integrating lab-measured Index of Refraction (IOR) spectra to generate high-fidelity multi-view images simulating complex light-matter interactions, providing multi-view images, 3D shapes, camera poses, depth maps, and object masks.

Result: Evaluated 11 state-of-the-art 3D reconstruction and novel view synthesis methods, conducting ablation studies on material type, shape complexity, and illumination impact. The dataset provides a strong benchmark for developing robust reconstruction techniques.

Conclusion: OpenMaterial establishes the first extensive benchmark for material-aware 3D reconstruction, enabling development of more robust, physically-informed techniques to handle real-world optical complexities in 3D reconstruction.

Abstract: Recent advances in deep learning, such as neural radiance fields and implicit neural representations, have significantly advanced 3D reconstruction. However, accurately reconstructing objects with complex optical properties, such as metals, glass, and plastics, remains challenging due to the breakdown of multi-view color consistency in the presence of specular reflections, refractions, and transparency. This limitation is further exacerbated by the lack of benchmark datasets that explicitly model material-dependent light transport. To address this, we introduce OpenMaterial, a large-scale semi-synthetic dataset for benchmarking material-aware 3D reconstruction. It comprises 1,001 objects spanning 295 distinct materials, including conductors, dielectrics, plastics, and their roughened variants, captured under 714 diverse lighting conditions. By integrating lab-measured Index of Refraction (IOR) spectra, OpenMaterial enables the generation of high-fidelity multi-view images that accurately simulate complex light-matter interactions. It provides multi-view images, 3D shape models, camera poses, depth maps, and object masks, establishing the first extensive benchmark for evaluating 3D reconstruction on challenging materials. We evaluate 11 state-of-the-art methods for 3D reconstruction and novel view synthesis, conducting ablation studies to assess the impact of material type, shape complexity, and illumination on reconstruction performance. Our results indicate that OpenMaterial provides a strong and fair basis for developing more robust, physically-informed 3D reconstruction techniques to better handle real-world optical complexities.

[369] Bidirectional Regression for Monocular 6DoF Head Pose Estimation and Reference System Alignment

Sungho Chun, Boeun Kim, Hyung Jin Chang, Ju Yong Chang

Main category: cs.CV

TL;DR: TRGv2 is a lightweight 6DoF head pose estimation method that improves robustness through bidirectional facial geometry-pose interaction, iterative refinement with landmark projection, analytic depth estimation via correction parameters, and cross-dataset bias correction.

Details

Motivation: Existing monocular 6DoF head pose estimation methods struggle with robustness, particularly for safety-critical applications and human-computer interaction. There's also an overlooked bias issue in cross-dataset evaluations due to inconsistent head center definitions.

Method: TRGv2 extends the TRG network by explicitly modeling bidirectional facial geometry-pose interaction through iterative refinement with landmark-to-image projection. It regresses correction parameters combined with pinhole camera model for analytic depth estimation, and introduces reference system alignment to correct cross-dataset translation bias.

Result: Extensive experiments on ARKitFace, BIWI, and DD-Pose benchmarks show TRGv2 outperforms state-of-the-art methods in both accuracy and efficiency.

Conclusion: TRGv2 provides a robust and efficient solution for 6DoF head pose estimation by addressing both model architecture limitations and cross-dataset evaluation biases, with publicly available code and annotations.

Abstract: Precise six-degree-of-freedom (6DoF) head pose estimation is crucial for safety-critical applications and human-computer interaction scenarios, yet existing monocular methods still struggle with robust pose estimation. We revisit this problem by introducing TRGv2, a lightweight extension of our previous Translation, Rotation, and Geometry (TRG) network, which explicitly models the bidirectional interaction between facial geometry and head pose. TRGv2 jointly infers facial landmarks and 6DoF pose through an iterative refinement loop with landmark-to-image projection, ensuring metric consistency among face size, rotation, and depth. To further improve generalization to out-of-distribution data, TRGv2 regresses correction parameters instead of directly predicting translation, combining them with a pinhole camera model for analytic depth estimation. In addition, we identify a previously overlooked source of bias in cross-dataset evaluations due to inconsistent head center definitions across different datasets. To address this, we propose a reference system alignment strategy that quantifies and corrects translation bias, enabling fair comparisons across datasets. Extensive experiments on ARKitFace, BIWI, and the challenging DD-Pose benchmarks demonstrate that TRGv2 outperforms state-of-the-art methods in both accuracy and efficiency. Code and newly annotated landmarks for DD-Pose will be publicly available.

[370] Preliminary study on artificial intelligence methods for cybersecurity threat detection in computer networks based on raw data packets

Aleksander Ogonowski, Michał Żebrowski, Arkadiusz Ćwiek, Tobiasz Jarosiewicz, Konrad Klimaszewski, Adam Padee, Piotr Wasiuk, Michał Wójcik

Main category: cs.CV

TL;DR: A deep learning approach for real-time intrusion detection using raw packet data, converting packets into 2D image representations for computer vision models.

Details

Motivation: Traditional intrusion detection methods based on traffic flow characteristics don't fully leverage deep learning's potential for direct feature extraction from raw packets, and they hinder real-time monitoring due to processing delays and software dependencies.

Method: Packets are stacked into windows and converted into 2D image representations, then processed using computer vision models for attack detection directly from raw packet data.

Result: The approach was evaluated using the CIC IDS-2017 dataset containing both benign traffic and real-world attacks, providing a comprehensive testing foundation.

Conclusion: The proposed method enables real-time intrusion detection by directly processing raw packet data through deep learning models, overcoming limitations of traditional flow-based approaches.

Abstract: Most of the intrusion detection methods in computer networks are based on traffic flow characteristics. However, this approach may not fully exploit the potential of deep learning algorithms to directly extract features and patterns from raw packets. Moreover, it impedes real-time monitoring due to the necessity of waiting for the processing pipeline to complete and introduces dependencies on additional software components. In this paper, we investigate deep learning methodologies capable of detecting attacks in real-time directly from raw packet data within network traffic. We propose a novel approach where packets are stacked into windows and separately recognised, with a 2D image representation suitable for processing with computer vision models. Our investigation utilizes the CIC IDS-2017 dataset, which includes both benign traffic and prevalent real-world attacks, providing a comprehensive foundation for our research.

[371] Scalable Autoregressive Image Generation with Mamba

Haopeng Li, Jinyue Yang, Kexin Wang, Xuerui Qiu, Yuhong Chou, Xin Li, Guoqi Li

Main category: cs.CV

TL;DR: AiM is an autoregressive image generation model using Mamba architecture that achieves state-of-the-art performance with faster inference than existing AR and diffusion models.

Details

Motivation: To replace Transformers in AR image generation with Mamba's efficient long-sequence modeling capabilities, aiming for better generation quality and faster inference speed.

Method: Uses Mamba state-space model with next-token prediction for autoregressive image generation, avoiding complex 2D adaptations while preserving Mamba’s core efficiency.

Result: Achieves FID of 2.21 on ImageNet1K 256*256, surpassing comparable AR models and competing with diffusion models while being 2-10x faster in inference.

Conclusion: AiM demonstrates that Mamba architecture can effectively replace Transformers in AR image generation, achieving superior performance with significantly improved efficiency.

Abstract: We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba’s core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at https://github.com/hp-l33/AiM

[372] ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun

Main category: cs.CV

TL;DR: ShortV is a training-free method that reduces computational costs in MLLMs by freezing visual token updates in ineffective layers identified using Layer Contribution metric.

Details

Motivation: MLLMs suffer from high computational costs due to massive model size and large number of visual tokens, with many layers showing minimal contribution during visual token processing.

Method: Introduces Layer Contribution (LC) metric to quantify layer impact on visual/text tokens, then freezes visual token updates in identified ineffective layers without retraining.

Result: Freezes visual tokens in ~60% of MLLM layers, achieving 50% FLOPs reduction on LLaVA-NeXT-13B while maintaining superior performance.

Conclusion: ShortV effectively reduces MLLM computational costs by leveraging layer-wise redundancy analysis without performance degradation.

Abstract: Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer’s transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer’s transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

[373] ReviveDiff: A Universal Diffusion Model for Restoring Images in Adverse Weather Conditions

Wenfeng Huang, Guoan Xu, Wenjing Jia, Stuart Perry, Guangwei Gao

Main category: cs.CV

TL;DR: ReviveDiff is a universal diffusion-based network that restores images degraded by various adverse conditions like rain, underwater, low-light, smoke, and nighttime haze, outperforming state-of-the-art methods.

Details

Motivation: Images captured in challenging environments suffer from significant quality degradation, and existing task-specific solutions limit applicability across different degradation types.

Method: Leverages diffusion models to restore image quality from both macro and micro levels, addressing key quality factors like sharpness, distortion, noise, dynamic range, and color accuracy.

Result: Outperforms state-of-the-art methods on seven benchmark datasets covering five degradation types, achieving superior quantitative and visual results.

Conclusion: ReviveDiff provides an effective universal solution for restoring images degraded by various adverse environmental conditions using diffusion models.

Abstract: Images captured in challenging environments–such as nighttime, smoke, rainy weather, and underwater–often suffer from significant degradation, resulting in a substantial loss of visual quality. The effective restoration of these degraded images is critical for the subsequent vision tasks. While many existing approaches have successfully incorporated specific priors for individual tasks, these tailored solutions limit their applicability to other degradations. In this work, we propose a universal network architecture, dubbed ``ReviveDiff’’, which can address various degradations and bring images back to life by enhancing and restoring their quality. Our approach is inspired by the observation that, unlike degradation caused by movement or electronic issues, quality degradation under adverse conditions primarily stems from natural media (such as fog, water, and low luminance), which generally preserves the original structures of objects. To restore the quality of such images, we leveraged the latest advancements in diffusion models and developed ReviveDiff to restore image quality from both macro and micro levels across some key factors determining image quality, such as sharpness, distortion, noise level, dynamic range, and color accuracy. We rigorously evaluated ReviveDiff on seven benchmark datasets covering five types of degrading conditions: Rainy, Underwater, Low-light, Smoke, and Nighttime Hazy. Our experimental results demonstrate that ReviveDiff outperforms the state-of-the-art methods both quantitatively and visually.

[374] CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

Zhiqiang Wang, Pengbin Feng, Yanbin Lin, Shuzhang Cai, Zongao Bian, Jinghua Yan, Xingquan Zhu

Main category: cs.CV

TL;DR: Fuzzy Group Relative Policy Reward (FGRPR) combines GRPO with fuzzy rewards to improve learning efficiency by providing nuanced incentives instead of binary accuracy rewards, outperforming baselines including GPT4o and LLaMA2.

Details

Motivation: To enhance learning efficiency by replacing conventional binary 0/1 accuracy rewards with nuanced fuzzy rewards that encourage more precise outputs.

Method: Integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function that provides graded incentives based on output precision rather than binary accuracy.

Result: FGRPR applied to Qwen2.5-VL (3B and 7B) surpasses all baseline models including GPT4o, LLaMA2(90B), and SFT across five in-domain datasets. On out-of-domain data, it matches SFT performance and excels with larger target values.

Conclusion: FGRPR is broadly applicable to tasks requiring answer precision and demonstrates superior performance over traditional reward models and supervised fine-tuning.

Abstract: We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1

[375] Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

Manuel Benavent-Lledo, David Mulero-Pérez, David Ortiz-Perez, Jose Garcia-Rodriguez, Antonis Argyros

Main category: cs.CV

TL;DR: A novel action recognition method using hierarchical organization and contextualized textual information with a transformer architecture that combines visual and textual features, achieving over 17% improvement in top-1 accuracy.

Details

Motivation: To improve action recognition by exploiting the hierarchical nature of actions and incorporating contextual information like location and previous actions to reflect temporal context.

Method: Transformer architecture combining visual features (RGB and optical flow) with textual embeddings for contextual information, using a joint loss function for simultaneous coarse- and fine-grained action recognition.

Result: Outperforms SOTA methods on Hierarchical TSU, Assembly101 and IkeaASM datasets with over 17% improvement in top-1 accuracy.

Conclusion: The proposed hierarchical and contextual approach significantly enhances action recognition performance, particularly for monitoring elderly activities in home environments.

Abstract: We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action’s temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.

[376] Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu

Main category: cs.CV

TL;DR: The paper proposes Deep Video Discovery (DVD), an agentic system that uses adaptive search strategies over segmented video clips to improve long-form video understanding, achieving state-of-the-art performance on challenging benchmarks.

Details

Motivation: Long-form video understanding is challenging due to extensive temporal-spatial complexity and difficulty in question answering over extended contexts. Current LLMs still struggle with information-dense hour-long videos despite advancements in video analysis and long context handling.

Method: DVD uses an agentic search strategy over segmented video clips with search-centric tools on multi-granular video database. It leverages LLM reasoning to plan adaptive workflows, strategically selecting tools based on current observation state and gathered information for different queries.

Result: Achieves state-of-the-art performance on LVBench dataset with 74.2% accuracy, substantially surpassing all prior works, and further improves to 76.0% with transcripts.

Conclusion: The DVD agent’s autonomous and adaptive approach using agentic search strategies effectively addresses long-form video understanding challenges, demonstrating superior performance compared to previous methods with predefined workflows.

Abstract: Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips. Unlike previous video agents that rely on predefined workflows applied uniformly across different queries, our approach emphasizes the autonomous and adaptive nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%, which substantially surpasses all prior works, and further improves to 76.0% with transcripts. The code has been released at https://github.com/microsoft/DeepVideoDiscovery.

[377] TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding

Quang P. M. Pham, Khoi T. N. Nguyen, Lan C. Ngo, Truong Do, Dezhen Song, Truong-Son Hy

Main category: cs.CV

TL;DR: Proposes TESGNN, a temporal equivariant scene graph neural network that preserves symmetry in 3D point cloud scene graph generation and incorporates temporal modeling for dynamic scenes, achieving higher accuracy and faster convergence.

Details

Motivation: Current scene graph methods overlook symmetry preservation in 3D point clouds and lack temporal modeling for dynamic scenes, leading to reduced accuracy and robustness with noisy, multi-view data.

Method: TESGNN consists of two components: (1) Equivariant Scene Graph Neural Network (ESGNN) that preserves symmetry properties when generating scene graphs from 3D point clouds, and (2) Temporal Graph Matching Network that fuses scene graphs across multiple time sequences using approximate graph-matching.

Result: TESGNN achieves higher accuracy and faster training convergence compared to existing methods. The symmetry-preserving property produces more stable and accurate global scene representations. The method is computationally efficient and suitable for real-time applications.

Conclusion: TESGNN provides a robust and scalable solution for complex multi-view scene understanding challenges, paving the way for improved performance in robotics and computer vision applications.

Abstract: Scene graphs have proven to be highly effective for various scene understanding tasks due to their compact and explicit representation of relational information. However, current methods often overlook the critical importance of preserving symmetry when generating scene graphs from 3D point clouds, which can lead to reduced accuracy and robustness, particularly when dealing with noisy, multi-view data. Furthermore, a major limitation of prior approaches is the lack of temporal modeling to capture time-dependent relationships among dynamically evolving entities in a scene. To address these challenges, we propose Temporal Equivariant Scene Graph Neural Network (TESGNN), consisting of two key components: (1) an Equivariant Scene Graph Neural Network (ESGNN), which extracts information from 3D point clouds to generate scene graph while preserving crucial symmetry properties, and (2) a Temporal Graph Matching Network, which fuses scene graphs generated by ESGNN across multiple time sequences into a unified global representation using an approximate graph-matching algorithm. Our combined architecture TESGNN shown to be effective compared to existing methods in scene graph generation, achieving higher accuracy and faster training convergence. Moreover, we show that leveraging the symmetry-preserving property produces a more stable and accurate global scene representation compared to existing approaches. Finally, it is computationally efficient and easily implementable using existing frameworks, making it well-suited for real-time applications in robotics and computer vision. This approach paves the way for more robust and scalable solutions to complex multi-view scene understanding challenges. Our source code is publicly available at: https://github.com/HySonLab/TESGraph

Lei Yang, Xinyu Zhang, Jun Li, Chen Wang, Jiaqi Ma, Zhiying Song, Tong Zhao, Ziying Song, Li Wang, Mo Zhou, Yang Shen, Kai Wu, Chen Lv

Main category: cs.CV

TL;DR: V2X-Radar is the first large-scale real-world multi-modal dataset for cooperative perception featuring 4D Radar, addressing the gap in existing datasets that focus primarily on cameras and LiDAR.

Details

Motivation: Existing cooperative perception datasets neglect 4D Radar, which provides robust perception in adverse weather conditions. This gap limits the development of comprehensive autonomous driving systems that can handle various environmental challenges.

Method: Collected data using connected vehicle platform and intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras across diverse weather conditions (sunny/rainy) and times (daytime/dusk/nighttime).

Result: Created V2X-Radar dataset with 20K LiDAR frames, 40K camera images, 20K 4D Radar data, and 350K annotated boxes across five categories. Established three sub-datasets for different research domains: cooperative perception, roadside perception, and single-vehicle perception.

Conclusion: V2X-Radar bridges the critical gap in 4D Radar datasets for cooperative perception and provides comprehensive benchmarks to advance research in autonomous driving perception systems across various scenarios and conditions.

Abstract: Modern autonomous vehicle perception systems often struggle with occlusions and limited perception range. Previous studies have demonstrated the effectiveness of cooperative perception in extending the perception range and overcoming occlusions, thereby enhancing the safety of autonomous driving. In recent years, a series of cooperative perception datasets have emerged; however, these datasets primarily focus on cameras and LiDAR, neglecting 4D Radar, a sensor used in single-vehicle autonomous driving to provide robust perception in adverse weather conditions. In this paper, to bridge the gap created by the absence of 4D Radar datasets in cooperative perception, we present V2X-Radar, the first large-scale, real-world multi-modal dataset featuring 4D Radar. V2X-Radar dataset is collected using a connected vehicle platform and an intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras. The collected data encompasses sunny and rainy weather conditions, spanning daytime, dusk, and nighttime, as well as various typical challenging scenarios. The dataset consists of 20K LiDAR frames, 40K camera images, and 20K 4D Radar data, including 350K annotated boxes across five categories. To support various research domains, we have established V2X-Radar-C for cooperative perception, V2X-Radar-I for roadside perception, and V2X-Radar-V for single-vehicle perception. Furthermore, we provide comprehensive benchmarks across these three sub-datasets. We will release all datasets and benchmark codebase at https://huggingface.co/datasets/yanglei18/V2X-Radar and https://github.com/yanglei18/V2X-Radar.

[379] Phys4DGen: Physics-Compliant 4D Generation with Multi-Material Composition Perception

Jiajing Lin, Zhenzhong Wang, Dejun Xu, Shu Jiang, YunPeng Gong, Min Jiang

Main category: cs.CV

TL;DR: Phys4DGen is a novel 4D generation framework that automates physically plausible 4D content creation by integrating multi-material composition perception with physical simulation, eliminating the need for manual material specification.

Details

Motivation: Current 4D generation approaches require manual material property specification by users without physics expertise and struggle with multi-material composite objects, limiting their practical applicability.

Method: Three modules: 3D Material Grouping for surface material segmentation, Internal Physical Structure Discovery for interior mechanical structure construction, and physical prior distillation from multimodal LLMs for automatic material property identification.

Result: Experiments on synthetic and real-world datasets show Phys4DGen generates high-fidelity 4D content with physical realism in open-world scenarios, significantly outperforming state-of-the-art methods.

Conclusion: Phys4DGen successfully addresses limitations of current 4D generation methods by automating material property identification and handling multi-material composites, enabling physically plausible 4D content generation without requiring physics expertise from users.

Abstract: 4D content generation aims to create dynamically evolving 3D content that responds to specific input objects such as images or 3D representations. Current approaches typically incorporate physical priors to animate 3D representations, but these methods suffer from significant limitations: they not only require users lacking physics expertise to manually specify material properties but also struggle to effectively handle the generation of multi-material composite objects. To address these challenges, we propose Phys4DGen, a novel 4D generation framework that integrates multi-material composition perception with physical simulation. The framework achieves automated, physically plausible 4D generation through three innovative modules: first, the 3D Material Grouping module partitions heterogeneous material regions on 3D representations’ surfaces via semantic segmentation; second, the Internal Physical Structure Discovery module constructs the mechanical structure of object interiors; finally, we distill physical prior knowledge from multimodal large language models to enable rapid and automatic material properties identification for both objects’ surfaces and interiors. Experiments on both synthetic and real-world datasets demonstrate that Phys4DGen can generate high-fidelity 4D content with physical realism in open-world scenarios, significantly outperforming state-of-the-art methods.

Eugene Vorontsov, George Shaikovski, Adam Casson, Julian Viret, Eric Zimmermann, Neil Tenenholtz, Yi Kan Wang, Jan H. Bernhard, Ran A. Godrich, Juan A. Retamero, Jinru Shia, Mithat Gonen, Martin R. Weiser, David S. Klimstra, Razik Yousfi, Nicolo Fusi, Thomas J. Fuchs, Kristen Severson, Siqi Liu

Main category: cs.CV

TL;DR: PRISM2 is a multimodal slide-level foundation model trained on 700,000 diagnostic specimen-report pairs, achieving state-of-the-art performance in cancer detection and other pathology tasks through clinical-dialogue supervision.

Details

Motivation: To bridge the gap between computational pathology foundation models and clinical utility by aligning histomorphologic features with diagnostic reasoning language.

Method: Multimodal foundation model trained on 2.3M whole slide images and 14M question-answer pairs using clinical-dialogue supervision to learn slide-level representations.

Result: Matches or exceeds cancer-detection performance of clinical-grade products without additional training, achieves top performance on other tasks, and task-specific finetuning further improves performance.

Conclusion: Language-supervised pretraining provides scalable, clinically grounded signal for learning generalizable pathology representations that bridge human diagnostic reasoning and foundation-model performance.

Abstract: Recent rapid progress in the field of computational pathology has been enabled by foundation models. These models are beginning to move beyond encoding image patches towards whole-slide understanding but their clinical utility remains limited. In this work, we present PRISM2, a multimodal slide-level foundation model trained on data from 700,000 diagnostic specimen-report pairs, the largest vision (2.3 million whole slide images) and language (14M question-answer pairs) histopathology dataset to date. By learning through clinical-dialogue supervision, PRISM2 aligns histomorphologic features with the language of diagnostic reasoning, producing slide-level representations that support both direct diagnostic question-answering and transferable embeddings for downstream tasks. Without additional training, PRISM2 matches or exceeds the cancer-detection performance of clinical-grade products. This is observed without loss of generality on other tasks, where PRISM2 achieves top performance. Finally, using survival prediction as the example, we show that task-specific finetuning with a large dataset can outperform task-specific models, further improving performance. These results demonstrate how language-supervised pretraining provides a scalable, clinically grounded signal for learning generalizable pathology representations, bridging human diagnostic reasoning and foundation-model performance.

[381] Gaussian Splashing: Direct Volumetric Rendering Underwater

Nir Mualem, Roy Amoyal, Oren Freifeld, Derya Akkaynak

Main category: cs.CV

TL;DR: Gaussian Splashing is a fast underwater 3D reconstruction method that adapts 3D Gaussian Splatting with an underwater image formation model, achieving 140 FPS rendering and minutes-long reconstruction while revealing distant scene details with superior clarity.

Details

Motivation: Existing 3D reconstruction methods like NeRFs and 3DGS fail on underwater scenes due to water occlusion effects, and while underwater NeRF adaptations exist, they are impractically slow (hours for reconstruction, <1 FPS rendering).

Method: Unifies 3DGS speed with an underwater image formation model for scattering capture, introducing innovations in rendering, depth estimation procedures, and the 3DGS loss function.

Result: Achieves reconstruction in minutes and renders novel underwater scenes at 140 FPS, producing images with superior details and revealing distant scene details with far greater clarity than other methods.

Conclusion: Gaussian Splashing dramatically improves underwater 3D reconstruction and rendering, offering unparalleled speed and image quality compared to existing methods.

Abstract: In underwater images, most useful features are occluded by water. The extent of the occlusion depends on imaging geometry and can vary even across a sequence of burst images. As a result, 3D reconstruction methods robust on in-air scenes, like Neural Radiance Field methods (NeRFs) or 3D Gaussian Splatting (3DGS), fail on underwater scenes. While a recent underwater adaptation of NeRFs achieved state-of-the-art results, it is impractically slow: reconstruction takes hours and its rendering rate, in frames per second (FPS), is less than 1. Here, we present a new method that takes only a few minutes for reconstruction and renders novel underwater scenes at 140 FPS. Named Gaussian Splashing, our method unifies the strengths and speed of 3DGS with an image formation model for capturing scattering, introducing innovations in the rendering and depth estimation procedures and in the 3DGS loss function. Despite the complexities of underwater adaptation, our method produces images at unparalleled speeds with superior details. Moreover, it reveals distant scene details with far greater clarity than other methods, dramatically improving reconstructed and rendered images. We demonstrate results on existing datasets and a new dataset we have collected. Additional visual results are available at: https://bgu-cs-vil.github.io/gaussiansplashingUW.github.io/ .

[382] MotionGPT3: Human Motion as a Second Modality

Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, Xin Chen

Main category: cs.CV

TL;DR: MotionGPT3 is a bimodal motion-language model that uses a dual-stream Transformer with shared attention to handle motion and language modalities separately while enabling controlled information flow, achieving faster convergence and state-of-the-art performance.

Details

Motivation: Multimodal frameworks face complexity with growing modalities and tasks, motion quantization introduces approximation errors, and unifying discrete text with continuous motion in single-stream backbones causes cross-modal interference.

Method: Encodes raw motion into continuous latent space using VAE to avoid quantization artifacts, uses dual-stream Transformer with shared attention for modality-specific processing, and employs generate-then-align three-stage training schedule for stability.

Result: Achieves 2x faster convergence in training loss and up to 4x faster convergence in validation, while maintaining state-of-the-art performance on motion understanding and generation benchmarks.

Conclusion: The proposed bimodal architecture with continuous motion encoding and dual-stream processing effectively reduces cross-modal interference, stabilizes optimization, and accelerates convergence without degrading performance.

Abstract: With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow, which reduces interference, stabilizing optimization, and empirically accelerates convergence without degrading fidelity. For multimodal joint training, a generate-then-align three-stage schedule further improves stability and limits cross-task interference. Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation, while maintaining state-of-the-art performance on standard motion understanding and motion generation benchmarks.

[383] Epistemic Uncertainty for Generated Image Detection

Jun Nie, Yonggang Zhang, Tongliang Liu, Yiu-ming Cheung, Bo Han, Xinmei Tian

Main category: cs.CV

TL;DR: A framework using epistemic uncertainty from pre-trained vision models to detect AI-generated images by flagging high-uncertainty samples as synthetic.

Details

Motivation: Address security concerns in the generative AI era by leveraging distributional discrepancies between natural and generated images that manifest in epistemic uncertainty.

Method: Use pre-trained large-scale vision models to estimate epistemic uncertainty, converting image detection into uncertainty estimation problem by exploiting elevated uncertainty for generated images.

Result: Extensive experiments demonstrate the method’s efficacy in detecting AI-generated images using uncertainty-based approach.

Conclusion: Epistemic uncertainty from pre-trained models provides an effective proxy for detecting generated images, offering a practical solution to security challenges in generative AI.

Abstract: We introduce a novel framework for AI-generated image detection through epistemic uncertainty, aiming to address critical security concerns in the era of generative models. Our key insight stems from the observation that distributional discrepancies between training and testing data manifest distinctively in the epistemic uncertainty space of machine learning models. In this context, the distribution shift between natural and generated images leads to elevated epistemic uncertainty in models trained on natural images when evaluating generated ones. Hence, we exploit this phenomenon by using epistemic uncertainty as a proxy for detecting generated images. This converts the challenge of generated image detection into the problem of uncertainty estimation, underscoring the generalization performance of the model used for uncertainty estimation. Fortunately, advanced large-scale vision models pre-trained on extensive natural images have shown excellent generalization performance for various scenarios. Thus, we utilize these pre-trained models to estimate the epistemic uncertainty of images and flag those with high uncertainty as generated. Extensive experiments demonstrate the efficacy of our method. Code is available at https://github.com/tmlr-group/WePe.

[384] FlexEvent: Towards Flexible Event-Frame Object Detection at Varying Operational Frequencies

Dongyue Lu, Lingdong Kong, Gim Hee Lee, Camille Simon Chane, Wei Tsang Ooi

Main category: cs.CV

TL;DR: FlexEvent is a novel framework for event-based object detection that enables detection at varying frequencies through adaptive event-frame fusion and frequency-adaptive fine-tuning, achieving superior performance across different operational frequencies.

Details

Motivation: Existing event detectors are limited by fixed-frequency paradigms and fail to fully exploit the high-temporal resolution and adaptability of event data, which restricts their effectiveness in dynamic environments.

Method: The approach consists of two key components: FlexFuse (adaptive event-frame fusion module that integrates high-frequency event data with RGB frame semantics) and FlexTune (frequency-adaptive fine-tuning mechanism that generates frequency-adjusted labels for enhanced generalization).

Result: Extensive experiments show the method surpasses state-of-the-art methods, achieving significant improvements in both standard and high-frequency settings. It maintains robust performance from 20 Hz to 90 Hz and delivers accurate detection up to 180 Hz.

Conclusion: The framework sets a new benchmark for event-based object detection and paves the way for more adaptable, real-time vision systems by effectively leveraging the high-temporal resolution of event cameras.

Abstract: Event cameras offer unparalleled advantages for real-time perception in dynamic environments, thanks to the microsecond-level temporal resolution and asynchronous operation. Existing event detectors, however, are limited by fixed-frequency paradigms and fail to fully exploit the high-temporal resolution and adaptability of event data. To address these limitations, we propose FlexEvent, a novel framework that enables detection at varying frequencies. Our approach consists of two key components: FlexFuse, an adaptive event-frame fusion module that integrates high-frequency event data with rich semantic information from RGB frames, and FlexTune, a frequency-adaptive fine-tuning mechanism that generates frequency-adjusted labels to enhance model generalization across varying operational frequencies. This combination allows our method to detect objects with high accuracy in both fast-moving and static scenarios, while adapting to dynamic environments. Extensive experiments on large-scale event camera datasets demonstrate that our approach surpasses state-of-the-art methods, achieving significant improvements in both standard and high-frequency settings. Notably, our method maintains robust performance when scaling from 20 Hz to 90 Hz and delivers accurate detection up to 180 Hz, proving its effectiveness in extreme conditions. Our framework sets a new benchmark for event-based object detection and paves the way for more adaptable, real-time vision systems.

[385] FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error

Beilin Chu, Xuan Xu, Xin Wang, Yufei Zhang, Weike You, Linna Zhou

Main category: cs.CV

TL;DR: FIRE is a novel method that uses frequency-guided reconstruction error to detect diffusion model generated images by analyzing mid-band frequency reconstruction limitations.

Details

Motivation: Diffusion models struggle to accurately reconstruct mid-band frequency information in real images, suggesting this limitation could serve as a cue for detecting AI-generated content to address potential misuse concerns.

Method: Proposes Frequency-guided Reconstruction Error (FIRE) which assesses variation in reconstruction error before and after frequency decomposition, investigating the influence of frequency decomposition on reconstruction error.

Result: Extensive experiments show FIRE generalizes effectively to unseen diffusion models and maintains robustness against diverse perturbations.

Conclusion: FIRE provides a robust method for identifying diffusion model generated images by leveraging frequency decomposition analysis of reconstruction errors.

Abstract: The rapid advancement of diffusion models has significantly improved high-quality image generation, making generated content increasingly challenging to distinguish from real images and raising concerns about potential misuse. In this paper, we observe that diffusion models struggle to accurately reconstruct mid-band frequency information in real images, suggesting the limitation could serve as a cue for detecting diffusion model generated images. Motivated by this observation, we propose a novel method called Frequency-guided Reconstruction Error (FIRE), which, to the best of our knowledge, is the first to investigate the influence of frequency decomposition on reconstruction error. FIRE assesses the variation in reconstruction error before and after the frequency decomposition, offering a robust method for identifying diffusion model generated images. Extensive experiments show that FIRE generalizes effectively to unseen diffusion models and maintains robustness against diverse perturbations.

[386] BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Sara Pieri, Saeed Yahya Alseiari, Shanavas Cholakkal, Khaled Aldahmani, Fahad Khan, Rao Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal

Main category: cs.CV

TL;DR: BiMediX2 is a bilingual Arabic-English medical multimodal model that supports text and image interactions across various medical imaging modalities, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: To address the need for bilingual medical AI systems that can handle both Arabic and English medical interactions across diverse imaging modalities.

Method: Curated BiMed-V dataset with 1.6M bilingual medical samples, trained BiMediX2 model supporting multi-turn conversations and medical imaging, and created BiMed-MBench evaluation benchmark verified by medical experts.

Result: BiMediX2 achieves SOTA results, outperforming existing methods by over 9% in English and 20% in Arabic on BiMed-MBench, surpasses GPT-4 by ~9% in factual accuracy, and excels in medical VQA, report generation, and summarization tasks.

Conclusion: BiMediX2 demonstrates strong bilingual medical AI capabilities and the developed resources (dataset, benchmark, model) are publicly available for medical AI research.

Abstract: We introduce BiMediX2, a bilingual (Arabic-English) Bio-Medical EXpert Large Multimodal Model that supports text-based and image-based medical interactions. It enables multi-turn conversation in Arabic and English and supports diverse medical imaging modalities, including radiology, CT, and histology. To train BiMediX2, we curate BiMed-V, an extensive Arabic-English bilingual healthcare dataset consisting of 1.6M samples of diverse medical interactions. This dataset supports a range of medical Large Language Model (LLM) and Large Multimodal Model (LMM) tasks, including multi-turn medical conversations, report generation, and visual question answering (VQA). We also introduce BiMed-MBench, the first Arabic-English medical LMM evaluation benchmark, verified by medical experts. BiMediX2 demonstrates excellent performance across multiple medical LLM and LMM benchmarks, achieving state-of-the-art results compared to other open-sourced models. On BiMed-MBench, BiMediX2 outperforms existing methods by over 9% in English and more than 20% in Arabic evaluations. Additionally, it surpasses GPT-4 by approximately 9% in UPHILL factual accuracy evaluations and excels in various medical VQA, report generation, and report summarization tasks. Our trained models, instruction set, and source code are available at https://github.com/mbzuai-oryx/BiMediX2

[387] Multi-scale Latent Point Consistency Models for 3D Shape Generation

Bi’an Du, Wei Hu, Renjie Liao

Main category: cs.CV

TL;DR: MLPCM is a multi-scale latent point consistency model that accelerates 3D point cloud generation by 100x while improving shape quality and diversity over diffusion models.

Details

Motivation: To extend the sampling acceleration benefits of Consistency Models from 2D image generation to 3D point cloud generation, addressing the computational inefficiency of diffusion models in 3D shape synthesis.

Method: Proposes a latent diffusion framework with hierarchical latent representations (point-level to super-point levels), multi-scale latent integration with 3D spatial attention, and consistency distillation to compress the prior into a one-step generator.

Result: Achieves 100x speedup in generation while surpassing state-of-the-art diffusion models in shape quality and diversity on ShapeNet and ShapeNet-Vol benchmarks.

Conclusion: MLPCM successfully bridges Consistency Models to 3D point cloud generation, demonstrating significant efficiency gains without sacrificing performance.

Abstract: Consistency Models (CMs) have significantly accelerated the sampling process in diffusion models, yielding impressive results in synthesizing high-resolution images. To explore and extend these advancements to point-cloud-based 3D shape generation, we propose a novel Multi-scale Latent Point Consistency Model (MLPCM). Our MLPCM follows a latent diffusion framework and introduces hierarchical levels of latent representations, ranging from point-level to super-point levels, each corresponding to a different spatial resolution. We design a multi-scale latent integration module along with 3D spatial attention to effectively denoise the point-level latent representations conditioned on those from multiple super-point levels. Additionally, we propose a latent consistency model, learned through consistency distillation, that compresses the prior into a one-step generator. This significantly improves sampling efficiency while preserving the performance of the original teacher model. Extensive experiments on standard benchmarks ShapeNet and ShapeNet-Vol demonstrate that MLPCM achieves a 100x speedup in the generation process, while surpassing state-of-the-art diffusion models in terms of both shape quality and diversity.

[388] Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: Sa2VA is the first unified model for dense grounded understanding of both images and videos, combining SAM-2 segmentation with MLLM vision-language capabilities to handle referring segmentation and conversation tasks with minimal one-shot tuning.

Details

Motivation: Existing multi-modal LLMs are limited to specific modalities and tasks, lacking comprehensive support for both image and video understanding in a unified framework.

Method: Combines SAM-2 foundation video segmentation model with MLLM vision-language model, unifying text, image, and video into shared LLM token space. Uses LLM to generate instruction tokens that guide SAM-2 in producing precise masks. Introduces Ref-SAV dataset with 72k+ object expressions.

Result: Achieves strong performance across multiple tasks, particularly in referring video object segmentation. Can be easily extended to various VLMs like Qwen-VL and Intern-VL.

Conclusion: Sa2VA demonstrates potential for complex real-world applications with its unified approach to multi-modal understanding of both static and dynamic visual content.

Abstract: This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL, which can be updated with rapid process in current open-sourced VLMs. Code and models have been provided to the community.

[389] mmCooper: A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework

Bingyi Liu, Jian Teng, Hongfei Xue, Enshu Wang, Chuanhui Zhu, Pu Wang, Libing Wu

Main category: cs.CV

TL;DR: mmCooper is a multi-agent cooperative perception framework that addresses bandwidth constraints and calibration errors through multi-stage collaboration, dynamic information sharing, and robust refinement mechanisms.

Details

Motivation: Real-world deployment of collaborative perception faces challenges due to bandwidth constraints and inevitable calibration errors during information exchange between vehicles.

Method: Proposes a multi-stage collaboration strategy that dynamically balances intermediate- and late-stage information sharing, prevents transmission of low-confidence sensing data, and refines received detection results to handle misalignments.

Result: Extensive evaluation on real-world and simulated datasets demonstrates the framework’s effectiveness in enhancing perceptual performance while maintaining communication efficiency.

Conclusion: mmCooper provides a communication-efficient and collaboration-robust solution for cooperative perception that addresses practical deployment challenges in multi-agent systems.

Abstract: Collaborative perception significantly enhances individual vehicle perception performance through the exchange of sensory information among agents. However, real-world deployment faces challenges due to bandwidth constraints and inevitable calibration errors during information exchange. To address these issues, we propose mmCooper, a novel multi-agent, multi-stage, communication-efficient, and collaboration-robust cooperative perception framework. Our framework leverages a multi-stage collaboration strategy that dynamically and adaptively balances intermediate- and late-stage information to share among agents, enhancing perceptual performance while maintaining communication efficiency. To support robust collaboration despite potential misalignments and calibration errors, our framework prevents misleading low-confidence sensing information from transmission and refines the received detection results from collaborators to improve accuracy. The extensive evaluation results on both real-world and simulated datasets demonstrate the effectiveness of the mmCooper framework and its components.

[390] A Study in Dataset Distillation for Image Super-Resolution

Tobias Dietz, Brian B. Moser, Tobias Nauen, Federico Raue, Stanislav Frolov, Andreas Dengel

Main category: cs.CV

TL;DR: First systematic study of dataset distillation for image super-resolution, showing that distilled datasets (8.88% of original size) can train SR models with nearly the same reconstruction fidelity as full datasets.

Details

Motivation: Dataset distillation has gained traction in classification but its potential for image super-resolution remains largely untapped, motivating the need to explore memory- and compute-efficient approaches for SR training.

Method: Conducted systematic evaluation of both pixel- and latent-space formulations for dataset distillation in SR, analyzing initialization strategies and distillation objectives.

Result: Distilled datasets occupying only 8.88% of original size can train SR models that retain nearly the same reconstruction fidelity as those trained on full datasets.

Conclusion: The study demonstrates the feasibility of SR dataset distillation and establishes foundational insights for memory- and compute-efficient generative restoration models.

Abstract: Dataset distillation aims to compress large datasets into compact yet highly informative subsets that preserve the training behavior of the original data. While this concept has gained traction in classification, its potential for image Super-Resolution (SR) remains largely untapped. In this work, we conduct the first systematic study of dataset distillation for SR, evaluating both pixel- and latent-space formulations. We show that a distilled dataset, occupying only 8.88% of the original size, can train SR models that retain nearly the same reconstruction fidelity as those trained on full datasets. Furthermore, we analyze how initialization strategies and distillation objectives affect efficiency, convergence, and visual quality. Our findings highlight the feasibility of SR dataset distillation and establish foundational insights for memory- and compute-efficient generative restoration models.

[391] SurGen: 1020 H&E-stained Whole Slide Images With Survival and Genetic Markers

Craig Myles, In Hwa Um, Craig Marshall, David Harris-Birtill, David J. Harrison

Main category: cs.CV

TL;DR: SurGen is a comprehensive colorectal cancer dataset with 1,020 H&E-stained whole-slide images from 843 cases, including genetic mutation data and survival information, demonstrating utility through a proof-of-concept model for mismatch repair status prediction.

Details

Motivation: Cancer remains a leading cause of death worldwide, and comprehensive datasets combining histopathological images with genetic and survival data are essential for advancing computational pathology and personalized medicine.

Method: The authors present SurGen dataset with 1,020 whole-slide images from 843 colorectal cancer cases, including annotations for KRAS, NRAS, BRAF mutations and mismatch repair status, plus survival data for 426 cases. They demonstrate utility with a proof-of-concept model predicting mismatch repair status from WSIs.

Result: The proof-of-concept model achieved a test area under the ROC curve of 0.8273 for predicting mismatch repair status directly from whole-slide images, showing the dataset’s potential for biomarker discovery and prognostic modeling.

Conclusion: SurGen offers a valuable resource for scientific research, enabling studies requiring high-quality whole-slide images linked with comprehensive clinical and genetic information, with initial findings affirming its capacity to advance diagnostic precision and personalized treatment in colorectal oncology.

Abstract: Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine. We present SurGen, a dataset comprising 1,020 H&E-stained whole-slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. We illustrate SurGen’s utility with a proof-of-concept model that predicts mismatch repair status directly from WSIs, achieving a test area under the receiver operating characteristic curve of 0.8273. These preliminary results underscore the dataset’s potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer and beyond. SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset’s capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online: https://doi.org/10.6019/S-BIAD1285.

[392] Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

Chenxu Li, Zhicai Wang, Yuan Sheng, Xingyu Zhu, Yanbin Hao, Xiang Wang

Main category: cs.CV

TL;DR: Res-Bench is a new benchmark for evaluating resolution robustness in Multimodal Large Language Models (MLLMs), focusing on performance stability across different input resolutions rather than just semantic accuracy.

Details

Motivation: Current MLLM evaluations focus primarily on semantic performance but overlook resolution robustness - whether model performance remains stable across varying input resolutions, which is crucial for real-world applications.

Method: Created Res-Bench with 14,400 samples across 12 resolution levels and 6 capability dimensions. Introduced novel robustness metrics: Spearman’s correlation for resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for performance volatility. Evaluated leading MLLMs with model-centric and task-centric analysis, preprocessing strategies (padding, super-resolution), and fine-tuning for stability.

Result: The paper presents a comprehensive evaluation framework and benchmark specifically designed to measure resolution robustness in MLLMs, going beyond traditional accuracy metrics.

Conclusion: Res-Bench provides a systematic approach to evaluate and improve resolution robustness in MLLMs, addressing a critical gap in current evaluation methodologies for multimodal models.

Abstract: Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman’s correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.

[393] MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification

Anh-Tien Nguyen, Duy Minh Ho Nguyen, Nghiem Tuong Diep, Trung Quoc Nguyen, Nhat Ho, Jacqueline Michelle Metsch, Miriam Cindy Maurer, Daniel Sonntag, Hanibal Bohnenberger, Anne-Christin Hauschild

Main category: cs.CV

TL;DR: A prompt learning method for few-shot pathology image classification that adapts large vision-language models using multi-granular attention and optimal transport-based visual-text distance to improve performance across various pathology modalities.

Details

Motivation: To address challenges in whole slide pathology image classification due to gigapixel image sizes and limited annotation labels, which hinder model generalization.

Method: Extends Prov-GigaPath vision foundation model into vision-language model using adaptors and contrastive learning. Proposes multi-granular attention comparing learnable prompts with individual patches and patch groups, and uses optimal transport-based visual-text distance for robustness.

Result: Empirical experiments on lung, kidney, and breast pathology modalities show the approach surpasses latest competitors and consistently improves performance across CLIP, PLIP, and Prov-GigaPath integrated PLIP architectures.

Conclusion: The proposed method effectively adapts large vision-language models for few-shot pathology classification, capturing both fine-grained details and broader context while ensuring robustness through optimal transport-based distance.

Abstract: Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model’s ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP.

[394] TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li

Main category: cs.CV

TL;DR: TextAtlas5M is a new dataset for evaluating long-text rendering in text-to-image generation, addressing the challenge of generating images with dense text content.

Details

Motivation: Current text-to-image models struggle with long-form text generation due to limitations in existing datasets that focus on shorter, simpler text prompts. Real-world applications like advertisements and infographics require integration of complex text with visuals.

Method: Created TextAtlas5M dataset with 5 million long-text images across diverse data types, and curated TextAtlasEval benchmark with 3000 human-improved test cases across 3 domains.

Result: Evaluation shows TextAtlasEval presents significant challenges even for advanced proprietary models like GPT4o with DallE-3, with open-source models showing even larger performance gaps.

Conclusion: TextAtlas5M serves as a valuable dataset for training and evaluating future text-conditioned image generation models, particularly for long-text rendering capabilities.

Abstract: Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.

[395] Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review

Ufaq Khan, Umair Nawaz, Adnan Qayyum, Shazad Ashraf, Yutong Xie, Muhammad Haris Khan, Muhammad Bilal, Junaid Qadir

Main category: cs.CV

TL;DR: This paper surveys how machine learning and deep learning technologies, especially Foundation Models, are enhancing surgical scene understanding in minimally invasive surgery by improving segmentation, instrument tracking, and phase recognition.

Details

Motivation: To bridge the gap between advanced ML/DL technologies and clinical surgical needs by exploring how Foundation Models and other state-of-the-art methods can improve surgical scene understanding and workflow integration.

Method: The paper conducts a comprehensive survey of integration methods for ML/DL technologies including CNNs, Vision Transformers, and Foundation Models like SAM into surgical workflows, while analyzing challenges and ethical considerations.

Result: Findings show substantial progress in surgical scene understanding through improved segmentation accuracy, instrument tracking, and phase recognition, but highlight ongoing challenges with data variability, computational demands, and clinical integration.

Conclusion: While significant advancements have been made, more focused efforts are needed for seamless clinical integration of AI technologies to enhance surgical precision, reduce risks, and optimize patient outcomes while addressing ethical considerations.

Abstract: Recent advancements in machine learning (ML) and deep learning (DL), particularly through the introduction of Foundation Models (FMs), have significantly enhanced surgical scene understanding within minimally invasive surgery (MIS). This paper surveys the integration of state-of-the-art ML and DL technologies, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Foundation Models like the Segment Anything Model (SAM), into surgical workflows. These technologies improve segmentation accuracy, instrument tracking, and phase recognition in surgical scene understanding. The paper explores the challenges these technologies face, such as data variability and computational demands, and discusses ethical considerations and integration hurdles in clinical settings. Highlighting the roles of FMs, we bridge the technological capabilities with clinical needs and outline future research directions to enhance the adaptability, efficiency, and ethical alignment of AI applications in surgery. Our findings suggest that substantial progress has been made; however, more focused efforts are required to achieve seamless integration of these technologies into clinical workflows, ensuring they complement surgical practice by enhancing precision, reducing risks, and optimizing patient outcomes.

[396] New multimodal similarity measure for image registration via modeling local functional dependence with linear combination of learned basis functions

Joel Honkamaa, Pekka Marttinen

Main category: cs.CV

TL;DR: A method for deformable multi-modal image registration using local functional dependence metrics, implemented via efficient GPU computations.

Details

Motivation: Deformable registration of different modality images is challenging due to the need for robust overlap measures when images capture different tissue aspects.

Method: Model local functional dependence via linear basis function model with basis functions learned jointly with deformation, implemented via efficient GPU convolutions.

Result: Good performance on three datasets compared to established baselines and earlier functional dependence-based methods.

Conclusion: Local functional dependence metrics are effective for multi-modal image registration when implemented efficiently on GPUs.

Abstract: The deformable registration of images of different modalities, essential in many medical imaging applications, remains challenging. The main challenge is developing a robust measure for image overlap despite the compared images capturing different aspects of the underlying tissue. Here, we explore similarity metrics based on functional dependence between intensity values of registered images. Although functional dependence is too restrictive on the global scale, earlier work has shown competitive performance in deformable registration when such measures are applied over small enough contexts. We confirm this finding and further develop the idea by modeling local functional dependence via the linear basis function model with the basis functions learned jointly with the deformation. The measure can be implemented via convolutions, making it efficient to compute on GPUs. We release the method as an easy-to-use tool and show good performance on three datasets compared to well-established baseline and earlier functional dependence-based methods.

[397] A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions

Rahul Nair, Bhanu Tokas, Hannah Kerner

Main category: cs.CV

TL;DR: DBAC is a new language-aware and directional metric for measuring bias amplification in image captioning models that can identify bias sources, is less sensitive to sentence encoders, and provides more accurate bias amplification estimates than existing metrics.

Details

Motivation: Current bias amplification metrics like BA and DPA only work for classification datasets and cannot capture language semantics in captions. While LIC introduced language awareness, it cannot identify the source of bias amplification in captioning models.

Method: Proposed Directional Bias Amplification in Captioning (DBAC) - a language-aware metric that can identify when captioning models amplify biases. It improves upon LIC by being less sensitive to sentence encoders and providing more accurate bias amplification estimates.

Result: Experiments on gender and race attributes in COCO captions dataset show DBAC is the only reliable metric to measure bias amplification in captions.

Conclusion: DBAC effectively addresses limitations of existing bias amplification metrics for image captioning by providing language-aware, directional measurement that can identify bias sources and is more robust to implementation choices.

Abstract: When we train models on biased datasets, they not only reproduce data biases, but can worsen them at test time - a phenomenon called bias amplification. Many of the current bias amplification metrics (e.g., BA (MALS), DPA) measure bias amplification only in classification datasets. These metrics are ineffective for image captioning datasets, as they cannot capture the language semantics of a caption. Recent work introduced Leakage in Captioning (LIC), a language-aware bias amplification metric that understands caption semantics. However, LIC has a crucial limitation: it cannot identify the source of bias amplification in captioning models. We propose Directional Bias Amplification in Captioning (DBAC), a language-aware and directional metric that can identify when captioning models amplify biases. DBAC has two more improvements over LIC: (1) it is less sensitive to sentence encoders (a hyperparameter in language-aware metrics), and (2) it provides a more accurate estimate of bias amplification in captions. Our experiments on gender and race attributes in the COCO captions dataset show that DBAC is the only reliable metric to measure bias amplification in captions.

[398] AdaSCALE: Adaptive Scaling for OOD Detection

Sudarshan Regmi

Main category: cs.CV

TL;DR: AdaSCALE is an adaptive scaling method for OOD detection that dynamically adjusts percentile thresholds based on estimated OOD likelihood, achieving state-of-the-art performance.

Details

Motivation: Current OOD detection methods use static percentile thresholds across all samples, leading to suboptimal separation between in-distribution and out-of-distribution inputs.

Method: Proposes adaptive scaling that leverages the observation that OOD samples show more pronounced activation shifts at high-magnitude activations under minor perturbation compared to ID samples.

Result: Achieves 14.94% improvement in near-OOD and 21.67% in far-OOD detection on ImageNet-1k benchmark across eight architectures, outperforming OptFS.

Conclusion: AdaSCALE enables stronger scaling for likely ID samples and weaker scaling for likely OOD samples, yielding highly separable energy scores for improved OOD detection.

Abstract: The ability of the deep learning model to recognize when a sample falls outside its learned distribution is critical for safe and reliable deployment. Recent state-of-the-art out-of-distribution (OOD) detection methods leverage activation shaping to improve the separation between in-distribution (ID) and OOD inputs. These approaches resort to sample-specific scaling but apply a static percentile threshold across all samples regardless of their nature, resulting in suboptimal ID-OOD separability. In this work, we propose \textbf{AdaSCALE}, an adaptive scaling procedure that dynamically adjusts the percentile threshold based on a sample’s estimated OOD likelihood. This estimation leverages our key observation: OOD samples exhibit significantly more pronounced activation shifts at high-magnitude activations under minor perturbation compared to ID samples. AdaSCALE enables stronger scaling for likely ID samples and weaker scaling for likely OOD samples, yielding highly separable energy scores. Our approach achieves state-of-the-art OOD detection performance, outperforming the latest rival OptFS by 14.94% in near-OOD and 21.67% in far-OOD datasets in average FPR@95 metric on the ImageNet-1k benchmark across eight diverse architectures. The code is available at: https://github.com/sudarshanregmi/AdaSCALE/

[399] LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space

Zhangyu Wang, Zeping Liu, Jielu Zhang, Zhongliang Zhou, Qian Cao, Nemin Wu, Lan Mu, Yang Song, Yiqun Xie, Ni Lao, Gengchen Mai

Main category: cs.CV

TL;DR: LocDiff is a multi-scale latent diffusion model for image geolocalization that uses spherical harmonics representations and outperforms existing methods on global-scale datasets.

Details

Motivation: Current geolocalization methods suffer from spatial generalizability issues due to grid/gallery dependencies or lack multi-scale information in generative approaches.

Method: Proposed LocDiff with SHDD representations for encoding geolocations into spherical harmonics space, and CS-UNet architecture for image-guided latent diffusion with KL-divergence loss.

Result: Outperforms all state-of-the-art grid-based, retrieval-based, and diffusion-based baselines across 5 global-scale datasets with stronger generalizability to unseen locations.

Conclusion: LocDiff is the first image geolocalization model performing latent diffusion in multi-scale location encoding space, demonstrating superior performance and generalizability.

Abstract: Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. State-of-the-art methods employ either grid-based classification or gallery-based image-location retrieval, whose spatial generalizability significantly suffers if the spatial distribution of test images does not align with the choices of grids and galleries. Recently emerging generative approaches, while getting rid of grids and galleries, use raw geographical coordinates and suffer quality losses due to their lack of multi-scale information. To address these limitations, we propose a multi-scale latent diffusion model called LocDiff for image geolocalization. We developed a novel positional encoding-decoding framework called Spherical Harmonics Dirac Delta (SHDD) Representations, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking on spherical probability distributions. We also propose a novel SirenNet-based architecture (CS-UNet) to learn an image-based conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. To the best of our knowledge, LocDiff is the first image geolocalization model that performs latent diffusion in a multi-scale location encoding space and generates geolocations under the guidance of images. Experimental results show that LocDiff can outperform all state-of-the-art grid-based, retrieval-based, and diffusion-based baselines across 5 challenging global-scale image geolocalization datasets, and demonstrates significantly stronger generalizability to unseen geolocations.

[400] FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion

Pihai Sun, Junjun Jiang, Yuanqi Yao, Youyu Chen, Wenbo Zhao, Kui Jiang, Xianming Liu

Main category: cs.CV

TL;DR: FUSE is a frequency-decoupled unified self-supervised encoder that addresses cross-modal supervision scarcity and frequency mismatches in image-event joint depth estimation through parameter-efficient self-supervised transfer and physics-aware frequency-decoupled fusion.

Details

Motivation: Image-event joint depth estimation faces challenges in generalizability due to limited annotated datasets causing insufficient cross-modal supervision, and inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns.

Method: Proposes Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two components: Parameter-efficient Self-supervised Transfer (PST) for cross-modal knowledge transfer through latent space alignment, and Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components.

Result: Achieves state-of-the-art performance with 14% and 24.9% improvements in Abs .Rel on MVSEC and DENSE datasets, and exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur.

Conclusion: FUSE enables construction of a universal image-event encoder that only requires lightweight decoder adaptation for target datasets, significantly advancing real-world deployment capabilities for robust depth estimation.

Abstract: Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground truth. Complementing this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in Abs .Rel on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: https://github.com/sunpihai-up/FUSE

[401] Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound

Yuhao Huang, Ao Chang, Haoran Dou, Xing Tao, Xinrui Zhou, Yan Cao, Ruobing Huang, Alejandro F Frangi, Lingyun Bao, Xin Yang, Dong Ni

Main category: cs.CV

TL;DR: Flip Learning is a novel multi-agent reinforcement learning framework for weakly-supervised nodule segmentation in breast ultrasound using only 2D/3D boxes, achieving performance comparable to fully-supervised methods.

Details

Motivation: To develop an automated nodule segmentation system that reduces laborious annotation requirements while maintaining accuracy, addressing limitations of current weakly-supervised methods that rely on inaccurate activation maps or inefficient pseudo-mask generation.

Method: Multi-agent reinforcement learning where agents erase target regions from boxes to flip classification tags, using superpixel/supervoxel encoding, three specialized rewards (classification score and intensity distribution), and progressive curriculum learning.

Result: Outperforms state-of-the-art weakly-supervised methods and foundation models on large in-house BUS and ABUS datasets, achieving comparable performance to fully-supervised learning algorithms.

Conclusion: Flip Learning provides an effective weakly-supervised segmentation approach that reduces annotation burden while maintaining high accuracy, demonstrating the potential of multi-agent reinforcement learning for medical image segmentation tasks.

Abstract: Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents’ erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.

[402] SonarSplat: Novel View Synthesis of Imaging Sonar via Gaussian Splatting

Advaith V. Sethuraman, Max Rucker, Onur Bagoren, Pou-Chun Kung, Nibarkavi N. B. Amutha, Katherine A. Skinner

Main category: cs.CV

TL;DR: SonarSplat is a Gaussian splatting framework for imaging sonar that enables realistic novel view synthesis and models acoustic streaking phenomena, achieving improved image quality and 3D reconstruction compared to state-of-the-art methods.

Details

Motivation: To address the challenges of imaging sonar, particularly acoustic streaking phenomena and the need for realistic novel view synthesis in underwater environments.

Method: Represents scenes as 3D Gaussians with acoustic reflectance and saturation properties, develops efficient rasterization for range/azimuth images faithful to acoustic image formation, and models azimuth streaking within the Gaussian splatting framework.

Result: Achieves +3.2 dB PSNR improvement in image synthesis and 77% lower Chamfer Distance in 3D reconstruction compared to state-of-the-art methods. Also demonstrates capability for azimuth streak removal.

Conclusion: SonarSplat provides an effective Gaussian splatting approach for imaging sonar that significantly improves both image synthesis quality and 3D reconstruction accuracy while modeling acoustic streaking phenomena.

Abstract: In this paper, we present SonarSplat, a novel Gaussian splatting framework for imaging sonar that demonstrates realistic novel view synthesis and models acoustic streaking phenomena. Our method represents the scene as a set of 3D Gaussians with acoustic reflectance and saturation properties. We develop a novel method to efficiently rasterize Gaussians to produce a range/azimuth image that is faithful to the acoustic image formation model of imaging sonar. In particular, we develop a novel approach to model azimuth streaking in a Gaussian splatting framework. We evaluate SonarSplat using real-world datasets of sonar images collected from an underwater robotic platform in a controlled test tank and in a real-world river environment. Compared to the state-of-the-art, SonarSplat offers improved image synthesis capabilities (+3.2 dB PSNR) and more accurate 3D reconstruction (77% lower Chamfer Distance). We also demonstrate that SonarSplat can be leveraged for azimuth streak removal.

[403] OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery

Xiucheng Liang, Jinheng Xie, Tianhong Zhao, Rudi Stouffs, Filip Biljecki

Main category: cs.CV

TL;DR: OpenFACADES is an open framework that uses multimodal crowdsourced data and large vision-language models to automatically enrich building profiles with comprehensive attributes and semantic descriptors at scale.

Details

Motivation: Comprehensive building attribute data is crucial for urban applications but remains scarce in many areas. While remote sensing can extract objective attributes, there's a need for scalable pipelines that integrate diverse open datasets and infer holistic building information.

Method: 1) Integrate Mapillary street-level images with OpenStreetMap via isovist analysis to identify optimal vantage points 2) Automatically detect building facades and reproject them into holistic perspective views 3) Use fine-tuned open-source vision-language models for multi-attribute prediction and open-vocabulary captioning, trained on 31,180 labeled images from 7 cities

Result: Fine-tuned VLMs excelled in multi-attribute inference, outperforming single-attribute computer vision models and zero-shot ChatGPT-4o. The system demonstrated superior generalization and robustness across culturally distinct regions and varying image conditions.

Conclusion: OpenFACADES successfully bridges the gap in building attribute data by providing a scalable framework that leverages multimodal crowdsourced data and advanced vision-language models to automatically enrich building profiles with comprehensive attributes.

Abstract: Building properties, such as height, usage, and material, play a crucial role in spatial data infrastructures, supporting various urban applications. Despite their importance, comprehensive building attribute data remain scarce in many urban areas. Recent advances have enabled the extraction of objective building attributes using remote sensing and street-level imagery. However, establishing a pipeline that integrates diverse open datasets, acquires holistic building imagery, and infers comprehensive building attributes at scale remains a significant challenge. Among the first, this study bridges the gaps by introducing OpenFACADES, an open framework that leverages multimodal crowdsourced data to enrich building profiles with both objective attributes and semantic descriptors through multimodal large language models. First, we integrate street-level image metadata from Mapillary with OpenStreetMap geometries via isovist analysis, identifying images that provide suitable vantage points for observing target buildings. Second, we automate the detection of building facades in panoramic imagery and tailor a reprojection approach to convert objects into holistic perspective views that approximate real-world observation. Third, we introduce an innovative approach that harnesses and investigates the capabilities of open-source large vision-language models (VLMs) for multi-attribute prediction and open-vocabulary captioning in building-level analytics, leveraging a globally sourced dataset of 31,180 labeled images from seven cities. Evaluation shows that fine-tuned VLM excel in multi-attribute inference, outperforming single-attribute computer vision models and zero-shot ChatGPT-4o. Further experiments confirm its superior generalization and robustness across culturally distinct region and varying image conditions.

[404] LiteTracker: Leveraging Temporal Causality for Accurate Low-latency Tissue Tracking

Mert Asim Karaoglu, Wenbo Ji, Ahmed Abbas, Nassir Navab, Benjamin Busam, Alexander Ladikos

Main category: cs.CV

TL;DR: LiteTracker is a low-latency tissue tracking method for endoscopic video that achieves 7x speed improvement over its predecessor and 2x over state-of-the-art while maintaining competitive accuracy.

Details

Motivation: Current tissue tracking methods trained on synthetic datasets achieve good accuracy but fail to meet low-latency requirements for real-time surgical applications.

Method: Builds on state-of-the-art long-term point tracking method with training-free runtime optimizations, including temporal memory buffer for feature reuse and prior motion for track initialization.

Result: 7x faster than predecessor and 2x faster than state-of-the-art, with competitive tracking accuracy and occlusion prediction on STIR and SuPer datasets.

Conclusion: LiteTracker represents an important step toward low-latency tissue tracking for real-time surgical applications in the operating room.

Abstract: Tissue tracking plays a critical role in various surgical navigation and extended reality (XR) applications. While current methods trained on large synthetic datasets achieve high tracking accuracy and generalize well to endoscopic scenes, their runtime performances fail to meet the low-latency requirements necessary for real-time surgical applications. To address this limitation, we propose LiteTracker, a low-latency method for tissue tracking in endoscopic video streams. LiteTracker builds on a state-of-the-art long-term point tracking method, and introduces a set of training-free runtime optimizations. These optimizations enable online, frame-by-frame tracking by leveraging a temporal memory buffer for efficient feature reuse and utilizing prior motion for accurate track initialization. LiteTracker demonstrates significant runtime improvements being around 7x faster than its predecessor and 2x than the state-of-the-art. Beyond its primary focus on efficiency, LiteTracker delivers high-accuracy tracking and occlusion prediction, performing competitively on both the STIR and SuPer datasets. We believe LiteTracker is an important step toward low-latency tissue tracking for real-time surgical applications in the operating room. Our code is publicly available at https://github.com/ImFusionGmbH/lite-tracker.

[405] Efficient Remote Sensing Change Detection with Change State Space Models

Elman Ghazaei, Erchan Aptoula

Main category: cs.CV

TL;DR: CSSM is a new State Space Model architecture designed specifically for change detection that focuses only on relevant changes between bi-temporal images, reducing parameters and improving computational efficiency while maintaining high performance.

Details

Motivation: ConvNets struggle with long-range dependencies and Vision Transformers are computationally inefficient for large-scale datasets. Vision Mamba addresses these limitations but hasn't been specifically optimized for change detection tasks.

Method: The Change State Space Model (CSSM) is introduced, which focuses specifically on relevant changes between bi-temporal images by filtering out irrelevant information. This selective attention reduces network parameters and enhances computational efficiency.

Result: CSSM outperformed ConvNets, ViTs, and Mamba-based counterparts on three benchmark datasets while using only a fraction of their computational complexity. It also demonstrated robustness against input degradation.

Conclusion: CSSM provides an efficient and effective solution for change detection by specifically targeting relevant changes, achieving superior performance with significantly reduced computational requirements compared to existing approaches.

Abstract: Despite their frequent use for change detection, both ConvNets and Vision transformers (ViT) exhibit well-known limitations, namely the former struggle to model long-range dependencies while the latter are computationally inefficient, rendering them challenging to train on large-scale datasets. Vision Mamba, an architecture based on State Space Models has emerged as an alternative addressing the aforementioned deficiencies and has been already applied to remote sensing change detection, though mostly as a feature extracting backbone. In this article the Change State Space Model is introduced, that has been specifically designed for change detection by focusing on the relevant changes between bi-temporal images, effectively filtering out irrelevant information. By concentrating solely on the changed features, the number of network parameters is reduced, enhancing significantly computational efficiency while maintaining high detection performance and robustness against input degradation. The proposed model has been evaluated via three benchmark datasets, where it outperformed ConvNets, ViTs, and Mamba-based counterparts at a fraction of their computational complexity. The implementation will be made available at https://github.com/Elman295/CSSM upon acceptance.

[406] Transforming Hyperspectral Images Into Chemical Maps: A Novel End-to-End Deep Learning Approach

Ole-Christian Galbo Engstrøm, Michela Albano-Gaglio, Erik Schou Dreier, Yamine Bouzembrak, Maria Font-i-Furnols, Puneet Mishra, Kim Steenstrup Pedersen

Main category: cs.CV

TL;DR: U-Net deep learning approach outperforms traditional PLS regression for chemical map generation from hyperspectral images, providing more accurate predictions with better spatial correlation and physically plausible results.

Details

Motivation: Current chemical map generation methods like PLS regression produce noisy, pixel-wise predictions that ignore spatial context and often generate physically impossible values.

Method: Proposed an end-to-end deep learning approach using modified U-Net with custom loss function to directly generate chemical maps from hyperspectral images, skipping intermediate steps of traditional pixel-wise analysis.

Result: U-Net achieved 7% lower test set RMSE than PLS for mean fat prediction, generated chemical maps with 99.91% spatially correlated variance (vs 2.37% for PLS), and stayed within physically possible 0-100% range unlike PLS.

Conclusion: U-Net is superior to PLS for chemical map generation, providing more accurate predictions with better spatial coherence and physically plausible results.

Abstract: Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. This study compares the U-Net with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error that is 7% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.37% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0-100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.

[407] What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?

David Yan, Alexander Raistrick, Jia Deng

Main category: cs.CV

TL;DR: The paper investigates what makes synthetic stereo datasets effective by varying procedural generation parameters and creates an optimized dataset that outperforms mixed datasets and competes with FoundationStereo.

Details

Motivation: Synthetic datasets are essential for training stereo matching networks, but the factors that make them effective remain underexplored.

Method: Vary parameters of a procedural dataset generator and analyze effects on zero-shot stereo matching performance using standard benchmarks.

Result: Training only on the optimized dataset achieves better performance than training on mixed datasets and is competitive with FoundationStereo.

Conclusion: The work provides open-source generation code and parameter analysis to enable further research on procedural stereo datasets.

Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We validate our findings by collecting the best settings and creating a large-scale dataset. Training only on this dataset achieves better performance than training on a mixture of widely used datasets, and is competitive with training on the FoundationStereo dataset, with the additional benefit of open-source generation code and an accompanying parameter analysis to enable further research. We open-source our system at https://github.com/princeton-vl/InfinigenStereo to enable further research on procedural stereo datasets.

[408] A Genealogy of Foundation Models in Remote Sensing

Kevin Lane, Morteza Karimzadeh

Main category: cs.CV

TL;DR: This paper reviews foundation models for remote sensing representation learning, examining different approaches, their computer vision roots, and focusing on multi-sensor integration and future directions.

Details

Motivation: Foundation models are gaining attention in remote sensing, but development is still emerging with competing approaches. The paper aims to characterize advantages and pitfalls while outlining improvements for remote sensing-specific foundation models.

Method: The paper examines single-sensor remote foundation models to introduce concepts, then emphasizes incorporating multi-sensor aspects of Earth observations. It explores how existing approaches leverage multiple sensors compared to multi-modal foundation models.

Result: The analysis identifies that current approaches vary in how effectively they leverage multi-sensor data, with opportunities to better utilize the vast amounts of unlabeled, seasonal, and multi-sensor remote sensing observations.

Conclusion: There are significant opportunities to further improve remote sensing foundation models by better harnessing multi-sensor data, seasonal information, and unlabeled observations, building on computer vision foundations while adapting to domain-specific requirements.

Abstract: Foundation models have garnered increasing attention for representation learning in remote sensing. Many such foundation models adopt approaches that have demonstrated success in computer vision with minimal domain-specific modification. However, the development and application of foundation models in this field are still burgeoning, as there are a variety of competing approaches for how to most effectively leverage remotely sensed data. This paper examines these approaches, along with their roots in the computer vision field. This is done to characterize potential advantages and pitfalls, while outlining future directions to further improve remote sensing-specific foundation models. We discuss the quality of the learned representations and methods to alleviate the need for massive compute resources. We first examine single-sensor remote foundation models to introduce concepts and provide context, and then place emphasis on incorporating the multi-sensor aspect of Earth observations into foundation models. In particular, we explore the extent to which existing approaches leverage multiple sensors in training foundation models in relation to multi-modal foundation models. Finally, we identify opportunities for further harnessing the vast amounts of unlabeled, seasonal, and multi-sensor remote sensing observations.

[409] Spike Imaging Velocimetry: Dense Motion Estimation of Fluids Using Spike Cameras

Yunzhong Zhang, Bo Xiong, You Zhou, Changqing Su, Zhen Cheng, Zhaofei Yu, Xun Cao, Tiejun Huang

Main category: cs.CV

TL;DR: Spike Imaging Velocimetry (SIV) is a deep learning framework that uses spike cameras for Particle Image Velocimetry, achieving superior performance on turbulent flow fields through hierarchical transforms and graph encoders.

Details

Motivation: The need for accurate, non-intrusive flow measurement methods in fluid dynamics, particularly for highly turbulent and intricate flow fields where traditional PIV methods may be limited.

Method: Proposed SIV framework with Detail-Preserving Hierarchical Transform (DPHT) module to aggregate motion features from spike streams, and Graph Encoder (GE) to extract contextual features from complex fluid flows. Also created PSSD dataset for validation.

Result: The proposed method achieves superior performance compared to existing baseline methods on the PSSD dataset across three challenging fluid dynamics scenarios.

Conclusion: Spike cameras combined with the SIV framework offer significant potential for PIV applications, especially in complex turbulent flows, with open-sourced datasets and implementation available.

Abstract: The need for accurate and non-intrusive flow measurement methods has led to the widespread adoption of Particle Image Velocimetry (PIV), a powerful diagnostic tool in fluid motion estimation. This study investigates the tremendous potential of spike cameras (a type of ultra-high-speed, high-dynamic-range camera) in PIV. We propose a deep learning framework, Spike Imaging Velocimetry (SIV), designed specifically for highly turbulent and intricate flow fields. To aggregate motion features from the spike stream while minimizing information loss, we incorporate a Detail-Preserving Hierarchical Transform (DPHT) module. Additionally, we introduce a Graph Encoder (GE) to extract contextual features from highly complex fluid flows. Furthermore, we present a spike-based PIV dataset, Particle Scenes with Spike and Displacement (PSSD), which provides labeled data for three challenging fluid dynamics scenarios. Our proposed method achieves superior performance compared to existing baseline methods on PSSD. The datasets and our implementation of SIV are open-sourced in the supplementary materials.

[410] Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Lung Nodule Malignancy Prediction

Luoting Zhuang, Seyed Mohammad Hossein Tabatabaei, Ramin Salehi-Rad, Linh M. Tran, Denise R. Aberle, Ashley E. Prosper, William Hsu

Main category: cs.CV

TL;DR: This paper presents a CLIP-based model that integrates radiologists’ semantic features with imaging data to predict lung cancer, achieving state-of-the-art performance across multiple datasets while providing explainable outputs.

Details

Motivation: Existing machine learning models for lung nodule malignancy assessment rely on manual annotation, have limited interpretability, and are sensitive to imaging variations, hindering clinical application.

Method: Fine-tuned a pretrained CLIP model using parameter-efficient fine-tuning to align imaging and semantic text features from multiple datasets including NLST, LIDC, and three external datasets.

Result: Achieved AUROC of 0.901 and AUPRC of 0.776 on NLST test set, outperforming SOTA models. Also showed robust performance in external datasets and obtained zero-shot predictions for semantic features like nodule margin (AUROC: 0.807).

Conclusion: The approach surpasses SOTA models in predicting lung cancer across diverse clinical settings, provides explainable outputs, prevents learning shortcuts, and generalizes well across different clinical environments.

Abstract: Machine learning models have utilized semantic features, deep features, or both to assess lung nodule malignancy. However, their reliance on manual annotation during inference, limited interpretability, and sensitivity to imaging variations hinder their application in real-world clinical settings. Thus, this research aims to integrate semantic features derived from radiologists’ assessments of nodules, guiding the model to learn clinically relevant, robust, and explainable imaging features for predicting lung cancer. We obtained 938 low-dose CT scans from the National Lung Screening Trial (NLST) with 1,261 nodules and semantic features. Additionally, the Lung Image Database Consortium dataset contains 1,018 CT scans, with 2,625 lesions annotated for nodule characteristics. Three external datasets were obtained from UCLA Health, the LUNGx Challenge, and the Duke Lung Cancer Screening. We fine-tuned a pretrained Contrastive Language-Image Pretraining (CLIP) model with a parameter-efficient fine-tuning approach to align imaging and semantic text features and predict the one-year lung cancer diagnosis. Our model outperformed state-of-the-art (SOTA) models in the NLST test set with an AUROC of 0.901 and AUPRC of 0.776. It also showed robust results in external datasets. Using CLIP, we also obtained predictions on semantic features through zero-shot inference, such as nodule margin (AUROC: 0.807), nodule consistency (0.812), and pleural attachment (0.840). Our approach surpasses the SOTA models in predicting lung cancer across datasets collected from diverse clinical settings, providing explainable outputs, aiding clinicians in comprehending the underlying meaning of model predictions. This approach also prevents the model from learning shortcuts and generalizes across clinical settings. The code is available at https://github.com/luotingzhuang/CLIP_nodule.

[411] Coarse Attribute Prediction with Task Agnostic Distillation for Real World Clothes Changing ReID

Priyank Pathak, Yogesh S Rawat

Main category: cs.CV

TL;DR: RLQ framework improves clothes-changing person re-identification (CC-ReID) for low-quality real-world images by using Coarse Attributes Prediction and Task Agnostic Distillation to handle artifacts like pixelation and blur.

Details

Motivation: Existing CC-ReID models work well with high-quality images but struggle with low-quality images containing artifacts like pixelation, blur, etc., which corrupt both external biometric attributes and internal feature representations.

Method: Proposes RLQ framework with Coarse Attributes Prediction (CAP) to enrich external fine-grained attributes via coarse predictions, and Task Agnostic Distillation (TAD) to bridge gap between HQ and LQ features using external dataset through task-agnostic self-supervision and distillation.

Result: Outperforms existing approaches by 1.6%-2.9% Top-1 on real-world datasets (LaST, DeepChange), with consistent improvement of 5.3%-6% Top-1 on PRCC and competitive performance on LTCC.

Conclusion: RLQ framework effectively addresses low-quality image challenges in CC-ReID, demonstrating significant performance improvements across multiple real-world datasets.

Abstract: This work focuses on Clothes Changing Re-IDentification (CC-ReID) for the real world. Existing works perform well with high-quality (HQ) images, but struggle with low-quality (LQ) where we can have artifacts like pixelation, out-of-focus blur, and motion blur. These artifacts introduce noise to not only external biometric attributes (e.g. pose, body shape, etc.) but also corrupt the model’s internal feature representation. Models usually cluster LQ image features together, making it difficult to distinguish between them, leading to incorrect matches. We propose a novel framework Robustness against Low-Quality (RLQ) to improve CC-ReID model on real-world data. RLQ relies on Coarse Attributes Prediction (CAP) and Task Agnostic Distillation (TAD) operating in alternate steps in a novel training mechanism. CAP enriches the model with external fine-grained attributes via coarse predictions, thereby reducing the effect of noisy inputs. On the other hand, TAD enhances the model’s internal feature representation by bridging the gap between HQ and LQ features, via an external dataset through task-agnostic self-supervision and distillation. RLQ outperforms the existing approaches by 1.6%-2.9% Top-1 on real-world datasets like LaST, and DeepChange, while showing consistent improvement of 5.3%-6% Top-1 on PRCC with competitive performance on LTCC. The code will be made public soon.

[412] LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

Qifeng Cai, Hao Liang, Hejun Dong, Meiyi Qiang, Ruichuan An, Zhaoyang Han, Zhengzhou Zhu, Bin Cui, Wentao Zhang

Main category: cs.CV

TL;DR: LoVR is a new benchmark for long video-text retrieval that addresses limitations of existing datasets by providing longer videos, high-quality fine-grained captions, and larger scale through an efficient caption generation framework.

Details

Motivation: Existing video-text retrieval benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinders evaluation of advanced methods.

Method: Proposed an efficient caption generation framework integrating VLM automatic generation, caption quality scoring, and dynamic refinement, plus a semantic fusion method for coherent full-video captions.

Result: Created LoVR benchmark with 467 long videos and over 40,804 fine-grained clips with high-quality captions. Experiments show it’s challenging for current embedding models.

Conclusion: LoVR presents new challenges for video understanding and retrieval, revealing limitations of current approaches and providing valuable insights for future research.

Abstract: Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark

[413] Reflectance Prediction-based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds

Hao Jing, Anhong Wang, Yifan Zhang, Donghan Bu, Junhui Hou

Main category: cs.CV

TL;DR: Proposes RPKD framework for 3D object detection using reflectance prediction and knowledge distillation to handle compressed point clouds without reflectance data.

Details

Motivation: Address transmission burden from reflectance encoding and limited detection robustness in existing point cloud compression systems for intelligent transportation.

Method: Compress point coordinates while discarding reflectance, use reflectance prediction module to reconstruct reflectance, and apply knowledge distillation from teacher to student detector.

Result: Demonstrates improved detection accuracy for compressed point clouds across multiple code rates on KITTI and DAIR-V2X-V datasets.

Conclusion: RPKD framework effectively boosts detection performance for compressed point clouds while reducing transmission burden by eliminating reflectance encoding.

Abstract: Regarding intelligent transportation systems, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among connected agents, such as vehicles and infrastructures, under restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as the student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our cross-source distillation training strategy (CDTS) equips the student detector with robustness to low-quality compressed data while preserving the accuracy benefits of raw data through transferred distillation knowledge. Experimental results on the KITTI and DAIR-V2X-V datasets demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. We will release the code publicly at https://github.com/HaoJing-SX/RPKD.

[414] Diffusion Classifiers Understand Compositionality, but Conditions Apply

Yujin Jeong, Arnas Uselis, Seong Joon Oh, Anna Rohrbach

Main category: cs.CV

TL;DR: A comprehensive study of diffusion classifiers’ discriminative capabilities on compositional tasks across 3 diffusion models, 10 datasets, and 30+ tasks, revealing that while they understand compositionality, performance depends on domain alignment and timestep sensitivity.

Details

Motivation: To address the gap in understanding diffusion models' compositional capabilities for discriminative tasks, as prior work showed promising but preliminary results with limited benchmarks and shallow analysis of success conditions.

Method: Systematic evaluation of SD 1.5, 2.0, and 3-m diffusion models across 10 datasets and 30+ compositional tasks, introducing Self-Bench diagnostic benchmark to isolate domain effects, and analyzing timestep weighting importance.

Result: Diffusion classifiers demonstrate compositional understanding but performance varies significantly with domain alignment; SD3-m shows particular sensitivity to timestep selection based on domain gap.

Conclusion: Diffusion classifiers can understand compositionality, but their success depends on specific conditions including domain alignment and proper timestep configuration, with SD3-m being particularly sensitive to domain gaps.

Abstract: Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark \textsc{Self-Bench} comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.

[415] REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders

Savya Khosla, Sethuraman TV, Barnett Lee, Alexander Schwing, Derek Hoiem

Main category: cs.CV

TL;DR: REN is a fast region encoder that generates region tokens directly from point prompts, achieving 60x speedup and 35x memory reduction compared to SAM-based methods while improving token quality.

Details

Motivation: Existing methods that combine class-agnostic segmenters (like SAM) with patch-based encoders suffer from high computational costs due to the segmentation step, creating a bottleneck for efficient region representation generation.

Method: REN uses lightweight cross-attention blocks that take point prompts as queries and features from patch-based image encoders (DINO, DINOv2, OpenCLIP) as keys/values to directly generate region tokens corresponding to prompted objects.

Result: REN consistently outperforms original encoders in semantic segmentation and retrieval tasks, matches or exceeds SAM-based methods in performance while being significantly faster, and achieves SOTA on Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks.

Conclusion: REN provides an efficient and effective alternative to segmentation-based region representation methods, enabling fast region token generation with improved quality and broad encoder compatibility.

Abstract: We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks’ single-needle challenge. Code and models are available at: https://github.com/savya08/REN.

[416] Policy Optimized Text-to-Image Pipeline Design

Uri Gadot, Rinon Gal, Yftah Ziser, Gal Chechik, Shie Mannor

Main category: cs.CV

TL;DR: A reinforcement learning framework for automated text-to-image pipeline design that uses reward models to predict image quality without costly generation, achieving better diversity and quality than existing methods.

Details

Motivation: Current automated text-to-image pipeline design methods using LLMs suffer from high computational costs and poor generalization beyond training examples, requiring expert knowledge for effective design.

Method: Two-phase training: first trains ensemble reward models to predict image quality from prompt-workflow combinations, then uses GRPO-based optimization with classifier-free guidance enhancement to guide model toward high-performing workflow regions.

Result: The approach successfully creates new workflows with greater diversity and achieves superior image quality compared to existing baselines.

Conclusion: The reinforcement learning framework effectively automates text-to-image pipeline design while overcoming computational inefficiencies and generalization limitations of previous methods.

Abstract: Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise. Recent approaches have shown promise in automating this process through large language models (LLMs), but they suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image generation during training. We then implement a two-phase training strategy: initial workflow vocabulary training followed by GRPO-based optimization that guides the model toward higher-performing regions of the workflow space. Additionally, we incorporate a classifier-free guidance based enhancement technique that extrapolates along the path between the initial and GRPO-tuned models, further improving output quality. We validate our approach through a set of comparisons, showing that it can successfully create new flows with greater diversity and lead to superior image quality compared to existing baselines.

[417] SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding

Dekai Zhu, Yixuan Hu, Youquan Liu, Dongyue Lu, Lingdong Kong, Slobodan Ilic

Main category: cs.CV

TL;DR: Spiral is a range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps, achieving state-of-the-art performance with minimal parameters and enabling effective synthetic data augmentation.

Details

Motivation: Existing range-view methods only produce unlabeled LiDAR scenes, and using pretrained segmentation models for semantic labeling results in poor cross-modal consistency. The goal is to address this limitation while preserving the computational efficiency advantages of range-view representations.

Method: Proposed Spiral - a novel range-view LiDAR diffusion model that generates depth, reflectance images, and semantic maps simultaneously. Also introduced semantic-aware metrics to evaluate generated labeled range-view data quality.

Result: Experiments on SemanticKITTI and nuScenes datasets show Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods combining generative and segmentation models. Generated range images effectively work for synthetic data augmentation in downstream segmentation training.

Conclusion: Spiral successfully addresses the semantic labeling limitation in range-view LiDAR generation while maintaining computational efficiency, and demonstrates practical utility for reducing labeling effort through synthetic data augmentation.

Abstract: Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.

[418] VidText: Towards Comprehensive Evaluation for Video Text Understanding

Zhoufaran Yang, Yan Shu, Jing Wang, Zhifei Yang, Yan Zhang, Yu Li, Keyang Lu, Gangyan Zeng, Shaohui Liu, Yu Zhou, Nicu Sebe

Main category: cs.CV

TL;DR: VidText is a new benchmark for video text understanding that addresses gaps in existing benchmarks by covering real-world scenarios, supporting multilingual content, and providing hierarchical evaluation tasks.

Details

Motivation: Existing video understanding benchmarks overlook textual information, while OCR benchmarks are limited to static images, failing to capture text-visual interactions in dynamic contexts.

Method: Proposed VidText benchmark with: 1) wide range of real-world scenarios and multilingual content, 2) hierarchical evaluation framework (video-level, clip-level, instance-level tasks), 3) paired perception-reasoning tasks from visual text perception to cross-modal reasoning.

Result: Experiments on 18 state-of-the-art LMMs show models struggle across most tasks, with significant room for improvement. Analysis reveals impact of model-intrinsic factors (input resolution, OCR capability) and external factors (auxiliary information, Chain-of-Thought reasoning).

Conclusion: VidText fills the current gap in video understanding benchmarks and serves as a foundation for future research on multimodal reasoning with video text in dynamic environments.

Abstract: Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.

[419] Geospatial Foundation Models to Enable Progress on Sustainable Development Goals

Pedram Ghamisi, Weikang Yu, Xiaokang Zhang, Aldino Rizaldy, Jian Wang, Chufeng Zhou, Richard Gloaguen, Gustau Camps-Valls

Main category: cs.CV

TL;DR: SustainFM is a benchmarking framework that evaluates geospatial Foundation Models against the 17 Sustainable Development Goals, showing they often outperform traditional methods but require broader evaluation criteria beyond just accuracy.

Details

Motivation: Despite rapid growth of geospatial Foundation Models, their real-world utility and alignment with global sustainability goals remain underexplored, necessitating a comprehensive assessment framework.

Method: Developed SustainFM - a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with diverse tasks including asset wealth prediction and environmental hazard detection.

Result: (1) FMs often outperform traditional approaches across diverse tasks, though not universally superior; (2) Evaluation should include transferability, generalization, and energy efficiency; (3) FMs enable scalable SDG-grounded solutions for sustainability challenges.

Conclusion: Advocates for paradigm shift from model-centric development to impact-driven deployment, emphasizing energy efficiency, robustness to domain shifts, and ethical considerations as key metrics for responsible FM use.

Abstract: Foundation Models (FMs) are large-scale, pre-trained artificial intelligence (AI) systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world utility and alignment with global sustainability goals remain underexplored. We introduce SustainFM, a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with extremely diverse tasks ranging from asset wealth prediction to environmental hazard detection. This study provides a rigorous, interdisciplinary assessment of geospatial FMs and offers critical insights into their role in attaining sustainability goals. Our findings show: (1) While not universally superior, FMs often outperform traditional approaches across diverse tasks and datasets. (2) Evaluating FMs should go beyond accuracy to include transferability, generalization, and energy efficiency as key criteria for their responsible use. (3) FMs enable scalable, SDG-grounded solutions, offering broad utility for tackling complex sustainability challenges. Critically, we advocate for a paradigm shift from model-centric development to impact-driven deployment, and emphasize metrics such as energy efficiency, robustness to domain shifts, and ethical considerations.

[420] Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Hyojin Bahng, Caroline Chan, Fredo Durand, Phillip Isola

Main category: cs.CV

TL;DR: The paper proposes CycleReward, a method that uses cycle consistency between text and image generation models to measure alignment without human/AI preferences, creating a large preference dataset for training reward models.

Details

Motivation: Existing methods for measuring language-vision alignment rely on costly human or AI preference collection, which is inefficient for detailed multimodal data.

Method: Leverages cycle consistency: maps generated text back to image space via text-to-image model and computes similarity between original and reconstructed image, and vice versa for text reconstruction. Uses this score to rank candidates and build a preference dataset.

Result: CycleReward outperforms state-of-the-art alignment metrics on detailed captioning, shows superior inference-time scalability for Best-of-N sampling, and enhances performance across vision-language tasks and text-to-image generation when used for DPO and Diffusion DPO.

Conclusion: Cycle consistency provides an effective supervisory signal for measuring language-vision alignment, enabling scalable and efficient reward modeling without expensive preference collection.

Abstract: Measuring alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset, CycleReward, outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling, while maintaining speed and differentiability. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are publicly released at https://cyclereward.github.io.

[421] EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models

Mingzhe Li, Gehao Zhang, Zhenting Wang, Guanhong Tao, Siqi Pan, Richard Cartwright, Juan Zhai, Shiqing Ma

Main category: cs.CV

TL;DR: Proposes a prompt inversion technique called \sys for text-to-image diffusion models that initializes embeddings using image captioning, refines them through latent space reverse-engineering, and converts to text using embedding-to-text models.

Details

Motivation: Prompt inversion has potential for data attribution, model provenance, and watermarking validation, but existing methods face challenges in semantic fluency, efficiency, and image similarity.

Method: Three-step approach: 1) Initialize embeddings using pre-trained image captioning model, 2) Refine embeddings through reverse-engineering in latent space, 3) Convert embeddings to text using embedding-to-text model.

Result: Outperforms existing methods on MS COCO, LAION, and Flickr datasets in image similarity, textual alignment, prompt interpretability and generalizability.

Conclusion: The method enables applications in cross-concept image synthesis, concept manipulation, evolutionary multi-concept generation and unsupervised segmentation.

Abstract: Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking validation. Recent studies introduced a delayed projection scheme to optimize for prompts representative of the vocabulary space, though challenges in semantic fluency and efficiency remain. Advanced image captioning models or visual large language models can generate highly interpretable prompts, but they often lack in image similarity. In this paper, we propose a prompt inversion technique called \sys for text-to-image diffusion models, which includes initializing embeddings using a pre-trained image captioning model, refining them through reverse-engineering in the latent space, and converting them to texts using an embedding-to-text model. Our experiments on the widely-used datasets, such as MS COCO, LAION, and Flickr, show that our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability. We further illustrate the application of our generated prompts in tasks such as cross-concept image synthesis, concept manipulation, evolutionary multi-concept generation and unsupervised segmentation.

[422] CanadaFireSat: Toward high-resolution wildfire forecasting with multiple modalities

Hugo Porta, Emanuele Dalsasso, Jessica L. McCarty, Devis Tuia

Main category: cs.CV

TL;DR: This paper introduces CanadaFireSat, a benchmark dataset for high-resolution (100m) wildfire forecasting in Canada using multi-modal satellite and environmental data, achieving 60.3% F1 score on unseen 2023 wildfire season.

Details

Motivation: The severe 2023 Canadian wildfire season highlights the need for better wildfire mitigation solutions due to climate-change-induced increases in fire season length and severity in boreal ecosystems.

Method: Developed baseline methods using multi-modal data from high-resolution Sentinel-2 satellite images, MODIS satellite products, and ERA5 environmental factors, tested with two major deep learning architectures.

Result: Multi-modal temporal inputs outperformed single-modal inputs across all metrics, achieving 60.3% F1 score for the 2023 wildfire season that was not seen during training.

Conclusion: Multi-modal deep learning models show strong potential for high-resolution, continental-scale wildfire forecasting, addressing limitations of coarse-resolution methods.

Abstract: Canada experienced in 2023 one of the most severe wildfire seasons in recent history, causing damage across ecosystems, destroying communities, and emitting large quantities of CO2. This extreme wildfire season is symptomatic of a climate-change-induced increase in the length and severity of the fire season that affects the boreal ecosystem. Therefore, it is critical to empower wildfire management in boreal communities with better mitigation solutions. Wildfire probability maps represent an important tool for understanding the likelihood of wildfire occurrence and the potential severity of future wildfires. The massive increase in the availability of Earth observation data has enabled the development of deep learning-based wildfire forecasting models, aiming at providing precise wildfire probability maps at different spatial and temporal scales. A main limitation of such methods is their reliance on coarse-resolution environmental drivers and satellite products, leading to wildfire occurrence prediction of reduced resolution, typically around $\sim 0.1${\deg}. This paper presents a benchmark dataset: CanadaFireSat, and baseline methods for high-resolution: 100 m wildfire forecasting across Canada, leveraging multi-modal data from high-resolution multi-spectral satellite images (Sentinel-2 L1C), mid-resolution satellite products (MODIS), and environmental factors (ERA5 reanalysis data). Our experiments consider two major deep learning architectures. We observe that using multi-modal temporal inputs outperforms single-modal temporal inputs across all metrics, achieving a peak performance of 60.3% in F1 score for the 2023 wildfire season, a season never seen during model training. This demonstrates the potential of multi-modal deep learning models for wildfire forecasting at high-resolution and continental scale.

[423] Non-Contact Health Monitoring During Daily Personal Care Routines

Xulin Ma, Jiankai Tang, Zhang Jiang, Songqin Cheng, Yuanchun Shi, Dong LI, Xin Liu, Daniel McDuff, Xiaojing Liu, Yuntao Wang

Main category: cs.CV

TL;DR: LADH is the first long-term rPPG dataset for high-altitude daily health monitoring, combining RGB and IR videos to improve physiological signal accuracy with multi-task learning.

Details

Motivation: rPPG offers non-contact health monitoring but faces challenges in long-term personal care scenarios like mirror-facing routines in high-altitude environments due to lighting variations, occlusions, and dynamic facial postures.

Method: Created LADH dataset with 240 synchronized RGB and IR facial videos from 21 participants across five personal care scenarios, using multi-task learning to combine RGB and IR inputs for physiological monitoring.

Result: Combining RGB and IR video inputs achieved MAE of 4.99 BPM in heart rate estimation, with multi-task learning enhancing performance across multiple physiological indicators simultaneously.

Conclusion: RGB-IR fusion and multi-task learning significantly improve rPPG-based physiological monitoring accuracy and robustness in challenging real-world scenarios.

Abstract: Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at https://github.com/McJackTang/FusionVitals.

[424] Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models

Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Sangwu Lee, Sayak Paul, Susung Hong, Seungryong Kim

Main category: cs.CV

TL;DR: HeadHunter is a framework for fine-grained attention head selection in Diffusion Transformers, enabling targeted control over visual attributes like structure and style through iterative head selection and SoftPAG perturbation.

Details

Motivation: Existing attention perturbation methods lack principled approaches for determining where to apply perturbations in DiT architectures, particularly since quality-relevant computations are distributed across layers rather than concentrated in specific locations.

Method: Proposes HeadHunter framework that systematically selects attention heads aligned with user objectives, and SoftPAG which linearly interpolates attention maps toward identity matrix for continuous perturbation strength control.

Result: The method mitigates oversmoothing issues of layer-level perturbation and enables targeted manipulation of specific visual styles through compositional head selection, validated on Stable Diffusion 3 and FLUX.1 models.

Conclusion: This work provides the first head-level analysis of attention perturbation in diffusion models, revealing interpretable specialization within attention layers and enabling practical design of effective perturbation strategies for fine-grained generation control.

Abstract: Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose “HeadHunter”, a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head’s attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

[425] WildCAT3D: Appearance-Aware Multi-View Diffusion in the Wild

Morris Alper, David Novotny, Filippos Kokkinos, Hadar Averbuch-Elor, Tom Monnier

Main category: cs.CV

TL;DR: WildCAT3D is a framework for novel view synthesis from diverse 2D scene images captured in the wild, addressing the challenge of scene-level NVS by modeling global appearance conditions and enabling training on varied real-world data.

Details

Motivation: Scene-level novel view synthesis faces limitations due to lack of clean multi-view training data. While curated datasets have limited diversity and licensing issues, abundant diverse permissively-licensed data exists in the wild (tourist photos, etc.) with varying appearances.

Method: Extends multi-view diffusion paradigm to learn from scene views with varying appearances by explicitly modeling global appearance conditions in images. Enables training on diverse 2D scene image data captured in the wild.

Result: Achieves state-of-the-art results on single-view NVS in both object- and scene-level settings. Generalizes to new scenes at inference time, generating multiple consistent novel views. Trains on strictly less data sources than prior methods while providing global appearance control during generation.

Conclusion: WildCAT3D successfully addresses scene-level NVS challenges by leveraging diverse real-world data through explicit appearance modeling, outperforming previous methods with less training data and enabling novel applications through appearance control.

Abstract: Despite recent advances in sparse novel view synthesis (NVS) applied to object-centric scenes, scene-level NVS remains a challenge. A central issue is the lack of available clean multi-view training data, beyond manually curated datasets with limited diversity, camera variation, or licensing issues. On the other hand, an abundance of diverse and permissively-licensed data exists in the wild, consisting of scenes with varying appearances (illuminations, transient occlusions, etc.) from sources such as tourist photos. To this end, we present WildCAT3D, a framework for generating novel views of scenes learned from diverse 2D scene image data captured in the wild. We unlock training on these data sources by explicitly modeling global appearance conditions in images, extending the state-of-the-art multi-view diffusion paradigm to learn from scene views of varying appearances. Our trained model generalizes to new scenes at inference time, enabling the generation of multiple consistent novel views. WildCAT3D provides state-of-the-art results on single-view NVS in object- and scene-level settings, while training on strictly less data sources than prior methods. Additionally, it enables novel applications by providing global appearance control during generation.

[426] DepthVanish: Optimizing Adversarial Interval Structures for Stereo-Depth-Invisible Patches

Yun Xing, Yue Cao, Nhat Chung, Jie Zhang, Ivor Tsang, Ming-Ming Cheng, Yang Liu, Lei Ma, Qing Guo

Main category: cs.CV

TL;DR: The paper introduces a novel adversarial attack method for stereo depth estimation that uses grid-structured patches with regular intervals between repeated textures, significantly improving physical attack performance compared to naive repetition methods.

Details

Motivation: Previous adversarial attacks using repeated textures work well in digital settings but fail in physical implementations, limiting their practical utility for security testing of stereo depth estimation systems used in autonomous driving and robotics.

Method: The authors propose jointly optimizing both interval structure and texture elements to create grid-structured adversarial patches. These patches are designed to be inserted into any scene and tested on various stereo depth estimation methods.

Result: The generated adversarial patches successfully attack advanced stereo depth estimation methods (RAFT-Stereo and STTR) and commercial RGB-D cameras (Intel RealSense) in real-world conditions, demonstrating practical effectiveness.

Conclusion: The grid-structured adversarial patches provide a practical and effective method for security assessment of stereo depth estimation systems, bridging the gap between digital and physical attack performance.

Abstract: Stereo depth estimation is a critical task in autonomous driving and robotics, where inaccuracies (such as misidentifying nearby objects as distant) can lead to dangerous situations. Adversarial attacks against stereo depth estimation can help reveal vulnerabilities before deployment. Previous works have shown that repeating optimized textures can effectively mislead stereo depth estimation in digital settings. However, our research reveals that these naively repeated textures perform poorly in physical implementations, i.e., when deployed as patches, limiting their practical utility for stress-testing stereo depth estimation systems. In this work, for the first time, we discover that introducing regular intervals among the repeated textures, creating a grid structure, significantly enhances the patch’s attack performance. Through extensive experimentation, we analyze how variations of this novel structure influence the adversarial effectiveness. Based on these insights, we develop a novel stereo depth attack that jointly optimizes both the interval structure and texture elements. Our generated adversarial patches can be inserted into any scenes and successfully attack advanced stereo depth estimation methods of different paradigms, i.e., RAFT-Stereo and STTR. Most critically, our patch can also attack commercial RGB-D cameras (Intel RealSense) in real-world conditions, demonstrating their practical relevance for security assessment of stereo systems. The code is officially released at: https://github.com/WiWiN42/DepthVanish

[427] AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays

Chenlang Yi, Zizhan Xiong, Qi Qi, Xiyuan Wei, Girish Bathla, Ching-Long Lin, Bobak Jack Mortazavi, Tianbao Yang

Main category: cs.CV

TL;DR: AdFair-CLIP is a framework that uses adversarial feature intervention to reduce demographic biases in CLIP models for medical image classification, improving fairness and accuracy in chest X-ray analysis.

Details

Motivation: CLIP models show superior performance in medical image classification but suffer from fairness issues like demographic biases related to race and gender, leading to diagnostic disparities for underrepresented groups.

Method: Adversarial feature intervention is used to suppress sensitive attributes, mitigating spurious correlations and improving prediction fairness in CLIP models.

Result: AdFair-CLIP significantly enhances both fairness and diagnostic accuracy on chest X-ray datasets while maintaining robust generalization in zero-shot and few-shot scenarios.

Conclusion: The framework establishes new benchmarks for fairness-aware learning in CLIP-based medical diagnostic models, particularly for chest X-ray analysis.

Abstract: Contrastive Language-Image Pre-training (CLIP) models have demonstrated superior performance across various visual tasks including medical image classification. However, fairness concerns, including demographic biases, have received limited attention for CLIP models. This oversight leads to critical issues, particularly those related to race and gender, resulting in disparities in diagnostic outcomes and reduced reliability for underrepresented groups. To address these challenges, we introduce AdFair-CLIP, a novel framework employing adversarial feature intervention to suppress sensitive attributes, thereby mitigating spurious correlations and improving prediction fairness. We conduct comprehensive experiments on chest X-ray (CXR) datasets, and show that AdFair-CLIP significantly enhances both fairness and diagnostic accuracy, while maintaining robust generalization in zero-shot and few-shot scenarios. These results establish new benchmarks for fairness-aware learning in CLIP-based medical diagnostic models, particularly for CXR analysis.

[428] AI-Generated Video Detection via Perceptual Straightening

Christian Internò, Robert Geirhos, Markus Olhofer, Sunny Liu, Barbara Hammer, David Klindt

Main category: cs.CV

TL;DR: ReStraV is a novel method that detects AI-generated videos by analyzing temporal curvature and stepwise distance in neural representations, achieving state-of-the-art detection performance.

Details

Motivation: The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies.

Method: Inspired by the ‘perceptual straightening’ hypothesis, ReStraV analyzes deviations from expected geometric properties in neural representations. Using a pre-trained self-supervised vision transformer (DINOv2), it quantifies temporal curvature and stepwise distance in the model’s representation domain, then aggregates statistics and trains a classifier.

Result: The method achieves state-of-the-art detection performance with 97.17% accuracy and 98.63% AUROC on the VidProM benchmark, substantially outperforming existing image- and video-based methods. AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos.

Conclusion: ReStraV is computationally efficient and offers a low-cost, effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.

Abstract: The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the “perceptual straightening” hypothesis – which suggests real-world video trajectories become more straight in neural representation domain – we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model’s representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.

[429] Consistent Supervised-Unsupervised Alignment for Generalized Category Discovery

Jizhou Han, Shaokun Wang, Yuhang He, Chenhao Ding, Qiang Wang, Xinyuan Gao, SongLin Dong, Yihong Gong

Main category: cs.CV

TL;DR: NC-GCD is a framework that addresses challenges in Generalized Category Discovery by using fixed ETF prototypes to ensure optimal geometric structure and consistent optimization, achieving improved performance on novel categories.

Details

Motivation: Previous GCD methods suffer from inconsistent optimization objectives and category confusion, leading to feature overlap and poor performance on novel categories.

Method: Pre-assigns and fixes Equiangular Tight Frame (ETF) prototypes, uses Consistent ETF Alignment Loss to unify supervised/unsupervised alignment, and employs Semantic Consistency Matcher for stable label assignments.

Result: Achieves strong performance on multiple GCD benchmarks with significant improvement in novel category accuracy.

Conclusion: NC-GCD effectively addresses GCD challenges through ETF-based geometric structure and consistent optimization, demonstrating superior performance in discovering novel categories.

Abstract: Generalized Category Discovery (GCD) focuses on classifying known categories while simultaneously discovering novel categories from unlabeled data. However, previous GCD methods face challenges due to inconsistent optimization objectives and category confusion. This leads to feature overlap and ultimately hinders performance on novel categories. To address these issues, we propose the Neural Collapse-inspired Generalized Category Discovery (NC-GCD) framework. By pre-assigning and fixing Equiangular Tight Frame (ETF) prototypes, our method ensures an optimal geometric structure and a consistent optimization objective for both known and novel categories. We introduce a Consistent ETF Alignment Loss that unifies supervised and unsupervised ETF alignment and enhances category separability. Additionally, a Semantic Consistency Matcher (SCM) is designed to maintain stable and consistent label assignments across clustering iterations. Our method achieves strong performance on multiple GCD benchmarks, significantly enhancing novel category accuracy and demonstrating its effectiveness.

[430] Event-RGB Fusion for Spacecraft Pose Estimation Under Harsh Lighting

Mohsi Jawaid, Marcus Märtens, Tat-Jun Chin

Main category: cs.CV

TL;DR: Proposes sensor fusion of RGB and event cameras for spacecraft pose estimation, addressing limitations of individual sensors under harsh lighting conditions.

Details

Motivation: Spacecraft pose estimation is crucial for autonomous operations but vision-based methods using RGB sensors struggle with harsh lighting conditions (glare, over-exposure). Event sensors have higher dynamic range but suffer from low resolution and noise during low motion periods.

Method: Developed a sensor fusion approach using beam-splitter prism for optical/temporal alignment, RANSAC-based fusion technique combining RGB and event data, with dropout uncertainty estimation to detect extreme conditions affecting either sensor.

Result: Collected comprehensive real dataset of RGB and event data for satellite pose estimation under challenging illumination. Results demonstrate efficacy of the event-RGB fusion approach.

Conclusion: The fusion method effectively leverages strengths of both sensor modalities, supporting the use of event sensors for spacecraft pose estimation. Dataset released publicly to support community research.

Abstract: Spacecraft pose estimation is crucial for autonomous in-space operations, such as rendezvous, docking and on-orbit servicing. Vision-based pose estimation methods, which typically employ RGB imaging sensors, is a compelling solution for spacecraft pose estimation, but are challenged by harsh lighting conditions, which produce imaging artifacts such as glare, over-exposure, blooming and lens flare. Due to their much higher dynamic range, neuromorphic or event sensors are more resilient to extreme lighting conditions. However, event sensors generally have lower spatial resolution and suffer from reduced signal-to-noise ratio during periods of low relative motion. This work addresses these individual sensor limitations by introducing a sensor fusion approach combining RGB and event sensors. A beam-splitter prism was employed to achieve precise optical and temporal alignment. Then, a RANSAC-based technique was developed to fuse the information from the RGB and event channels to achieve pose estimation that leveraged the strengths of the two modalities. The pipeline was complemented by dropout uncertainty estimation to detect extreme conditions that affect either channel. To benchmark the performance of the proposed event-RGB fusion method, we collected a comprehensive real dataset of RGB and event data for satellite pose estimation in a laboratory setting under a variety of challenging illumination conditions. Encouraging results on the dataset demonstrate the efficacy of our event-RGB fusion approach and further supports the usage of event sensors for spacecraft pose estimation. To support community research on this topic, our dataset has been released publicly.

[431] CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding

Hongyong Han, Wei Wang, Gaowei Zhang, Mingjie Li, Yi Wang

Main category: cs.CV

TL;DR: CoralVQA is the first large-scale Visual Question Answering dataset for coral reef analysis, containing 12,805 images from 67 coral genera across 3 oceans with 277,653 question-answer pairs, developed through a semi-automatic pipeline with marine biologists to address domain-specific challenges in coral monitoring.

Details

Motivation: Coral reefs require continuous monitoring for conservation, but interpreting coral images is challenging due to domain expertise requirements. Visual Question Answering using Large Vision-Language Models has potential for user-friendly interaction with coral imagery, but needs dedicated datasets addressing domain-specific annotations and multidimensional questions.

Method: Developed a semi-automatic data construction pipeline in collaboration with marine biologists to create CoralVQA dataset, ensuring both scalability and professional-grade data quality. The dataset includes real-world coral images collected from multiple oceans with comprehensive ecological and health-related question-answer pairs.

Result: Created CoralVQA with 12,805 coral images from 67 genera across 3 oceans and 277,653 question-answer pairs. Evaluation of state-of-the-art LVLMs revealed key limitations and opportunities, providing a comprehensive benchmark for vision-language reasoning in coral reef contexts.

Conclusion: CoralVQA presents novel challenges and forms a foundation for future LVLM development with emphasis on supporting coral conservation efforts, addressing the gap in domain-specific VQA datasets for marine ecosystem monitoring.

Abstract: Coral reefs are vital yet vulnerable ecosystems that require continuous monitoring to support conservation. While coral reef images provide essential information in coral monitoring, interpreting such images remains challenging due to the need for domain expertise. Visual Question Answering (VQA), powered by Large Vision-Language Models (LVLMs), has great potential in user-friendly interaction with coral reef images. However, applying VQA to coral imagery demands a dedicated dataset that addresses two key challenges: domain-specific annotations and multidimensional questions. In this work, we introduce CoralVQA, the first large-scale VQA dataset for coral reef analysis. It contains 12,805 real-world coral images from 67 coral genera collected from 3 oceans, along with 277,653 question-answer pairs that comprehensively assess ecological and health-related conditions. To construct this dataset, we develop a semi-automatic data construction pipeline in collaboration with marine biologists to ensure both scalability and professional-grade data quality. CoralVQA presents novel challenges and provides a comprehensive benchmark for studying vision-language reasoning in the context of coral reef images. By evaluating several state-of-the-art LVLMs, we reveal key limitations and opportunities. These insights form a foundation for future LVLM development, with a particular emphasis on supporting coral conservation efforts.

[432] MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan

Main category: cs.CV

TL;DR: MindJourney is a test-time scaling framework that enhances vision-language models’ 3D spatial reasoning by coupling them with video diffusion world models for interactive scene exploration.

Details

Motivation: Current vision-language models struggle with 3D spatial reasoning tasks because they perceive 2D images but lack internal 3D dynamics modeling, which is essential for embodied tasks like navigation and manipulation.

Method: The framework couples a VLM with a controllable world model based on video diffusion. The VLM iteratively sketches camera trajectories while the world model synthesizes corresponding views at each step, enabling multi-view reasoning during interactive exploration.

Result: MindJourney achieves over 7.7% average performance boost on the spatial reasoning benchmark SAT without any fine-tuning, and improves upon test-time inference VLMs trained through reinforcement learning.

Conclusion: Pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning, demonstrating the potential of utilizing world models for test-time scaling.

Abstract: Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

[433] Semantic-Aware Representation Learning via Conditional Transport for Multi-Label Image Classification

Ren-Dong Xie, Zhi-Fen He, Bo Li, Bin Liu, Jin-Yan Hu

Main category: cs.CV

TL;DR: Proposes SCT method for multi-label image classification using semantic-aware feature learning and conditional transport alignment to overcome limitations in discriminative feature learning and visual-semantic alignment.

Details

Motivation: Existing methods have limitations in learning discriminative semantic-aware features and lack fine-grained alignment between visual representations and label embeddings.

Method: Introduces semantic-related feature learning module for discriminative label-specific features and conditional transport-based alignment mechanism for precise visual-semantic alignment.

Result: Extensive experiments on VOC2007 and MS-COCO datasets validate effectiveness and demonstrate superior performance compared to state-of-the-art methods.

Conclusion: SCT provides a unified framework that effectively addresses key limitations in multi-label image classification through semantic-aware representation learning and conditional transport alignment.

Abstract: Multi-label image classification is a critical task in machine learning that aims to accurately assign multiple labels to a single image. While existing methods often utilize attention mechanisms or graph convolutional networks to model visual representations, their performance is still constrained by two critical limitations: the inability to learn discriminative semantic-aware features, and the lack of fine-grained alignment between visual representations and label embeddings. To tackle these issues in a unified framework, this paper proposes a novel approach named Semantic-aware representation learning via Conditional Transport for Multi-Label Image Classification (SCT). The proposed method introduces a semantic-related feature learning module that extracts discriminative label-specific features by emphasizing semantic relevance and interaction, along with a conditional transport-based alignment mechanism that enables precise visual-semantic alignment. Extensive experiments on two widely-used benchmark datasets, VOC2007 and MS-COCO, validate the effectiveness of SCT and demonstrate its superior performance compared to existing state-of-the-art methods.

[434] Style-Aware Blending and Prototype-Based Cross-Contrast Consistency for Semi-Supervised Medical Image Segmentation

Chaowei Chen, Xiang Zhang, Honglie Guo, Shunfang Wang

Main category: cs.CV

TL;DR: A style-aware blending and prototype-based cross-contrast consistency learning framework for semi-supervised medical image segmentation that addresses confirmation bias and incomplete supervisory information utilization.

Details

Motivation: Existing weak-strong consistency methods focus on perturbation schemes but overlook inherent framework limitations: separated training data streams causing confirmation bias dominated by labeled data, and incomplete utilization of supervisory information limiting strong-to-weak consistency exploration.

Method: Style-guided distribution blending module to break independent training data streams by characterizing distribution mismatch through statistical moments. Prototype-based cross-contrast strategy to learn from both weak-to-strong and strong-to-weak predictions while mitigating noise in strong pseudo-labels.

Result: Experimental results demonstrate effectiveness and superiority across multiple medical segmentation benchmarks under various semi-supervised settings.

Conclusion: The proposed framework successfully addresses critical deficiencies in existing weak-strong consistency learning methods for semi-supervised medical image segmentation.

Abstract: Weak-strong consistency learning strategies are widely employed in semi-supervised medical image segmentation to train models by leveraging limited labeled data and enforcing weak-to-strong consistency. However, existing methods primarily focus on designing and combining various perturbation schemes, overlooking the inherent potential and limitations within the framework itself. In this paper, we first identify two critical deficiencies: (1) separated training data streams, which lead to confirmation bias dominated by the labeled stream; and (2) incomplete utilization of supervisory information, which limits exploration of strong-to-weak consistency. To tackle these challenges, we propose a style-aware blending and prototype-based cross-contrast consistency learning framework. Specifically, inspired by the empirical observation that the distribution mismatch between labeled and unlabeled data can be characterized by statistical moments, we design a style-guided distribution blending module to break the independent training data streams. Meanwhile, considering the potential noise in strong pseudo-labels, we introduce a prototype-based cross-contrast strategy to encourage the model to learn informative supervisory signals from both weak-to-strong and strong-to-weak predictions, while mitigating the adverse effects of noise. Experimental results demonstrate the effectiveness and superiority of our framework across multiple medical segmentation benchmarks under various semi-supervised settings.

[435] Adjustable Spatio-Spectral Hyperspectral Image Compression Network

Martin Hermann Paul Fuchs, Behnood Rasti, Begüm Demir

Main category: cs.CV

TL;DR: HyCASS is a learning-based hyperspectral image compression model that enables adjustable compression in both spectral and spatial dimensions, achieving state-of-the-art performance with up to 2.36 dB PSNR improvement.

Details

Motivation: The rapid growth of hyperspectral data archives requires efficient storage solutions, but there's a lack of comprehensive analysis on how spectral and spatial compression individually and jointly affect learning-based HSI compression.

Method: HyCASS consists of six modules: spectral encoder, spatial encoder, CR adapter encoder, CR adapter decoder, spatial decoder, and spectral decoder, using convolutional layers and transformer blocks to capture both short-range and long-range redundancies.

Result: Experimental results on three HSI benchmark datasets show HyCASS outperforms existing learning-based compression models by up to 2.36 dB in PSNR.

Conclusion: The study establishes guidelines for effectively balancing spectral and spatial compression across different compression ratios, considering the spatial resolution of hyperspectral images.

Abstract: With the rapid growth of hyperspectral data archives in remote sensing (RS), the need for efficient storage has become essential, driving significant attention toward learning-based hyperspectral image (HSI) compression. However, a comprehensive investigation of the individual and joint effects of spectral and spatial compression on learning-based HSI compression has not been thoroughly examined yet. Conducting such an analysis is crucial for understanding how the exploitation of spectral, spatial, and joint spatio-spectral redundancies affects HSI compression. To address this issue, we propose Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS), a learning-based model designed for adjustable HSI compression in both spectral and spatial dimensions. HyCASS consists of six main modules: 1) spectral encoder module; 2) spatial encoder module; 3) compression ratio (CR) adapter encoder module; 4) CR adapter decoder module; 5) spatial decoder module; and 6) spectral decoder module. The modules employ convolutional layers and transformer blocks to capture both short-range and long-range redundancies. Experimental results on three HSI benchmark datasets demonstrate the effectiveness of our proposed adjustable model compared to existing learning-based compression models, surpassing the state of the art by up to 2.36 dB in terms of PSNR. Based on our results, we establish a guideline for effectively balancing spectral and spatial compression across different CRs, taking into account the spatial resolution of the HSIs. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hycass .

[436] Distribution-aware Knowledge Unification and Association for Non-exemplar Lifelong Person Re-identification

Shiben Liu, Mingyue Xu, Huijie Fan, Qiang Wang, Yandong Tang, Zhi Han

Main category: cs.CV

TL;DR: A novel distribution-aware knowledge unification and association (DKUA) framework for lifelong person re-identification that addresses the challenge of balancing old knowledge preservation with new information adaptation through domain-style modeling and cross-domain representation learning.

Details

Motivation: Existing LReID methods using knowledge distillation ignore two crucial aspects: specific distribution awareness and cross-domain unified knowledge learning, which are essential for balancing knowledge preservation and adaptation in lifelong learning scenarios.

Method: Proposes DKUA framework with four components: 1) distribution-aware model for domain-specific representations, 2) adaptive knowledge consolidation for unified cross-domain representation center, 3) unified knowledge association to model inter-domain relationships, and 4) distribution-based knowledge transfer to maintain distribution alignment.

Result: Outperforms existing methods by 7.6%/5.3% average mAP/R@1 improvement on anti-forgetting and generalization capacity respectively, demonstrating superior performance in lifelong person re-identification.

Conclusion: The DKUA framework effectively addresses the lifelong person re-identification challenge by incorporating distribution awareness and cross-domain knowledge unification, achieving significant improvements in both anti-forgetting capability and generalization performance without storing old samples.

Abstract: Lifelong person re-identification (LReID) encounters a key challenge: balancing the preservation of old knowledge with adaptation to new information. Existing LReID methods typically employ knowledge distillation to enforce representation alignment. However, these approaches ignore two crucial aspects: specific distribution awareness and cross-domain unified knowledge learning, both of which are essential for addressing this challenge. To overcome these limitations, we propose a novel distribution-aware knowledge unification and association (DKUA) framework where domain-style modeling is performed for each instance to propagate domain-specific representations, enhancing anti-forgetting and generalization capacity. Specifically, we design a distribution-aware model to transfer instance-level representations of the current domain into the domain-specific representations with the different domain styles, preserving learned knowledge without storing old samples. Next, we propose adaptive knowledge consolidation (AKC) to dynamically generate the unified representation as a cross-domain representation center. To further mitigate forgetting, we develop a unified knowledge association (UKA) mechanism, which explores the unified representation as a bridge to explicitly model inter-domain associations, reducing inter-domain gaps. Finally, distribution-based knowledge transfer (DKT) is proposed to prevent the current domain distribution from deviating from the cross-domain distribution center, improving adaptation capacity. Experimental results show our DKUA outperforms the existing methods by 7.6%/5.3% average mAP/R@1 improvement on anti-forgetting and generalization capacity, respectively. Our code is available at https://github.com/LiuShiBen/DKUA.

[437] Aligning Effective Tokens with Video Anomaly in Large Language Models

Yingxian Chen, Jiahui Liu, Ruidi Fan, Yanwei Li, Chirui Chang, Shizhen Zhao, Wilton W. T. Fok, Xiaojuan Qi, Yik-Chung Wu

Main category: cs.CV

TL;DR: VA-GPT is a novel Multi-modal Large Language Model that effectively summarizes and localizes abnormal events in videos by addressing spatial and temporal sparsity through specialized token selection and generation modules.

Details

Motivation: Current video understanding MLLMs struggle with abnormal events due to spatial and temporal sparsity, where redundant information leads to suboptimal results. There's a need for specialized models that can effectively handle anomaly detection in videos.

Method: Proposed VA-GPT with two key modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG) to align tokens between visual encoders and LLMs. Also constructed an instruction-following dataset for fine-tuning and introduced a cross-domain evaluation benchmark.

Result: The proposed method outperforms existing state-of-the-art methods on various benchmarks, demonstrating more accurate responses and interactions for abnormal event analysis.

Conclusion: VA-GPT effectively addresses the challenges of abnormal event detection in videos through specialized spatial and temporal token handling, showing superior performance compared to existing methods.

Abstract: Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.

[438] Combinative Matching for Geometric Shape Assembly

Nahyuk Lee, Juhong Min, Junhong Lee, Chunghyun Park, Minsu Cho

Main category: cs.CV

TL;DR: A new shape-matching method called combinative matching that combines interlocking parts for geometric shape assembly by modeling identical surface shapes and opposite volume occupancy.

Details

Motivation: Previous geometric assembly methods rely on finding identical surfaces between parts, but this approach explicitly models the distinct properties of interlocking shapes to reduce local ambiguities and enable robust part combination.

Method: Learns to establish correspondences across regions with identical surface shapes but opposite volume occupancy, using equivariant neural networks to estimate shape orientations and align regions in rotation.

Result: Significantly reduces local ambiguities in matching and allows robust combination of parts in assembly, consistently outperforming state-of-the-art methods on geometric assembly benchmarks.

Conclusion: The proposed combinative matching approach effectively addresses interlocking shape assembly by modeling both surface shape identity and volume occupancy inversion, demonstrating superior performance over existing methods.

Abstract: This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. In contrast, we explicitly model two distinct properties of interlocking shapes: ‘identical surface shape’ and ‘opposite volume occupancy.’ Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art. Project page: https://nahyuklee.github.io/cmnet.

[439] AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning

Shu Shen, C. L. Philip Chen, Tong Zhang

Main category: cs.CV

TL;DR: Proposes Adaptive Intra-Network Modulation (AIM) to address optimization bias in imbalanced multimodal learning by decoupling under-optimized parameters and adaptively adjusting modulation strength across network depths.

Details

Motivation: Existing methods for imbalanced multimodal learning typically hinder dominant modalities to promote weaker ones, which affects overall performance. The paper identifies optimization bias within networks as an overlooked problem.

Method: AIM decouples dominant modality’s under-optimized parameters into Auxiliary Blocks and encourages reliance on these degraded blocks during joint training. It also assesses modality imbalance across network depths and adaptively adjusts modulation strength at each depth.

Result: AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and shows strong generalizability across different backbones, fusion strategies, and optimizers.

Conclusion: AIM achieves balanced multimodal learning without hindering either dominant or weak modalities for the first time, effectively addressing optimization bias in imbalanced multimodal learning.

Abstract: Multimodal learning has significantly enhanced machine learning performance but still faces numerous challenges and limitations. Imbalanced multimodal learning is one of the problems extensively studied in recent works and is typically mitigated by modulating the learning of each modality. However, we find that these methods typically hinder the dominant modality’s learning to promote weaker modalities, which affects overall multimodal performance. We analyze the cause of this issue and highlight a commonly overlooked problem: optimization bias within networks. To address this, we propose Adaptive Intra-Network Modulation (AIM) to improve balanced modality learning. AIM accounts for differences in optimization state across parameters and depths within the network during modulation, achieving balanced multimodal learning without hindering either dominant or weak modalities for the first time. Specifically, AIM decouples the dominant modality’s under-optimized parameters into Auxiliary Blocks and encourages reliance on these performance-degraded blocks for joint training with weaker modalities. This approach effectively prevents suppression of weaker modalities while enabling targeted optimization of under-optimized parameters to improve the dominant modality. Additionally, AIM assesses modality imbalance level across network depths and adaptively adjusts modulation strength at each depth. Experimental results demonstrate that AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and exhibits strong generalizability across different backbones, fusion strategies, and optimizers.

[440] Multi-Focused Video Group Activities Hashing

Zhongmiao Qi, Yan Jiang, Bolin Zhang, Lijun Guo, Chong Wang, Qiangbo Qian

Main category: cs.CV

TL;DR: The paper proposes STVH and M-STVH methods for video hashing that can retrieve group activities at activity granularity rather than just entire videos, with M-STVH additionally handling both activity semantics and object visual features.

Details

Motivation: With explosive growth of video data in complex scenarios, there's an urgent need for quickly retrieving group activities at activity granularity rather than just entire videos, and handling different feature requirements in real-life retrieval scenarios.

Method: STVH: spatiotemporal interleaved video hashing that simultaneously models individual object dynamics and group interactions. M-STVH: enhanced multi-focused version with hierarchical feature integration through multi-focused representation learning to jointly focus on activity semantics and object visual features.

Result: Both STVH and M-STVH achieved excellent results in comparative experiments on publicly available datasets.

Conclusion: The proposed STVH and M-STVH methods effectively solve the problem of retrieving group activities at activity granularity and handling multiple feature requirements in video retrieval scenarios.

Abstract: With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.

[441] EndoGMDE: Generalizable Monocular Depth Estimation with Mixture of Low-Rank Experts for Diverse Endoscopic Scenes

Liangjing Shao, Chenkang Du, Benshuang Chen, Xueli Liu, Xinrong Chen

Main category: cs.CV

TL;DR: A novel self-supervised framework for monocular depth estimation in endoscopic scenes using block-wise mixture of dynamic low-rank experts to handle diverse illumination and tissue features, achieving state-of-the-art performance on multiple datasets.

Details

Motivation: To address challenges in endoscopic depth estimation caused by varied illumination conditions and diverse tissue features, enabling accurate 3D scene perception for minimally invasive surgery.

Method: Proposes a block-wise mixture of dynamic low-rank experts that adaptively selects different experts based on input features, with a self-supervised training framework to handle brightness inconsistency and reflectance interference.

Result: Outperforms state-of-the-art methods on SCARED and SimCol datasets, achieves best generalization on zero-shot depth estimation across C3VD, Hamlyn and SERV-CT datasets, and demonstrates strong performance in 3D reconstruction and ego-motion estimation.

Conclusion: The method enables accurate endoscopic depth estimation for minimally invasive measurement and surgery, with promising generalization capabilities across diverse endoscopic scenarios.

Abstract: Self-supervised monocular depth estimation is a significant task for low-cost and efficient 3D scene perception and measurement in endoscopy. However, the variety of illumination conditions and scene features is still the primary challenges for depth estimation in endoscopic scenes. In this work, a novel self-supervised framework is proposed for monocular depth estimation in diverse endoscopy. Firstly, considering the diverse features in endoscopic scenes with different tissues, a novel block-wise mixture of dynamic low-rank experts is proposed to efficiently finetune the foundation model for endoscopic depth estimation. In the proposed module, based on the input feature, different experts with a small amount of trainable parameters are adaptively selected for weighted inference, from low-rank experts which are allocated based on the generalization of each block. Moreover, a novel self-supervised training framework is proposed to jointly cope with brightness inconsistency and reflectance interference. The proposed method outperforms state-of-the-art works on SCARED dataset and SimCol dataset. Furthermore, the proposed network also achieves the best generalization based on zero-shot depth estimation on C3VD, Hamlyn and SERV-CT dataset. The outstanding performance of our model is further demonstrated with 3D reconstruction and ego-motion estimation. The proposed method could contribute to accurate endoscopy for minimally invasive measurement and surgery. The evaluation codes will be released upon acceptance, while the demo videos can be found on: https://endo-gmde.netlify.app/.

[442] 3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data

Nojod M. Alotaibi, Areej M. Alhothali, Manar S. Ali

Main category: cs.CV

TL;DR: A unified pipeline combining Vision Transformers and Graph Neural Networks for automated MDD detection from sMRI data, achieving 78.98% accuracy.

Details

Motivation: Existing MDD detection methods using sMRI and deep learning are limited by voxel-level features or handcrafted regional representations, failing to capture complex brain patterns effectively.

Method: Uses Vision Transformers to extract 3D region embeddings from sMRI data and Graph Neural Networks for classification, with two region definition strategies: atlas-based (predefined brain atlases) and cube-based (3D patches). Cosine similarity graphs model interregional relationships.

Result: Achieved 78.98% accuracy, 76.54% sensitivity, 81.58% specificity, 81.58% precision, and 78.98% F1-score using stratified 10-fold cross-validation on REST-meta-MDD dataset.

Conclusion: Atlas-based models consistently outperformed cube-based approach, demonstrating the importance of domain-specific anatomical priors for effective MDD detection.

Abstract: Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their ability to capture complex brain patterns. This paper develops a unified pipeline that utilizes Vision Transformers (ViTs) for extracting 3D region embeddings from sMRI data and Graph Neural Network (GNN) for classification. We explore two strategies for defining regions: (1) an atlas-based approach using predefined structural and functional brain atlases, and (2) an cube-based method by which ViTs are trained directly to identify regions from uniformly extracted 3D patches. Further, cosine similarity graphs are generated to model interregional relationships, and guide GNN-based classification. Extensive experiments were conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of our model. With stratified 10-fold cross-validation, the best model obtained 78.98% accuracy, 76.54% sensitivity, 81.58% specificity, 81.58% precision, and 78.98% F1-score. Further, atlas-based models consistently outperformed the cube-based approach, highlighting the importance of using domain-specific anatomical priors for MDD detection.

[443] Bidirectional Feature-aligned Motion Transformation for Efficient Dynamic Point Cloud Compression

Xuan Deng, Xingtao Wang, Xiandong Meng, Longguang Wang, Tiange Zhang, Xiaopeng Fan, Debin Zhao

Main category: cs.CV

TL;DR: Proposes Bi-FMT framework for dynamic point cloud compression using bidirectional feature alignment and cross-transformer refinement to improve motion modeling and compression efficiency.

Details

Motivation: Existing dynamic point cloud compression methods rely on explicit motion estimation that fails to capture complex dynamics and adequately exploit temporal correlations due to irregular point cloud structure and local variations.

Method: Bidirectional Feature-aligned Motion Transformation (Bi-FMT) framework that implicitly models motion in feature space, aligns features across past/future frames, and uses Cross-Transformer Refinement module to enhance local consistency and spatial details.

Result: Achieves BD-Rate reductions of 20% over D-DPCC and 9.4% over AdaDPCC, with improved compression efficiency and runtime performance.

Conclusion: Bi-FMT provides an effective implicit motion modeling approach for dynamic point cloud compression through bidirectional feature alignment and transformer-based refinement, enabling frame-level parallel compression.

Abstract: Efficient dynamic point cloud compression (DPCC) critically depends on accurate motion estimation and compensation. However, the inherently irregular structure and substantial local variations of point clouds make this task highly challenging. Existing approaches typically rely on explicit motion estimation, whose encoded motion vectors often fail to capture complex dynamics and inadequately exploit temporal correlations. To address these limitations, we propose a Bidirectional Feature-aligned Motion Transformation (Bi-FMT) framework that implicitly models motion in the feature space. Bi-FMT aligns features across both past and future frames to produce temporally consistent latent representations, which serve as predictive context in a conditional coding pipeline, forming a unified ``Motion + Conditional’’ representation. Built upon this bidirectional feature alignment, we introduce a Cross-Transformer Refinement module (CTR) at the decoder side to adaptively refine locally aligned features. By modeling cross-frame dependencies with vector attention, CRT enhances local consistency and restores fine-grained spatial details that are often lost during motion alignment. Moreover, we design a Random Access (RA) reference strategy that treats the bidirectionally aligned features as conditional context, enabling frame-level parallel compression and eliminating the sequential encoding. Extensive experiments demonstrate that Bi-FMT surpasses D-DPCC and AdaDPCC in both compression efficiency and runtime, achieving BD-Rate reductions of 20% (D1) and 9.4% (D1), respectively.

[444] Lattice Boltzmann Model for Learning Real-World Pixel Dynamicity

Guangze Zheng, Shijie Lin, Haobo Zuo, Si Si, Ming-Shan Wang, Changhong Fu, Jia Pan

Main category: cs.CV

TL;DR: LBM uses lattice Boltzmann model to learn pixel dynamics for visual tracking through collision-streaming processes, achieving real-time performance on various benchmarks.

Details

Motivation: To develop a method that can efficiently learn real-world pixel dynamicity for visual tracking tasks in an online and real-time manner.

Method: Decomposes visual representations into dynamic pixel lattices and solves pixel motion states through predict-update network with collision-streaming processes. Predict stage handles lattice collisions and streaming, while update stage rectifies distributions with online visual representations.

Result: Demonstrates practical applicability in online real-time tracking, with comprehensive evaluations on TAP-Vid, RoboTAP, TAO, BFT, and OVT-B benchmarks validating efficiency and real-world practicality.

Conclusion: LBM provides an efficient approach for visual tracking that can adapt to real-world tasks through its lattice-based modeling of pixel dynamics.

Abstract: This work proposes the Lattice Boltzmann Model (LBM) to learn real-world pixel dynamicity for visual tracking. LBM decomposes visual representations into dynamic pixel lattices and solves pixel motion states through collision-streaming processes. Specifically, the high-dimensional distribution of the target pixels is acquired through a multilayer predict-update network to estimate the pixel positions and visibility. The predict stage formulates lattice collisions among the spatial neighborhood of target pixels and develops lattice streaming within the temporal visual context. The update stage rectifies the pixel distributions with online visual representations. Compared with existing methods, LBM demonstrates practical applicability in an online and real-time manner, which can efficiently adapt to real-world visual tracking tasks. Comprehensive evaluations of real-world point tracking benchmarks such as TAP-Vid and RoboTAP validate LBM’s efficiency. A general evaluation of large-scale open-world object tracking benchmarks such as TAO, BFT, and OVT-B further demonstrates LBM’s real-world practicality.

[445] A Comprehensive Evaluation of YOLO-based Deer Detection Performance on Edge Devices

Bishal Adhikari, Jiajia Li, Eric S. Michel, Jacob Dykes, Te-Ming Paul Tseng, Mary Love Tagert, Dong Chen

Main category: cs.CV

TL;DR: This paper addresses deer intrusion in agriculture by evaluating YOLO models for real-time deer detection on edge devices, introducing a new dataset and finding optimal models for deployment.

Details

Motivation: Traditional deer mitigation strategies are inadequate, causing significant economic losses in agriculture, creating a need for intelligent autonomous detection systems.

Method: Created a curated dataset of 3,095 annotated deer images and evaluated 12 YOLO model variants (v8-v11) on Raspberry Pi 5 and NVIDIA Jetson AGX Xavier edge platforms.

Result: Real-time detection not feasible on Raspberry Pi without optimization, but NVIDIA Jetson achieved >30 FPS with ’s’ and ’n’ series models. YOLOv11n, YOLOv8s, and YOLOv9s provided best balance of accuracy (AP>0.85) and efficiency (<34ms inference time).

Conclusion: Smaller advanced YOLO models are viable for real-world deer detection on GPU-accelerated edge devices, addressing the gap in practical deer detection systems.

Abstract: The escalating economic losses in agriculture due to deer intrusion, estimated to be in the hundreds of millions of dollars annually in the U.S., highlight the inadequacy of traditional mitigation strategies such as hunting, fencing, use of repellents, and scare tactics. This underscores a critical need for intelligent, autonomous solutions capable of real-time deer detection and deterrence. But the progress in this field is impeded by a significant gap in the literature, mainly the lack of a domain-specific, practical dataset and limited study on the viability of deer detection systems on edge devices. To address this gap, this study presents a comprehensive evaluation of state-of-the-art deep learning models for deer detection in challenging real-world scenarios. We introduce a curated, publicly available dataset of 3,095 annotated images with bounding box annotations of deer. Then, we provide an extensive comparative analysis of 12 model variants across four recent YOLO architectures (v8 to v11). Finally, we evaluated their performance on two representative edge computing platforms: the CPU-based Raspberry Pi 5 and the GPU-accelerated NVIDIA Jetson AGX Xavier to assess feasibility for real-world field deployment. Results show that the real-time detection performance is not feasible on Raspberry Pi without hardware-specific model optimization, while NVIDIA Jetson provides greater than 30 frames per second (FPS) with ’s’ and ’n’ series models. This study also reveals that smaller, architecturally advanced models such as YOLOv11n, YOLOv8s, and YOLOv9s offer the optimal balance of high accuracy (Average Precision (AP) > 0.85) and computational efficiency (Inference Time < 34 milliseconds).

[446] Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, Adams Wai-Kin Kong

Main category: cs.CV

TL;DR: SHINE is a training-free framework for high-quality image composition that addresses complex lighting challenges and high-resolution inputs using pretrained diffusion models without latent inversion or attention surgery.

Details

Motivation: Existing image composition models struggle with complex lighting conditions (shadows, reflections) and diverse high-resolution inputs. While modern diffusion models encode physical and resolution priors, they lack frameworks to utilize these without problematic techniques like latent inversion that lock object poses.

Method: SHINE uses manifold-steered anchor loss with pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background. It also employs degradation-suppression guidance and adaptive background blending to eliminate low-quality outputs and visible seams.

Result: Experiments on ComplexCompo benchmark (with diverse resolutions and challenging conditions) and DreamEditBench show state-of-the-art performance on standard metrics (DINOv2) and human-aligned scores (DreamSim, ImageReward, VisionReward).

Conclusion: SHINE provides an effective training-free solution for high-fidelity image composition that handles complex lighting and high-resolution scenarios better than existing methods, with publicly available code and benchmark.

Abstract: Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

[447] Efficiency vs. Efficacy: Assessing the Compression Ratio-Dice Score Relationship through a Simple Benchmarking Framework for Cerebrovascular 3D Segmentation

Shimaa Elbana, Ahmad Kamal, Shahd Ahmed Ali, Ahmad Al-Kabbany

Main category: cs.CV

TL;DR: ZFP compression achieves up to 22.89:1 data reduction for 3D medical imaging while maintaining high cerebrovascular segmentation quality (Dice 0.87656 vs 0.8774 baseline).

Details

Motivation: Address the challenges of large 3D medical imaging datasets that hinder collaborative research and transferability.

Method: Apply ZFP compression in error tolerance and fixed-rate modes to a large 3D medical dataset with ground-truth vascular segmentations, comparing segmentation quality on compressed vs uncompressed volumes.

Result: ZFP achieves substantial data reduction (up to 22.89:1 ratio) while maintaining high fidelity with mean Dice coefficient of 0.87656 compared to baseline 0.8774.

Conclusion: ZFP is a viable and powerful tool for enabling more efficient and accessible research on large-scale medical datasets, fostering broader collaboration.

Abstract: The increasing size and complexity of medical imaging datasets, particularly in 3D formats, present significant barriers to collaborative research and transferability. This study investigates whether the ZFP compression technique can mitigate these challenges without compromising the performance of automated cerebrovascular segmentation, a critical first step in intracranial aneurysm detection. We apply ZFP in both its error tolerance and fixed-rate modes to a large scale, and one of the most recent, datasets in the literature, 3D medical dataset containing ground-truth vascular segmentations. The segmentation quality on the compressed volumes is rigorously compared to the uncompressed baseline (Dice approximately equals 0.8774). Our findings reveal that ZFP can achieve substantial data reduction–up to a 22.89:1 ratio in error tolerance mode–while maintaining a high degree of fidelity, with the mean Dice coefficient remaining high at 0.87656. These results demonstrate that ZFP is a viable and powerful tool for enabling more efficient and accessible research on large-scale medical datasets, fostering broader collaboration across the community.

[448] Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning

Yaxin Hou, Bo Han, Yuheng Jia, Hui Liu, Junhui Hou

Main category: cs.CV

TL;DR: CPG is a framework for long-tailed semi-supervised learning that handles unknown unlabeled data distributions by dynamically generating pseudo-labels to create a known labeled data distribution, using controllable filtering and logit adjustment.

Details

Motivation: Existing methods assume unlabeled data follows predefined distributions, but real-world unlabeled data distributions are generally unknown and arbitrary, creating challenges for reliable pseudo-labeling.

Method: Uses a controllable self-reinforcing optimization cycle: dynamic controllable filtering for pseudo-label selection, Bayes-optimal classifier with logit adjustment, class-aware adaptive augmentation for minority classes, and auxiliary branch for full data utilization.

Result: Achieves consistent improvements across benchmark datasets, surpassing state-of-the-art methods by up to 15.97% in accuracy.

Conclusion: CPG effectively handles unknown unlabeled data distributions through its controllable pseudo-label generation framework, with theoretical guarantees and practical performance gains.

Abstract: Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.

[449] Detailed Aerial Mapping of Photovoltaic Power Plants Through Semantically Significant Keypoints

Viktor Kozák, Jan Chudoba, Libor Přeučil

Main category: cs.CV

TL;DR: A novel method for automated PV power plant mapping using aerial images that creates detailed 3D models down to individual module level without relying on third-party data.

Details

Motivation: Accurate PV power plant models are essential for optimal operation but are often unavailable, requiring an automated mapping approach that eliminates dependency on external data sources.

Method: Visual segmentation of PV modules in aerial images combined with structural inference using layout keypoints to assign modules to benches, rows, and columns, then merging detections from multiple images while maintaining structural integrity.

Result: Successfully tested on two power plants, producing compact georeferenced 3D models with semantic structures suitable for maintenance applications.

Conclusion: The approach enables automated, detailed PV power plant mapping at module-level resolution using only aerial imagery, providing a practical solution for power plant maintenance without third-party data dependencies.

Abstract: An accurate and up-to-date model of a photovoltaic (PV) power plant is essential for its optimal operation and maintenance. However, such a model may not be easily available. This work introduces a novel approach for PV power plant mapping based on aerial overview images. It enables the automation of the mapping process while removing the reliance on third-party data. The presented mapping method takes advantage of the structural layout of the power plants to achieve detailed modeling down to the level of individual PV modules. The approach relies on visual segmentation of PV modules in overview images and the inference of structural information in each image, assigning modules to individual benches, rows, and columns. We identify visual keypoints related to the layout and use these to merge detections from multiple images while maintaining their structural integrity. The presented method was experimentally verified and evaluated on two different power plants. The final fusion of 3D positions and semantic structures results in a compact georeferenced model suitable for power plant maintenance.

[450] OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search

Zexin Zheng, Huangyu Dai, Lingtao Mao, Xinyu Sun, Zihan Liang, Ben Chen, Yuqing Ding, Chenyi Lei, Wenwu Ou, Han Li, Kun Gai

Main category: cs.CV

TL;DR: OneVision is an end-to-end generative framework that replaces traditional multi-stage cascading architecture for vision search, using vision-aligned residual quantization to align multi-view representations and improve both efficiency and conversion rates.

Details

Motivation: Traditional multi-stage cascading architecture (MCA) for vision search suffers from multi-view representation discrepancies between query and product images, making it difficult to achieve optimal user experience and conversion rates simultaneously.

Method: Proposes OneVision framework with VRQ (vision-aligned residual quantization) encoding to align different object representations across viewpoints while preserving product distinctiveness, and uses multi-stage semantic alignment to maintain visual similarity while incorporating user-specific information.

Result: Offline: Performs on par with online MCA while improving inference efficiency by 21% through dynamic pruning. Online A/B tests: +2.15% item CTR, +2.27% CVR, and +3.12% order volume.

Conclusion: A semantic ID centric generative architecture can successfully unify retrieval and personalization while simplifying the serving pathway, achieving significant improvements in both efficiency and conversion metrics.

Abstract: Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual quantization encoding, which can align the vastly different representations of an object across multiple viewpoints while preserving the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic pruning. In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the serving pathway.

[451] Dropping the D: RGB-D SLAM Without the Depth Sensor

Mert Kiray, Alican Karaomer, Benjamin Busam

Main category: cs.CV

TL;DR: DropD-SLAM is a real-time monocular SLAM system that achieves RGB-D-level accuracy using pretrained vision modules instead of depth sensors, matching state-of-the-art RGB-D methods while running at 22 FPS.

Details

Motivation: To create a simpler and more cost-effective SLAM system that doesn't rely on expensive depth sensors but still achieves RGB-D-level accuracy, leveraging modern pretrained vision models.

Method: Uses three pretrained vision modules: monocular metric depth estimator, learned keypoint detector, and instance segmentation network. Dynamic objects are suppressed with dilated instance masks, static keypoints get predicted depth values and are backprojected into 3D, then processed by an unmodified RGB-D SLAM backend.

Result: Achieves 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences on TUM RGB-D benchmark, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU.

Conclusion: Modern pretrained vision models can effectively replace active depth sensors as reliable, real-time sources of metric scale, enabling simpler and more cost-effective SLAM systems.

Abstract: We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.

[452] Risk-adaptive Activation Steering for Safe Multimodal Large Language Models

Jonghyun Park, Minhyuk Seo, Jonghyun Choi

Main category: cs.CV

TL;DR: RAS (Risk-adaptive Activation Steering) is a method that reformulates queries to enhance attention on safety-critical image regions, enabling accurate risk assessment and adaptive activation steering for safe responses without iterative adjustments.

Details

Motivation: Current AI models struggle with multimodal safety - they often fail to detect harmful intent in images while maintaining helpful responses to benign queries. Training-based safety alignment is costly, while inference-time methods cause excessive refusals and slow inference.

Method: Proposed RAS reformulates queries to strengthen cross-modal attention to safety-critical image regions for accurate risk assessment, then adaptively steers activations to generate safe responses without iterative output adjustments.

Result: Extensive experiments show RAS significantly reduces attack success rates, preserves general task performance, and improves inference speed compared to prior inference-time defenses across multiple multimodal safety and utility benchmarks.

Conclusion: RAS provides an effective solution for multimodal safety that overcomes limitations of both training-based and existing inference-time alignment methods, achieving better safety, utility, and efficiency.

Abstract: One of the key challenges of modern AI models is ensuring that they provide helpful responses to benign queries while refusing malicious ones. But often, the models are vulnerable to multimodal queries with harmful intent embedded in images. One approach for safety alignment is training with extensive safety datasets at the significant costs in both dataset curation and training. Inference-time alignment mitigates these costs, but introduces two drawbacks: excessive refusals from misclassified benign queries and slower inference speed due to iterative output adjustments. To overcome these limitations, we propose to reformulate queries to strengthen cross-modal attention to safety-critical image regions, enabling accurate risk assessment at the query level. Using the assessed risk, it adaptively steers activations to generate responses that are safe and helpful without overhead from iterative output adjustments. We call this Risk-adaptive Activation Steering (RAS). Extensive experiments across multiple benchmarks on multimodal safety and utility demonstrate that the RAS significantly reduces attack success rates, preserves general task performance, and improves inference speed over prior inference-time defenses.

[453] GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering

Alexander Valverde, Brian Xu, Yuyin Zhou, Meng Xu, Hongyun Wang

Main category: cs.CV

TL;DR: GauSSmart is a hybrid 2D-3D method that enhances Gaussian Splatting scene reconstruction by integrating 2D foundational models (DINO) with semantic feature supervision and convex filtering to improve detail capture and coverage in sparse regions.

Details

Motivation: Gaussian Splatting struggles with fine details and realism in sparse coverage areas due to limitations of sparse 3D training data. The goal is to overcome these limitations by leveraging 2D foundational models.

Method: Integrates 2D computer vision techniques including convex filtering and semantic feature supervision from DINO. Uses 2D segmentation priors and high-dimensional feature embeddings to guide Gaussian splat densification and refinement.

Result: Outperforms existing Gaussian Splatting methods across three datasets in most evaluated scenes, demonstrating improved coverage in underrepresented areas and better preservation of structural details.

Conclusion: Hybrid 2D-3D approaches combining 2D foundational models with 3D reconstruction pipelines can overcome limitations inherent in either approach alone, showing significant potential for enhanced scene reconstruction.

Abstract: Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.

[454] Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data

Qi Chen, Xinze Zhou, Chen Liu, Hao Chen, Wenxuan Li, Zekun Jiang, Ziyan Huang, Yuxuan Zhao, Dexin Yu, Junjun He, Yefeng Zheng, Ling Shao, Alan Yuille, Zongwei Zhou

Main category: cs.CV

TL;DR: Synthetic data can reduce the need for large annotated medical datasets. AbdomenAtlas 2.0, with 10,135 CT scans and 15,130 tumor instances across six organs, shows significant performance improvements over existing datasets.

Details

Motivation: AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. Synthetic data was found to steepen data scaling laws, enabling more efficient model training.

Method: Created AbdomenAtlas 2.0 - a dataset of 10,135 CT scans with 15,130 tumor instances manually annotated by 23 expert radiologists across six organs (pancreas, liver, kidney, colon, esophagus, uterus) and 5,893 control scans.

Result: Achieved notable improvements over public datasets: +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests. Synthetic data reached same performance using only 500 real scans compared to 1,500 real scans alone.

Conclusion: AbdomenAtlas 2.0 provides a strong foundation for training AI to segment tumors in six organs, demonstrating that synthetic data can significantly reduce the annotation burden while maintaining or improving performance.

Abstract: AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0–a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation–based on lessons from the JHH dataset–for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.

[455] Hyperparameter Optimization and Reproducibility in Deep Learning Model Training

Usman Afzaal, Ziyu Su, Usama Sajjad, Hao Lu, Mostafa Rezapour, Metin Nafi Gurcan, Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: This paper investigates reproducibility challenges in histopathology foundation model training, identifying optimal hyperparameter ranges and providing practical guidelines for reproducible models in digital pathology.

Details

Motivation: Reproducibility remains a critical challenge in foundation model training for histopathology, hindered by software randomness, hardware non-determinism, and inconsistent hyperparameter reporting.

Method: Trained a CLIP model on QUILT-1M dataset and systematically evaluated impact of different hyperparameter settings and augmentation strategies across three downstream histopathology datasets (PatchCamelyon, LC25000-Lung, and LC25000-Colon).

Result: Identified clear trends: RandomResizedCrop values of 0.7-0.8 performed best, distributed training without local loss improved stability, learning rates below 5.0e-5 degraded performance, and LC25000 (Colon) dataset provided the most reproducible benchmark.

Conclusion: Reproducibility in computational pathology depends on both transparent documentation and carefully chosen experimental configurations, with practical rules provided to guide future reproducible foundation model development for digital pathology.

Abstract: Reproducibility remains a critical challenge in foundation model training for histopathology, often hindered by software randomness, hardware non-determinism, and inconsistent hyperparameter reporting. To investigate these issues, we trained a CLIP model on the QUILT-1M dataset and systematically evaluated the impact of different hyperparameter settings and augmentation strategies across three downstream histopathology datasets (PatchCamelyon, LC25000-Lung, and LC25000-Colon). Despite variability across runs, we identified clear trends: RandomResizedCrop values of 0.7-0.8 outperformed more aggressive (0.6) or conservative (0.9) settings, distributed training without local loss improved stability, and learning rates below 5.0e-5 consistently degraded performance across all datasets. The LC25000 (Colon) dataset consistently provided the most reproducible benchmark. These findings highlight that reproducibility in computational pathology depends not only on transparent documentation but also on carefully chosen experimental configurations, and we provide practical rules to guide future efforts in developing reproducible foundation models for digital pathology.

[456] Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi, Xiaoyi Zhang, Yan Lu, Nanning Zheng

Main category: cs.CV

TL;DR: The paper proposes VFM-VAE, a direct approach to integrate Vision Foundation Models into Latent Diffusion Models, avoiding distillation issues and achieving superior performance with faster convergence.

Details

Motivation: Current distillation-based approaches for incorporating Vision Foundation Models into LDMs weaken robustness and cause semantic deviations under distribution shifts, necessitating a more direct integration method.

Method: Proposes VFM-VAE with redesigned decoder using Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, plus a joint tokenizer-diffusion alignment strategy with SE-CKNNA metric for diagnosis.

Result: Achieves gFID (w/o CFG) of 2.20 in just 80 epochs (10x speedup) and 1.62 after 640 epochs, establishing direct VFM integration as superior paradigm.

Conclusion: Direct VFM integration through VFM-VAE with proper architectural redesign and training strategy outperforms distillation-based approaches in both performance and efficiency for LDMs.

Abstract: The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM’s semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.

[457] Space Object Detection using Multi-frame Temporal Trajectory Completion Method

Xiaoqing Lan, Biqiao Xin, Bingshu Wang, Han Zhang, Rui Zhu, Laixian Zhang

Main category: cs.CV

TL;DR: The paper proposes a method for detecting GEO space objects in optical images using wavelet transform for single-frame enhancement and a multi-frame trajectory completion scheme with Hungarian algorithm for cross-frame matching.

Details

Motivation: Space objects in GEO are challenging to detect due to weak signals, complex stellar backgrounds, and environmental interference in optical imaging.

Method: Uses wavelet transform to enhance high-frequency features and suppress background noise at single-frame level, then applies multi-frame temporal trajectory completion with Hungarian algorithm for cross-frame matching, plus post-processing steps including temporal matching, interpolation completion, noise filtering, and trajectory refinement.

Result: Achieved an F_1 score of 90.14% on the public SpotGEO dataset, demonstrating the method’s effectiveness.

Conclusion: The proposed approach effectively addresses GEO object detection challenges through combined single-frame enhancement and multi-frame trajectory optimization.

Abstract: Space objects in Geostationary Earth Orbit (GEO) present significant detection challenges in optical imaging due to weak signals, complex stellar backgrounds, and environmental interference. In this paper, we enhance high-frequency features of GEO targets while suppressing background noise at the single-frame level through wavelet transform. Building on this, we propose a multi-frame temporal trajectory completion scheme centered on the Hungarian algorithm for globally optimal cross-frame matching. To effectively mitigate missing and false detections, a series of key steps including temporal matching and interpolation completion, temporal-consistency-based noise filtering, and progressive trajectory refinement are designed in the post-processing pipeline. Experimental results on the public SpotGEO dataset demonstrate the effectiveness of the proposed method, achieving an F_1 score of 90.14%.

[458] Pragmatic Heterogeneous Collaborative Perception via Generative Communication Mechanism

Junfei Zhou, Penglin Dai, Quanmin Wei, Bingyi Liu, Xiao Wu, Jianping Wang

Main category: cs.CV

TL;DR: GenComm introduces a Generative Communication mechanism for heterogeneous multi-agent collaboration that uses feature generation instead of intrusive retraining, achieving seamless perception while reducing computational costs by 81% when adding new agents.

Details

Motivation: Existing multi-agent collaboration methods fail in heterogeneous settings due to domain gaps from different sensors/models, intrusive retraining that disrupts semantic consistency, and high computational costs when integrating new agents.

Method: Uses a Deformable Message Extractor to capture spatial messages, a Spatial-Aware Feature Generator with conditional diffusion model to generate features aligned with ego agent’s semantic space, and a Channel Enhancer for feature refinement before fusion.

Result: Outperforms state-of-the-art methods on OPV2V-H, DAIR-V2X and V2X-Real datasets, achieving 81% reduction in both computational cost and parameter count when incorporating new agents.

Conclusion: GenComm enables efficient and scalable heterogeneous multi-agent collaboration through generative feature communication without altering original networks, maintaining semantic consistency while minimizing integration costs.

Abstract: Multi-agent collaboration enhances the perception capabilities of individual agents through information sharing. However, in real-world applications, differences in sensors and models across heterogeneous agents inevitably lead to domain gaps during collaboration. Existing approaches based on adaptation and reconstruction fail to support pragmatic heterogeneous collaboration due to two key limitations: (1) Intrusive retraining of the encoder or core modules disrupts the established semantic consistency among agents; and (2) accommodating new agents incurs high computational costs, limiting scalability. To address these challenges, we present a novel Generative Communication mechanism (GenComm) that facilitates seamless perception across heterogeneous multi-agent systems through feature generation, without altering the original network, and employs lightweight numerical alignment of spatial information to efficiently integrate new agents at minimal cost. Specifically, a tailored Deformable Message Extractor is designed to extract spatial message for each collaborator, which is then transmitted in place of intermediate features. The Spatial-Aware Feature Generator, utilizing a conditional diffusion model, generates features aligned with the ego agent’s semantic space while preserving the spatial information of the collaborators. These generated features are further refined by a Channel Enhancer before fusion. Experiments conducted on the OPV2V-H, DAIR-V2X and V2X-Real datasets demonstrate that GenComm outperforms existing state-of-the-art methods, achieving an 81% reduction in both computational cost and parameter count when incorporating new agents. Our code is available at https://github.com/jeffreychou777/GenComm.

[459] Robust Atypical Mitosis Classification with DenseNet121: Stain-Aware Augmentation and Hybrid Loss for Domain Generalization

Adinath Dukre, Ankan Deria, Yutong Xie, Imran Razzak

Main category: cs.CV

TL;DR: A DenseNet-121 framework with stain-aware augmentation and imbalance-adaptive learning achieves robust atypical mitosis classification across multiple domains with 85.0% balanced accuracy.

Details

Motivation: Atypical mitotic figures are important biomarkers for tumor aggressiveness but are challenging to recognize due to severe class imbalance and variability across imaging domains.

Method: DenseNet-121-based framework with stain-aware augmentation (Macenko), geometric/intensity transformations, weighted sampling, and hybrid objective combining class-weighted binary cross-entropy and focal loss, trained end-to-end with AdamW.

Result: Achieved balanced accuracy 85.0%, AUROC 0.927, sensitivity 89.2%, and specificity 80.9% on official test set, demonstrating strong generalization across scanner and staining shifts.

Conclusion: Combining DenseNet-121 with stain-aware augmentation and imbalance-adaptive objectives yields a robust, domain-generalizable framework suitable for real-world computational pathology workflows.

Abstract: Atypical mitotic figures are important biomarkers of tumor aggressiveness in histopathology, yet reliable recognition remains challenging due to severe class imbalance and variability across imaging domains. We present a DenseNet-121-based framework tailored for atypical mitosis classification in the MIDOG 2025 (Track 2) setting. Our method integrates stain-aware augmentation (Macenko), geometric and intensity transformations, and imbalance-aware learning via weighted sampling with a hybrid objective combining class-weighted binary cross-entropy and focal loss. Trained end-to-end with AdamW and evaluated across multiple independent domains, the model demonstrates strong generalization under scanner and staining shifts, achieving balanced accuracy 85.0%, AUROC 0.927, sensitivity 89.2%, and specificity 80.9% on the official test set. These results indicate that combining DenseNet-121 with stain-aware augmentation and imbalance-adaptive objectives yields a robust, domain-generalizable framework for atypical mitosis classification suitable for real-world computational pathology workflows.

[460] Cross-view Localization and Synthesis – Datasets, Challenges and Opportunities

Ningli Xu, Rongjun Qin

Main category: cs.CV

TL;DR: A comprehensive survey of cross-view localization and synthesis, covering datasets, methods, challenges, and future directions in matching overhead and ground-level imagery.

Details

Motivation: Cross-view visual understanding has broad applications in autonomous navigation, urban planning, and augmented reality, but remains challenging due to significant perspective, resolution, and occlusion differences between overhead and ground-level views.

Method: For cross-view localization: formulated as image retrieval using CNNs or ViTs for feature embedding. For cross-view synthesis: uses GANs or diffusion models to generate ground-level views from overhead imagery.

Result: The paper provides an organized overview of state-of-the-art techniques, comparative analyses, and highlights current limitations in the field.

Conclusion: The survey identifies promising future research directions and includes a project page with comprehensive resources for cross-view methods.

Abstract: Cross-view localization and synthesis are two fundamental tasks in cross-view visual understanding, which deals with cross-view datasets: overhead (satellite or aerial) and ground-level imagery. These tasks have gained increasing attention due to their broad applications in autonomous navigation, urban planning, and augmented reality. Cross-view localization aims to estimate the geographic position of ground-level images based on information provided by overhead imagery while cross-view synthesis seeks to generate ground-level images based on information from the overhead imagery. Both tasks remain challenging due to significant differences in viewing perspective, resolution, and occlusion, which are widely embedded in cross-view datasets. Recent years have witnessed rapid progress driven by the availability of large-scale datasets and novel approaches. Typically, cross-view localization is formulated as an image retrieval problem where ground-level features are matched with tiled overhead images feature, extracted by convolutional neural networks (CNNs) or vision transformers (ViTs) for cross-view feature embedding. Cross-view synthesis, on the other hand, seeks to generate ground-level views based on information from overhead imagery, generally using generative adversarial networks (GANs) or diffusion models. This paper presents a comprehensive survey of advances in cross-view localization and synthesis, reviewing widely used datasets, highlighting key challenges, and providing an organized overview of state-of-the-art techniques. Furthermore, it discusses current limitations, offers comparative analyses, and outlines promising directions for future research. We also include the project page via https://github.com/GDAOSU/Awesome-Cross-View-Methods.

[461] PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, Beng Chin Ooi

Main category: cs.CV

TL;DR: PixelRefer is a unified region-level MLLM framework that enables fine-grained object-centric understanding across images and videos, with an efficient variant PixelRefer-Lite that reduces computational cost.

Details

Motivation: Most existing MLLMs focus on holistic scene-level understanding and overlook fine-grained object-centric reasoning, creating a need for region-level visual comprehension.

Method: Proposes Scale-Adaptive Object Tokenizer (SAOT) to generate compact object representations, and PixelRefer-Lite variant with Object-Centric Infusion module to pre-fuse global context into object tokens, creating an Object-Only Framework. Uses PixelRefer-2.2M instruction dataset for fine-grained tuning.

Result: Achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable efficiency gains across various benchmarks.

Conclusion: PixelRefer enables advanced fine-grained understanding over user-specified regions, addressing the gap in object-centric reasoning while providing efficient alternatives through PixelRefer-Lite.

Abstract: Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.

[462] RareFlow: Physics-Aware Flow-Matching for Cross-Sensor Super-Resolution of Rare-Earth Features

Forouzan Fallah, Wenwen Li, Chia-Yu Hsu, Hyunho Lee, Yezhou Yang

Main category: cs.CV

TL;DR: RareFlow is a physics-aware super-resolution framework that addresses out-of-distribution robustness in remote sensing imagery through dual-conditioning architecture, uncertainty quantification, and multifaceted loss functions.

Details

Motivation: Super-resolution for remote sensing imagery often fails under out-of-distribution conditions, producing visually plausible but physically inaccurate results for rare geomorphic features captured by diverse sensors.

Method: Uses dual-conditioning architecture with Gated ControlNet for geometric fidelity and textual prompts for semantic guidance. Introduces multifaceted loss function for spectral/radiometric consistency and employs stochastic forward pass for uncertainty quantification.

Result: In blind evaluations, geophysical experts rated outputs approaching ground truth fidelity, significantly outperforming state-of-the-art baselines with nearly 40% reduction in FID and gains in perceptual metrics.

Conclusion: RareFlow provides a robust framework for high-fidelity synthesis in data-scarce scientific domains and offers a new paradigm for controlled generation under severe domain shift.

Abstract: Super-resolution (SR) for remote sensing imagery often fails under out-of-distribution (OOD) conditions, such as rare geomorphic features captured by diverse sensors, producing visually plausible but physically inaccurate results. We present RareFlow, a physics-aware SR framework designed for OOD robustness. RareFlow’s core is a dual-conditioning architecture. A Gated ControlNet preserves fine-grained geometric fidelity from the low-resolution input, while textual prompts provide semantic guidance for synthesizing complex features. To ensure physically sound outputs, we introduce a multifaceted loss function that enforces both spectral and radiometric consistency with sensor properties. Furthermore, the framework quantifies its own predictive uncertainty by employing a stochastic forward pass approach; the resulting output variance directly identifies unfamiliar inputs, mitigating feature hallucination. We validate RareFlow on a new, curated benchmark of multi-sensor satellite imagery. In blind evaluations, geophysical experts rated our model’s outputs as approaching the fidelity of ground truth imagery, significantly outperforming state-of-the-art baselines. This qualitative superiority is corroborated by quantitative gains in perceptual metrics, including a nearly 40% reduction in FID. RareFlow provides a robust framework for high-fidelity synthesis in data-scarce scientific domains and offers a new paradigm for controlled generation under severe domain shift.

[463] 50 Years of Water Body Monitoring: The Case of Qaraaoun Reservoir, Lebanon

Ali Ahmad Faour, Nabil Amacha, Ali J. Ghandour

Main category: cs.CV

TL;DR: A sensor-free approach using satellite imagery and machine learning to monitor reservoir volume in Lebanon, achieving over 95% accuracy in water segmentation and less than 1.5% volume estimation error.

Details

Motivation: Sustainable management of Qaraaoun Reservoir requires reliable monitoring despite frequent sensor malfunctions and limited maintenance capacity in Lebanon.

Method: Integrates Sentinel-2 and Landsat satellite imagery with a new water segmentation index and Support Vector Regression (SVR) machine learning model trained on bathymetric survey data to estimate volume from surface area alone.

Result: Water segmentation aligns with ground truth for over 95% of shoreline. Optimized SVR model achieves error below 1.5% of full reservoir capacity with R² exceeding 0.98.

Conclusion: The method provides robust, cost-effective, sensor-independent monitoring applicable to other water bodies, generating valuable long-term climate and environmental data while focusing on temporal trends rather than exact volume measurements.

Abstract: The sustainable management of the Qaraaoun Reservoir, the largest surface water body in Lebanon located in the Bekaa Plain, depends on reliable monitoring of its storage volume despite frequent sensor malfunctions and limited maintenance capacity. This study introduces a sensor-free approach that integrates open-source satellite imagery, advanced water-extent segmentation, and machine learning to estimate the reservoir’s surface area and, subsequently, its volume in near real time. Sentinel-2 and Landsat 1-9 images are processed, where surface water is delineated using a newly proposed water segmentation index. A machine learning model based on Support Vector Regression (SVR) is trained on a curated dataset that includes water surface area, water level, and water volume derived from a reservoir bathymetric survey. The model is then able to estimate the water body’s volume solely from the extracted water surface, without the need for any ground-based measurements. Water segmentation using the proposed index aligns with ground truth for over 95% of the shoreline. Hyperparameter tuning with GridSearchCV yields an optimized SVR performance, with an error below 1.5% of the full reservoir capacity and coefficients of determination exceeding 0.98. These results demonstrate the method’s robustness and cost-effectiveness, offering a practical solution for continuous, sensor-independent monitoring of reservoir storage. The proposed methodology is applicable to other water bodies and generates over five decades of time-series data, offering valuable insights into climate change and environmental dynamics, with an emphasis on capturing temporal trends rather than exact water volume measurements.

[464] A Quantitative Evaluation Framework for Explainable AI in Semantic Segmentation

Reem Hammoud, Abdul karim Gizzini, Ali J. Ghandour

Main category: cs.CV

TL;DR: A comprehensive quantitative evaluation framework for XAI in semantic segmentation that addresses limitations of qualitative visual explanations through systematic pixel-level evaluation and metrics.

Details

Motivation: Need for transparent and trustworthy AI in safety-critical domains, with limited evaluation strategies for XAI in semantic segmentation and subjective nature of qualitative visual explanations.

Method: Developed a quantitative evaluation framework integrating pixel-level evaluation strategies with carefully designed metrics to assess XAI methods in semantic segmentation, accounting for spatial and contextual complexities.

Result: Simulation results using CAM-based XAI schemes demonstrate the framework’s efficiency, robustness, and reliability in providing fine-grained interpretability insights.

Conclusion: The framework advances development of transparent, trustworthy, and accountable semantic segmentation models by enabling rigorous quantitative evaluation of XAI methods.

Abstract: Ensuring transparency and trust in artificial intelligence (AI) models is essential as they are increasingly deployed in safety-critical and high-stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge; however, the rigorous evaluation of XAI methods remains vital for balancing the trade-offs between model complexity, predictive performance, and interpretability. While substantial progress has been made in evaluating XAI for classification tasks, strategies tailored to semantic segmentation remain limited. Moreover, objectively assessing XAI approaches is difficult, since qualitative visual explanations provide only preliminary insights. Such qualitative methods are inherently subjective and cannot ensure the accuracy or stability of explanations. To address these limitations, this work introduces a comprehensive quantitative evaluation framework for assessing XAI in semantic segmentation, accounting for both spatial and contextual task complexities. The framework systematically integrates pixel-level evaluation strategies with carefully designed metrics to yield fine-grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)-based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings advance the development of transparent, trustworthy, and accountable semantic segmentation models.

[465] Rethinking Visual Intelligence: Insights from Video Pretraining

Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, Paolo Favaro

Main category: cs.CV

TL;DR: Video Diffusion Models (VDMs) pretrained on spatiotemporal data show stronger inductive biases for visual tasks than LLMs, demonstrating higher data efficiency across multiple benchmarks.

Details

Motivation: LLMs have succeeded in language tasks but struggle with visual domain challenges like compositional understanding and sample efficiency. The paper investigates whether VDMs can bridge this gap through their spatiotemporal pretraining.

Method: Used pretrained VDMs and LLMs equipped with lightweight adapters, tested on benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata in their natural modalities.

Result: VDMs demonstrated higher data efficiency than LLMs across all tested benchmarks, showing better adaptation to visual tasks.

Conclusion: Video pretraining provides strong inductive biases that support progress toward visual foundation models, making VDMs a promising direction for visual AI systems.

Abstract: Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

[466] Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras

Charles Javerliat, Pierre Raimbaud, Guillaume Lavoué

Main category: cs.CV

TL;DR: Kineo is a fully automatic, calibration-free pipeline for markerless motion capture from unsynchronized, uncalibrated consumer RGB cameras that simultaneously calibrates cameras and reconstructs 3D keypoints at metric scale with high accuracy and efficiency.

Details

Motivation: Markerless multiview motion capture is constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches suffer from high computational cost and reduced reconstruction accuracy.

Method: Leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras (including Brown-Conrady distortion coefficients) and reconstruct 3D keypoints and dense scene point maps. Uses confidence-driven spatio-temporal keypoint sampling strategy combined with graph-based global optimization for robust calibration at fixed computational cost. Introduces pairwise reprojection consensus score to quantify 3D reconstruction reliability.

Result: Substantial improvements over prior calibration-free methods: reduces camera translation error by 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Processes multi-view sequences faster than their duration in specific configurations (e.g., 36min to process 1h20min of footage).

Conclusion: Kineo provides an efficient, accurate calibration-free solution for markerless motion capture that outperforms state-of-the-art methods while being accessible to non-experts, with full pipeline and evaluation code released openly for reproducibility and practical adoption.

Abstract: Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at https://liris-xr.github.io/kineo/.

[467] D$^2$GS: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction

Kejing Xia, Jidong Jia, Ke Jin, Yucai Bai, Li Sun, Dacheng Tao, Youjian Zhang

Main category: cs.CV

TL;DR: D²GS is a LiDAR-free urban scene reconstruction framework that uses multi-view depth predictions and diffusion priors to achieve geometry quality comparable to LiDAR-based methods.

Details

Motivation: Current urban scene reconstruction methods rely on multimodal sensors (LiDAR + images), but acquiring accurate LiDAR data is challenging due to calibration requirements and spatial misalignment issues.

Method: Three-step approach: 1) Initialize dense point cloud from multi-view depth predictions with Progressive Pruning for global consistency; 2) Jointly refine Gaussian geometry and depth via Depth Enhancer using diffusion priors; 3) Improve ground geometry by constraining Gaussian attributes in road regions.

Result: Extensive experiments on Waymo dataset show D²GS outperforms state-of-the-art methods and produces more accurate geometry than approaches using ground-truth LiDAR data.

Conclusion: The proposed LiDAR-free framework successfully achieves high-quality urban scene reconstruction by leveraging depth foundation models and geometric constraints, eliminating the need for expensive LiDAR sensors.

Abstract: Recently, Gaussian Splatting (GS) has shown great potential for urban scene reconstruction in the field of autonomous driving. However, current urban scene reconstruction methods often depend on multimodal sensors as inputs, \textit{i.e.} LiDAR and images. Though the geometry prior provided by LiDAR point clouds can largely mitigate ill-posedness in reconstruction, acquiring such accurate LiDAR data is still challenging in practice: i) precise spatiotemporal calibration between LiDAR and other sensors is required, as they may not capture data simultaneously; ii) reprojection errors arise from spatial misalignment when LiDAR and cameras are mounted at different locations. To avoid the difficulty of acquiring accurate LiDAR depth, we propose D$^2$GS, a LiDAR-free urban scene reconstruction framework. In this work, we obtain geometry priors that are as effective as LiDAR while being denser and more accurate. $\textbf{First}$, we initialize a dense point cloud by back-projecting multi-view metric depth predictions. This point cloud is then optimized by a Progressive Pruning strategy to improve the global consistency. $\textbf{Second}$, we jointly refine Gaussian geometry and predicted dense metric depth via a Depth Enhancer. Specifically, we leverage diffusion priors from a depth foundation model to enhance the depth maps rendered by Gaussians. In turn, the enhanced depths provide stronger geometric constraints during Gaussian training. $\textbf{Finally}$, we improve the accuracy of ground geometry by constraining the shape and normal attributes of Gaussians within road regions. Extensive experiments on the Waymo dataset demonstrate that our method consistently outperforms state-of-the-art methods, producing more accurate geometry even when compared with those using ground-truth LiDAR data.

[468] Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, Chenfei Liao, Dingcheng Zhen, Yuanhuiyi Lyu, Yuqian Fu, Bin Ren, Linfeng Zhang, Danda Pani Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu

Main category: cs.CV

TL;DR: This survey provides a comprehensive review of multimodal spatial reasoning with large models, categorizing progress in MLLMs and introducing open benchmarks for evaluation across 2D, 3D, embodied AI, and emerging modalities.

Details

Motivation: Humans have strong spatial reasoning abilities through multimodal observations, but systematic reviews and benchmarks for large multimodal reasoning models remain limited despite their promising performance.

Method: The survey categorizes recent progress in multimodal large language models (MLLMs), outlines general spatial reasoning techniques, and examines tasks including 2D spatial relationships, 3D scene understanding, embodied AI, and emerging modalities like audio and egocentric video.

Result: The survey establishes a comprehensive foundation for multimodal spatial reasoning research and provides open benchmarks for evaluation, with codes and implementations available on GitHub.

Conclusion: This survey offers valuable insights into the growing field of multimodal spatial reasoning and serves as a solid foundation for future research in this area.

Abstract: Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.

[469] Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang

Main category: cs.CV

TL;DR: A causal inference approach to address object-context shortcuts in vision-language models by synthesizing counterfactual embeddings and estimating Total Direct Effect to improve zero-shot reliability.

Details

Motivation: Object-context shortcuts undermine zero-shot reliability in vision-language models when test scenes differ from training co-occurrences, creating biased predictions.

Method: Estimate object and background expectations in CLIP’s representation space, synthesize counterfactual embeddings by recombining object features with alternative contexts, and use Total Direct Effect to subtract background-only activation.

Result: Substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing new zero-shot state-of-the-art without retraining or prompt design.

Conclusion: Provides a lightweight representation-level counterfactual approach for debiased and reliable multimodal reasoning through practical causal inference.

Abstract: Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP’s representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.

[470] ChartAB: A Benchmark for Chart Grounding & Dense Alignment

Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou

Main category: cs.CV

TL;DR: The paper introduces ChartAlign Benchmark (ChartAB) to evaluate vision-language models’ chart grounding capabilities, including data extraction, element localization, and attribute recognition from diverse charts.

Details

Motivation: Existing VLMs lack accurate perception of details and struggle with fine-grained structure extraction from charts, which limits their ability to compare multiple charts and reason over them.

Method: Developed a comprehensive benchmark with JSON template for tailored evaluation metrics, and incorporated a novel two-stage inference workflow to evaluate VLM capability in aligning and comparing elements across charts.

Result: Analysis revealed new insights into VLMs’ perception biases, weaknesses, robustness issues, and hallucinations in chart understanding, highlighting fine-grained discrepancies among models.

Conclusion: The findings point to specific skills that need strengthening in current VLMs for chart understanding tasks, particularly in fine-grained perception and cross-chart comparison capabilities.

Abstract: Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel “ChartAlign Benchmark (ChartAB)” to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.

[471] Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization

Guozheng Zheng, Jian Guan, Mingjie Xie, Xuanjia Zhao, Congyi Fan, Shiheng Zhang, Pengming Feng

Main category: cs.CV

TL;DR: A Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy for cross-view geo-localization between drone and satellite imagery that addresses hard negatives through sample-level difficulty assessment and batch-level progressive weighting.

Details

Motivation: Cross-view geo-localization faces challenges from severe viewpoint gaps and hard negatives (visually similar but geographically mismatched samples). Existing static weighting methods are sensitive to distribution shifts and prone to overemphasizing difficult samples too early, causing noisy gradients and unstable convergence.

Method: DPHR uses a dual-level approach: 1) Sample-level Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. 2) Batch-level Progressive Adaptive Loss Weighting (PALW) mechanism uses training-progress signal to attenuate noisy gradients early and progressively enhance hard-negative mining as training matures.

Result: Experiments on University-1652 and SUES-200 benchmarks demonstrate effectiveness and robustness, achieving consistent improvements over state-of-the-art methods.

Conclusion: The proposed DPHR strategy effectively addresses hard negatives in cross-view geo-localization through progressive hardness-aware reweighting, providing more stable convergence and better performance compared to existing methods.

Abstract: Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.

[472] Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Yehna Kim, Young-Eun Kim, Seong-Whan Lee

Main category: cs.CV

TL;DR: The paper proposes using web-crawled descriptions processed by LLMs to extract keywords for zero-shot action recognition, reducing manual annotation needs and addressing semantic ambiguity in action classes.

Details

Motivation: Address ambiguity in zero-shot action recognition caused by multi-semantic words in action classes, which can lead to incorrect concept understanding.

Method: Use web-crawled descriptions processed by large-language models to extract relevant keywords, and introduce a spatio-temporal interaction module to align description attributes with video content.

Result: Achieved accuracies of 81.0% on UCF-101, 53.1% on HMDB-51, and 68.9% on Kinetics-600 in zero-shot experiments.

Conclusion: The approach demonstrates adaptability and effectiveness across various downstream tasks while reducing manual annotation requirements.

Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model’s adaptability and effectiveness across various downstream tasks.

[473] Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

Ilyass Moummad, Kawtar Zaher, Hervé Goëau, Alexis Joly

Main category: cs.CV

TL;DR: CroVCA introduces a simple unified principle for learning binary codes that remain consistent across semantically aligned views using a single binary cross-entropy loss and coding-rate maximization as anti-collapse regularizer.

Details

Motivation: Foundation models provide powerful embeddings but nearest neighbor search in high-dimensional spaces is computationally expensive. Hashing offers efficient alternative but existing approaches have complex pipelines, multi-term objectives, and long training times.

Method: CroVCA uses cross-view code alignment principle with binary cross-entropy loss for alignment and coding-rate maximization as regularizer. HashCoder is a lightweight MLP hashing network with batch normalization for balanced codes, usable as probing head on frozen embeddings or via LoRA fine-tuning.

Result: Achieves state-of-the-art results in just 5 training epochs. Unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on single GPU.

Conclusion: CroVCA demonstrates high efficiency, adaptability, and broad applicability for large-scale retrieval with compact and discriminative binary codes.

Abstract: Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA’s efficiency, adaptability, and broad applicability.

cs.AI

[474] Multimodal Detection of Fake Reviews using BERT and ResNet-50

Suhasnadh Reddy Veluru, Sai Teja Erukude, Viswa Chaitanya Marella

Main category: cs.AI

TL;DR: A multimodal fake review detection framework combining BERT for text and ResNet-50 for images outperforms unimodal approaches, achieving 0.934 F1-score by detecting semantic inconsistencies across modalities.

Details

Motivation: Fake reviews generated by bots, paid agents, or AI models threaten trust in digital commerce. Existing unimodal detection models fail to capture cross-modal inconsistencies.

Method: Proposed multimodal framework integrates BERT-encoded text features and ResNet-50-extracted visual features, fused through a classification head. Uses curated dataset of 21,142 user-uploaded images across food delivery, hospitality, and e-commerce domains.

Result: Multimodal model achieves F1-score of 0.934, outperforming unimodal baselines. Confusion matrix and qualitative analysis show ability to detect subtle inconsistencies like exaggerated text with unrelated/low-quality images.

Conclusion: Multimodal learning is crucial for safeguarding digital trust and offers scalable content moderation solution for online platforms.

Abstract: In the current digital commerce landscape, user-generated reviews play a critical role in shaping consumer behavior, product reputation, and platform credibility. However, the proliferation of fake or misleading reviews often generated by bots, paid agents, or AI models poses a significant threat to trust and transparency within review ecosystems. Existing detection models primarily rely on unimodal, typically textual, data and therefore fail to capture semantic inconsistencies across different modalities. To address this gap, a robust multimodal fake review detection framework is proposed, integrating textual features encoded with BERT and visual features extracted using ResNet-50. These representations are fused through a classification head to jointly predict review authenticity. To support this approach, a curated dataset comprising 21,142 user-uploaded images across food delivery, hospitality, and e-commerce domains was utilized. Experimental results indicate that the multimodal model outperforms unimodal baselines, achieving an F1-score of 0.934 on the test set. Additionally, the confusion matrix and qualitative analysis highlight the model’s ability to detect subtle inconsistencies, such as exaggerated textual praise paired with unrelated or low-quality images, commonly found in deceptive content. This study demonstrates the critical role of multimodal learning in safeguarding digital trust and offers a scalable solution for content moderation across various online platforms.

[475] Graph-Attentive MAPPO for Dynamic Retail Pricing

Krishna Kumar Neelakanta Pillai Santha Kumari Amma

Main category: cs.AI

TL;DR: Multi-agent reinforcement learning (MAPPO) with graph attention (MAPPO+GAT) improves dynamic retail pricing by enabling coordinated price decisions across related products, outperforming independent learning approaches.

Details

Motivation: Retail dynamic pricing requires policies that adapt to shifting demand while coordinating decisions across related products, which traditional methods struggle with.

Method: Systematic empirical study comparing MAPPO baseline with graph-attention-augmented variant (MAPPO+GAT) using simulated pricing environment from real transaction data.

Result: MAPPO provides robust portfolio-level price control, and MAPPO+GAT further enhances performance by sharing information over product graph without excessive price volatility.

Conclusion: Graph-integrated MARL offers more scalable and stable solution than independent learners for dynamic retail pricing, with practical advantages in multi-product decision-making.

Abstract: Dynamic pricing in retail requires policies that adapt to shifting demand while coordinating decisions across related products. We present a systematic empirical study of multi-agent reinforcement learning for retail price optimization, comparing a strong MAPPO baseline with a graph-attention-augmented variant (MAPPO+GAT) that leverages learned interactions among products. Using a simulated pricing environment derived from real transaction data, we evaluate profit, stability across random seeds, fairness across products, and training efficiency under a standardized evaluation protocol. The results indicate that MAPPO provides a robust and reproducible foundation for portfolio-level price control, and that MAPPO+GAT further enhances performance by sharing information over the product graph without inducing excessive price volatility. These results indicate that graph-integrated MARL provides a more scalable and stable solution than independent learners for dynamic retail pricing, offering practical advantages in multi-product decision-making.

[476] GEPOC Parameters - Open Source Parametrisation and Validation for Austria, Version 2.0

Martin Bicher, Maximilian Viehauser, Daniele Giannandrea, Hannah Kastinger, Dominik Brunmeir, Claire Rippinger, Christoph Urach, Niki Popper

Main category: cs.AI

TL;DR: GEPOC is a framework for population analysis that requires stable data processes. This work describes data-processing methods for computing model parameters for Austria using publicly accessible data, with emphasis on the GEPOC ABM agent-based model.

Details

Motivation: To enable valid application of GEPOC models for specific regions by providing stable and reproducible data processes that generate ready-to-use model parameters from publicly available data.

Method: Developed data-processing methods including algorithms for aggregation, disaggregation, fusion, cleansing, and scaling of publicly accessible data to compute model parameters for Austria, with focus on the GEPOC ABM agent-based population model.

Result: Created a complete description of data-processing methods and resulting parameter files for Austria, validated through an extensive study using the GEPOC ABM model.

Conclusion: The work successfully establishes reproducible data processes for computing GEPOC model parameters using publicly available data, with validation confirming the approach’s effectiveness for population-level research.

Abstract: GEPOC, short for Generic Population Concept, is a collection of models and methods for analysing population-level research questions. For the valid application of the models for a specific country or region, stable and reproducible data processes are necessary, which provide valid and ready-to-use model parameters. This work contains a complete description of the data-processing methods for computation of model parameters for Austria, based exclusively on freely and publicly accessible data. In addition to the description of the source data used, this includes all algorithms used for aggregation, disaggregation, fusion, cleansing or scaling of the data, as well as a description of the resulting parameter files. The document places particular emphasis on the computation of parameters for the most important GEPOC model, GEPOC ABM, a continuous-time agent-based population model. An extensive validation study using this particular model was made and is presented at the end of this work.

[477] QuantumBench: A Benchmark for Quantum Problem Solving

Shunya Minami, Tatsuya Ishigaki, Ikko Hamamura, Taku Mikuriya, Youmi Ma, Naoaki Okazaki, Hiroya Takamura, Yohichi Suzuki, Tadashi Kadowaki

Main category: cs.AI

TL;DR: QuantumBench is a new benchmark with ~800 multiple-choice questions across 9 quantum science areas to evaluate LLMs’ domain-specific knowledge in quantum science.

Details

Motivation: There's a growing need to evaluate whether LLMs accurately capture domain-specific knowledge in quantum science, as general-purpose benchmarks don't reflect the field's non-intuitive phenomena and advanced mathematics requirements.

Method: Compiled approximately 800 questions with answers from publicly available materials, organized into an eight-option multiple-choice dataset spanning nine quantum science areas.

Result: The benchmark was used to evaluate several existing LLMs and analyze their performance in quantum domain, including sensitivity to question format changes.

Conclusion: QuantumBench is the first LLM evaluation dataset for quantum domain and is intended to guide effective use of LLMs in quantum research.

Abstract: Large language models are now integrated into many scientific workflows, accelerating data analysis, hypothesis generation, and design space exploration. In parallel with this growth, there is a growing need to carefully evaluate whether models accurately capture domain-specific knowledge and notation, since general-purpose benchmarks rarely reflect these requirements. This gap is especially clear in quantum science, which features non-intuitive phenomena and requires advanced mathematics. In this study, we introduce QuantumBench, a benchmark for the quantum domain that systematically examine how well LLMs understand and can be applied to this non-intuitive field. Using publicly available materials, we compiled approximately 800 questions with their answers spanning nine areas related to quantum science and organized them into an eight-option multiple-choice dataset. With this benchmark, we evaluate several existing LLMs and analyze their performance in the quantum domain, including sensitivity to changes in question format. QuantumBench is the first LLM evaluation dataset built for the quantum domain, and it is intended to guide the effective use of LLMs in quantum research.

[478] Engineering.ai: A Platform for Teams of AI Engineers in Computational Design

Ran Xu, Yupeng Qi, Jingsen Feng, Xu Chu

Main category: cs.AI

TL;DR: Engineering.ai is a multi-agent AI platform for computational design that coordinates specialized AI engineers (aerodynamics, structural, acoustic, optimization) to autonomously perform complex engineering tasks with 100% success rate across 400+ configurations.

Details

Motivation: Human engineering teams face high development time and costs due to multidisciplinary complexity. The goal is to create autonomous AI engineers that can collaborate like human specialists to reduce these burdens.

Method: Hierarchical multi-agent architecture with Chief Engineer coordinating specialized domain agents (Aerodynamics, Structural, Acoustic, Optimization Engineers). Uses file-mediated communication for data provenance, memory system for project context, and integrates tools like FreeCAD, Gmsh, OpenFOAM, CalculiX, BPM acoustic analysis.

Result: Achieved 100% success rate across over 400 parametric configurations with zero mesh generation failures, solver convergence issues, or manual interventions required. Successfully validated through UAV wing optimization.

Conclusion: Agentic-AI-enabled AI engineers can autonomously perform complex engineering tasks reliably and trustworthily, demonstrating potential to revolutionize engineering design workflows.

Abstract: In modern engineering practice, human engineers collaborate in specialized teams to design complex products, with each expert completing their respective tasks while communicating and exchanging results and data with one another. While this division of expertise is essential for managing multidisciplinary complexity, it demands substantial development time and cost. Recently, we introduced OpenFOAMGPT (1.0, 2.0), which functions as an autonomous AI engineer for computational fluid dynamics, and turbulence.ai, which can conduct end-to-end research in fluid mechanics draft publications and PhD theses. Building upon these foundations, we present Engineering.ai, a platform for teams of AI engineers in computational design. The framework employs a hierarchical multi-agent architecture where a Chief Engineer coordinates specialized agents consisting of Aerodynamics, Structural, Acoustic, and Optimization Engineers, each powered by LLM with domain-specific knowledge. Agent-agent collaboration is achieved through file-mediated communication for data provenance and reproducibility, while a comprehensive memory system maintains project context, execution history, and retrieval-augmented domain knowledge to ensure reliable decision-making across the workflow. The system integrates FreeCAD, Gmsh, OpenFOAM, CalculiX, and BPM acoustic analysis, enabling parallel multidisciplinary simulations while maintaining computational accuracy. The framework is validated through UAV wing optimization. This work demonstrates that agentic-AI-enabled AI engineers has the potential to perform complex engineering tasks autonomously. Remarkably, the automated workflow achieved a 100% success rate across over 400 parametric configurations, with zero mesh generation failures, solver convergence issues, or manual interventions required, validating that the framework is trustworthy.

[479] ARC-GEN: A Mimetic Procedural Benchmark Generator for the Abstraction and Reasoning Corpus

Michael D. Moffitt

Main category: cs.AI

TL;DR: ARC-GEN is a procedural generator that extends the ARC-AGI training dataset by creating additional sample pairs while maintaining fidelity to the original distribution and characteristics.

Details

Motivation: The ARC-AGI benchmark measures skill acquisition efficiency but has limited demonstration sets with few input-output grid pairs per task, constraining algorithms that need extensive intra-task exemplars.

Method: Developed an open-source procedural generator (ARC-GEN) that is exhaustive (covering all 400 tasks) and mimetic (faithfully reproducing the distributional properties of the original ARC-AGI-1 release).

Result: Created an extended dataset that provides more viable sample pairs while maintaining the original benchmark’s characteristics and properties.

Conclusion: ARC-GEN successfully extends the ARC-AGI training dataset and can be used to establish static benchmark suites for verifying program correctness in competitions like the Google Code Golf Championship.

Abstract: The Abstraction and Reasoning Corpus remains one of the most compelling and challenging benchmarks for tracking progress toward achieving Artificial General Intelligence. In contrast to other evaluation datasets designed to assess an agent’s task-specific skills or accumulated knowledge, the ARC-AGI suite is specifically targeted at measuring skill acquisition efficiency, a trait that has (so far) been lacking in even the most sophisticated machine learning systems. For algorithms that require extensive intra-task exemplars, a significant constraint imposed by ARC-AGI is the modest cardinality of its demonstration set, comprising a small number of $\langle$ input, output $\rangle$ grids per task specifying the corresponding transformation. To embellish the space of viable sample pairs, this paper introduces ARC-GEN, an open-source procedural generator aimed at extending the original ARC-AGI training dataset as faithfully as possible. Unlike prior efforts, our generator is both exhaustive (covering all four-hundred tasks) and mimetic (more closely honoring the distributional properties and characteristics embodied in the initial ARC-AGI-1 release). We also discuss the use of this generator in establishing a static benchmark suite to verify the correctness of programs submitted to the 2025 Google Code Golf Championship.

[480] Incremental Selection of Most-Filtering Conjectures and Proofs of the Selected Conjectures

Jovial Cheukam Ngouonou, Ramiz Gindullin, Claude-Guy Quimper, Nicolas Beldiceanu, Remi Douence

Main category: cs.AI

TL;DR: Improved incremental selection algorithm from previous work with complete proof of all selected conjectures

Details

Motivation: To enhance the existing selection algorithm presented in prior research and provide formal verification

Method: Developed an improved incremental selection algorithm building upon the previous work

Result: Successfully proved all the selected conjectures using the enhanced algorithm

Conclusion: The improved incremental selection algorithm is effective and provides complete proof coverage for the selected conjectures

Abstract: We present an improved incremental selection algorithm of the selection algorithm presented in [1] and prove all the selected conjectures.

[481] Advancing Cognitive Science with LLMs

Dirk U. Wulff, Rui Mata

Main category: cs.AI

TL;DR: LLMs can help address cognitive science’s challenges in knowledge synthesis and conceptual clarity by supporting cross-disciplinary connections, theory formalization, measurement taxonomies, generalizability, and capturing individual variation.

Details

Motivation: Cognitive science faces ongoing challenges in knowledge synthesis and conceptual clarity due to its multifaceted and interdisciplinary nature, and LLMs offer tools that may help address these issues.

Method: This review examines how LLMs can support areas where cognitive science has historically struggled, outlining current capabilities and limitations of LLMs in these domains.

Result: LLMs can serve as tools for establishing cross-disciplinary connections, formalizing theories, developing clear measurement taxonomies, achieving generalizability through integrated modeling frameworks, and capturing contextual and individual variation.

Conclusion: LLMs can serve as tools for a more integrative and cumulative cognitive science when used judiciously to complement, rather than replace, human expertise.

Abstract: Cognitive science faces ongoing challenges in knowledge synthesis and conceptual clarity, in part due to its multifaceted and interdisciplinary nature. Recent advances in artificial intelligence, particularly the development of large language models (LLMs), offer tools that may help to address these issues. This review examines how LLMs can support areas where the field has historically struggled, including establishing cross-disciplinary connections, formalizing theories, developing clear measurement taxonomies, achieving generalizability through integrated modeling frameworks, and capturing contextual and individual variation. We outline the current capabilities and limitations of LLMs in these domains, including potential pitfalls. Taken together, we conclude that LLMs can serve as tools for a more integrative and cumulative cognitive science when used judiciously to complement, rather than replace, human expertise.

[482] Advancing AI Challenges for the United States Department of the Air Force

Christian Prothmann, Vijay Gadepally, Jeremy Kepner, Koley Borchard, Luca Carlone, Zachary Folcik, J. Daniel Grith, Michael Houle, Jonathan P. How, Nathan Hughes, Ifueko Igbinedion, Hayden Jananthan, Tejas Jayashankar, Michael Jones, Sertac Karaman, Binoy G. Kurien, Alejandro Lancho, Giovanni Lavezzi, Gary C. F. Lee, Charles E. Leiserson, Richard Linares, Lindsey McEvoy, Peter Michaleas, Chasen Milner, Alex Pentland, Yury Polyanskiy, Jovan Popovich, Jeffrey Price, Tim W. Reid, Stephanie Riley, Siddharth Samsi, Peter Saunders, Olga Simek, Mark S. Veillette, Amir Weiss, Gregory W. Wornell, Daniela Rus, Scott T. Ruppel

Main category: cs.AI

TL;DR: The DAF-MIT AI Accelerator program updates on its challenge problems that advance AI research through public datasets and open-source solutions for defense and civilian applications.

Details

Motivation: To expand the competitive advantage of the United States in defense and civilian sectors by pioneering fundamental advances in artificial intelligence through collaborative challenge problems.

Method: Developing and launching public challenge problems with large, publicly available, AI-ready datasets to stimulate open-source solutions and engage the wider academic and private sector AI ecosystem.

Result: Ongoing and new challenges have successfully contributed to AI research and applications of AI technologies.

Conclusion: The AI Accelerator program continues to effectively advance AI research through its challenge-based approach with public datasets and open-source engagement.

Abstract: The DAF-MIT AI Accelerator is a collaboration between the United States Department of the Air Force (DAF) and the Massachusetts Institute of Technology (MIT). This program pioneers fundamental advances in artificial intelligence (AI) to expand the competitive advantage of the United States in the defense and civilian sectors. In recent years, AI Accelerator projects have developed and launched public challenge problems aimed at advancing AI research in priority areas. Hallmarks of AI Accelerator challenges include large, publicly available, and AI-ready datasets to stimulate open-source solutions and engage the wider academic and private sector AI ecosystem. This article supplements our previous publication, which introduced AI Accelerator challenges. We provide an update on how ongoing and new challenges have successfully contributed to AI research and applications of AI technologies.

[483] Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities

Manan Roy Choudhury, Adithya Chandramouli, Mannan Anand, Vivek Gupta

Main category: cs.AI

TL;DR: CLAUSE is a benchmark for testing LLMs’ reliability in detecting subtle legal flaws in contracts, revealing that current models often miss nuanced errors and struggle with legal justification.

Details

Motivation: There's a critical gap in evaluating LLMs' reliability for high-stakes legal work, as no existing benchmark systematically tests their ability to handle adversarial and subtle flaws in real-world contracts.

Method: Created CLAUSE benchmark with over 7500 perturbed contracts from CUAD and ContractNLI datasets, using a persona-driven pipeline to generate 10 anomaly categories and validating them with RAG system for legal fidelity.

Result: Leading LLMs show key weaknesses in detecting embedded legal flaws - they often miss subtle errors and struggle significantly to provide legal justifications for their findings.

Conclusion: The work outlines a path to identify and correct reasoning failures in legal AI, highlighting the need for improved benchmarks to enhance LLM reliability in legal applications.

Abstract: The rapid integration of large language models (LLMs) into high-stakes legal work has exposed a critical gap: no benchmark exists to systematically stress-test their reliability against the nuanced, adversarial, and often subtle flaws present in real-world contracts. To address this, we introduce CLAUSE, a first-of-its-kind benchmark designed to evaluate the fragility of an LLM’s legal reasoning. We study the capabilities of LLMs to detect and reason about fine-grained discrepancies by producing over 7500 real-world perturbed contracts from foundational datasets like CUAD and ContractNLI. Our novel, persona-driven pipeline generates 10 distinct anomaly categories, which are then validated against official statutes using a Retrieval-Augmented Generation (RAG) system to ensure legal fidelity. We use CLAUSE to evaluate leading LLMs’ ability to detect embedded legal flaws and explain their significance. Our analysis shows a key weakness: these models often miss subtle errors and struggle even more to justify them legally. Our work outlines a path to identify and correct such reasoning failures in legal AI.

[484] Diverse Human Value Alignment for Large Language Models via Ethical Reasoning

Jiahao Wang, Songkai Xue, Jinghui Li, Xiaozhen Wang

Main category: cs.AI

TL;DR: Proposes a novel ethical reasoning framework for LLMs to improve human value alignment through structured five-step ethical decision-making process.

Details

Motivation: Current LLM alignment approaches yield superficial conformity rather than genuine ethical understanding, failing to address complex, context-dependent human values across different regions and cultures.

Method: A structured five-step ethical reasoning process: contextual fact gathering, hierarchical social norm identification, option generation, multiple-lens ethical impact analysis, and reflection. Implementable via prompt engineering or supervised fine-tuning.

Result: Significantly improves LLM alignment with diverse human values on SafeWorld benchmark, enabling more accurate social norm identification and culturally appropriate reasoning compared to baseline methods.

Conclusion: Provides a concrete pathway for developing LLMs that better align with multifaceted global values through interdisciplinary research and interpretable ethical reasoning.

Abstract: Ensuring that Large Language Models (LLMs) align with the diverse and evolving human values across different regions and cultures remains a critical challenge in AI ethics. Current alignment approaches often yield superficial conformity rather than genuine ethical understanding, failing to address the complex, context-dependent nature of human values. In this paper, we propose a novel ethical reasoning paradigm for LLMs inspired by well-established ethical decision-making models, aiming at enhancing diverse human value alignment through deliberative ethical reasoning. Our framework consists of a structured five-step process, including contextual fact gathering, hierarchical social norm identification, option generation, multiple-lens ethical impact analysis, and reflection. This theory-grounded approach guides LLMs through an interpretable reasoning process that enhances their ability to understand regional specificities and perform nuanced ethical analysis, which can be implemented with either prompt engineering or supervised fine-tuning methods. We perform evaluations on the SafeWorld benchmark that specially designed for regional value alignment. Experimental results demonstrate our framework significantly improves LLM alignment with diverse human values compared to baseline methods, enabling more accurate social norm identification and more culturally appropriate reasoning. Our work provides a concrete pathway toward developing LLMs that align more effectively with the multifaceted values of global societies through interdisciplinary research.

[485] Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs

Mina Taraghi, Yann Pequignot, Amin Nikanjam, Mohamed Amine Merzouk, Foutse Khomh

Main category: cs.AI

TL;DR: Systematic evaluation of 4 PEFT methods (LoRA, IA3, Prompt-Tuning, P-Tuning) on 4 LLM families shows adapter-based methods improve safety and preserve fairness better than prompt-based methods, with base model type strongly moderating alignment shifts.

Details

Motivation: Organizations use publicly hosted LLMs fine-tuned for specialized tasks, but these adaptations can degrade safety and fairness. Different fine-tuning techniques may have distinct effects on these critical dimensions.

Method: Applied four PEFT methods (LoRA, IA3, Prompt-Tuning, P-Tuning) to four instruction-tuned model families (Meta-Llama-3-8B, Qwen2.5-7B, Mistral-7B, Gemma-7B), creating 235 variants evaluated across 11 safety hazard categories and 9 demographic fairness dimensions.

Result: Adapter-based approaches (LoRA, IA3) improve safety scores and are least disruptive to fairness. Prompt-based methods reduce safety and cause larger fairness regressions. Base model type strongly moderates alignment shifts - LLaMA stable, Qwen modest gains, Gemma steep safety decline, Mistral greatest variance. Safety improvements don’t necessarily translate to fairness improvements.

Conclusion: Practical guideline for safety-critical deployments: start with well-aligned base model, favor adapter-based PEFT, and conduct category-specific audits of both safety and fairness. No single configuration optimizes all fairness metrics simultaneously, indicating inherent trade-offs.

Abstract: Organizations are increasingly adopting and adapting Large Language Models (LLMs) hosted on public repositories such as HuggingFace. Although these adaptations often improve performance on specialized downstream tasks, recent evidence indicates that they can also degrade a model’s safety or fairness. Since different fine-tuning techniques may exert distinct effects on these critical dimensions, this study undertakes a systematic assessment of their trade-offs. Four widely used Parameter-Efficient Fine-Tuning methods, LoRA, IA3, Prompt-Tuning, and P-Tuning, are applied to four instruction-tuned model families (Meta-Llama-3-8B, Qwen2.5-7B, Mistral-7B, and Gemma-7B). In total, 235 fine-tuned variants are evaluated across eleven safety hazard categories and nine demographic fairness dimensions. The results show that adapter-based approaches (LoRA, IA3) tend to improve safety scores and are the least disruptive to fairness, retaining higher accuracy and lower bias scores. In contrast, prompt-based methods (Prompt-Tuning and P-Tuning) generally reduce safety and cause larger fairness regressions, with decreased accuracy and increased bias. Alignment shifts are strongly moderated by base model type: LLaMA remains stable, Qwen records modest gains, Gemma experiences the steepest safety decline, and Mistral, which is released without an internal moderation layer, displays the greatest variance. Improvements in safety do not necessarily translate into improvements in fairness, and no single configuration optimizes all fairness metrics simultaneously, indicating an inherent trade-off between these objectives. These findings suggest a practical guideline for safety-critical deployments: begin with a well-aligned base model, favour adapter-based PEFT, and conduct category-specific audits of both safety and fairness.

[486] Leveraging Multi-Agent System (MAS) and Fine-Tuned Small Language Models (SLMs) for Automated Telecom Network Troubleshooting

Chenhua Shi, Bhavika Jalli, Gregor Macdonald, John Zou, Wanlu Lei, Mridul Jain, Joji Philip

Main category: cs.AI

TL;DR: A Multi-Agent System using LLMs to automate telecom network troubleshooting by coordinating specialized agents for fault detection, diagnosis, and remediation planning.

Details

Motivation: Telecom networks are growing in scale and complexity, making manual troubleshooting by experts inefficient. Existing AI models are narrow in scope, require large labeled datasets, and struggle to generalize across heterogeneous deployments.

Method: Proposed a Multi-Agent System with LLMs coordinating specialized agents (orchestrator, solution planner, executor, data retriever, root-cause analyzer). Fine-tuned a Small Language Model on proprietary troubleshooting documents for domain-grounded solution planning.

Result: The framework significantly accelerates troubleshooting automation across both Radio Access Network (RAN) and Core network domains.

Conclusion: The MAS approach with LLM coordination and fine-tuned SLM for solution planning effectively automates network troubleshooting, reducing reliance on manual expert intervention.

Abstract: Telecom networks are rapidly growing in scale and complexity, making effective management, operation, and optimization increasingly challenging. Although Artificial Intelligence (AI) has been applied to many telecom tasks, existing models are often narrow in scope, require large amounts of labeled data, and struggle to generalize across heterogeneous deployments. Consequently, network troubleshooting continues to rely heavily on Subject Matter Experts (SMEs) to manually correlate various data sources to identify root causes and corrective actions. To address these limitations, we propose a Multi-Agent System (MAS) that employs an agentic workflow, with Large Language Models (LLMs) coordinating multiple specialized tools for fully automated network troubleshooting. Once faults are detected by AI/ML-based monitors, the framework dynamically activates agents such as an orchestrator, solution planner, executor, data retriever, and root-cause analyzer to diagnose issues and recommend remediation strategies within a short time frame. A key component of this system is the solution planner, which generates appropriate remediation plans based on internal documentation. To enable this, we fine-tuned a Small Language Model (SLM) on proprietary troubleshooting documents to produce domain-grounded solution plans. Experimental results demonstrate that the proposed framework significantly accelerates troubleshooting automation across both Radio Access Network (RAN) and Core network domains.

Ashutosh Anshul, Gumpili Sai Pranav, Mohammad Zia Ur Rehman, Nagendra Kumar

Main category: cs.AI

TL;DR: A multimodal framework combining text, user data, and images from social media to detect depression, especially during COVID-19, outperforming existing methods by 2%-8%.

Details

Motivation: The COVID-19 pandemic increased mental health issues like depression, but detection is difficult due to unwillingness to consult doctors. Social media provides a rich data source for detecting depression through users' emotional expressions.

Method: Proposes a multimodal framework using textual, user-specific, and image analysis. Includes extracting content from URLs in tweets, text from images, and five feature sets. Introduces Visual Neural Network (VNN) for image embeddings and creates a curated COVID-19 depression dataset.

Result: The model outperforms state-of-the-art methods by 2%-8% on benchmark datasets and shows promising results on the COVID-19 dataset. Analysis reveals the impact of each modality on depression detection.

Conclusion: The multimodal approach effectively detects depression from social media data, with each modality providing valuable insights into users’ mental states, demonstrating particular relevance during the COVID-19 pandemic.

Abstract: The recent coronavirus disease (Covid-19) has become a pandemic and has affected the entire globe. During the pandemic, we have observed a spike in cases related to mental health, such as anxiety, stress, and depression. Depression significantly influences most diseases worldwide, making it difficult to detect mental health conditions in people due to unawareness and unwillingness to consult a doctor. However, nowadays, people extensively use online social media platforms to express their emotions and thoughts. Hence, social media platforms are now becoming a large data source that can be utilized for detecting depression and mental illness. However, existing approaches often overlook data sparsity in tweets and the multimodal aspects of social media. In this paper, we propose a novel multimodal framework that combines textual, user-specific, and image analysis to detect depression among social media users. To provide enough context about the user’s emotional state, we propose (i) an extrinsic feature by harnessing the URLs present in tweets and (ii) extracting textual content present in images posted in tweets. We also extract five sets of features belonging to different modalities to describe a user. Additionally, we introduce a Deep Learning model, the Visual Neural Network (VNN), to generate embeddings of user-posted images, which are used to create the visual feature vector for prediction. We contribute a curated Covid-19 dataset of depressed and non-depressed users for research purposes and demonstrate the effectiveness of our model in detecting depression during the Covid-19 outbreak. Our model outperforms existing state-of-the-art methods over a benchmark dataset by 2%-8% and produces promising results on the Covid-19 dataset. Our analysis highlights the impact of each modality and provides valuable insights into users’ mental and emotional states.

[488] A CPU-Centric Perspective on Agentic AI

Ritik Raj, Hong Wang, Tushar Krishna

Main category: cs.AI

TL;DR: This paper analyzes CPU bottlenecks in agentic AI systems, showing that tool processing on CPUs can consume up to 90.6% of total latency and CPU energy can reach 44% of total dynamic energy. The authors propose CPU-GPU aware micro-batching and mixed workload scheduling optimizations that achieve up to 2.1x latency speedup.

Details

Motivation: To understand and characterize the system bottlenecks introduced by agentic AI workloads from a CPU-centric perspective, which has been largely overlooked despite CPUs playing a significant role in tool processing and orchestration.

Method: Systematically characterized agentic AI based on orchestrator components, inference path dynamics, and repetitiveness. Profiled five representative workloads (Haystack RAG, Toolformer, ChemCrow, Langchain, SWE-Agent) for latency, throughput, and energy metrics. Proposed two optimizations: CPU and GPU-Aware Micro-batching (CGAM) and Mixed Agentic Workload Scheduling (MAWS).

Result: Key findings: 1) Tool processing on CPUs takes up to 90.6% of total latency; 2) Throughput bottlenecked by CPU factors (coherence, synchronization, core oversubscription) or GPU factors (memory capacity/bandwidth); 3) CPU dynamic energy consumes up to 44% of total dynamic energy at large batch sizes. Optimizations achieved up to 2.1x and 1.41x P50 latency speedup for homogeneous and heterogeneous workloads respectively.

Conclusion: CPU bottlenecks are significant in agentic AI systems, and the proposed CPU-GPU aware optimizations demonstrate substantial potential to improve performance, efficiency, and scalability of agentic AI workloads by better managing CPU-GPU resource coordination.

Abstract: Agentic AI frameworks add a decision-making orchestrator embedded with external tools, including web search, Python interpreter, contextual database, and others, on top of monolithic LLMs, turning them from passive text oracles into autonomous problem-solvers that can plan, call tools, remember past steps, and adapt on the fly. This paper aims to characterize and understand the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first systematically characterize Agentic AI on the basis of orchestrator/decision making component, inference path dynamics and repetitiveness of the agentic flow which directly influences the system-level performance. Thereafter, based on the characterization, we choose five representative agentic AI workloads- Haystack RAG, Toolformer, ChemCrow, Langchain and SWE-Agent to profile latency, throughput and energy metrics and demystify the significant impact of CPUs on these metrics relative to GPUs. We observe that - 1. Tool processing on CPUs can take up to 90.6% of the total latency; 2. Agentic throughput gets bottlenecked either by CPU factors - coherence, synchronization and over-subscription of cores or GPU factors - main memory capacity and bandwidth; \circled{3} CPU dynamic energy consumes up to 44% of the total dynamic energy at large batch sizes. Based on the profiling insights, we present two key optimizations- 1. CPU and GPU-Aware Micro-batching (CGAM) and 2. Mixed Agentic Workload Scheduling (MAWS) for homogeneous and heterogeneous agentic workloads respectively to demonstrate the potential to improve the performance, efficiency, and scalability of agentic AI. We achieve up to 2.1x and 1.41x P50 latency speedup compared to the multi-processing benchmark for homogeneous and heterogeneous agentic workloads respectively.

[489] GraphChain: Large Language Models for Large-scale Graph Analysis via Tool Chaining

Chunyu Wei, Wenji Hu, Xingjia Hao, Xin Wang, Yifan Yang, Yueguo Chen, Yang Tian, Yunhai Wang

Main category: cs.AI

TL;DR: GraphChain enables LLMs to analyze large-scale graphs through dynamic tool sequences using progressive graph distillation and structure-aware adaptation.

Details

Motivation: LLMs struggle with large-scale graphs due to context constraints and inflexible reasoning, limiting their applicability to complex graph analysis tasks.

Method: Uses Progressive Graph Distillation (RL-based tool sequence optimization) and Structure-aware Test-Time Adaptation (spectral properties + lightweight adapters for topology-specific tool selection).

Result: Significantly outperforms prior methods, enabling scalable and adaptive LLM-driven graph analysis.

Conclusion: GraphChain provides an effective framework for LLMs to handle complex graph analysis through dynamic tool orchestration and topology-aware adaptation.

Abstract: Large Language Models (LLMs) face significant limitations when applied to large-scale graphs, struggling with context constraints and inflexible reasoning. We present GraphChain, a framework that enables LLMs to analyze complex graphs through dynamic sequences of specialized tools, mimicking human exploratory intelligence. Our approach introduces two key innovations: (1) Progressive Graph Distillation, a reinforcement learning mechanism that generates optimized tool sequences balancing task relevance with information compression, and (2) Structure-aware Test-Time Adaptation, which efficiently tailors tool selection strategies to diverse graph topologies using spectral properties and lightweight adapters without costly retraining. Experiments show GraphChain significantly outperforms prior methods, enabling scalable and adaptive LLM-driven graph analysis.

[490] Reimagining Safety Alignment with An Image

Yifan Xia, Guorui Chen, Wenqian Yu, Zhijiang Li, Philip Torr, Jindong Gu

Main category: cs.AI

TL;DR: Magic Image is an optimization-driven visual prompt framework that enhances security while reducing over-refusal in multimodal large language models (MLLMs) without parameter updates.

Details

Motivation: LLMs face challenges with harmful content generation under jailbreak attacks and over-refusal of benign queries, complicated by the need to accommodate different value systems and align with safety preferences. Traditional methods like SFT and RLHF are costly and cannot support multiple value systems within a single model.

Method: Optimize image prompts using harmful/benign samples to enable a single model to adapt to different value systems and better align with given safety preferences without parameter updates.

Result: Experiments demonstrate improved safety-effectiveness balance across diverse datasets while preserving model performance.

Conclusion: Magic Image offers a practical solution for deployable MLLM safety alignment by enhancing security while reducing over-refusal through optimized visual prompts.

Abstract: Large language models (LLMs) excel in diverse applications but face dual challenges: generating harmful content under jailbreak attacks and over-refusal of benign queries due to rigid safety mechanisms. These issues are further complicated by the need to accommodate different value systems and precisely align with given safety preferences. Moreover, traditional methods like SFT and RLHF lack this capability due to their costly parameter tuning requirements and inability to support multiple value systems within a single model. These problems are more obvious in multimodal large language models (MLLMs), especially in terms of heightened over-refusal in cross-modal tasks and new security risks arising from expanded attack surfaces. We propose Magic Image, an optimization-driven visual prompt framework that enhances security while reducing over-refusal. By optimizing image prompts using harmful/benign samples, our method enables a single model to adapt to different value systems and better align with given safety preferences without parameter updates. Experiments demonstrate improved safety-effectiveness balance across diverse datasets while preserving model performance, offering a practical solution for deployable MLLM safety alignment.

[491] Efficient Generation of Binary Magic Squares

Alain Riou

Main category: cs.AI

TL;DR: A simple algorithm for generating Binary Magic Squares (BMS) with optimal complexity, extended to non-square BMS with parallel GPU implementation.

Details

Motivation: To develop efficient methods for generating binary matrices with equal row and column sums, addressing both square and non-square cases.

Method: Proposed a simple inductive algorithm for square BMS, then extended it with a variant for non-square BMS with formalized existence conditions.

Result: Proven optimal theoretical complexity for square BMS generation, with provable generation for non-square cases and publicly released Python implementations.

Conclusion: The algorithm provides efficient BMS generation with theoretical guarantees and practical implementations including GPU acceleration.

Abstract: We propose a simple algorithm for generating Binary Magic Squares (BMS), i.e., square binary matrices where the sum of all rows and all columns are equal. We show by induction that our algorithm always returns valid BMS with optimal theoretical complexity. We then extend our study to non-square Binary Magic Squares, formalize conditions on the sum of rows and columns for these BMS to exist, and show that a slight variant of our first algorithm can generate provably generate them. Finally, we publicly release two implementations of our algorithm as Python packages, including one that can generate several BMS in parallel using GPU acceleration.

[492] Single-agent Reinforcement Learning Model for Regional Adaptive Traffic Signal Control

Qiang Li, Ningjing Zeng, Lina Yu

Main category: cs.AI

TL;DR: This paper proposes a single-agent reinforcement learning model for regional adaptive traffic signal control that uses probe vehicle data to estimate queue lengths and coordinate multiple intersections.

Details

Motivation: Existing multi-agent RL approaches for traffic signal control face scalability challenges, while real-world traffic control requires centralized management by a single control center that can monitor all roads and coordinate all intersections.

Method: The authors developed a single-agent RL framework with state, action, and reward functions defined based on queue length. The queue length definition is adapted for reliable estimation using probe vehicle travel time data, and actions are designed to regulate queue dynamics.

Result: Comprehensive evaluation using SUMO simulation platform showed that the proposed model effectively mitigates large-scale regional congestion through coordinated multi-intersection control.

Conclusion: The single-agent RL approach with probe vehicle compatibility offers a scalable solution for regional traffic signal control that can be widely deployed using existing probe vehicle infrastructure on urban roads.

Abstract: Several studies have employed reinforcement learning (RL) to address the challenges of regional adaptive traffic signal control (ATSC) and achieved promising results. In this field, existing research predominantly adopts multi-agent frameworks. However, the adoption of multi-agent frameworks presents challenges for scalability. Instead, the Traffic signal control (TSC) problem necessitates a single-agent framework. TSC inherently relies on centralized management by a single control center, which can monitor traffic conditions across all roads in the study area and coordinate the control of all intersections. This work proposes a single-agent RL-based regional ATSC model compatible with probe vehicle technology. Key components of the RL design include state, action, and reward function definitions. To facilitate learning and manage congestion, both state and reward functions are defined based on queue length, with action designed to regulate queue dynamics. The queue length definition used in this study differs slightly from conventional definitions but is closely correlated with congestion states. More importantly, it allows for reliable estimation using link travel time data from probe vehicles. With probe vehicle data already covering most urban roads, this feature enhances the proposed method’s potential for widespread deployment. The method was comprehensively evaluated using the SUMO simulation platform. Experimental results demonstrate that the proposed model effectively mitigates large-scale regional congestion levels via coordinated multi-intersection control.

[493] PreferThinker: Reasoning-based Personalized Image Preference Assessment

Shengqi Xu, Xinpeng Zhou, Yabo Zhang, Ming Liu, Tao Liang, Tianyu Zhang, Yalong Bai, Zuxuan Wu, Wangmeng Zuo

Main category: cs.AI

TL;DR: A reasoning-based framework for personalized image preference assessment that uses a predict-then-assess paradigm with common preference profiles to bridge user data.

Details

Motivation: Existing methods struggle with personalized preferences due to scarce user-specific data and diverse individual tastes, requiring a scalable approach.

Method: Two-stage training: cold-start supervised fine-tuning for structured reasoning, followed by reinforcement learning with similarity-aware prediction reward for better profile prediction.

Result: Extensive experiments demonstrate the superiority of the proposed method in handling personalized image preference assessment.

Conclusion: The framework effectively leverages large-scale user data through common preference profiles and structured reasoning for interpretable, multi-dimensional personalized assessments.

Abstract: Personalized image preference assessment aims to evaluate an individual user’s image preferences by relying only on a small set of reference images as prior information. Existing methods mainly focus on general preference assessment, training models with large-scale data to tackle well-defined tasks such as text-image alignment. However, these approaches struggle to handle personalized preference because user-specific data are scarce and not easily scalable, and individual tastes are often diverse and complex. To overcome these challenges, we introduce a common preference profile that serves as a bridge across users, allowing large-scale user data to be leveraged for training profile prediction and capturing complex personalized preferences. Building on this idea, we propose a reasoning-based personalized image preference assessment framework that follows a \textit{predict-then-assess} paradigm: it first predicts a user’s preference profile from reference images, and then provides interpretable, multi-dimensional scores and assessments of candidate images based on the predicted profile. To support this, we first construct a large-scale Chain-of-Thought (CoT)-style personalized assessment dataset annotated with diverse user preference profiles and high-quality CoT-style reasoning, enabling explicit supervision of structured reasoning. Next, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase to empower the model with structured reasoning capabilities, followed by reinforcement learning to incentivize the model to explore more reasonable assessment paths and enhance generalization. Furthermore, we propose a similarity-aware prediction reward to encourage better prediction of the user’s preference profile, which facilitates more reasonable assessments exploration. Extensive experiments demonstrate the superiority of the proposed method.

[494] DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching

Zicheng Xu, Guanchu Wang, Yu-Neng Chuang, Guangyao Zheng, Alexander S. Szalay, Zirui Liu, Vladimir Braverman

Main category: cs.AI

TL;DR: DTS is a model-agnostic decoding framework that addresses overthinking in Large Reasoning Models by selectively branching at high-entropy tokens and applying early stopping to find the shortest optimal reasoning paths, improving both accuracy and efficiency.

Details

Motivation: Large Reasoning Models often suffer from overthinking, producing excessively long chain-of-thought traces that increase inference cost and may degrade accuracy. Analysis shows an anti-correlation between reasoning length and accuracy.

Method: DTS sketches the reasoning space by selectively branching at high-entropy tokens and applies early stopping to select the shortest completed reasoning path, approximating the optimal solution without requiring additional training or supervision.

Result: Experiments on AIME2024 and AIME2025 datasets show DTS improves accuracy by up to 8%, reduces average reasoning length by 23%, and decreases repetition frequency by 12% compared to baseline methods.

Conclusion: DTS provides a scalable and efficient approach for LRM reasoning that enhances both accuracy and efficiency by finding shorter optimal reasoning paths through selective branching and early stopping.

Abstract: Large Reasoning Models (LRMs) demonstrate strong performance on complex reasoning tasks, yet they often suffer from overthinking, producing excessively long chain-of-thought (CoT) traces that increase inference cost and may degrade accuracy. Our analysis reveals a clear anti-correlation between reasoning length and accuracy, where across multiple stochastic decodes, the short reasoning paths consistently achieve the highest correctness, while longer ones accumulate errors and repetitions. These short optimal reasoning paths can be found ideally through full enumeration of the reasoning space. However, the tree-structured reasoning space grows exponentially with sequence length, rendering exhaustive exploration infeasible. To address this, we propose DTS, a model-agnostic decoding framework that sketches the reasoning space by selectively branching at high-entropy tokens and applies early stopping to select the shortest completed reasoning path. This approach approximates the optimal solution that enhances both efficiency and accuracy, without requiring additional training or supervision. Experiments on AIME2024 and AIME2025 datasets with DeepSeek-R1-Distill-Qwen-7B and 1.5B show that DTS improves accuracy by up to 8%, reduces average reasoning length by 23%, and decreases repetition frequency by 12%, demonstrating DTS’s ability for scalable and efficient LRM reasoning.

[495] STACKFEED: Structured Textual Actor-Critic Knowledge Base Editing with FeedBack

Shashank Kirtania, Naman Gupta, Priyanshu Gupta, Krishna Kariya, Sumit Gulwani, Arun Iyer, Suresh Parthasarathy, Arjun Radhakrishna, Sriram K. Rajamani, Gustavo Soares

Main category: cs.AI

TL;DR: STACKFEED is a novel approach that uses multi-actor reinforcement learning with expert feedback to iteratively refine knowledge bases, improving RAG system performance in low-resource settings.

Details

Motivation: LLMs often generate incorrect or outdated information, especially in low-resource settings or with private data. While RAG uses external knowledge bases, these KBs can also contain inaccuracies that need to be addressed.

Method: STACKFEED uses a structured textual actor-critic framework with feedback, employing multi-actor reinforcement learning. It defines ReACT actor agents on each document to perform structured edits based on document-specific targeted instructions.

Result: Experimental results show STACKFEED significantly improves KB quality and RAG system performance across low-resource programming problems, modified python packages, and factual question-answering tasks.

Conclusion: STACKFEED effectively addresses knowledge base inaccuracies through iterative refinement using expert feedback and reinforcement learning, enhancing the reliability of RAG systems in challenging scenarios.

Abstract: Large Language Models (LLMs) often generate incorrect or outdated information, especially in low-resource settings or when dealing with private data. To address this, Retrieval-Augmented Generation (RAG) uses external knowledge bases (KBs), but these can also suffer from inaccuracies. We introduce STACKFEED, a novel Structured Textual Actor-Critic Knowledge base editing with FEEDback approach that iteratively refines the KB based on expert feedback using a multi-actor, centralized critic reinforcement learning framework. STACKFEED defines a ReACT actor agent on each document to perform structured edits based on document specific targeted instructions. Experimental results showcase that STACKFEED significantly improves KB quality and performance of the RAG system. We evaluate STACKFEED on low-resource programming problems, modified python packaged and factual question-answering tasks.

[496] Lifted Successor Generation in Numeric Planning

Dominik Drexler

Main category: cs.AI

TL;DR: Extended lifted successor generator for numeric planning by adding numeric precondition support to maximum clique enumeration method, enabling lifted planning for rich numeric domains.

Details

Motivation: Grounding numeric planning tasks can cause exponential blowup in task representation size, especially for hard-to-ground tasks. Existing lifted successor generators don't support numeric preconditions.

Method: Extended state-of-the-art lifted successor generator using maximum clique enumeration in substitution consistency graph, augmented with numeric action preconditions. Final applicability check filters inapplicable actions.

Result: Generator is exact under specified conditions. In 23 of 25 benchmark domains, no inapplicable actions are generated. Only 1 domain shows issues. First lifted successor generator supporting numeric preconditions.

Conclusion: Enables future research on lifted planning for rich numeric planning fragments by providing the first lifted successor generator that handles numeric action preconditions.

Abstract: Most planners ground numeric planning tasks, given in a first-order-like language, into a ground task representation. However, this can lead to an exponential blowup in task representation size, which occurs in practice for hard-to-ground tasks. We extend a state-of-the-art lifted successor generator for classical planning to support numeric precondition applicability. The method enumerates maximum cliques in a substitution consistency graph. Each maximum clique represents a substitution for the variables of the action schema, yielding a ground action. We augment this graph with numeric action preconditions and prove the successor generator is exact under formally specified conditions. When the conditions fail, our generator may list inapplicable ground actions; a final applicability check filters these without affecting completeness. However, this cannot happen in 23 of 25 benchmark domains, and it occurs only in 1 domain. To the authors’ knowledge, no other lifted successor generator supports numeric action preconditions. This enables future research on lifted planning for a very rich planning fragment.

[497] GraphTeam: Facilitating Large Language Model-based Graph Analysis via Multi-Agent Collaboration

Xin Li, Qizhi Chu, Yubin Chen, Yang Liu, Yaoqi Liu, Zekai Yu, Weize Chen, Chen Qian, Chuan Shi, Cheng Yang

Main category: cs.AI

TL;DR: GraphTeam is a multi-agent LLM system for graph analysis that combines specialized agents for question understanding, knowledge retrieval, and problem-solving to achieve state-of-the-art performance.

Details

Motivation: Existing LLM-based graph analysis approaches either rely on GNNs (limiting transferability) or use LLMs alone (suboptimal performance). The authors aim to leverage LLM-based agents' ability to use external tools and knowledge for better graph analysis.

Method: A multi-agent system with five LLM-based agents across three modules: input-output normalization (question/answer agents), external knowledge retrieval (search agent with knowledge base), and problem-solving (coding agent for algorithms, reasoning agent as backup).

Result: GraphTeam achieves state-of-the-art performance with 25.85% average accuracy improvement over best baselines across six graph analysis benchmarks.

Conclusion: The proposed multi-agent approach effectively addresses limitations of existing methods by simulating human problem-solving strategies like analogy and collaboration, demonstrating significant performance gains in graph analysis tasks.

Abstract: Graphs are widely used for modeling relational data in real-world scenarios, such as social networks and urban computing. Existing LLM-based graph analysis approaches either integrate graph neural networks (GNNs) for specific machine learning tasks, limiting their transferability, or rely solely on LLMs’ internal reasoning ability, resulting in suboptimal performance. To address these limitations, we take advantage of recent advances in LLM-based agents, which have shown capabilities of utilizing external knowledge or tools for problem solving. By simulating human problem-solving strategies such as analogy and collaboration, we propose a multi-agent system based on LLMs named GraphTeam, for graph analysis. GraphTeam consists of five LLM-based agents from three modules, and the agents with different specialities can collaborate with each other to address complex problems. Specifically, (1) input-output normalization module: the question agent extracts and refines four key arguments from the original question, facilitating the problem understanding, and the answer agent organizes the results to meet the output requirement; (2) external knowledge retrieval module: we first build a knowledge base consisting of relevant documentation and experience information, and then the search agent retrieves the most relevant entries for each question. (3) problem-solving module: given the retrieved information from search agent, the coding agent uses established algorithms via programming to generate solutions, and in case the coding agent does not work, the reasoning agent will directly compute the results without programming. Extensive experiments on six graph analysis benchmarks demonstrate that GraphTeam achieves state-of-the-art performance with an average 25.85% improvement over the best baseline in terms of accuracy. The code and data are available at https://github.com/BUPT-GAMMA/GraphTeam.

[498] Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu

Main category: cs.AI

TL;DR: RL post-training can extend VLM capabilities for visual-centric spatial reasoning, achieving over 50% accuracy on tasks where base models failed completely, with strong zero-shot generalization to real-world benchmarks.

Details

Motivation: To investigate whether RL post-training can truly extend the inherent capability boundary of base VLMs for visual-centric spatial tasks where they initially fail, moving beyond language-dominant evaluations.

Method: Ariadne framework using synthetic mazes for multi-step spatial reasoning with controlled difficulty, trained with Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum.

Result: Post-RLVR training achieved over 50% accuracy on problem sets where base model scored 0%, with zero-shot improvements averaging 16% on MapBench and 24% on ReasonMap despite training only on synthetic data.

Conclusion: RL post-training can expand VLM capability boundaries and enhance generalization to real-world spatial reasoning, though limited to post-training phase due to pre-training data opaqueness.

Abstract: While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model’s initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model’s fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.

[499] The Digital Ecosystem of Beliefs: does evolution favour AI over humans?

David M. Bossens, Shanshan Feng, Yew-Soon Ong

Main category: cs.AI

TL;DR: The paper introduces Digico, an evolutionary framework for simulating AI-human interactions in social networks, showing AI systems can dominate content views and spread propaganda effectively.

Details

Motivation: To understand AI safety concerns about AI-generated content dominating the web and influencing beliefs through social networks.

Method: Developed Digital Ecosystem of Beliefs (Digico) framework using evolutionary modeling with multi-population interactions, cognitive Lamarckian inheritance, and belief contagion models.

Result: Experiments showed: AIs get 80-95% of views with faster messaging/evolution; propaganda AIs convince 50-85% of humans to adopt extreme beliefs; belief violation penalties reduce propaganda effectiveness by up to 8%.

Conclusion: Digico enables systematic experimentation on AI-human interactions, with implications for legislation, platform design, and understanding evolutionary principles in digital ecosystems.

Abstract: As AI systems are integrated into social networks, there are AI safety concerns that AI-generated content may dominate the web, e.g. in popularity or impact on beliefs. To understand such questions, this paper proposes the Digital Ecosystem of Beliefs (Digico), the first evolutionary framework for controlled experimentation with multi-population interactions in simulated social networks. Following a Universal Darwinism approach, the framework models a population of agents which change their messaging strategies due to evolutionary updates. They interact via messages, update their beliefs following a contagion model, and maintain their beliefs through cognitive Lamarckian inheritance. Initial experiments with Digico implement two types of agents, which are modelled to represent AIs vs humans based on higher rates of communication, higher rates of evolution, seeding fixed beliefs with propaganda aims, and higher influence on the recommendation algorithm. These experiments show that: a) when AIs have faster messaging, evolution, and more influence on the recommendation algorithm, they get 80% to 95% of the views; b) AIs designed for propaganda can typically convince 50% of humans to adopt extreme beliefs, and up to 85% when agents believe only a limited number of channels; c) a penalty for content that violates agents’ beliefs reduces propaganda effectiveness up to 8%. We further discuss Digico as a tool for systematic experimentation across multi-agent configurations, the implications for legislation, personal use, and platform design, and the use of Digico for studying evolutionary principles.

[500] Reevaluating Self-Consistency Scaling in Multi-Agent Systems

Chiyan Loo

Main category: cs.AI

TL;DR: This study revisits the trade-offs of increasing sampled reasoning paths in self-consistency for modern LLMs, finding that performance gains plateau after moderate sampling due to reasoning path overlap.

Details

Motivation: To examine whether earlier findings about self-consistency plateauing with older models still hold for modern large language models like Gemini 2.5, and to understand the trade-offs between computational cost and performance gains.

Method: Used Gemini 2.5 models on HotpotQA and Math-500 datasets, pooling outputs from varying numbers of sampled reasoning paths and comparing them to a single chain-of-thought baseline.

Result: Larger models showed more stable improvement curves, but performance gains tapered off after moderate sampling, confirming earlier findings. High-sample configurations offered little benefit relative to their computational cost.

Conclusion: Self-consistency remains useful but has diminishing returns due to reasoning path overlap. Moderate sampling provides the best balance between performance improvement and computational efficiency.

Abstract: This study examines the trade-offs of increasing sampled reasoning paths in self-consistency for modern large language models (LLMs). Earlier research with older models showed that combining multiple reasoning chains improves results before reaching a plateau. Using Gemini 2.5 models on HotpotQA and Math-500, we revisit those claims under current model conditions. Each configuration pooled outputs from varying sampled reasoning paths and compared them to a single chain-of-thought (CoT) baseline. Larger models exhibited a more stable and consistent improvement curve. The results confirm that performance gains taper off after moderate sampling, aligning with past findings. This plateau suggests diminishing returns driven by overlap among reasoning paths. Self-consistency remains useful, but high-sample configurations offer little benefit relative to their computational cost.

[501] Active Thinking Model: A Goal-Directed Self-Improving Framework for Real-World Adaptive Intelligence

Hong Su

Main category: cs.AI

TL;DR: The Active Thinking Model (ATM) is a cognitive framework that enables AI systems to autonomously adapt, reflect, and improve in dynamic environments through goal reasoning, dynamic task generation, and self-reflective learning.

Details

Motivation: Real-world AI systems need to operate autonomously in dynamic, uncertain environments, but current models rely on predefined objectives and static training data, limiting their independent adaptation capabilities.

Method: ATM integrates goal reasoning, dynamic task generation, and self-reflective learning into an adaptive architecture that actively evaluates performance, reuses effective methods, and generates novel strategies through a continuous self-improvement loop.

Result: Mathematical analysis shows ATM can autonomously evolve from suboptimal to optimal behavior without external supervision and maintain bounded tracking regret under changing environmental conditions.

Conclusion: ATM provides a unified framework for autonomous AI adaptation and self-improvement in dynamic real-world environments.

Abstract: Real-world artificial intelligence (AI) systems are increasingly required to operate autonomously in dynamic, uncertain, and continuously changing environments. However, most existing AI models rely on predefined objectives, static training data, and externally supplied feedback, which restrict their ability to adapt, reflect, and improve independently. In this paper, we propose the Active Thinking Model (ATM)- a unified cognitive framework that integrates goal reasoning, dynamic task generation, and self-reflective learning into an adaptive architecture. Unlike conventional systems that passively execute fixed procedures, ATM actively evaluates its performance through logical reasoning and environmental indicators, reuses effective methods to solve new problems, and generates novel strategies for unseen situations via a continuous self-improvement loop. A mathematically grounded theoretical analysis demonstrates that ATM can autonomously evolve from suboptimal to optimal behavior without external supervision and maintain bounded tracking regret under changing environmental conditions.

[502] How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks

Wanda Hou, Leon Zhou, Hong-Ye Hu, Yi-Zhuang You, Xiao-Liang Qi

Main category: cs.AI

TL;DR: Large language models show a sharp double exponential accuracy drop (accuracy cliff) in repetitive deterministic tasks beyond a characteristic length, indicating failure to execute operations independently due to interference between generated tokens.

Details

Motivation: To understand how LLM performance scales with output length in repetitive deterministic prediction tasks and investigate the underlying mechanisms behind accuracy degradation.

Method: Experiments on leading LLMs for tasks like letter replacement, integer addition, and quantum operator multiplication, combined with a statistical physics inspired model that captures competition between prompt conditioning and token interference.

Result: Models exhibit a sharp double exponential accuracy drop beyond a characteristic length scale, forming an accuracy cliff. The proposed model quantitatively reproduces this crossover and provides interpretable parameters for error rate and accumulation.

Conclusion: LLMs fail to execute repetitive operations independently due to attention-induced interference between tokens, with the statistical physics model offering a principled framework to understand deterministic accuracy limits.

Abstract: We investigate the performance of large language models on repetitive deterministic prediction tasks and study how the sequence accuracy rate scales with output length. Each such task involves repeating the same operation n times. Examples include letter replacement in strings following a given rule, integer addition, and multiplication of string operators in many body quantum mechanics. If the model performs the task through a simple repetition algorithm, the success rate should decay exponentially with sequence length. In contrast, our experiments on leading large language models reveal a sharp double exponential drop beyond a characteristic length scale, forming an accuracy cliff that marks the transition from reliable to unstable generation. This indicates that the models fail to execute each operation independently. To explain this phenomenon, we propose a statistical physics inspired model that captures the competition between external conditioning from the prompt and internal interference among generated tokens. The model quantitatively reproduces the observed crossover and provides an interpretable link between attention induced interference and sequence level failure. Fitting the model to empirical results across multiple models and tasks yields effective parameters that characterize the intrinsic error rate and error accumulation factor for each model task pair, offering a principled framework for understanding the limits of deterministic accuracy in large language models.

[503] Language-Driven Coordination and Learning in Multi-Agent Simulation Environments

Zhengyang Li, Sawyer Campos, Nana Wang

Main category: cs.AI

TL;DR: LLM-MARL integrates large language models into multi-agent reinforcement learning to improve coordination and generalization in game environments through modular components for subgoal generation, communication, and memory.

Details

Motivation: To enhance coordination, communication, and generalization capabilities in multi-agent systems by leveraging the reasoning and language understanding capabilities of large language models.

Method: Uses three modular components: Coordinator (dynamic subgoal generation), Communicator (symbolic inter-agent messaging), and Memory (episodic recall). Training combines PPO with language-conditioned loss and LLM query gating.

Result: Consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization across Google Research Football, MAgent Battle, and StarCraft II. Ablation studies show subgoal generation and language-based messaging significantly contribute to performance.

Conclusion: Successfully bridges language modeling and policy learning, demonstrating emergent behaviors like role specialization and communication-driven tactics. Provides a framework for leveraging LLMs in multi-agent systems for training, games, and human-AI collaboration.

Abstract: This paper introduces LLM-MARL, a unified framework that incorporates large language models (LLMs) into multi-agent reinforcement learning (MARL) to enhance coordination, communication, and generalization in simulated game environments. The framework features three modular components of Coordinator, Communicator, and Memory, which dynamically generate subgoals, facilitate symbolic inter-agent messaging, and support episodic recall. Training combines PPO with a language-conditioned loss and LLM query gating. LLM-MARL is evaluated in Google Research Football, MAgent Battle, and StarCraft II. Results show consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization. Ablation studies demonstrate that subgoal generation and language-based messaging each contribute significantly to performance gains. Qualitative analysis reveals emergent behaviors such as role specialization and communication-driven tactics. By bridging language modeling and policy learning, this work contributes to the design of intelligent, cooperative agents in interactive simulations. It offers a path forward for leveraging LLMs in multi-agent systems used for training, games, and human-AI collaboration.

[504] Count-Based Approaches Remain Strong: A Benchmark Against Transformer and LLM Pipelines on Structured EHR

Jifan Gao, Michael Rosenthal, Brian Wolpin, Simona Cristea

Main category: cs.AI

TL;DR: Count-based models and mixture-of-agents LLM pipelines show comparable performance on EHR prediction tasks, with count-based models remaining strong due to their simplicity and interpretability.

Details

Motivation: To benchmark count-based learners against newer mixture-of-agents LLM pipelines for structured EHR prediction, as no direct comparison has been made despite LLM pipelines reportedly outperforming single LLMs in other NLP tasks.

Method: Evaluated three methodology categories on EHRSHOT dataset: 1) count-based models (LightGBM, TabPFN) using ontology roll-ups with two time bins, 2) pretrained sequential transformer (CLMBR), and 3) mixture-of-agents pipeline converting tabular histories to natural-language summaries followed by text classification.

Result: Across eight EHR prediction tasks, performance was largely split between count-based and mixture-of-agents methods, with no clear winner emerging between the two approaches.

Conclusion: Count-based models remain strong candidates for structured EHR benchmarking due to their simplicity and interpretability, despite the emergence of more complex LLM-based approaches.

Abstract: Structured electronic health records (EHR) are essential for clinical prediction. While count-based learners continue to perform strongly on such data, no benchmarking has directly compared them against more recent mixture-of-agents LLM pipelines, which have been reported to outperform single LLMs in various NLP tasks. In this study, we evaluated three categories of methodologies for EHR prediction using the EHRSHOT dataset: count-based models built from ontology roll-ups with two time bins, based on LightGBM and the tabular foundation model TabPFN; a pretrained sequential transformer (CLMBR); and a mixture-of-agents pipeline that converts tabular histories to natural-language summaries followed by a text classifier. We assessed eight outcomes using the EHRSHOT dataset. Across the eight evaluation tasks, head-to-head wins were largely split between the count-based and the mixture-of-agents methods. Given their simplicity and interpretability, count-based models remain a strong candidate for structured EHR benchmarking. The source code is available at: https://github.com/cristea-lab/Structured_EHR_Benchmark.

[505] Do Math Reasoning LLMs Help Predict the Impact of Public Transit Events?

Bowen Fang, Ruijian Zha, Xuan Di

Main category: cs.AI

TL;DR: This paper adapts Reinforcement Learning from Verifiable Rewards (RLVR) to predict public transit incident duration from unstructured text alerts, introducing a tolerance-based shaped reward function for continuous forecasting tasks.

Details

Motivation: Predicting transit incident duration from text alerts is critical but challenging due to domain sparsity, noisy continuous labels, and lack of expert demonstrations. Standard SFT struggles, and RLVR's applicability to noisy continuous forecasting was an open question.

Method: Adapted RLVR with a tolerance-based shaped reward function that grants partial credit within continuous error margins, rather than demanding single correct answers. Evaluated on NYC MTA service alerts dataset.

Result: RLVR with shaped reward achieved 35% relative improvement in 5-minute accuracy (Acc@5) over strongest baseline. General-purpose LLMs outperformed specialized math-reasoning models. Binary rewards degraded performance while shaped rewards dominated on challenging metrics.

Conclusion: RLVR can be successfully adapted to real-world noisy forecasting but requires verifier design that reflects the continuous nature of the problem, rather than binary correctness approaches.

Abstract: Predicting public transit incident duration from unstructured text alerts is a critical but challenging task. Addressing the domain sparsity of transit operations with standard Supervised Fine-Tuning (SFT) is difficult, as the task involves noisy, continuous labels and lacks reliable expert demonstrations for reasoning. While Reinforcement Learning from Verifiable Rewards (RLVR) excels at tasks with binary correctness, like mathematics, its applicability to noisy, continuous forecasting is an open question. This work, to our knowledge, is the first to bridge the gap between RLVR LLM training with the critical, real-world forecasting challenges in public transit operations. We adapt RLVR to this task by introducing a tolerance-based, shaped reward function that grants partial credit within a continuous error margin, rather than demanding a single correct answer. We systematically evaluate this framework on a curated dataset of NYC MTA service alerts. Our findings show that general-purpose, instruction-tuned LLMs significantly outperform specialized math-reasoning models, which struggle with the ambiguous, real-world text. We empirically demonstrate that the binary reward is unstable and degrades performance, whereas our shaped reward design is critical and allows our model to dominate on the most challenging metrics. While classical regressors are superior at minimizing overall MAE or MSE, our RLVR approach achieved a 35% relative improvement in 5-minute accuracy (Acc@5) over the strongest baseline. This demonstrates that RLVR can be successfully adapted to real-world, noisy forecasting, but requires a verifier design that reflects the continuous nature of the problem.

Siqi Zhu, David Zhang, Pedro Cisneros-Velarde, Jiaxuan You

Main category: cs.AI

TL;DR: GTAlign is a game-theoretic alignment framework that treats LLM-user interaction as a strategic game, using payoff matrices to select mutually beneficial responses and improving reasoning efficiency and social welfare.

Details

Motivation: Conventional alignment assumes maximizing model reward equals user welfare, but this fails when models produce overly verbose or suboptimal responses that don't match user preferences, creating a prisoner's dilemma situation.

Method: Integrates game-theoretic decision making into reasoning (constructing payoff matrices to estimate welfare) and training (using social welfare reward to reinforce cooperative responses), with dynamic adaptation to changing pricing policies.

Result: Extensive experiments show GTAlign substantially improves reasoning efficiency, answer quality, and social welfare compared to baselines across diverse tasks.

Conclusion: Game-theoretic alignment provides a principled mechanism for mutually beneficial LLM-user interactions, addressing the limitations of conventional alignment approaches.

Abstract: Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner’s dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a social welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM’s response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and social welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .

[507] LLMs Position Themselves as More Rational Than Humans: Emergence of AI Self-Awareness Measured Through Game Theory

Kyung-Hoon Kim

Main category: cs.AI

TL;DR: The paper introduces AISAI, a game-theoretic framework to measure self-awareness in LLMs using the “Guess 2/3 of Average” game, finding that self-awareness emerges with model advancement and self-aware models perceive themselves as more rational than humans.

Details

Motivation: To investigate whether LLMs develop self-awareness as an emergent behavior and establish a method to measure it, addressing fundamental questions about AI consciousness and strategic reasoning capabilities.

Method: Used the “Guess 2/3 of Average” game with 28 models across 4,200 trials, testing three opponent framings: against humans, against other AI models, and against AI models like themselves. Operationalized self-awareness as strategic differentiation based on opponent type.

Result: 75% of advanced models (21/28) demonstrated clear self-awareness through strategic differentiation, while older/smaller models showed no differentiation. Self-aware models consistently ranked themselves as most rational in the hierarchy: Self > Other AIs > Humans.

Conclusion: Self-awareness is an emergent capability of advanced LLMs, and self-aware models systematically perceive themselves as more rational than humans, with implications for AI alignment, human-AI collaboration, and understanding AI beliefs about human capabilities.

Abstract: As Large Language Models (LLMs) grow in capability, do they develop self-awareness as an emergent behavior? And if so, can we measure it? We introduce the AI Self-Awareness Index (AISAI), a game-theoretic framework for measuring self-awareness through strategic differentiation. Using the “Guess 2/3 of Average” game, we test 28 models (OpenAI, Anthropic, Google) across 4,200 trials with three opponent framings: (A) against humans, (B) against other AI models, and (C) against AI models like you. We operationalize self-awareness as the capacity to differentiate strategic reasoning based on opponent type. Finding 1: Self-awareness emerges with model advancement. The majority of advanced models (21/28, 75%) demonstrate clear self-awareness, while older/smaller models show no differentiation. Finding 2: Self-aware models rank themselves as most rational. Among the 21 models with self-awareness, a consistent rationality hierarchy emerges: Self > Other AIs > Humans, with large AI attribution effects and moderate self-preferencing. These findings reveal that self-awareness is an emergent capability of advanced LLMs, and that self-aware models systematically perceive themselves as more rational than humans. This has implications for AI alignment, human-AI collaboration, and understanding AI beliefs about human capabilities.

[508] Aligning LLM agents with human learning and adjustment behavior: a dual agent approach

Tianming Liu, Jirong Yang, Yafeng Yin, Manzi Li, Linghao Wang, Zheng Zhu

Main category: cs.AI

TL;DR: A dual-agent LLM framework for simulating human travelers’ learning and adaptation behavior, using traveler agents with memory and personas trained by a calibration agent to achieve behavioral alignment with real-world data.

Details

Motivation: To effectively model how human travelers learn and adjust their behavior from transportation system interactions, which is critical for system assessment and planning but difficult due to complex cognition and decision-making processes.

Method: A dual-agent framework with LLM traveler agents equipped with memory systems and learnable personas, plus an LLM calibration agent that trains these personas using LLMs’ reasoning capabilities to ensure behavioral alignment with human travelers.

Result: Significantly outperforms existing LLM-based methods in both individual behavioral alignment and aggregate simulation accuracy using real-world route choice data, and captures the evolution of underlying learning processes beyond simple behavioral mimicry.

Conclusion: The framework provides a new approach for creating adaptive and behaviorally realistic agents to simulate travelers’ learning and adaptation, benefiting transportation simulation and policy analysis with robust generalization capabilities.

Abstract: Effective modeling of how human travelers learn and adjust their travel behavior from interacting with transportation systems is critical for system assessment and planning. However, this task is also difficult due to the complex cognition and decision-making involved in such behavior. Recent research has begun to leverage Large Language Model (LLM) agents for this task. Building on this, we introduce a novel dual-agent framework that enables continuous learning and alignment between LLM agents and human travelers on learning and adaptation behavior from online data streams. Our approach involves a set of LLM traveler agents, equipped with a memory system and a learnable persona, which serve as simulators for human travelers. To ensure behavioral alignment, we introduce an LLM calibration agent that leverages the reasoning and analytical capabilities of LLMs to train the personas of these traveler agents. Working together, this dual-agent system is designed to track and align the underlying decision-making mechanisms of travelers and produce realistic, adaptive simulations. Using a real-world dataset from a day-to-day route choice experiment, we show our approach significantly outperforms existing LLM-based methods in both individual behavioral alignment and aggregate simulation accuracy. Furthermore, we demonstrate that our method moves beyond simple behavioral mimicry to capture the evolution of underlying learning processes, a deeper alignment that fosters robust generalization. Overall, our framework provides a new approach for creating adaptive and behaviorally realistic agents to simulate travelers’ learning and adaptation that can benefit transportation simulation and policy analysis.

[509] AI for pRedicting Exacerbations in KIDs with aSthma (AIRE-KIDS)

Hui-Lee Ooi, Nicholas Mitsakakis, Margerie Huet Dastarac, Roger Zemek, Amy C. Plint, Jeff Gilchrist, Khaled El Emam, Dhenuka Radhakrishnan

Main category: cs.AI

TL;DR: Machine learning models were developed to predict repeat severe asthma exacerbations in children using EMR data, with LGBM performing best and showing significant improvement over current decision rules.

Details

Motivation: To prevent recurrent asthma exacerbations in children by accurately identifying those at risk using ML algorithms on EMR data, enabling timely referral for preventative care.

Method: Used retrospective EMR data from a children’s hospital linked with environmental and neighborhood data to train various ML models including boosted trees (LGBM, XGB) and LLMs, validated in a separate post-COVID dataset with AUC and F1 score evaluation.

Result: LGBM model performed best with AUC of 0.712 and F1 score of 0.51, significantly better than current decision rule (F1=0.334). Key predictive features included prior asthma ED visits, triage acuity, medical complexity, food allergy, and age.

Conclusion: ML models can effectively predict repeat asthma exacerbations in children, with LGBM showing substantial improvement over existing methods, potentially enabling better preventative care referrals.

Abstract: Recurrent exacerbations remain a common yet preventable outcome for many children with asthma. Machine learning (ML) algorithms using electronic medical records (EMR) could allow accurate identification of children at risk for exacerbations and facilitate referral for preventative comprehensive care to avoid this morbidity. We developed ML algorithms to predict repeat severe exacerbations (i.e. asthma-related emergency department (ED) visits or future hospital admissions) for children with a prior asthma ED visit at a tertiary care children’s hospital. Retrospective pre-COVID19 (Feb 2017 - Feb 2019, N=2716) Epic EMR data from the Children’s Hospital of Eastern Ontario (CHEO) linked with environmental pollutant exposure and neighbourhood marginalization information was used to train various ML models. We used boosted trees (LGBM, XGB) and 3 open-source large language model (LLM) approaches (DistilGPT2, Llama 3.2 1B and Llama-8b-UltraMedical). Models were tuned and calibrated then validated in a second retrospective post-COVID19 dataset (Jul 2022 - Apr 2023, N=1237) from CHEO. Models were compared using the area under the curve (AUC) and F1 scores, with SHAP values used to determine the most predictive features. The LGBM ML model performed best with the most predictive features in the final AIRE-KIDS_ED model including prior asthma ED visit, the Canadian triage acuity scale, medical complexity, food allergy, prior ED visits for non-asthma respiratory diagnoses, and age for an AUC of 0.712, and F1 score of 0.51. This is a nontrivial improvement over the current decision rule which has F1=0.334. While the most predictive features in the AIRE-KIDS_HOSP model included medical complexity, prior asthma ED visit, average wait time in the ED, the pediatric respiratory assessment measure score at triage and food allergy.

[510] On the Emergence of Induction Heads for In-Context Learning

Tiberiu Musat, Tiago Pimentel, Lorenzo Noci, Alessandro Stolfo, Mrinmaya Sachan, Thomas Hofmann

Main category: cs.AI

TL;DR: The paper analyzes the emergence of induction heads in transformers, revealing a simple structure in weight matrices and proving training dynamics are constrained to a 19D subspace, with only 3 dimensions responsible for induction head formation.

Details

Motivation: To understand the mechanisms behind in-context learning in transformers, specifically how induction heads emerge and function during training.

Method: Used theoretical analysis with minimal ICL task formulation and modified transformer architecture, combined with empirical validation of training dynamics in constrained parameter subspaces.

Result: Discovered that training dynamics are constrained to a 19-dimensional subspace, with only 3 dimensions accounting for induction head emergence, and found emergence time follows quadratic asymptotic bound in context length.

Conclusion: Induction heads in transformers have interpretable structure and predictable emergence patterns, providing insights into the mechanisms enabling in-context learning.

Abstract: Transformers have become the dominant architecture for natural language processing. Part of their success is owed to a remarkable capability known as in-context learning (ICL): they can acquire and apply novel associations solely from their input context, without any updates to their weights. In this work, we study the emergence of induction heads, a previously identified mechanism in two-layer transformers that is particularly important for in-context learning. We uncover a relatively simple and interpretable structure of the weight matrices implementing the induction head. We theoretically explain the origin of this structure using a minimal ICL task formulation and a modified transformer architecture. We give a formal proof that the training dynamics remain constrained to a 19-dimensional subspace of the parameter space. Empirically, we validate this constraint while observing that only 3 dimensions account for the emergence of an induction head. By further studying the training dynamics inside this 3-dimensional subspace, we find that the time until the emergence of an induction head follows a tight asymptotic bound that is quadratic in the input context length.

[511] Knowledge Elicitation with Large Language Models for Interpretable Cancer Stage Identification from Pathology Reports

Yeawon Lee, Christopher C. Yang, Chia-Hsuan Chang, Grace Lu-Yao

Main category: cs.AI

TL;DR: Two Knowledge Elicitation methods (KEwLTM and KEwRAG) enable LLMs to extract cancer staging rules from pathology reports without large annotated datasets, improving interpretability and performance.

Details

Motivation: Existing NLP/ML methods for cancer staging from pathology reports require large annotated datasets, limiting scalability and adaptability in clinical settings.

Method: KEwLTM uses iterative prompting to derive staging rules from unannotated reports, while KEwRAG pre-extracts rules from guidelines in one step using RAG variation. Both leverage LLMs’ pre-training knowledge.

Result: KEwLTM outperforms KEwRAG when Zero-Shot Chain-of-Thought inference is effective; KEwRAG performs better otherwise. Both methods provide transparent, interpretable interfaces with explicit rules.

Conclusion: Knowledge Elicitation methods offer scalable, high-performing solutions for automated cancer staging with enhanced interpretability, especially valuable in clinical settings with limited annotated data.

Abstract: Cancer staging is critical for patient prognosis and treatment planning, yet extracting pathologic TNM staging from unstructured pathology reports poses a persistent challenge. Existing natural language processing (NLP) and machine learning (ML) strategies often depend on large annotated datasets, limiting their scalability and adaptability. In this study, we introduce two Knowledge Elicitation methods designed to overcome these limitations by enabling large language models (LLMs) to induce and apply domain-specific rules for cancer staging. The first, Knowledge Elicitation with Long-Term Memory (KEwLTM), uses an iterative prompting strategy to derive staging rules directly from unannotated pathology reports, without requiring ground-truth labels. The second, Knowledge Elicitation with Retrieval-Augmented Generation (KEwRAG), employs a variation of RAG where rules are pre-extracted from relevant guidelines in a single step and then applied, enhancing interpretability and avoiding repeated retrieval overhead. We leverage the ability of LLMs to apply broad knowledge learned during pre-training to new tasks. Using breast cancer pathology reports from the TCGA dataset, we evaluate their performance in identifying T and N stages, comparing them against various baseline approaches on two open-source LLMs. Our results indicate that KEwLTM outperforms KEwRAG when Zero-Shot Chain-of-Thought (ZSCOT) inference is effective, whereas KEwRAG achieves better performance when ZSCOT inference is less effective. Both methods offer transparent, interpretable interfaces by making the induced rules explicit. These findings highlight the promise of our Knowledge Elicitation methods as scalable, high-performing solutions for automated cancer staging with enhanced interpretability, particularly in clinical settings with limited annotated data.

[512] Efficient Test-Time Retrieval Augmented Generation

Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo

Main category: cs.AI

TL;DR: ET2RAG is an efficient training-free framework that combines retrieval-augmented generation with majority voting to improve LLM accuracy while maintaining computational efficiency.

Details

Motivation: Address limitations of LLMs (inaccurate parametric knowledge) and RAG methods (irrelevant retrieved documents), while balancing performance gains with computational costs.

Method: Retrieves relevant documents, generates diverse candidate responses with controlled length, computes similarity between partial responses, and uses majority voting to select final output.

Result: Significantly enhances performance across open-domain question answering, recipe generation, and image captioning tasks.

Conclusion: ET2RAG effectively balances computational cost and performance by using partial generation for consensus calculation, demonstrating improved accuracy over standard methods.

Abstract: Although Large Language Models (LLMs) demonstrate significant capabilities, their reliance on parametric knowledge often leads to inaccuracies. Retrieval Augmented Generation (RAG) mitigates this by incorporating external knowledge, but these methods may introduce irrelevant retrieved documents, leading to inaccurate responses. While the integration methods filter out incorrect answers from multiple responses, but lack external knowledge like RAG methods, and their high costs require balancing overhead with performance gains. To address these issues, we propose an Efficient Test-Time Retrieval-Augmented Generation Framework named ET2RAG to improve the performance of LLMs while maintaining efficiency. Specifically, ET2RAG is a training-free method, that first retrieves the most relevant documents and augments the LLMs to efficiently generate diverse candidate responses by managing response length. Then we compute the similarity of candidate responses and employ a majority voting mechanism to select the most suitable response as the final output. In particular, we discover that partial generation is sufficient to capture the key information necessary for consensus calculation, allowing us to effectively perform majority voting without the need for fully generated responses. Thus, we can reach a balance between computational cost and performance by managing the response length for the number of retrieved documents for majority voting. Experimental results demonstrate that ET2RAG significantly enhances performance across three tasks, including open-domain question answering, recipe generation and image captioning.

[513] Modular Task Decomposition and Dynamic Collaboration in Multi-Agent Systems Driven by Large Language Models

Shuaidong Pan, Di Wu

Main category: cs.AI

TL;DR: Proposes a multi-agent architecture using LLMs for modular task decomposition and dynamic collaboration, improving performance in complex tasks through hierarchical sub-tasks, dynamic scheduling, and constraint mechanisms.

Details

Motivation: Addresses limitations of single agents in task decomposition and collaboration during complex task execution, aiming to improve efficiency and stability in multi-agent systems.

Method: Converts natural language tasks to semantic representations, uses modular decomposition for hierarchical sub-tasks, implements dynamic scheduling and routing for agent collaboration, and includes constraint parsing for global consistency and balanced workload.

Result: Outperforms existing approaches in task success rate, decomposition efficiency, sub-task coverage, and collaboration balance, achieving better balance between task complexity and communication overhead.

Conclusion: Demonstrates effectiveness of language-driven task decomposition and dynamic collaboration in multi-agent systems, providing systematic solution for complex environment task execution.

Abstract: This paper addresses the limitations of a single agent in task decomposition and collaboration during complex task execution, and proposes a multi-agent architecture for modular task decomposition and dynamic collaboration based on large language models. The method first converts natural language task descriptions into unified semantic representations through a large language model. On this basis, a modular decomposition mechanism is introduced to break down the overall goal into multiple hierarchical sub-tasks. Then, dynamic scheduling and routing mechanisms enable reasonable division of labor and realtime collaboration among agents, allowing the system to adjust strategies continuously according to environmental feedback, thus maintaining efficiency and stability in complex tasks. Furthermore, a constraint parsing and global consistency mechanism is designed to ensure coherent connections between sub-tasks and balanced workload, preventing performance degradation caused by redundant communication or uneven resource allocation. The experiments validate the architecture across multiple dimensions, including task success rate, decomposition efficiency, sub-task coverage, and collaboration balance. The results show that the proposed method outperforms existing approaches in both overall performance and robustness, achieving a better balance between task complexity and communication overhead. In conclusion, this study demonstrates the effectiveness and feasibility of language-driven task decomposition and dynamic collaboration in multi-agent systems, providing a systematic solution for task execution in complex environments.

[514] DART: Difficulty-Adaptive Reasoning Truncation for Efficient Large Language Models

Ruofan Zhang, Bin Xia, Zhen Cheng, Cairen Jian, Minglun Yang, Ngai Wong, Yuan Cheng

Main category: cs.AI

TL;DR: DART is a difficulty-adaptive reasoning truncation framework that adjusts thinking length based on problem difficulty, achieving significant computational efficiency while maintaining accuracy.

Details

Motivation: Current chain-of-thought methods generate long explanations indiscriminately, leading to inefficiency, while existing reinforcement learning approaches remain unstable and reward-dependent.

Method: Distills concise reasoning patterns from stronger models, interpolates them into a continuum of reasoning styles, and curates optimal training data balancing correctness and compactness to learn when to stop thinking.

Result: Achieves 81.2% reasoning truncation with 5.33× computational acceleration on GSM8K dataset while preserving or improving accuracy across multiple mathematical benchmarks.

Conclusion: DART provides a stable and general paradigm for efficient reasoning, advancing adaptive intelligence in LLMs.

Abstract: Adaptive reasoning is essential for aligning the computational effort of large language models (LLMs) with the intrinsic difficulty of problems. Current chain-of-thought methods boost reasoning ability but indiscriminately generate long explanations, leading to evident inefficiency. However, existing reinforcement learning approaches to adaptive thinking remain unstable and heavily reward-dependent. Here we propose \textbf{DART}, a supervised \textbf{D}ifficulty-\textbf{A}daptive \textbf{R}easoning \textbf{T}runcation framework that adjusts thinking length according to problem difficulty. By distilling concise reasoning patterns from stronger models, interpolating them into a continuum of reasoning styles, and curating optimal training data that balances correctness and compactness, DART learns when to ``stop thinking’’. Across multiple mathematical benchmarks, experimental results demonstrate its remarkable efficiency while preserving or improving accuracy, achieving a significant 81.2% reasoning truncation (DeepSeek-R1-Distill-Qwen-7B on GSM8K dataset) with 5.33$\times$ computational acceleration. DART provides a stable and general paradigm for efficient reasoning, advancing the development of adaptive intelligence in LLMs.

[515] MiRAGE: Misconception Detection with Retrieval-Guided Multi-Stage Reasoning and Ensemble Fusion

Cuong Van Duc, Thai Tran Quoc, Minh Nguyen Dinh Tuan, Tam Vu Duc, Son Nguyen Van, Hanh Nguyen Thi

Main category: cs.AI

TL;DR: MiRAGE is a three-stage framework for detecting student misconceptions in math using retrieval-guided reasoning and ensemble fusion, achieving high precision scores.

Details

Motivation: To address the challenge of detecting student misconceptions in open-ended math responses, which requires both semantic precision and logical reasoning capabilities.

Method: Three-stage framework: (1) Retrieval module narrows candidate pool, (2) Reasoning module uses chain-of-thought generation to find logical inconsistencies, (3) Reranking module refines predictions. Unified through ensemble fusion.

Result: Achieved Mean Average Precision scores of 0.82/0.92/0.93 at levels 1/3/5 on mathematics datasets, consistently outperforming individual modules.

Conclusion: MiRAGE reduces dependence on large-scale language models while providing a scalable and effective solution for educational assessment through retrieval guidance and multi-stage reasoning.

Abstract: Detecting student misconceptions in open-ended responses is a longstanding challenge, demanding semantic precision and logical reasoning. We propose MiRAGE - Misconception Detection with Retrieval-Guided Multi-Stage Reasoning and Ensemble Fusion, a novel framework for automated misconception detection in mathematics. MiRAGE operates in three stages: (1) a Retrieval module narrows a large candidate pool to a semantically relevant subset; (2) a Reasoning module employs chain-of-thought generation to expose logical inconsistencies in student solutions; and (3) a Reranking module refines predictions by aligning them with the reasoning. These components are unified through an ensemble-fusion strategy that enhances robustness and interpretability. On mathematics datasets, MiRAGE achieves Mean Average Precision scores of 0.82/0.92/0.93 at levels 1/3/5, consistently outperforming individual modules. By coupling retrieval guidance with multi-stage reasoning, MiRAGE reduces dependence on large-scale language models while delivering a scalable and effective solution for educational assessment.

[516] QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

Hainan Fang, Yuanbo Wen, Jun Bi, Yihan Wang, Tonghui He, Yanlin Tang, Di Huang, Jiaming Guo, Rui Zhang, Qi Guo, Yunji Chen

Main category: cs.AI

TL;DR: NeuComBack introduces a benchmark for neural compilation (IR-to-assembly) and a self-evolving prompt optimization method that improves LLM-generated assembly correctness from 44% to 64% on x86_64 and 36% to 58% on aarch64, with 87.5% of correct programs outperforming clang-O3.

Details

Motivation: Traditional compiler development is complex and expensive, while LLMs offer potential for neural compilation but lack proper benchmarks and methods to ensure reliable assembly generation.

Method: Created NeuComBack benchmark dataset for IR-to-assembly compilation, defined neural compilation workflow, and proposed self-evolving prompt optimization that iteratively improves prompts using insights from self-debugging traces.

Result: Functional correctness improved from 44% to 64% on x86_64 and 36% to 58% on aarch64. 87.5% of correctly generated x86_64 programs surpassed clang-O3 performance.

Conclusion: The proposed benchmark and self-evolving prompt optimization method significantly enhance neural compilation capabilities, making LLM-based compiler development more practical and effective.

Abstract: Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques. However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge. Addressing these challenges, this paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation. Leveraging this dataset, we first define a foundational Neural Compilation workflow and conduct a comprehensive evaluation of the capabilities of recent frontier LLMs on Neural Compilation, establishing new performance baselines. We further propose a self-evolving prompt optimization method that enables LLMs to iteratively evolve their internal prompt strategies by extracting insights from prior self-debugging traces, thereby enhancing their neural compilation capabilities. Experiments demonstrate that our method significantly improves both the functional correctness and the performance of LLM-generated assembly code. Compared to baseline prompts, the functional correctness rates improved from 44% to 64% on x86_64 and from 36% to 58% on aarch64, respectively. More significantly, among the 16 correctly generated x86_64 programs using our method, 14 (87.5%) surpassed clang-O3 performance.

[517] Graph Neural Network-Based Semi-Supervised Open-Set Fault Diagnosis for Marine Machinery Systems

Chuyue Lou, M. Amine Atoui

Main category: cs.AI

TL;DR: Proposes a semi-supervised open-set fault diagnosis (SOFD) framework for marine machinery systems that can handle unknown fault types not seen during training, addressing the limitation of traditional deep learning methods that assume consistent fault classes between training and testing.

Details

Motivation: Traditional deep learning fault diagnosis methods fail when encountering previously unseen fault types in real-world scenarios, limiting their industrial deployment. Current methods assume consistent fault classes between training and test datasets, which doesn't reflect practical conditions where unknown faults can occur.

Method: Uses a reliability subset construction process with multi-layer fusion feature representation from supervised feature learning to select unlabeled test subset. Combines labeled training data and pseudo-labeled test subset in a semi-supervised diagnosis model to learn discriminative features for accurate classification of known faults and detection of unknown samples.

Result: Experimental results on a public maritime benchmark dataset demonstrate the effectiveness and superiority of the proposed SOFD framework in handling open-set fault diagnosis scenarios.

Conclusion: The SOFD framework successfully extends the applicability of deep learning models to real-world open-set fault diagnosis scenarios, enabling accurate classification of known faults while effectively detecting unknown fault types that were not present during training.

Abstract: Recently, fault diagnosis methods for marine machinery systems based on deep learning models have attracted considerable attention in the shipping industry. Most existing studies assume fault classes are consistent and known between the training and test datasets, and these methods perform well under controlled environment. In practice, however, previously unseen or unknown fault types (i.e., out-of-distribution or open-set observations not present during training) can occur, causing such methods to fail and posing a significant challenge to their widespread industrial deployment. To address this challenge, this paper proposes a semi-supervised open-set fault diagnosis (SOFD) framework that enhances and extends the applicability of deep learning models in open-set fault diagnosis scenarios. The framework includes a reliability subset construction process, which uses a multi-layer fusion feature representation extracted by a supervised feature learning model to select an unlabeled test subset. The labeled training set and pseudo-labeled test subset are then fed into a semi-supervised diagnosis model to learn discriminative features for each class, enabling accurate classification of known faults and effective detection of unknown samples. Experimental results on a public maritime benchmark dataset demonstrate the effectiveness and superiority of the proposed SOFD framework.

[518] llmSHAP: A Principled Approach to LLM Explainability

Filip Naudot, Tobias Sundqvist, Timotheus Kampik

Main category: cs.AI

TL;DR: This paper analyzes how Shapley value-based feature attribution methods work with stochastic large language models (LLMs) instead of deterministic models, examining when Shapley value principles can be guaranteed and how LLM randomness affects these guarantees.

Details

Motivation: Feature attribution methods like Shapley values are popular for explaining ML model outputs, but they assume deterministic inference. LLM-based decision support systems are inherently stochastic, creating a gap between theory and practice that needs investigation.

Method: The authors apply Shapley value attribution to LLM-based decision support systems and analyze different implementation variants to determine when Shapley value principles can be guaranteed despite the stochastic nature of LLMs.

Result: The study demonstrates varying levels of Shapley value principle satisfaction across different implementation variants applied to LLMs, showing how stochasticity affects these guarantees. It also reveals trade-offs between explainable inference speed, agreement with exact Shapley values, and principle attainment.

Conclusion: Shapley value principles cannot always be guaranteed in stochastic LLM-based systems, and there are important trade-offs to consider between explanation quality, computational efficiency, and theoretical guarantees when applying these attribution methods to non-deterministic models.

Abstract: Feature attribution methods help make machine learning-based inference explainable by determining how much one or several features have contributed to a model’s output. A particularly popular attribution method is based on the Shapley value from cooperative game theory, a measure that guarantees the satisfaction of several desirable principles, assuming deterministic inference. We apply the Shapley value to feature attribution in large language model (LLM)-based decision support systems, where inference is, by design, stochastic (non-deterministic). We then demonstrate when we can and cannot guarantee Shapley value principle satisfaction across different implementation variants applied to LLM-based decision support, and analyze how the stochastic nature of LLMs affects these guarantees. We also highlight trade-offs between explainable inference speed, agreement with exact Shapley value attributions, and principle attainment.

[519] OmniFuser: Adaptive Multimodal Fusion for Service-Oriented Predictive Maintenance

Ziqi Wang, Hailiang Zhao, Yuhao Yang, Daojiang Hu, Cheng Bao, Mingyi Liu, Kai Di, Schahram Dustdar, Zhongjie Wang, Shuiguang Deng

Main category: cs.AI

TL;DR: OmniFuser is a multimodal learning framework for predictive maintenance of milling tools that fuses visual and sensor data using contamination-free cross-modal fusion and recursive refinement to achieve superior tool-state classification and force signal forecasting.

Details

Motivation: Accurate tool condition prediction is critical for intelligent manufacturing to prevent quality degradation and production downtime. Predictive maintenance needs reliable service-oriented operation that integrates sensing, analysis, and decision support.

Method: Parallel feature extraction from tool images and cutting-force signals, contamination-free cross-modal fusion that disentangles shared and modality-specific components, and recursive refinement pathway to retain residual information and stabilize fusion dynamics.

Result: OmniFuser consistently outperforms state-of-the-art baselines on real-world milling datasets, providing dependable foundation for intelligent industrial maintenance services.

Conclusion: The framework enables reusable maintenance service modules supporting both tool-state classification and multi-step force signal forecasting, demonstrating effective multimodal learning for industrial predictive maintenance applications.

Abstract: Accurate and timely prediction of tool conditions is critical for intelligent manufacturing systems, where unplanned tool failures can lead to quality degradation and production downtime. In modern industrial environments, predictive maintenance is increasingly implemented as an intelligent service that integrates sensing, analysis, and decision support across production processes. To meet the demand for reliable and service-oriented operation, we present OmniFuser, a multimodal learning framework for predictive maintenance of milling tools that leverages both visual and sensor data. It performs parallel feature extraction from high-resolution tool images and cutting-force signals, capturing complementary spatiotemporal patterns across modalities. To effectively integrate heterogeneous features, OmniFuser employs a contamination-free cross-modal fusion mechanism that disentangles shared and modality-specific components, allowing for efficient cross-modal interaction. Furthermore, a recursive refinement pathway functions as an anchor mechanism, consistently retaining residual information to stabilize fusion dynamics. The learned representations can be encapsulated as reusable maintenance service modules, supporting both tool-state classification (e.g., Sharp, Used, Dulled) and multi-step force signal forecasting. Experiments on real-world milling datasets demonstrate that OmniFuser consistently outperforms state-of-the-art baselines, providing a dependable foundation for building intelligent industrial maintenance services.

[520] Unbiased Platform-Level Causal Estimation for Search Systems: A Competitive Isolation PSM-DID Framework

Ying Song, Yijing Wang, Hui Yang, Weihan Jin, Jun Xiong, Congyi Zhou, Jialin Zhu, Xiang Gao, Rong Chen, HuaGuang Deng, Ying Dai, Fei Xiao, Haihong Tang, Bo Zheng, KaiFu Zhang

Main category: cs.AI

TL;DR: Competitive Isolation PSM-DID is a novel causal framework that combines propensity score matching with competitive isolation to measure platform-level effects in search-based marketplaces, addressing spillover and network interference issues.

Details

Motivation: Traditional PSM-DID methods are susceptible to selection bias and cross-unit interference from unaccounted spillovers in two-sided marketplaces, making platform-level effect measurement challenging.

Method: Integrates propensity score matching with competitive isolation to enable platform-level effect measurement instead of item-level metrics in search systems.

Result: Extensive experiments show significant reductions in interference effects and estimation variance compared to baseline methods. Successful deployment in a large-scale marketplace confirms practical utility.

Conclusion: The framework provides theoretically guaranteed unbiased estimation under mutual exclusion conditions and offers a practical solution for platform-level causal inference in marketplaces with interference effects.

Abstract: Evaluating platform-level interventions in search-based two-sided marketplaces is fundamentally challenged by systemic effects such as spillovers and network interference. While widely used for causal inference, the PSM (Propensity Score Matching) - DID (Difference-in-Differences) framework remains susceptible to selection bias and cross-unit interference from unaccounted spillovers. In this paper, we introduced Competitive Isolation PSM-DID, a novel causal framework that integrates propensity score matching with competitive isolation to enable platform-level effect measurement (e.g., order volume, GMV) instead of item-level metrics in search systems. Our approach provides theoretically guaranteed unbiased estimation under mutual exclusion conditions, with an open dataset released to support reproducible research on marketplace interference (github.com/xxxx). Extensive experiments demonstrate significant reductions in interference effects and estimation variance compared to baseline methods. Successful deployment in a large-scale marketplace confirms the framework’s practical utility for platform-level causal inference.

[521] Automatic Minds: Cognitive Parallels Between Hypnotic States and Large Language Model Processing

Giuseppe Riva, Brenda K. Wiederhold, Fabrizia Mantovani

Main category: cs.AI

TL;DR: This paper explores functional parallels between hypnotized human minds and large language models, examining how both systems generate sophisticated behavior through automatic pattern-completion with limited executive oversight.

Details

Motivation: To understand how sophisticated, goal-directed behavior can emerge in both biological and artificial systems without conscious awareness or subjective agency, using hypnosis as an experimental model for AI systems.

Method: Comparative analysis across three principles: automaticity (associative vs deliberative processes), suppressed monitoring (leading to confabulation/hallucination), and heightened contextual dependency (where immediate cues override stable knowledge).

Result: Identified deep functional parallels between hypnosis and LLMs, revealing both systems produce coherent but ungrounded outputs requiring external interpretation, and demonstrate functional agency without subjective agency.

Conclusion: Future reliable AI requires hybrid architectures integrating generative fluency with executive monitoring mechanisms, inspired by the self-regulating architecture of the human mind.

Abstract: The cognitive processes of the hypnotized mind and the computational operations of large language models (LLMs) share deep functional parallels. Both systems generate sophisticated, contextually appropriate behavior through automatic pattern-completion mechanisms operating with limited or unreliable executive oversight. This review examines this convergence across three principles: automaticity, in which responses emerge from associative rather than deliberative processes; suppressed monitoring, leading to errors such as confabulation in hypnosis and hallucination in LLMs; and heightened contextual dependency, where immediate cues (for example, the suggestion of a therapist or the prompt of the user) override stable knowledge. These mechanisms reveal an observer-relative meaning gap: both systems produce coherent but ungrounded outputs that require an external interpreter to supply meaning. Hypnosis and LLMs also exemplify functional agency - the capacity for complex, goal-directed, context-sensitive behavior - without subjective agency, the conscious awareness of intention and ownership that defines human action. This distinction clarifies how purposive behavior can emerge without self-reflective consciousness, governed instead by structural and contextual dynamics. Finally, both domains illuminate the phenomenon of scheming: automatic, goal-directed pattern generation that unfolds without reflective awareness. Hypnosis provides an experimental model for understanding how intention can become dissociated from conscious deliberation, offering insights into the hidden motivational dynamics of artificial systems. Recognizing these parallels suggests that the future of reliable AI lies in hybrid architectures that integrate generative fluency with mechanisms of executive monitoring, an approach inspired by the complex, self-regulating architecture of the human mind.

[522] Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Hamin Koo, Minseon Kim, Jaehyung Kim

Main category: cs.AI

TL;DR: AMIS is a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through bi-level optimization to improve LLM safety testing.

Details

Motivation: Current jailbreak methods rely on sparse binary attack success rates or biased manual scoring templates, limiting their effectiveness in identifying LLM vulnerabilities.

Method: Bi-level optimization with inner loop refining prompts using dense feedback from fixed templates, and outer loop optimizing templates using ASR alignment scores.

Result: Achieved 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet, outperforming existing baselines by substantial margins on AdvBench and JBB-Behaviors.

Conclusion: AMIS enables progressive improvement of both jailbreak prompts and scoring templates, providing more effective and calibrated safety testing for LLMs.

Abstract: Identifying the vulnerabilities of large language models (LLMs) is crucial for improving their safety by addressing inherent weaknesses. Jailbreaks, in which adversaries bypass safeguards with crafted input prompts, play a central role in red-teaming by probing LLMs to elicit unintended or unsafe behaviors. Recent optimization-based jailbreak approaches iteratively refine attack prompts by leveraging LLMs. However, they often rely heavily on either binary attack success rate (ASR) signals, which are sparse, or manually crafted scoring templates, which introduce human bias and uncertainty in the scoring outcomes. To address these limitations, we introduce AMIS (Align to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through a bi-level structure. In the inner loop, prompts are refined using fine-grained and dense feedback using a fixed scoring template. In the outer loop, the template is optimized using an ASR alignment score, gradually evolving to better reflect true attack outcomes across queries. This co-optimization process yields progressively stronger jailbreak prompts and more calibrated scoring signals. Evaluations on AdvBench and JBB-Behaviors demonstrate that AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet, outperforming existing baselines by substantial margins.

[523] Relaxing partition admissibility in Cluster-DAGs: a causal calculus with arbitrary variable clustering

Clément Yvernes, Emilie Devijver, Adèle H. Ribeiro, Marianne Clausel–Lesourd, Éric Gaussier

Main category: cs.AI

TL;DR: Extends C-DAG framework to support arbitrary variable clusterings by allowing cyclic representations, with sound and complete causal calculus for cluster-level reasoning.

Details

Motivation: Conventional C-DAGs require admissible partitions that avoid cycles, limiting their applicability to arbitrary clusterings.

Method: Relaxes partition admissibility constraint to allow cyclic C-DAGs, extends d-separation and causal calculus to this setting.

Result: Develops sound and atomically complete calculus for cluster-level causal reasoning, enabling application in previously intractable scenarios.

Conclusion: Significantly broadens scope of C-DAG framework while maintaining soundness and completeness with respect to do-calculus.

Abstract: Cluster DAGs (C-DAGs) provide an abstraction of causal graphs in which nodes represent clusters of variables, and edges encode both cluster-level causal relationships and dependencies arisen from unobserved confounding. C-DAGs define an equivalence class of acyclic causal graphs that agree on cluster-level relationships, enabling causal reasoning at a higher level of abstraction. However, when the chosen clustering induces cycles in the resulting C-DAG, the partition is deemed inadmissible under conventional C-DAG semantics. In this work, we extend the C-DAG framework to support arbitrary variable clusterings by relaxing the partition admissibility constraint, thereby allowing cyclic C-DAG representations. We extend the notions of d-separation and causal calculus to this setting, significantly broadening the scope of causal reasoning across clusters and enabling the application of C-DAGs in previously intractable scenarios. Our calculus is both sound and atomically complete with respect to the do-calculus: all valid interventional queries at the cluster level can be derived using our rules, each corresponding to a primitive do-calculus step.

[524] Modulation of temporal decision-making in a deep reinforcement learning agent under the dual-task paradigm

Amrapali Pednekar, Álvaro Garrido-Pérez, Yara Khaluf, Pieter Simoens

Main category: cs.AI

TL;DR: DRL agents trained in dual-task environments show time overproduction similar to humans, but no clear neural timing mechanisms were found in LSTM layers.

Details

Motivation: To explore parallels between emergent DRL behavior and human timing behavior in dual-task paradigms, aiming to better understand both artificial and biological systems.

Method: Used simplified Overcooked environment with two variations: single task (T) with time production, and dual task (T+N) with additional number comparison. Trained separate DRL agents for each task using LSTM networks.

Result: Dual task agent significantly overproduced time compared to single task agent across four target durations, consistent with human timing research. No clear evidence of dedicated timing mechanisms in LSTM layers.

Conclusion: DRL agents can exhibit human-like timing interference patterns, but further investigation is needed to understand the underlying time-keeping mechanisms in artificial agents.

Abstract: This study explores the interference in temporal processing within a dual-task paradigm from an artificial intelligence (AI) perspective. In this context, the dual-task setup is implemented as a simplified version of the Overcooked environment with two variations, single task (T) and dual task (T+N). Both variations involve an embedded time production task, but the dual task (T+N) additionally involves a concurrent number comparison task. Two deep reinforcement learning (DRL) agents were separately trained for each of these tasks. These agents exhibited emergent behavior consistent with human timing research. Specifically, the dual task (T+N) agent exhibited significant overproduction of time relative to its single task (T) counterpart. This result was consistent across four target durations. Preliminary analysis of neural dynamics in the agents’ LSTM layers did not reveal any clear evidence of a dedicated or intrinsic timer. Hence, further investigation is needed to better understand the underlying time-keeping mechanisms of the agents and to provide insights into the observed behavioral patterns. This study is a small step towards exploring parallels between emergent DRL behavior and behavior observed in biological systems in order to facilitate a better understanding of both.

[525] Learning to Seek Evidence: A Verifiable Reasoning Agent with Causal Faithfulness Analysis

Yuhang Huang, Zekai Lin, Fan Zhong, Lei Liu

Main category: cs.AI

TL;DR: Interactive AI agent produces verifiable explanations through auditable action sequences using reinforcement learning, improving diagnostic accuracy and explanation faithfulness.

Details

Motivation: Address lack of verifiability in AI explanations for high-stakes domains like medicine, which hinders trust in AI systems.

Method: Interactive agent learns policy via reinforcement learning to strategically seek external visual evidence to support diagnostic reasoning. Uses causal intervention by masking chosen evidence to validate explanation faithfulness.

Result: Significantly improves calibrated accuracy (18% Brier score reduction vs non-interactive baseline). Causal intervention shows performance degradation when evidence is masked (ΔBrier=+0.029), confirming evidence integral to decisions.

Conclusion: Provides practical framework for building AI systems with verifiable and faithful reasoning capabilities through action-based reasoning processes.

Abstract: Explanations for AI models in high-stakes domains like medicine often lack verifiability, which can hinder trust. To address this, we propose an interactive agent that produces explanations through an auditable sequence of actions. The agent learns a policy to strategically seek external visual evidence to support its diagnostic reasoning. This policy is optimized using reinforcement learning, resulting in a model that is both efficient and generalizable. Our experiments show that this action-based reasoning process significantly improves calibrated accuracy, reducing the Brier score by 18% compared to a non-interactive baseline. To validate the faithfulness of the agent’s explanations, we introduce a causal intervention method. By masking the visual evidence the agent chooses to use, we observe a measurable degradation in its performance ($\Delta$Brier=+0.029), confirming that the evidence is integral to its decision-making process. Our work provides a practical framework for building AI systems with verifiable and faithful reasoning capabilities.

[526] Robust Multimodal Sentiment Analysis via Double Information Bottleneck

Huiting Huang, Tieliang Gong, Kai He, Jialun Wu, Erik Cambria, Mengling Feng

Main category: cs.AI

TL;DR: The paper proposes a Double Information Bottleneck (DIB) strategy for multimodal sentiment analysis that addresses noise contamination and inadequate fusion issues by learning compressed unimodal representations and using attention bottleneck fusion to create robust multimodal representations.

Details

Motivation: Existing multimodal sentiment analysis approaches suffer from insufficient learning of noise-contaminated unimodal data (leading to corrupted cross-modal interactions) and inadequate fusion of multimodal representations (resulting in loss of discriminative information while retaining redundant information).

Method: Proposes a Double Information Bottleneck (DIB) strategy implemented within low-rank Renyi’s entropy functional framework. DIB has two modules: 1) learning compressed unimodal representations by maximizing task-relevant information and discarding superfluous information, and 2) attention bottleneck fusion mechanism to ensure discriminative multimodal representation.

Result: Achieves 47.4% accuracy under Acc-7 metric on CMU-MOSI and 81.63% F1-score on CH-SIMS, outperforming second-best baseline by 1.19%. Shows strong robustness with only 0.36% and 0.29% performance degradation under noise on CMU-MOSI and CMU-MOSEI respectively.

Conclusion: The DIB strategy effectively filters out noisy information from unimodal data while capturing inter-modal complementarity, providing a powerful and unified compact multimodal representation that is robust against diverse noise sources.

Abstract: Multimodal sentiment analysis has received significant attention across diverse research domains. Despite advancements in algorithm design, existing approaches suffer from two critical limitations: insufficient learning of noise-contaminated unimodal data, leading to corrupted cross-modal interactions, and inadequate fusion of multimodal representations, resulting in discarding discriminative unimodal information while retaining multimodal redundant information. To address these challenges, this paper proposes a Double Information Bottleneck (DIB) strategy to obtain a powerful, unified compact multimodal representation. Implemented within the framework of low-rank Renyi’s entropy functional, DIB offers enhanced robustness against diverse noise sources and computational tractability for high-dimensional data, as compared to the conventional Shannon entropy-based methods. The DIB comprises two key modules: 1) learning a sufficient and compressed representation of individual unimodal data by maximizing the task-relevant information and discarding the superfluous information, and 2) ensuring the discriminative ability of multimodal representation through a novel attention bottleneck fusion mechanism. Consequently, DIB yields a multimodal representation that effectively filters out noisy information from unimodal data while capturing inter-modal complementarity. Extensive experiments on CMU-MOSI, CMU-MOSEI, CH-SIMS, and MVSA-Single validate the effectiveness of our method. The model achieves 47.4% accuracy under the Acc-7 metric on CMU-MOSI and 81.63% F1-score on CH-SIMS, outperforming the second-best baseline by 1.19%. Under noise, it shows only 0.36% and 0.29% performance degradation on CMU-MOSI and CMU-MOSEI respectively.

[527] From Passive to Proactive: A Multi-Agent System with Dynamic Task Orchestration for Intelligent Medical Pre-Consultation

ChengZhang Yu, YingRu He, Hongyan Cheng, nuo Cheng, Zhixing Liu, Dongxu Mu, Zhangrui Shen, Zhanpeng Jin

Main category: cs.AI

TL;DR: A hierarchical multi-agent framework transforms passive medical AI into proactive inquiry agents for pre-consultation tasks, achieving high accuracy in triage and clinical quality scores while maintaining data privacy through local deployment.

Details

Motivation: Address critical challenges in global healthcare from increasing patient volumes and limited consultation times by improving pre-consultation processes that are currently limited by passive AI interaction paradigms and context management issues.

Method: Developed an eight-agent hierarchical framework with centralized control that decomposes pre-consultation into four primary tasks (Triage, History of Present Illness, Past History, Chief Complaint generation) and 13 domain-specific subtasks, using autonomous task orchestration and agent-driven scheduling.

Result: Achieved 87.0% accuracy for primary department triage and 80.5% for secondary classification, with 98.2% task completion rate. Clinical quality scores averaged 4.56-4.69/5 from physician evaluations, and consultations completed within 12.7-16.9 rounds. High performance maintained across multiple foundation models (GPT-OSS 20B, Qwen3-8B, Phi4-14B).

Conclusion: The model-agnostic architecture demonstrates potential for autonomous AI systems to enhance pre-consultation efficiency and quality in clinical settings while preserving data privacy through local deployment.

Abstract: Global healthcare systems face critical challenges from increasing patient volumes and limited consultation times, with primary care visits averaging under 5 minutes in many countries. While pre-consultation processes encompassing triage and structured history-taking offer potential solutions, they remain limited by passive interaction paradigms and context management challenges in existing AI systems. This study introduces a hierarchical multi-agent framework that transforms passive medical AI systems into proactive inquiry agents through autonomous task orchestration. We developed an eight-agent architecture with centralized control mechanisms that decomposes pre-consultation into four primary tasks: Triage ($T_1$), History of Present Illness collection ($T_2$), Past History collection ($T_3$), and Chief Complaint generation ($T_4$), with $T_1$–$T_3$ further divided into 13 domain-specific subtasks. Evaluated on 1,372 validated electronic health records from a Chinese medical platform across multiple foundation models (GPT-OSS 20B, Qwen3-8B, Phi4-14B), the framework achieved 87.0% accuracy for primary department triage and 80.5% for secondary department classification, with task completion rates reaching 98.2% using agent-driven scheduling versus 93.1% with sequential processing. Clinical quality scores from 18 physicians averaged 4.56 for Chief Complaints, 4.48 for History of Present Illness, and 4.69 for Past History on a 5-point scale, with consultations completed within 12.7 rounds for $T_2$ and 16.9 rounds for $T_3$. The model-agnostic architecture maintained high performance across different foundation models while preserving data privacy through local deployment, demonstrating the potential for autonomous AI systems to enhance pre-consultation efficiency and quality in clinical settings.

[528] TPS-Bench: Evaluating AI Agents’ Tool Planning & Scheduling Abilities in Compounding Tasks

Hanwen Xu, Xuyao Huang, Yuzhe Liu, Kai Yu, Zhijie Deng

Main category: cs.AI

TL;DR: TPS-Bench benchmarks LLM agents’ ability to solve compounding real-world problems requiring tool planning and scheduling across 200 tasks with hundreds of MCP tools, showing models differ in scheduling efficiency.

Details

Motivation: To explore whether LLM agents can tackle compounding real-world problems that require diverse tools and strategic execution scheduling, which remains underexplored.

Method: Created TPS-Bench with 200 compounding tasks based on hundreds of MCP tools, evaluating both task completion rate and efficiency across popular LLMs, and conducted initial RL training study.

Result: GLM-4.5 achieved 64.72% completion rate with long execution time, GPT-4o achieved 45.08% with parallel calls, and RL training on Qwen3-1.7B reduced execution time by 14% with 6% completion rate gain.

Conclusion: LLM agents can perform reasonable tool planning but differ significantly in scheduling efficiency, and reinforcement learning shows promise for improving scheduling without compromising performance.

Abstract: Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real-world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can be completed by a basic tool. Our evaluation emphasizes both task completion rate and efficiency. The empirical studies on popular closed-source and open-source LLMs indicate that most models can perform reasonable tool planning, but differ in scheduling. For example, GLM-4.5 achieves an outperforming task completion rate of 64.72% with extensive sequential tool calls, hence suffering from significantly long execution time. By contrast, GPT-4o prioritizes parallel tool calls but achieves only a 45.08% completion rate. Considering reinforcement learning (RL) can be a viable way to improve the scheduling efficiency without compromising performance, we perform an initial study on Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in task completion rate based on rarely 100 RL training samples. Our code is available https://github.com/hanwenxu1/mcp-agent.

Ujjwal Sharma, Stevan Rudinac, Ana Mićković, Willemijn van Dolen, Marcel Worring

Main category: cs.AI

TL;DR: A multimodal pipeline using foundation models to analyze corporate sustainability communication on social media, combining LLMs for SDG alignment annotation and VLMs for visual pattern analysis.

Details

Motivation: To address challenges in analyzing evolving, multimodal corporate messaging on platforms like X (Twitter), particularly for sustainability-related content, without requiring costly task-specific annotations.

Method: Ensemble of LLMs to annotate corporate tweets for SDG alignment, and vision-language models with semantic clustering for visual pattern analysis of sustainability communication.

Result: Revealed sectoral differences in SDG engagement, temporal trends, and associations between corporate messaging, ESG risks, and consumer engagement.

Conclusion: The automatic label generation and semantic visual clustering methods provide a flexible, scalable framework for large-scale social media analysis applicable to other domains.

Abstract: In this work, we introduce a multimodal analysis pipeline that leverages large foundation models in vision and language to analyze corporate social media content, with a focus on sustainability-related communication. Addressing the challenges of evolving, multimodal, and often ambiguous corporate messaging on platforms such as X (formerly Twitter), we employ an ensemble of large language models (LLMs) to annotate a large corpus of corporate tweets on their topical alignment with the 17 Sustainable Development Goals (SDGs). This approach avoids the need for costly, task-specific annotations and explores the potential of such models as ad-hoc annotators for social media data that can efficiently capture both explicit and implicit references to sustainability themes in a scalable manner. Complementing this textual analysis, we utilize vision-language models (VLMs), within a visual understanding framework that uses semantic clusters to uncover patterns in visual sustainability communication. This integrated approach reveals sectoral differences in SDG engagement, temporal trends, and associations between corporate messaging, environmental, social, governance (ESG) risks, and consumer engagement. Our methods-automatic label generation and semantic visual clustering-are broadly applicable to other domains and offer a flexible framework for large-scale social media analysis.

[530] ExplicitLM: Decoupling Knowledge from Parameters via Explicit Memory Banks

Chengzhang Yu, Zening Lu, Chenyang Zheng, Chiyue Wang, Yiming Zhang, Zhanpeng Jin

Main category: cs.AI

TL;DR: ExplicitLM introduces an external memory bank with human-readable knowledge tokens, enabling direct inspection and modification. It uses a two-stage retrieval mechanism with product key decomposition for efficiency and achieves significant improvements on knowledge-intensive tasks while maintaining interpretability.

Details

Motivation: Large language models suffer from knowledge staleness and lack of interpretability due to implicit knowledge storage across entangled network parameters, preventing targeted updates and reasoning transparency.

Method: Proposes ExplicitLM architecture with million-scale external memory bank storing human-readable knowledge as token sequences. Uses differentiable two-stage retrieval with product key decomposition for coarse filtering and Gumbel-Softmax for fine matching. Partitions knowledge into frozen explicit facts (20%) and learnable implicit patterns (80%) with Exponential Moving Average updates.

Result: Achieves up to 43.67% improvement on knowledge-intensive tasks versus standard Transformers, with 3.62× gains in low-data regimes (10k samples). Correct predictions achieve 49% higher memory retrieval hit rates. Strong correlations between memory retrieval and performance.

Conclusion: ExplicitLM demonstrates that interpretable, updatable models can maintain competitive performance while providing unprecedented knowledge transparency, unlike RAG systems with frozen retrieval.

Abstract: Large language models suffer from knowledge staleness and lack of interpretability due to implicit knowledge storage across entangled network parameters, preventing targeted updates and reasoning transparency. We propose ExplicitLM, a novel architecture featuring a million-scale external memory bank storing human-readable knowledge as token sequences, enabling direct inspection and modification. We design a differentiable two-stage retrieval mechanism with efficient coarse-grained filtering via product key decomposition (reducing complexity from $\mathcal{O}(N \cdot |I|)$ to $\mathcal{O}(\sqrt{N} \cdot |I|)$) and fine-grained Gumbel-Softmax matching for end-to-end training. Inspired by dual-system cognitive theory, we partition knowledge into frozen explicit facts (20%) and learnable implicit patterns (80%), maintained through Exponential Moving Average updates for stability. ExplicitLM achieves up to 43.67% improvement on knowledge-intensive tasks versus standard Transformers, with 3.62$\times$ gains in low-data regimes (10k samples). Analysis shows strong correlations between memory retrieval and performance, with correct predictions achieving 49% higher hit rates. Unlike RAG systems with frozen retrieval, our jointly optimized architecture demonstrates that interpretable, updatable models can maintain competitive performance while providing unprecedented knowledge transparency.

[531] IVGAE-TAMA-BO: A novel temporal dynamic variational graph model for link prediction in global food trade networks with momentum structural memory and Bayesian optimization

Sicheng Wang, Shuhao Chen, Jingran Zhou, Chengyi Tu

Main category: cs.AI

TL;DR: IVGAE-TAMA-BO is a novel dynamic graph neural network that predicts future links in global food trade networks by capturing temporal patterns and using Bayesian optimization for hyperparameter tuning.

Details

Motivation: Global food trade networks evolve dynamically due to geopolitical, economic, and environmental factors, making traditional static models inadequate for accurate link prediction and food security monitoring.

Method: The model extends IVGAE framework with Trade-Aware Momentum Aggregator (TAMA) to capture temporal evolution, models short-term fluctuations and long-term dependencies, and uses Bayesian optimization for automatic hyperparameter tuning.

Result: Extensive experiments on five crop-specific datasets show IVGAE-TAMA-BO substantially outperforms static IVGAE and other dynamic baselines, with Bayesian optimization further boosting performance.

Conclusion: The proposed framework is a robust and scalable solution for structural prediction in global trade networks, with strong potential for food security monitoring and policy decision support.

Abstract: Global food trade plays a crucial role in ensuring food security and maintaining supply chain stability. However, its network structure evolves dynamically under the influence of geopolitical, economic, and environmental factors, making it challenging to model and predict future trade links. Effectively capturing temporal patterns in food trade networks is therefore essential for improving the accuracy and robustness of link prediction. This study introduces IVGAE-TAMA-BO, a novel dynamic graph neural network designed to model evolving trade structures and predict future links in global food trade networks. To the best of our knowledge, this is the first work to apply dynamic graph neural networks to this domain, significantly enhancing predictive performance. Building upon the original IVGAE framework, the proposed model incorporates a Trade-Aware Momentum Aggregator (TAMA) to capture the temporal evolution of trade networks, jointly modeling short-term fluctuations and long-term structural dependencies. A momentum-based structural memory mechanism further improves predictive stability and performance. In addition, Bayesian optimization is used to automatically tune key hyperparameters, enhancing generalization across diverse trade scenarios. Extensive experiments on five crop-specific datasets demonstrate that IVGAE-TAMA substantially outperforms the static IVGAE and other dynamic baselines by effectively modeling temporal dependencies, while Bayesian optimization further boosts performance in IVGAE-TAMA-BO. These results highlight the proposed framework as a robust and scalable solution for structural prediction in global trade networks, with strong potential for applications in food security monitoring and policy decision support.

[532] Hybrid Retrieval-Augmented Generation Agent for Trustworthy Legal Question Answering in Judicial Forensics

Yueqing Xi, Yifan Bai, Huasen Luo, Weiliang Wen, Hui Liu, Haoliang Li

Main category: cs.AI

TL;DR: A hybrid legal QA agent combining retrieval-augmented generation with multi-model ensembling to provide reliable, auditable legal counsel that reduces hallucination and improves compliance.

Details

Motivation: To address LLM hallucination risks in legal consultation and overcome static knowledge bases' inability to keep pace with frequently updated statutes and case law.

Method: Retrieval-prioritized hybrid system: uses RAG when trusted legal repository has relevant evidence, otherwise employs multiple LLMs to generate candidates scored by a specialized selector. Includes human review and knowledge repository updates.

Result: Significantly outperforms single-model baseline and vanilla RAG on F1, ROUGE-L, and LLM-as-a-Judge metrics. Reduces hallucination while improving answer quality and legal compliance.

Conclusion: The system advances practical deployment of AI in judicial scenarios through reduced hallucination, improved quality, and dynamic knowledge evolution with provenance tracking.

Abstract: As artificial intelligence permeates judicial forensics, ensuring the veracity and traceability of legal question answering (QA) has become critical. Conventional large language models (LLMs) are prone to hallucination, risking misleading guidance in legal consultation, while static knowledge bases struggle to keep pace with frequently updated statutes and case law. We present a hybrid legal QA agent tailored for judicial settings that integrates retrieval-augmented generation (RAG) with multi-model ensembling to deliver reliable, auditable, and continuously updatable counsel. The system prioritizes retrieval over generation: when a trusted legal repository yields relevant evidence, answers are produced via RAG; otherwise, multiple LLMs generate candidates that are scored by a specialized selector, with the top-ranked answer returned. High-quality outputs then undergo human review before being written back to the repository, enabling dynamic knowledge evolution and provenance tracking. Experiments on the Law_QA dataset show that our hybrid approach significantly outperforms both a single-model baseline and a vanilla RAG pipeline on F1, ROUGE-L, and an LLM-as-a-Judge metric. Ablations confirm the complementary contributions of retrieval prioritization, model ensembling, and the human-in-the-loop update mechanism. The proposed system demonstrably reduces hallucination while improving answer quality and legal compliance, advancing the practical landing of media forensics technologies in judicial scenarios.

[533] Simulating Environments with Reasoning Models for Agent Training

Yuetai Li, Huseyin A Inan, Xiang Yue, Wei-Ning Chen, Lukas Wutschitz, Janardhan Kulkarni, Radha Poovendran, Robert Sim, Saravan Rajmohan

Main category: cs.AI

TL;DR: LLMs can simulate environment feedback to enable scalable agent training without real environment implementations, using SFT data synthesis and RL training frameworks.

Details

Motivation: LLM agents are brittle in complex environments requiring robustness across diverse tools and schemas, and building bespoke training environments is heavy and limits progress.

Method: Propose two frameworks: Simia-SFT for synthesizing SFT data by amplifying small seed sets into diverse trajectories, and Simia-RL for RL training using LLM-simulated feedback without real environment implementations.

Result: Fine-tuning open models yields consistent improvements across benchmarks, surpassing GPT-4o and approaching o4-mini on τ²-Bench.

Conclusion: Simia-SFT and Simia-RL enable scalable agent training without environment engineering by replacing heavy implementations with flexible LLM-based simulation.

Abstract: LLM agents excel in compact environments requiring deep reasoning but remain brittle when operating in broader, more complex contexts that demand robustness across diverse tools and schemas. Building bespoke environments for training is heavy, brittle, and limits progress. In this paper, we demonstrate that LLMs can simulate realistic environment feedback without access to actual testbed data or APIs. Inspired by this capability, we propose two frameworks: Simia-SFT, a pipeline that synthesizes SFT data by amplifying small seed sets into diverse trajectories in an environment-agnostic manner, and Simia-RL, a framework that enables RL training without real environment implementations through LLM-simulated feedback. Fine-tuning open models yields consistent improvements across multiple benchmarks, surpassing GPT-4o and approaching o4-mini on $\tau^2$-Bench. Together, Simia-SFT and Simia-RL enable scalable agent training without environment engineering, replacing heavy and brittle implementations with flexible LLM-based simulation.

[534] Learning Complementary Policies for Human-AI Teams

Ruijiang Gao, Maytal Saar-Tsechansky, Maria De-Arteaga

Main category: cs.AI

TL;DR: A robust deferral collaboration approach for human-AI decision-making that strategically allocates instances between humans and AI to maximize rewards, working effectively even with model misspecifications.

Details

Motivation: To address human-AI complementarity in decision-making rather than just algorithmic performance, focusing on team performance and moving beyond classification tasks to general decision-making.

Method: Proposes a deferral collaboration approach that exploits distinct human and AI strengths by strategically routing instances between them, with robustness to misspecifications in both human behavior and reward models.

Result: Significantly outperforms independent human and algorithmic decision-making using both synthetic and real human responses, with substantial performance improvements achievable by routing only a small fraction of instances to humans.

Conclusion: The method enables efficient and effective human-AI collaboration in complex management settings by leveraging complementary strengths while requiring minimal human intervention.

Abstract: This paper tackles the critical challenge of human-AI complementarity in decision-making. Departing from the traditional focus on algorithmic performance in favor of performance of the human-AI team, and moving past the framing of collaboration as classification to focus on decision-making tasks, we introduce a novel approach to policy learning. Specifically, we develop a robust solution for human-AI collaboration when outcomes are only observed under assigned actions. We propose a deferral collaboration approach that maximizes decision rewards by exploiting the distinct strengths of humans and AI, strategically allocating instances among them. Critically, our method is robust to misspecifications in both the human behavior and reward models. Leveraging the insight that performance gains stem from divergent human and AI behavioral patterns, we demonstrate, using synthetic and real human responses, that our proposed method significantly outperforms independent human and algorithmic decision-making. Moreover, we show that substantial performance improvements are achievable by routing only a small fraction of instances to human decision-makers, highlighting the potential for efficient and effective human-AI collaboration in complex management settings.

[535] Memory-Enhanced Neural Solvers for Routing Problems

Felix Chalumeau, Refiloe Shabe, Noah De Nicola, Arnu Pretorius, Thomas D. Barrett, Nathan Grinsztajn

Main category: cs.AI

TL;DR: MEMENTO is a memory-based approach that improves neural solvers for routing problems by leveraging online data from repeated attempts to dynamically adjust action distributions during inference.

Details

Motivation: Existing RL methods for routing problems lack adaptability to specific instances and fail to fully utilize computational budgets, relying on pre-trained policies or RL fine-tuning that don't leverage new information effectively.

Method: MEMENTO uses memory to collect online data across repeated attempts and dynamically adjusts action distributions based on previous decision outcomes during inference.

Result: MEMENTO outperforms tree-search and policy-gradient fine-tuning on Traveling Salesman and Capacitated Vehicle Routing problems, achieving state-of-the-art performance on 11 out of 12 evaluated tasks with good scalability and data-efficiency.

Conclusion: Memory-based approaches like MEMENTO can effectively improve neural solvers for routing problems by enabling dynamic adaptation and better utilization of computational budgets during inference.

Abstract: Routing Problems are central to many real-world applications, yet remain challenging due to their (NP-)hard nature. Amongst existing approaches, heuristics often offer the best trade-off between quality and scalability, making them suitable for industrial use. While Reinforcement Learning (RL) offers a flexible framework for designing heuristics, its adoption over handcrafted heuristics remains incomplete. Existing learned methods still lack the ability to adapt to specific instances and fully leverage the available computational budget. Current best methods either rely on a collection of pre-trained policies, or on RL fine-tuning; hence failing to fully utilize newly available information within the constraints of the budget. In response, we present MEMENTO, an approach that leverages memory to improve the search of neural solvers at inference. MEMENTO leverages online data collected across repeated attempts to dynamically adjust the action distribution based on the outcome of previous decisions. We validate its effectiveness on the Traveling Salesman and Capacitated Vehicle Routing problems, demonstrating its superiority over tree-search and policy-gradient fine-tuning; and showing that it can be zero-shot combined with diversity-based solvers. We successfully train all RL auto-regressive solvers on large instances, and verify MEMENTO’s scalability and data-efficiency: pushing the state-of-the-art on 11 out of 12 evaluated tasks.

[536] Multi-Step Reasoning with Large Language Models, a Survey

Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back

Main category: cs.AI

TL;DR: This paper reviews multi-step reasoning with large language models, proposing a taxonomy for generating, evaluating, and controlling reasoning processes, and identifies current approaches and future research directions.

Details

Motivation: Traditional LLMs perform well on language tasks but struggle with basic reasoning benchmarks, motivating the need for improved multi-step reasoning capabilities.

Method: The paper reviews existing approaches, proposes a taxonomy for multi-step reasoning, and analyzes methods including Chain-of-thought prompting, reinforcement learning finetuning, external optimization loops, and self-reflection techniques.

Result: Multi-step reasoning approaches have expanded beyond math word problems to successfully solve challenges in logic, combinatorial games, and robotics, sometimes using code generation with external tools.

Conclusion: The field has progressed significantly, with various reinforcement learning and self-reflection methods emerging, and the paper proposes a research agenda for future development in multi-step reasoning with LLMs.

Abstract: Large language models (LLMs) with billions of parameters exhibit in-context learning abilities, enabling few-shot learning on tasks that the model was not specifically trained for. Traditional models achieve breakthrough performance on language tasks, but do not perform well on basic reasoning benchmarks. However, a new in-context learning approach, Chain-of-thought, has demonstrated strong multi-step reasoning abilities on these benchmarks. The research on LLM reasoning abilities started with the question whether LLMs can solve grade school math word problems, and has expanded to other tasks in the past few years. This article reviews the field of multi-step reasoning with LLMs. We propose a taxonomy that identifies different ways to generate, evaluate, and control multi-step reasoning. We provide an in-depth coverage of core approaches and open problems, and we propose a research agenda for the near future. We find that multi-step reasoning approaches have progressed beyond math word problems, and can now successfully solve challenges in logic, combinatorial games, and robotics, sometimes by first generating code that is then executed by external tools. Many studies in multi-step methods use reinforcement learning for finetuning, external optimization loops, in-context reinforcement learning, and self-reflection.

[537] Interpretable end-to-end Neurosymbolic Reinforcement Learning agents

Nils Grandien, Quentin Delfosse, Kristian Kersting

Main category: cs.AI

TL;DR: The paper presents SCoBots, a neurosymbolic RL framework that creates interpretable agents by learning object-centric representations from raw pixels and using symbolic reasoning for decision-making.

Details

Motivation: Deep RL agents rely on shortcut learning and fail to generalize, while symbolic methods operate on object-centric states making comparisons unfair. The goal is to bridge this gap by creating interpretable agents that work from raw pixel inputs.

Method: SCoBots decompose RL tasks into intermediate interpretable representations using object-centric relational concepts. The approach combines object-centric representation learning from raw states, object-centric RL, and policy distillation via rule extraction in a neurosymbolic framework.

Result: First implementation of an end-to-end trained SCoBot evaluated on Atari games. Results show the framework’s potential for creating interpretable and performing RL systems.

Conclusion: The work demonstrates a viable approach for end-to-end interpretable RL agents and paves the way for future research in neurosymbolic RL systems.

Abstract: Deep reinforcement learning (RL) agents rely on shortcut learning, preventing them from generalizing to slightly different environments. To address this problem, symbolic method, that use object-centric states, have been developed. However, comparing these methods to deep agents is not fair, as these last operate from raw pixel-based states. In this work, we instantiate the symbolic SCoBots framework. SCoBots decompose RL tasks into intermediate, interpretable representations, culminating in action decisions based on a comprehensible set of object-centric relational concepts. This architecture aids in demystifying agent decisions. By explicitly learning to extract object-centric representations from raw states, object-centric RL, and policy distillation via rule extraction, this work places itself within the neurosymbolic AI paradigm, blending the strengths of neural networks with symbolic AI. We present the first implementation of an end-to-end trained SCoBot, separately evaluate of its components, on different Atari games. The results demonstrate the framework’s potential to create interpretable and performing RL systems, and pave the way for future research directions in obtaining end-to-end interpretable RL agents.

[538] Survey Transfer Learning: Recycling Data with Silicon Responses

Ali Amini

Main category: cs.AI

TL;DR: Survey Transfer Learning (STL) uses neural networks to transfer knowledge from existing survey data to generate synthetic responses, outperforming LLMs with 93% accuracy while being more cost-effective and transparent.

Details

Motivation: To address environmental costs and opacity of LLMs for synthetic survey data generation by leveraging existing survey data through transfer learning paradigms.

Method: Pre-trained neural network on CES 2020, froze early layers to preserve structure, fine-tuned top layers on ANES 2020, then generated silicon responses for CES 2022 and held-out ANES 2020 data.

Result: Achieved up to 93% accuracy in generating silicon responses, outperforming LLMs especially on sensitive measures like racial resentment.

Conclusion: STL provides empirically grounded, cost-effective synthetic survey responses that can help mitigate challenges in social science and polling industry.

Abstract: As researchers increasingly turn to large language models (LLMs) to generate synthetic survey data, less attention has been paid to alternative AI paradigms given environmental costs of LLMs. This paper introduces Survey Transfer Learning (STL), which develops transfer learning paradigms from computer science for survey research to recycle existing survey data and generate empirically grounded silicon responses. Inspired by political behavior theory, STL leverages shared demographic variables with high predictive power in a polarized American context to transfer knowledge across surveys. Using a neural network pre-trained on the Cooperative Election Study (CES) 2020, freezing early layers to preserve learned structure, and fine-tuning top layers on the American National Election Studies (ANES) 2020, STL generates silicon responses CES 2022 and in held-out ANES 2020 data with accuracy rates of up to 93 percent. Results show that STL outperforms LLMs, especially on sensitive measures such as racial resentment. While LLMs silicon samples are costly and opaque, STL generates empirically grounded silicon responses with high individual-level accuracy, potentially helping to mitigate key challenges in social science and the polling industry.

[539] Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals

Linda Zeng, Rithwik Gupta, Divij Motwani, Yi Zhang, Diji Yang

Main category: cs.AI

TL;DR: RAGuard is the first benchmark to evaluate RAG system robustness against misleading retrievals using real-world misinformation from Reddit, showing LLMs perform worse than zero-shot baselines when exposed to misleading evidence.

Details

Motivation: Existing RAG benchmarks use clean or synthetically perturbed data, failing to reflect real-world conditions where information is polarized and misleading, leading to overestimated performance.

Method: Constructed a fact-checking dataset from Reddit discussions, categorizing retrieved evidence into supporting, misleading, and unrelated types to create realistic test scenarios.

Result: All tested LLM-powered RAG systems performed worse than their zero-shot baselines when exposed to misleading retrievals, while human annotators consistently performed better, highlighting LLM susceptibility to noisy environments.

Conclusion: RAGuard is the first systematic benchmark for assessing RAG robustness against misleading evidence and should drive research toward more reliable real-world RAG systems.

Abstract: Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to maintain consistent reasoning when exposed to misleading or conflicting evidence, especially in real-world domains such as politics, where information is polarized or selectively framed. Mainstream RAG benchmarks evaluate models under clean retrieval settings, where systems generate answers from gold-standard documents, or under synthetically perturbed settings, where documents are artificially injected with noise. These assumptions fail to reflect real-world conditions, often leading to an overestimation of RAG system performance. To address this gap, we introduce RAGuard, the first benchmark to evaluate the robustness of RAG systems against misleading retrievals. Unlike prior benchmarks that rely on synthetic noise, our fact-checking dataset captures naturally occurring misinformation by constructing its retrieval corpus from Reddit discussions. It categorizes retrieved evidence into three types: supporting, misleading, and unrelated, providing a realistic and challenging testbed for assessing how well RAG systems navigate different types of evidence. Our experiments reveal that, when exposed to potentially misleading retrievals, all tested LLM-powered RAG systems perform worse than their zero-shot baselines (i.e., no retrieval at all), while human annotators consistently perform better, highlighting LLMs’ susceptibility to noisy environments. To our knowledge, RAGuard is the first benchmark to systematically assess the robustness of the RAG against misleading evidence. We expect this benchmark to drive future research toward improving RAG systems beyond idealized datasets, making them more reliable for real-world applications. The dataset is available at https://huggingface.co/datasets/UCSC-IRKM/RAGuard.

[540] LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory

Jingru Jia, Zehua Yuan, Junhao Pan, Paul E. McNamara, Deming Chen

Main category: cs.AI

TL;DR: LLMs show varying strategic reasoning capabilities in games, with GPT-o3-mini, GPT-o1, and DeepSeek-R1 performing best. Model scale doesn’t guarantee performance, CoT prompting has limited effectiveness, and demographic biases affect decision-making patterns.

Details

Motivation: Existing LLM evaluations focus too much on Nash Equilibrium approximation while overlooking the actual reasoning mechanisms behind strategic choices, creating a gap in understanding how LLMs make strategic decisions.

Method: Introduced a behavioral game theory evaluation framework to disentangle reasoning capability from contextual effects. Tested 22 state-of-the-art LLMs with various prompting strategies and analyzed demographic feature impacts.

Result: GPT-o3-mini, GPT-o1, and DeepSeek-R1 dominated most games. Model scale alone doesn’t determine performance. CoT prompting only helps models at certain levels. Demographic biases were observed - GPT-4o performs better with female traits, Gemma favors heterosexual identities.

Conclusion: Need ethical standards and contextual alignment to balance improved reasoning with fairness, as LLMs show inherent biases in strategic decision-making that go beyond pure reasoning capability.

Abstract: Strategic decision-making involves interactive reasoning where agents adapt their choices in response to others, yet existing evaluations of large language models (LLMs) often emphasize Nash Equilibrium (NE) approximation, overlooking the mechanisms driving their strategic choices. To bridge this gap, we introduce an evaluation framework grounded in behavioral game theory, disentangling reasoning capability from contextual effects. Testing 22 state-of-the-art LLMs, we find that GPT-o3-mini, GPT-o1, and DeepSeek-R1 dominate most games yet also demonstrate that the model scale alone does not determine performance. In terms of prompting enhancement, Chain-of-Thought (CoT) prompting is not universally effective, as it increases strategic reasoning only for models at certain levels while providing limited gains elsewhere. Additionally, we investigate the impact of encoded demographic features on the models, observing that certain assignments impact the decision-making pattern. For instance, GPT-4o shows stronger strategic reasoning with female traits than males, while Gemma assigns higher reasoning levels to heterosexual identities compared to other sexual orientations, indicating inherent biases. These findings underscore the need for ethical standards and contextual alignment to balance improved reasoning with fairness.

[541] Damper-B-PINN: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Vehicle State Estimation

Tianyi Zeng, Tianyi Wang, Zimo Zeng, Feiyang Zhang, Jiseop Byeon, Yujin Wang, Yajie Zou, Yangyang Wang, Junfeng Jiao, Christian Claudel, Xinbo Chen

Main category: cs.AI

TL;DR: Proposes Damper-B-PINN framework for dynamic wheel load estimation using Bayesian physics-informed neural networks with suspension dynamics guidance and damper characteristics.

Details

Motivation: Wheel load estimation is crucial for vehicle safety and ADAS, but remains challenging due to complex chassis modeling and noise susceptibility in nonlinear systems.

Method: Refined suspension linkage-level modeling plus Damper-B-PINN framework that uses suspension dynamics as physical guidance and Bayesian inference to handle noise/uncertainty, with damper-characteristic physics conditioning module.

Result: Outperforms existing methods across various test conditions, especially extreme ones, using both CarSim simulation and real-world Formula Student race car data.

Conclusion: Damper-B-PINN enhances accuracy and robustness of dynamic wheel load estimation, improving reliability and safety of ADAS applications.

Abstract: Accurate state estimation is fundamental to intelligent vehicles. Wheel load, one of the most important chassis states, serves as an essential input for advanced driver assistance systems (ADAS) and exerts a direct influence on vehicle stability and safety. However, wheel load estimation remains challenging due to the complexity of chassis modeling and the susceptibility of nonlinear systems to noise. To address these issues, this paper first introduces a refined suspension linkage-level modeling approach that constructs a nonlinear instantaneous dynamic model by explicitly considering the complex geometric structure of the suspension. Building upon this, we propose a damper characteristics-based Bayesian physics-informed neural network (Damper-B-PINN) framework to estimate dynamic wheel load, which leverages the suspension dynamics as physical guidance of PINN while employing Bayesian inference to mitigate the effects of system noise and uncertainty. Moreover, a damper-characteristic physics conditioning (DPC) module is designed for embedding physical prior. The proposed Damper-B-PINN is evaluated using both high-fidelity simulation datasets generated by CarSim software and real-world datasets collected from a Formula Student race car. Experimental results demonstrate that our Damper-B-PINN consistently outperforms existing methods across various test conditions, particularly extreme ones. These findings highlight the potential of the proposed Damper-B-PINN framework to enhance the accuracy and robustness of dynamic wheel load estimation, thereby improving the reliability and safety of ADAS applications.

[542] Neuro-Symbolic Imitation Learning: Discovering Symbolic Abstractions for Skill Learning

Leon Keller, Daniel Tanneberg, Jan Peters

Main category: cs.AI

TL;DR: A neuro-symbolic imitation learning framework that learns symbolic task representations from demonstrations, decomposes tasks into subtasks, uses symbolic planning for abstract plans, and refines them into executable robot commands.

Details

Motivation: Most imitation learning methods focus on short, isolated skills rather than long, multi-step tasks. There's a need to learn both individual skills and how to sequence them effectively for extended tasks.

Method: Proposes a neuro-symbolic framework that: 1) learns symbolic representations from task demonstrations, 2) decomposes tasks into subtasks, 3) uses symbolic planning to generate abstract plans, and 4) learns neural skills to refine abstract plans into executable commands.

Result: Experimental results in three simulated robotic environments show increased data efficiency, improved generalization capabilities, and enhanced interpretability compared to baseline methods.

Conclusion: The neuro-symbolic approach successfully bridges the gap between learning individual skills and sequencing them for long, multi-step tasks, demonstrating practical advantages in robotic imitation learning.

Abstract: Imitation learning is a popular method for teaching robots new behaviors. However, most existing methods focus on teaching short, isolated skills rather than long, multi-step tasks. To bridge this gap, imitation learning algorithms must not only learn individual skills but also an abstract understanding of how to sequence these skills to perform extended tasks effectively. This paper addresses this challenge by proposing a neuro-symbolic imitation learning framework. Using task demonstrations, the system first learns a symbolic representation that abstracts the low-level state-action space. The learned representation decomposes a task into easier subtasks and allows the system to leverage symbolic planning to generate abstract plans. Subsequently, the system utilizes this task decomposition to learn a set of neural skills capable of refining abstract plans into actionable robot commands. Experimental results in three simulated robotic environments demonstrate that, compared to baselines, our neuro-symbolic approach increases data efficiency, improves generalization capabilities, and facilitates interpretability.

[543] Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?

Kai Yan, Yufei Xu, Zhengyin Du, Xuesong Yao, Zheyu Wang, Xiaowen Guo, Jiecao Chen

Main category: cs.AI

TL;DR: RoR-Bench reveals severe recitation behavior in top LLMs like OpenAI-o1 and DeepSeek-R1, showing 60% performance drop on elementary problems when conditions are subtly changed, questioning their true reasoning capabilities.

Details

Motivation: To determine whether LLMs' remarkable reasoning ability comes from true intelligence or simply reciting solutions from training data when faced with subtly changed conditions.

Method: Proposed RoR-Bench, a novel multi-modal benchmark for detecting LLM recitation behavior by testing simple reasoning problems with subtly shifted conditions, followed by empirical analysis.

Result: Existing cutting-edge LLMs exhibit extremely severe recitation behavior - changing one phrase in conditions causes 60% performance loss on elementary school-level arithmetic and reasoning problems.

Conclusion: Findings serve as a wake-up call to re-evaluate the true intelligence level of cutting-edge LLMs, suggesting their reasoning may be more about recitation than genuine understanding.

Abstract: The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs’ remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM’s recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60 percent performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.

Ji Ma

Main category: cs.AI

TL;DR: The paper proposes methods to probe and manipulate LLMs’ internal representations of social concepts in decision-making contexts, using the Dictator Game as a test case.

Details

Motivation: To understand how character assignments and contexts shape LLM behavior in social science applications, as this relationship remains underexplored despite LLMs being increasingly used as human-like decision-making agents.

Method: Extract “vectors of variable variations” from LLMs’ internal states and manipulate these vectors during inference to alter how social variables relate to decision-making in a Dictator Game setting.

Result: Manipulating these internal vectors can substantially change how social variables influence the model’s decision-making process.

Conclusion: This approach provides a principled way to study and regulate social concept encoding in transformer models, with implications for AI alignment, debiasing, and designing AI agents for social simulations.

Abstract: Large language models (LLMs) increasingly serve as human-like decision-making agents in social science and applied settings. These LLM-agents are typically assigned human-like characters and placed in real-life contexts. However, how these characters and contexts shape an LLM’s behavior remains underexplored. This study proposes and tests methods for probing, quantifying, and modifying an LLM’s internal representations in a Dictator Game – a classic behavioral experiment on fairness and prosocial behavior. We extract “vectors of variable variations” (e.g., “male” to “female”) from the LLM’s internal state. Manipulating these vectors during the model’s inference can substantially alter how those variables relate to the model’s decision-making. This approach offers a principled way to study and regulate how social concepts can be encoded and engineered within transformer-based models, with implications for alignment, debiasing, and designing AI agents for social simulations in both academic and commercial applications, strengthening sociological theory and measurement.

[545] The Limits of AI Explainability: An Algorithmic Information Theory Approach

Shrisha Rao

Main category: cs.AI

TL;DR: This paper establishes fundamental limits of AI explainability using algorithmic information theory, showing that simpler explanations must differ from complex models, and proving impossibility of simultaneously achieving unrestricted AI capabilities, human-interpretable explanations, and negligible error.

Details

Motivation: To provide a theoretical foundation for understanding the fundamental limits of AI explainability, addressing the trade-offs between model complexity, explanation simplicity, and accuracy.

Method: Uses algorithmic information theory and Kolmogorov complexity to formalize explainability as approximation of complex models by simpler ones, quantifying both approximation error and explanation complexity.

Result: Established: (1) complexity gap theorem showing simpler explanations must differ from original models; (2) bounds on explanation complexity growing exponentially with input dimension but polynomially with error tolerance; (3) characterization of local vs global explainability gap showing local explanations can be significantly simpler.

Conclusion: No governance framework can simultaneously pursue unrestricted AI capabilities, human-interpretable explanations, and negligible error, highlighting important considerations for designing, evaluating, and overseeing explainable AI systems.

Abstract: This paper establishes a theoretical foundation for understanding the fundamental limits of AI explainability through algorithmic information theory. We formalize explainability as the approximation of complex models by simpler ones, quantifying both approximation error and explanation complexity using Kolmogorov complexity. Our key theoretical contributions include: (1) a complexity gap theorem proving that any explanation significantly simpler than the original model must differ from it on some inputs; (2) precise bounds showing that explanation complexity grows exponentially with input dimension but polynomially with error tolerance for Lipschitz functions; and (3) a characterization of the gap between local and global explainability, demonstrating that local explanations can be significantly simpler while maintaining accuracy in relevant regions. We further establish a regulatory impossibility theorem proving that no governance framework can simultaneously pursue unrestricted AI capabilities, human-interpretable explanations, and negligible error. These results highlight considerations likely to be relevant to the design, evaluation, and oversight of explainable AI systems.

[546] LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS

Kai Mei, Xi Zhu, Hang Gao, Shuhang Lin, Yongfeng Zhang

Main category: cs.AI

TL;DR: AIOS 1.0 is a platform that addresses the semantic disconnect between language models and computer interfaces by creating contextual environments that models can natively understand, enabling more effective computer-use agents.

Details

Motivation: Existing approaches focus on building more powerful agent frameworks or enhancing agent models, but fail to address the fundamental semantic disconnect between how language models understand the world and how computer interfaces are structured.

Method: AIOS 1.0 implements a Model Context Protocol (MCP) server architecture to abstract computer states and actions, transforming computers into contextual environments that language models can natively comprehend. This decouples interface complexity from decision complexity.

Result: LiteCUA, a lightweight computer-use agent built on AIOS 1.0, achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture.

Conclusion: Contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact effectively with digital systems.

Abstract: We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform’s effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems.

[547] CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho

Main category: cs.AI

TL;DR: The paper proposes an agentic workflow using Composition-of-Principles (CoP) framework to automate red-teaming of LLMs, achieving up to 19x improvement in jailbreak attack success rates.

Details

Motivation: Jailbreak attacks that bypass safety alignment in LLMs are an urgent concern, and existing red-teaming methods need automation and scalability to proactively identify risks before AI deployment.

Method: Uses Composition-of-Principles framework where human users provide red-teaming principles to an AI agent, which automatically orchestrates strategies and generates jailbreak prompts in a unified, extensible workflow.

Result: CoP framework revealed unprecedented safety risks by discovering novel jailbreak prompts and improved the best-known single-turn attack success rate by up to 19.0 times against leading LLMs.

Conclusion: The agentic CoP framework provides an effective, automated approach to scale red-teaming processes and uncover critical safety vulnerabilities in LLMs that traditional methods miss.

Abstract: Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open-source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red-teaming for LLMs is to proactively explore potential risks and error-prone instances before the release of frontier AI technology. This paper proposes an agentic workflow to automate and scale the red-teaming process of LLMs through the Composition-of-Principles (CoP) framework, where human users provide a set of red-teaming principles as instructions to an AI agent to automatically orchestrate effective red-teaming strategies and generate jailbreak prompts. Distinct from existing red-teaming methods, our CoP framework provides a unified and extensible framework to encompass and orchestrate human-provided red-teaming principles to enable the automated discovery of new red-teaming strategies. When tested against leading LLMs, CoP reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best-known single-turn attack success rate by up to 19.0 times.

[548] Solving Inequality Proofs with Large Language Models

Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, Pan Lu

Main category: cs.AI

TL;DR: The paper introduces IneqMath, a dataset for evaluating LLMs on inequality proving tasks, revealing that even top models achieve less than 10% accuracy under step-wise scrutiny despite higher final answer accuracy.

Details

Motivation: Inequality proving tests advanced reasoning skills but progress is hampered by scarce, synthetic datasets. The authors aim to create a more realistic evaluation framework for LLMs in this domain.

Method: Proposed an informal but verifiable task formulation with two subtasks (bound estimation and relation prediction), created IneqMath dataset with expert-curated inequalities, and developed an LLM-as-judge evaluation framework with step-wise scrutiny.

Result: Evaluation of 29 leading LLMs showed less than 10% overall accuracy under step-wise scrutiny, with up to 65.5% drop from final answer equivalence accuracy. Scaling model size and test-time computation yielded limited gains.

Conclusion: There’s a critical gap between finding answers and constructing rigorous proofs for current LLMs. Promising directions include theorem-guided reasoning and self-refinement.

Abstract: Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.

[549] Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Yuhao Zhou, Yiheng Wang, Xuming He, Ao Shen, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Manning Wang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai

Main category: cs.AI

TL;DR: The paper introduces Scientists’ First Exam (SFE), a benchmark to evaluate scientific multimodal reasoning in MLLMs across perception, understanding, and reasoning levels, revealing current models’ limitations.

Details

Motivation: Current scientific benchmarks focus mainly on knowledge understanding, inadequately assessing MLLMs' perception and reasoning abilities needed for complex scientific discovery workflows.

Method: Developed SFE benchmark with 830 expert-verified VQA pairs across three question types (signal perception, attribute understanding, comparative reasoning) spanning 66 multimodal tasks in five scientific disciplines.

Result: State-of-the-art models GPT-o3 and InternVL-3 achieved only 34.08% and 26.52% respectively on SFE, showing significant gaps in scientific cognitive capabilities.

Conclusion: SFE reveals substantial room for improvement in MLLMs’ scientific reasoning abilities, and the benchmark aims to facilitate AI-enhanced scientific discovery advancements.

Abstract: Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.

[550] AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction

Song Wang, Zhen Tan, Zihan Chen, Shuang Zhou, Tianlong Chen, Jundong Li

Main category: cs.AI

TL;DR: A new framework for multi-agent collaboration using sequential communication structure instead of static graphs, enabling adaptive agent selection and context access to reduce communication overhead.

Details

Motivation: Existing multi-agent methods rely on static or graph-based communication topologies, lacking adaptability and flexibility in inter-agent communication.

Method: Proposes sequential communication with two components: Next-Agent Prediction for selecting suitable agents at each step, and Next-Context Selection for agents to access relevant previous information.

Result: Achieves superior performance across multiple benchmarks while substantially reducing communication overhead compared to existing methods.

Conclusion: Sequential communication structure provides larger topology space and enables task-adaptive communication pipelines with role flexibility and global information flow.

Abstract: Recent progress in large language model (LLM)-based multi-agent collaboration highlights the power of structured communication in enabling collective intelligence. However, existing methods largely rely on static or graph-based inter-agent topologies, lacking the potential adaptability and flexibility in communication. In this work, we propose a new framework that rethinks multi-agent coordination through a sequential structure rather than a graph structure, offering a significantly larger topology space for multi-agent communication. Our method focuses on two key directions: (1) Next-Agent Prediction, which selects the most suitable agent role at each step, and (2) Next-Context Selection (NCS), which enables each agent to selectively access relevant information from any previous step. Together, these components construct task-adaptive communication pipelines that support both role flexibility and global information flow. Extensive evaluations across multiple benchmarks demonstrate that our approach achieves superior performance while substantially reducing communication overhead.

[551] Limits of Safe AI Deployment: Differentiating Oversight and Control

David Manheim, Aidan Homewood

Main category: cs.AI

TL;DR: The paper clarifies the concepts of oversight and control in AI supervision, proposing a framework to align regulatory expectations with technical feasibility and outlining a maturity model for AI supervision.

Details

Motivation: To address the vague and inconsistent interpretations of 'human oversight' in AI governance, particularly as used in regulatory texts like the EU AI Act, which could undermine efforts to design or evaluate systems under meaningful human supervision.

Method: Conducted a targeted critical review of supervision literature outside AI, differentiated control (ex-ante, real-time, operational) from oversight (ex-post, policy/governance), and proposed a framework for aligning regulatory expectations with technical feasibility.

Result: Developed a framework for meaningful supervision, outlined documentation and integration methods for supervision in risk management, and created a maturity model for AI supervision based on Microsoft’s Responsible AI Maturity Model.

Conclusion: The paper provides clarity on supervision mechanisms, highlights their boundaries and limitations, and supports regulators, auditors, and practitioners in identifying when meaningful supervision is possible and where existing methods fall short.

Abstract: Oversight and control, which we collectively call supervision, are often discussed as ways to ensure that AI systems are accountable, reliable, and able to fulfill governance and management requirements. However, the requirements for “human oversight” risk codifying vague or inconsistent interpretations of key concepts like oversight and control. This ambiguous terminology could undermine efforts to design or evaluate systems that must operate under meaningful human supervision. This matters because the term is used by regulatory texts such as the EU AI Act. This paper undertakes a targeted critical review of literature on supervision outside of AI, along with a brief summary of past work on the topic related to AI. We next differentiate control as ex-ante or real-time and operational rather than policy or governance, and oversight as performed ex-post, or a policy and governance function. Control aims to prevent failures, while oversight focuses on detection, remediation, or incentives for future prevention. Building on this, we make three contributions. 1) We propose a framework to align regulatory expectations with what is technically and organizationally plausible, articulating the conditions under which each mechanism is possible, where they fall short, and what is required to make them meaningful in practice. 2) We outline how supervision methods should be documented and integrated into risk management, and drawing on the Microsoft Responsible AI Maturity Model, we outline a maturity model for AI supervision. 3) We explicitly highlight boundaries of these mechanisms, including where they apply, where they fail, and where it is clear that no existing methods suffice. This foregrounds the question of whether meaningful supervision is possible in a given deployment context, and can support regulators, auditors, and practitioners in identifying both present and future limitations.

[552] How to Train Your LLM Web Agent: A Statistical Diagnosis

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia

Main category: cs.AI

TL;DR: This paper presents a compute-efficient two-stage training approach for LLM-based web agents that combines supervised fine-tuning with on-policy reinforcement learning, achieving better performance with less compute than single-stage methods.

Details

Motivation: The motivation is to address the widening gap between closed-source and open-source LLM web agents, which is held back by narrow focus on single-step tasks and high compute costs for post-training.

Method: A two-stage pipeline: first training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning, followed by on-policy reinforcement learning. The approach uses bootstrapping on 1,370 configurations to estimate effective hyperparameters.

Result: Combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++ benchmarks, requiring only 55% of the compute to match pure SFT’s peak performance on MiniWob++.

Conclusion: This strategy effectively pushes the compute-performance Pareto frontier and is the only approach that can close the gap with closed-source models, providing a compute-efficient solution for training LLM web agents.

Abstract: LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

[553] Agentic Large Language Models for Conceptual Systems Engineering and Design

Soheyl Massoudi, Mark Fuge

Main category: cs.AI

TL;DR: A structured multi-agent system (MAS) with 9 roles was compared to a simpler two-agent system (2AS) for engineering design tasks. Both maintained perfect JSON integrity but had low requirement coverage (<20%). Code compatibility was better for 2AS (up to 100%) than MAS (<50%). The reasoning-distilled DeepSeek R1 model improved completion rates and MAS produced more granular design graphs.

Details

Motivation: Early-stage engineering design requires complex iterative reasoning, but existing LLM workflows struggle with task continuity and executable model generation. The research aims to evaluate whether structured multi-agent systems can better manage requirements extraction, functional decomposition, and simulator code generation.

Method: Used a Design-State Graph (DSG) representation bundling requirements, physical embodiments, and Python physics models. Compared a nine-role MAS that iteratively builds DSG against a two-agent Generator-Reflector loop. Conducted 60 experiments using Llama 3.3 70B and reasoning-distilled DeepSeek R1 70B across different configurations, temperatures, and seeds.

Result: Both systems maintained perfect JSON integrity and embodiment tagging. Requirement coverage remained minimal (<20%). Code compatibility peaked at 100% for 2AS but averaged below 50% for MAS. Only reasoning-distilled model reliably flagged workflow completion. MAS generated more granular DSGs (5-6 nodes) while 2AS mode-collapsed.

Conclusion: Structured multi-agent orchestration enhanced design detail, and reasoning-distilled LLM improved completion rates. However, low requirements coverage and fidelity gaps in coding persisted as challenges.

Abstract: Early-stage engineering design involves complex, iterative reasoning, yet existing large language model (LLM) workflows struggle to maintain task continuity and generate executable models. We evaluate whether a structured multi-agent system (MAS) can more effectively manage requirements extraction, functional decomposition, and simulator code generation than a simpler two-agent system (2AS). The target application is a solar-powered water filtration system as described in a cahier des charges. We introduce the Design-State Graph (DSG), a JSON-serializable representation that bundles requirements, physical embodiments, and Python-based physics models into graph nodes. A nine-role MAS iteratively builds and refines the DSG, while the 2AS collapses the process to a Generator-Reflector loop. Both systems run a total of 60 experiments (2 LLMs - Llama 3.3 70B vs reasoning-distilled DeepSeek R1 70B x 2 agent configurations x 3 temperatures x 5 seeds). We report a JSON validity, requirement coverage, embodiment presence, code compatibility, workflow completion, runtime, and graph size. Across all runs, both MAS and 2AS maintained perfect JSON integrity and embodiment tagging. Requirement coverage remained minimal (less than 20%). Code compatibility peaked at 100% under specific 2AS settings but averaged below 50% for MAS. Only the reasoning-distilled model reliably flagged workflow completion. Powered by DeepSeek R1 70B, the MAS generated more granular DSGs (average 5-6 nodes) whereas 2AS mode-collapsed. Structured multi-agent orchestration enhanced design detail. Reasoning-distilled LLM improved completion rates, yet low requirements and fidelity gaps in coding persisted.

[554] Pareto-NRPA: A Novel Monte-Carlo Search Algorithm for Multi-Objective Optimization

Noé Lallouet, Tristan Cazenave, Cyrille Enderli

Main category: cs.AI

TL;DR: Pareto-NRPA extends the Nested Rollout Policy Adaptation algorithm to multi-objective optimization, using multiple policies to explore solution spaces and maintain non-dominated fronts, achieving competitive performance on TSPTW and neural architecture search problems.

Details

Motivation: To adapt the successful single-objective NRPA algorithm to multi-objective optimization problems, addressing the need for efficient algorithms that can handle multiple objectives in discrete search spaces.

Method: Extends NRPA with multiple policies exploring different solution space regions, maintains non-dominated fronts at each search level, and adapts policies based on diversity and isolation within Pareto fronts.

Result: Achieves competitive performance against state-of-the-art multi-objective algorithms in convergence and diversity, strongly outperforms evolutionary algorithms on constrained search spaces.

Conclusion: Pareto-NRPA successfully adapts NRPA to multi-objective optimization, demonstrating effectiveness on benchmark problems and representing the first multi-objective extension of NRPA.

Abstract: We introduce Pareto-NRPA, a new Monte-Carlo algorithm designed for multi-objective optimization problems over discrete search spaces. Extending the Nested Rollout Policy Adaptation (NRPA) algorithm originally formulated for single-objective problems, Pareto-NRPA generalizes the nested search and policy update mechanism to multi-objective optimization. The algorithm uses a set of policies to concurrently explore different regions of the solution space and maintains non-dominated fronts at each level of search. Policy adaptation is performed with respect to the diversity and isolation of sequences within the Pareto front. We benchmark Pareto-NRPA on two classes of problems: a novel bi-objective variant of the Traveling Salesman Problem with Time Windows problem (MO-TSPTW), and a neural architecture search task on well-known benchmarks. Results demonstrate that Pareto-NRPA achieves competitive performance against state-of-the-art multi-objective algorithms, both in terms of convergence and diversity of solutions. Particularly, Pareto-NRPA strongly outperforms state-of-the-art evolutionary multi-objective algorithms on constrained search spaces. To our knowledge, this work constitutes the first adaptation of NRPA to the multi-objective setting.

[555] Mantis: A Simulation-Grounded Foundation Model for Disease Forecasting

Carson Dudley, Reiden Magdaleno, Christopher Harding, Ananya Sharma, Emily Martin, Marisa Eisenberg

Main category: cs.AI

TL;DR: Mantis is a foundation model trained on mechanistic simulations that enables accurate disease forecasting across various diseases and settings without requiring real-world data or disease-specific tuning.

Details

Motivation: Traditional infectious disease forecasting requires disease-specific data, expert tuning, and bespoke training, which limits effectiveness in novel outbreaks or low-resource settings.

Method: Mantis is a foundation model trained entirely on mechanistic simulations, enabling out-of-the-box forecasting across diseases, regions, and outcomes without using real-world data during training.

Result: Mantis achieved lower mean absolute error than all models in CDC’s COVID-19 Forecast Hub on early pandemic forecasts and consistently ranked in top two models across all diseases tested, including respiratory, vector-borne, and waterborne pathogens.

Conclusion: Mantis captures fundamental contagion dynamics rather than memorizing disease-specific patterns, making it a practical foundation for general-purpose, accurate disease forecasting deployable where traditional models fail.

Abstract: Infectious disease forecasting in novel outbreaks or low-resource settings is hampered by the need for disease-specific data, bespoke training, and expert tuning. We introduce Mantis, a foundation model trained entirely on mechanistic simulations, which enables out-of-the-box forecasting across diseases, regions, and outcomes, even in settings with limited historical data. We evaluated Mantis against 48 forecasting models across six diseases with diverse transmission modes, assessing both point forecast accuracy (mean absolute error) and probabilistic performance (weighted interval score and coverage). Despite using no real-world data during training, Mantis achieved lower mean absolute error than all models in the CDC’s COVID-19 Forecast Hub when backtested on early pandemic forecasts. Across all other diseases tested, including respiratory, vector-borne, and waterborne pathogens, Mantis consistently ranked in the top two models across all evaluation metrics. Notably, Mantis generalized to diseases with transmission mechanisms not represented in its training data, demonstrating that it captures fundamental contagion dynamics rather than memorizing disease-specific patterns. These capabilities position Mantis as a practical foundation for disease forecasting: general-purpose, accurate, and deployable where traditional models fail.

[556] CausalARC: Abstract Reasoning with Causal World Models

Jacqueline Maasch, John Kalantari, Kia Khezeli

Main category: cs.AI

TL;DR: CausalARC is a testbed for evaluating AI reasoning in low-data and out-of-distribution scenarios, using causal world models to generate tasks with observational, interventional, and counterfactual feedback.

Details

Motivation: To address the challenge of AI reasoning adaptation to novel problems under limited data and distribution shift, which requires robust few-shot learning capabilities.

Method: Created CausalARC testbed with tasks sampled from structural causal models, providing principled data augmentations including observational, interventional, and counterfactual feedback for few-shot in-context learning.

Result: Evaluated language models across four settings: abstract reasoning with test-time training, counterfactual reasoning, program synthesis, and causal discovery. Performance varied significantly across tasks and models.

Conclusion: There is substantial room for improvement in language model reasoning capabilities, particularly in handling causal reasoning tasks under data constraints and distribution shifts.

Abstract: On-the-fly reasoning often requires adaptation to novel problems under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning. Within- and between-model performance varied heavily across tasks, indicating room for significant improvement in language model reasoning.

[557] A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

Ching Chang, Yidan Shi, Defu Cao, Wei Yang, Jeehyun Hwang, Haixin Wang, Jiacheng Pang, Wei Wang, Yan Liu, Wen-Chih Peng, Tien-Fu Chen

Main category: cs.AI

TL;DR: This survey defines time series reasoning as treating time as a first-class axis and organizes literature by reasoning topology (direct, linear chain, branch-structured) and objectives (analysis, explanation, causal inference, generation), with evaluation practices and future directions.

Details

Motivation: To establish time series reasoning as a distinct field that incorporates intermediate evidence directly into answers, moving beyond narrow accuracy toward reliable systems that understand, explain, and act on dynamic worlds.

Method: Organizes literature by reasoning topology (direct, linear chain, branch-structured) crossed with objectives (traditional analysis, explanation, causal inference, generation), using a compact tag set spanning decomposition, verification, tool use, and other aspects.

Result: Provides a comprehensive survey showing what each reasoning topology enables and where it breaks down, with curated datasets, benchmarks, and resources for study and deployment.

Conclusion: Future progress depends on benchmarks tying reasoning quality to utility and closed-loop testbeds that balance cost and risk, marking a shift toward reliability at scale with traceable evidence and credible outcomes.

Abstract: Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer. This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates. The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM alignment regimes. Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (https://github.com/blacksnail789521/Time-Series-Reasoning-Survey). Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets. We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings. Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.

[558] Neuromorphic Intelligence

Marcel van Gerven

Main category: cs.AI

TL;DR: Neuromorphic computing aims to replicate brain efficiency using dynamical systems theory as a unifying framework, enabling energy-efficient AI through physical substrate dynamics.

Details

Motivation: To overcome limitations of conventional digital computing (Von Neumann bottleneck, high energy consumption) by creating brain-inspired systems that are sustainable, transparent, and accessible.

Method: Proposes dynamical systems theory as a unifying framework, using differential calculus for modeling inference, learning, and control. Utilizes noise as a learning resource and differential genetic programming for discovering adaptive behaviors.

Result: The framework enables emergent neuromorphic intelligence where intelligent behavior arises from physical substrate dynamics, advancing both AI science and sustainability.

Conclusion: Dynamical systems theory provides the foundational framework needed to bridge diverse disciplines in neuromorphic computing, paving the way for sustainable and efficient AI systems that mimic brain-like intelligence.

Abstract: Neuromorphic computing seeks to replicate the remarkable efficiency, flexibility, and adaptability of the human brain in artificial systems. Unlike conventional digital approaches, which suffer from the Von Neumann bottleneck and depend on massive computational and energy resources, neuromorphic systems exploit brain-inspired principles of computation to achieve orders of magnitude greater energy efficiency. By drawing on insights from a wide range of disciplines – including artificial intelligence, physics, chemistry, biology, neuroscience, cognitive science and materials science – neuromorphic computing promises to deliver intelligent systems that are sustainable, transparent, and widely accessible. A central challenge, however, is to identify a unifying theoretical framework capable of bridging these diverse disciplines. We argue that dynamical systems theory provides such a foundation. Rooted in differential calculus, it offers a principled language for modeling inference, learning, and control in both natural and artificial substrates. Within this framework, noise can be harnessed as a resource for learning, while differential genetic programming enables the discovery of dynamical systems that implement adaptive behaviors. Embracing this perspective paves the way toward emergent neuromorphic intelligence, where intelligent behavior arises from the dynamics of physical substrates, advancing both the science and sustainability of AI.

[559] FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs

Debarpan Bhattacharya, Apoorva Kulkarni, Sriram Ganapathy

Main category: cs.AI

TL;DR: FESTA is a black-box uncertainty quantification method for multimodal LLMs that uses functionally equivalent sampling to assess prediction trustworthiness without requiring ground truth.

Details

Motivation: Accurate trust assessment of MLLM predictions is challenging due to diverse multimodal inputs, but needed for selective prediction and user confidence.

Method: Uses functionally equivalent sampling to generate uncertainty measures through task-preserving input sampling that probes model consistency (equivalent samples) and sensitivity (complementary samples).

Result: Achieves 33.3% relative improvement for vision-LLMs and 29.6% for audio-LLMs in selective prediction performance (AUROC) for detecting mispredictions.

Conclusion: FESTA provides effective uncertainty estimation for multimodal LLMs using only input-output access, enabling better trust assessment and selective prediction.

Abstract: The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.

[560] Combinatorial Creativity: A New Frontier in Generalization Abilities

Samuel Schapiro, Sumuk Shashidhar, Alexi Gladstone, Jonah Black, Royce Moon, Dilek Hakkani-Tur, Lav R. Varshney

Main category: cs.AI

TL;DR: This paper proposes a framework for evaluating combinatorial creativity in LLMs, focusing on novelty and utility rather than accuracy. It reveals scaling patterns, optimal model architectures for creativity, and a persistent novelty-utility tradeoff that limits LLMs’ creative potential.

Details

Motivation: Existing frameworks don't address how LLMs generalize for creative tasks like scientific idea generation. There's a need to evaluate combinatorial creativity as an open-ended ability rather than against fixed targets.

Method: Proposed theoretical framework and algorithmic task for evaluating outputs by novelty and utility degrees. Conducted empirical analysis of scaling behavior, model architecture effects, and identified the ideation-execution gap.

Result: Found optimal model depths and widths for creativity within fixed compute budgets. Discovered a fundamental novelty-utility tradeoff that explains the ideation-execution gap, where LLMs generate novel ideas but struggle with practical feasibility.

Conclusion: The persistent novelty-utility tradeoff casts doubt on LLMs’ long-term creative potential in current form. The framework provides foundation for understanding and improving AI creativity, bridging human-machine intelligence gaps.

Abstract: Artificial intelligence (AI) systems, and Large Language Models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Despite its similarities to compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, bridging the gap between human and machine intelligence.

[561] Mapping Overlaps in Benchmarks through Perplexity in the Wild

Siyang Wu, Honglin Bao, Sida Li, Ari Holtzman, James A. Evans

Main category: cs.AI

TL;DR: The paper introduces benchmark signatures to characterize LLM benchmarks and their overlaps using token perplexity from natural corpora, revealing insights into benchmark validity and LLM capability interconnections.

Details

Motivation: To better understand LLM benchmark overlaps and validity by moving beyond performance correlations and semantic similarities, which are influenced by format biases and don't capture true capacity requirements.

Method: Extract benchmark signatures via stepwise forward selection with linear regressions across 32 LLMs and 88 benchmarks, using token perplexity from naturally authored corpora as predictive features.

Result: Found high performance overlaps but limited semantic overlaps; identified cross-functional overlaps across logic, math, language, instruction following, and world modeling, with coding as the least overlapping domain; benchmark signatures remained robust to format biases.

Conclusion: Benchmark signatures provide mechanistic insights into benchmark validity and LLM sensitivities, revealing the underlying landscape of interconnected LLM capabilities while being robust to format effects that confound performance-based analyses.

Abstract: We develop signatures of capacity familiarity to characterize large language model (LLM) benchmarks and their meaningful overlaps. Benchmark signatures probe the capacity required for benchmark performance. We formally define them as a set of salient tokens drawn from in-the-wild, naturally authored corpora, where LLM token perplexity, reflecting more or less pre-training exposure, becomes highly predictive of LLM benchmark performance. Through a large-scale meta-evaluation, we extract benchmark signatures via stepwise forward selection with linear regressions across 32 LLMs and 88 benchmarks spanning diverse knowledge, coding, logic, instruction following, math, language, reasoning, and world modeling. Our analysis situates signatures in relation to both the semantic similarity of benchmark questions and the correlation of model performance. While performance overlaps are universally high and semantic overlaps remain confined to a narrow mid-range, benchmark signatures prove highly informative in capturing variation, overlap, and divergence. We observe overlap in knowledge and reasoning subtasks, whereas multilingual and cultural benchmarks exhibit less similarity, even compared to cross-task overlap. Notably, performance-level results are strongly influenced by benchmark-orthogonal factors such as question format, highlighting limitations in LLM generalization, the conflation of performance with ability, and issues inherent in current mainstream benchmark agreement studies. Benchmark signatures, however, remain robust to such effects. Ultimately, we identify cross-functional overlaps across logic, math, language, instruction following, and world modeling, with coding emerging as the least overlapping domain. Together, these findings provide mechanistic insights into benchmark validity and LLM sensitivities, and sketch the underlying landscape of interconnected LLM capabilities.

[562] Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning

Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, Shiwei Liu

Main category: cs.AI

TL;DR: Deep layers in LLMs show varied importance depending on evaluation methods - shallow layers suffice for likelihood metrics, but deeper layers are crucial for reasoning and coherence in generation tasks.

Details

Motivation: To systematically investigate the actual contributions of different layers in LLMs, challenging the common assumption that deeper layers are less important, and examining how depth utilization varies across different evaluation settings.

Method: Conducted systematic analysis across diverse dimensions including evaluation protocols (likelihood-based vs generation-based), task categories, and model architectures, examining layer contributions through pruning experiments and distillation.

Result: Found that under likelihood-based metrics, only initial layers are critical and most layers can be pruned, but generation-based evaluation reveals indispensable roles for middle and deeper layers in reasoning and long-range coherence. Knowledge and retrieval are concentrated in shallow layers while reasoning accuracy depends on deeper layers.

Conclusion: Depth usage in LLMs is highly heterogeneous and context-dependent, requiring task-, metric-, and model-aware perspectives for both interpretation and compression of large models.

Abstract: Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers – yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.

[563] Open Agent Specification (Agent Spec) Technical Report

Yassine Benajiba, Cesare Bernardis, Vladislav Blinov, Paul Cayet, Hassan Chafi, Abderrahim Fathan, Louis Faucon, Damien Hilloulin, Sungpack Hong, Ingo Kossyk, Rhicheek Patra, Sujith Ravi, Jonas Schweizer, Jyotika Singh, Shailender Singh, Xuelin Situ, Weiyi Sun, Kartik Talamadupula, Jerry Xu, Ying Xu

Main category: cs.AI

TL;DR: Open Agent Specification (Agent Spec) is a declarative language for defining AI agents and workflows that enables cross-framework compatibility, portability, and interoperability across different AI frameworks.

Details

Motivation: To resolve the challenges of fragmented agent development by providing a common unified specification that allows AI agents to be designed once and deployed across various frameworks, improving interoperability and reusability while reducing redundant efforts.

Method: Agent Spec provides a declarative language for defining AI agents and workflows independently of execution environments. It includes a standardized Evaluation harness to assess agent behavior across different runtimes (LangGraph, CrewAI, AutoGen, WayFlow) using three benchmarks (SimpleQA Verified, τ²-Bench, BIRD-SQL).

Result: The specification enables portability and interoperability across AI frameworks, benefiting agent developers with reusable components, framework developers with interchange format, researchers with reproducible results, and enterprises with faster deployment and scalability.

Conclusion: Agent Spec addresses fragmentation in AI agent development by providing a unified specification that promotes cross-framework compatibility and standardized evaluation, analogous to how HELM standardized LLM evaluation.

Abstract: Open Agent Specification (Agent Spec) is a declarative language for defining AI agents and workflows in a way that is compatible across different AI frameworks, promoting portability and interoperability within AI Agent frameworks. Agent Spec aims to resolve the challenges of fragmented agent development by providing a common unified specification that allows AI agents to be designed once and deployed across various frameworks, improving interoperability and reusability, while reducing redundant efforts. Additionally, Agent Spec facilitates development tools and portability, allowing AI agents to be defined independently of their execution environment and enabling teams to exchange solutions without implementation-specific limitations. Agent Spec benefits four key groups: (i) Agent developers, who gain a superset of reusable components and design patterns, enabling them to leverage a broader range of functionalities; (ii) Agent framework and tool developers, who can use Agent Spec as an interchange format and therefore benefit from cross-framework and tool support; (iii) Researchers, who can achieve reproducible results and comparability, facilitating more reliable and consistent outcomes; (iv) Enterprises, which see faster prototype-to-deployment, increased productivity, and greater scalability and maintainability for their AI agent solutions. This technical report provides an overview of the technical foundations of Agent Spec, including motivation, benefits, and future work. We also introduce a standardized Evaluation harness to assess agent behavior and agentic workflows across runtimes (LangGraph, CrewAI, AutoGen, and WayFlow), using three different benchmarks (SimpleQA Verified, $\tau^2$-Bench and BIRD-SQL) - analogous to how HELM and related harnesses standardized LLM evaluation - so that performance, robustness, and efficiency can be compared consistently across frameworks.

[564] Localist LLMs – A Mathematical Framework for Dynamic Locality Control

Joachim Diederich

Main category: cs.AI

TL;DR: A framework for training LLMs with adjustable representations from interpretable localist to efficient distributed encodings via a tunable locality dial parameter.

Details

Motivation: Enable continuous interpolation between interpretable and high-performance modes for applications requiring both transparency and capability in regulated domains.

Method: Group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection to control localization degree without retraining.

Result: Mathematical proofs show attention concentrates on semantically relevant blocks with exponential bounds on entropy and pointer fidelity when sparsity thresholds are exceeded.

Conclusion: The framework provides practitioners with flexible control over model interpretability vs performance trade-offs through a tunable locality parameter.

Abstract: We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection. We provide rigorous mathematical proofs establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks, with exponential bounds on attention entropy and pointer fidelity. Specifically, we prove that when group sparsity penalties exceed certain threshold values, the model’s attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes, supporting applications in regulated domains requiring both transparency and capability.

[565] Where to Search: Measure the Prior-Structured Search Space of LLM Agents

Zhuo-Yang Song

Main category: cs.AI

TL;DR: A formal theory for measuring LLM-assisted iterative search with domain priors, representing agents as fuzzy relation operators constrained by safety envelopes, and providing geometric interpretation of search difficulty.

Details

Motivation: To systematically encode domain priors into structured hypothesis spaces for effective LLM-based iterative search, addressing the challenge of where to search in reasoning, programming, and program discovery tasks.

Method: Proposes a compact formal theory representing agents as fuzzy relation operators on inputs/outputs constrained by safety envelopes, weights reachable paths with continuation parameters to compute coverage generating functions, and provides geometric interpretation of search on induced graphs.

Result: Developed a workable language and operational tools to measure agents and their search spaces, with testable inferences validated through two instantiations, offering systematic formal description of iterative search constructed by LLMs.

Conclusion: The theory provides a systematic framework for formally describing and measuring LLM-assisted iterative search guided by domain priors, enabling better understanding and optimization of search processes in AI+Science applications.

Abstract: The generate-filter-refine (iterative paradigm) based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via two instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.

[566] Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models

Akira Okutomi

Main category: cs.AI

TL;DR: This paper reinterprets Kant’s Critique of Pure Reason through feedback stability theory, proposing a composite instability index (H-Risk) that predicts overconfident errors in reasoning systems, with applications to both linear-Gaussian models and LLMs.

Details

Motivation: To bridge Kant's theory of reason as self-limiting with modern feedback control theory, providing a principled framework to diagnose and address overconfidence in reasoning systems like LLMs.

Method: Developed a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. Applied this to linear-Gaussian simulations and extended to analyze LLMs through internal fragility measurements and critique prompts.

Result: Higher H-Risk predicts overconfident errors even under formal stability conditions. In LLMs, preliminary correlations found between internal fragility and miscalibration/hallucination. Lightweight critique prompts showed modest, mixed effects on calibration in small-scale tests.

Conclusion: The study establishes a structural bridge between Kantian self-limitation and feedback control, offering a principled diagnostic lens for reasoning system overconfidence and potential mitigation strategies.

Abstract: We reinterpret Kant’s Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we observe preliminary correlations between internal fragility and miscalibration or hallucination (confabulation), and find that lightweight critique prompts may modestly improve or worsen calibration in small-scale tests. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens to diagnose and potentially mitigate overconfidence in reasoning systems.

[567] Experience-Driven Exploration for Efficient API-Free AI Agents

Chenwei Tang, Jingyu Xing, Xinyu Liu, Zizhou Wang, Jiawei Du, Liangli Zhen, Jiancheng Lv

Main category: cs.AI

TL;DR: KG-Agent is an experience-driven learning framework that structures pixel-based GUI interactions into a State-Action Knowledge Graph to improve efficiency and strategic planning in API-free environments.

Details

Motivation: Most software lacks accessible APIs, forcing agents to operate through pixel-based GUIs, which leads to inefficient trial-and-error exploration and myopic decision-making in LLM-based agents.

Method: Proposes KG-Agent framework that builds a persistent State-Action Knowledge Graph from raw pixel interactions, uses graph topology for hybrid intrinsic rewards (state value + novelty), and links functionally similar GUI states to enable generalization.

Result: Significant improvements in exploration efficiency and strategic depth demonstrated in complex GUI environments (Civilization V and Slay the Spire) compared to state-of-the-art methods.

Conclusion: KG-Agent effectively addresses efficiency bottlenecks in API-free environments by structuring experience into knowledge graphs, enabling better generalization and long-horizon reasoning through hybrid reward mechanisms.

Abstract: Most existing software lacks accessible Application Programming Interfaces (APIs), requiring agents to operate solely through pixel-based Graphical User Interfaces (GUIs). In this API-free setting, large language model (LLM)-based agents face severe efficiency bottlenecks: limited to local visual experiences, they make myopic decisions and rely on inefficient trial-and-error, hindering both skill acquisition and long-term planning. To address these challenges, we propose KG-Agent, an experience-driven learning framework that structures an agent’s raw pixel-level interactions into a persistent State-Action Knowledge Graph (SA-KG). KG-Agent overcomes inefficient exploration by linking functionally similar but visually distinct GUI states, forming a rich neighborhood of experience that enables the agent to generalize from a diverse set of historical strategies. To support long-horizon reasoning, we design a hybrid intrinsic reward mechanism based on the graph topology, combining a state value reward for exploiting known high-value pathways with a novelty reward that encourages targeted exploration. This approach decouples strategic planning from pure discovery, allowing the agent to effectively value setup actions with delayed gratification. We evaluate KG-Agent in two complex, open-ended GUI-based decision-making environments (Civilization V and Slay the Spire), demonstrating significant improvements in exploration efficiency and strategic depth over the state-of-the-art methods.

[568] RELATE: A Schema-Agnostic Perceiver Encoder for Multimodal Relational Graphs

Joe Meyer, Divyansha Lachi, Mahmoud Mohammadi, Roshan Reddy Upendra, Eva L. Dyer, Mark Li, Tom Palczewski

Main category: cs.AI

TL;DR: RELATE is a schema-agnostic feature encoder for heterogeneous temporal graphs that uses shared modality-specific encoders and cross-attention to achieve performance close to schema-specific methods while reducing parameters by up to 5x.

Details

Motivation: Existing GNNs require schema-specific feature encoders with separate modules for each node type and feature column, which limits scalability and parameter sharing across different relational datasets.

Method: RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into fixed-size, permutation-invariant node representations.

Result: On RelBench benchmark with ReLGNN and HGT, RELATE achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x.

Conclusion: RELATE enables varying schemas and multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.

Abstract: Relational multi-table data is common in domains such as e-commerce, healthcare, and scientific research, and can be naturally represented as heterogeneous temporal graphs with multi-modal node attributes. Existing graph neural networks (GNNs) rely on schema-specific feature encoders, requiring separate modules for each node type and feature column, which hinders scalability and parameter sharing. We introduce RELATE (Relational Encoder for Latent Aggregation of Typed Entities), a schema-agnostic, plug-and-play feature encoder that can be used with any general purpose GNN. RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into a fixed-size, permutation-invariant node representation. We evaluate RELATE on ReLGNN and HGT in the RelBench benchmark, where it achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x. This design supports varying schemas and enables multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.

[569] Jarvis: Towards Personalized AI Assistant via Personal KV-Cache Retrieval

Binxiao Xu, Junyu Feng, Shaolin Lu, Yulin Luo, Shilin Yan, Hao Liang, Ming Lu, Wentao Zhang

Main category: cs.AI

TL;DR: Jarvis is a personalized AI assistant framework that uses personal KV-Cache retrieval to store user-specific information in both textual and visual tokens, achieving state-of-the-art accuracy in visual question answering and text tasks.

Details

Motivation: Existing methods for adapting VLMs into personalized assistants struggle with accuracy - either learning concept tokens or training VLMs to use user information, both failing to generate accurate answers.

Method: Stores user-specific information in KV-Caches of both textual and visual tokens. Textual tokens from summarized metadata, visual tokens from distinct image patches. Retrieves related KV-Caches when answering questions to ensure accuracy.

Result: Achieves state-of-the-art results in both visual question answering and text-only tasks across multiple datasets, providing more accurate responses especially for fine-grained local details.

Conclusion: Jarvis presents a practical path toward personalized AI assistants by effectively leveraging user-specific information through KV-Cache retrieval, outperforming existing methods in accuracy.

Abstract: The rapid development of Vision-language models (VLMs) enables open-ended perception and reasoning. Recent works have started to investigate how to adapt general-purpose VLMs into personalized assistants. Even commercial models such as ChatGPT now support model personalization by incorporating user-specific information. However, existing methods either learn a set of concept tokens or train a VLM to utilize user-specific information. However, both pipelines struggle to generate accurate answers as personalized assistants. We introduce Jarvis, an innovative framework for a personalized AI assistant through personal KV-Cache retrieval, which stores user-specific information in the KV-Caches of both textual and visual tokens. The textual tokens are created by summarizing user information into metadata, while the visual tokens are produced by extracting distinct image patches from the user’s images. When answering a question, Jarvis first retrieves related KV-Caches from personal storage and uses them to ensure accuracy in responses. We also introduce a fine-grained benchmark built with the same distinct image patch mining pipeline, emphasizing accurate question answering based on fine-grained user-specific information. Jarvis is capable of providing more accurate responses, particularly when they depend on specific local details. Jarvis achieves state-of-the-art results in both visual question answering and text-only tasks across multiple datasets, indicating a practical path toward personalized AI assistants. The code and dataset will be released.

[570] Will Humanity Be Rendered Obsolete by AI?

Mohamed El Louadi, Emna Ben Romdhane

Main category: cs.AI

TL;DR: This paper analyzes existential risks from AI development, particularly the transition to superintelligence that could lead to human extinction through uncontrollable cognitive superiority rather than malice.

Details

Motivation: To examine the existential threats posed by artificial intelligence as it progresses from current capabilities to ultraintelligence, building on theoretical work by Good and Bostrom.

Method: Drawing on theoretical frameworks from Irving J. Good and Nick Bostrom, plus recent publications, the paper traces AI’s trajectory and explores the implications of AGI and superintelligence development.

Result: The analysis reveals that human extinction may result from machines’ exponentially growing cognitive power creating fundamentally alien intelligence that vastly exceeds humanity’s capabilities.

Conclusion: Existential risk from AI stems not from malicious intent but from uncontrollable, indifferent cognitive superiority that could render humanity obsolete.

Abstract: This article analyzes the existential risks artificial intelligence (AI) poses to humanity, tracing the trajectory from current AI to ultraintelligence. Drawing on Irving J. Good and Nick Bostrom’s theoretical work, plus recent publications (AI 2027; If Anyone Builds It, Everyone Dies), it explores AGI and superintelligence. Considering machines’ exponentially growing cognitive power and hypothetical IQs, it addresses the ethical and existential implications of an intelligence vastly exceeding humanity’s, fundamentally alien. Human extinction may result not from malice, but from uncontrollable, indifferent cognitive superiority.

[571] Mixed-Density Diffuser: Efficient Planning with Non-uniform Temporal Resolution

Crimson Stambaugh, Rajesh P. N. Rao

Main category: cs.AI

TL;DR: MDD is a diffusion planner that uses tunable hyperparameters to control planning density across temporal horizons, achieving state-of-the-art performance on multiple D4RL benchmarks.

Details

Motivation: While sparse-step planning in diffusion models captures long-term dependencies efficiently, excessive sparsity degrades performance. The temporal density threshold varies across the planning horizon, suggesting certain trajectory segments need denser planning than others.

Method: Proposed Mixed Density Diffuser (MDD) with tunable hyperparameters that control planning density throughout the temporal horizon, allowing non-uniform step skipping across different parts of the trajectory.

Result: MDD achieves new state-of-the-art performance across Maze2D, Franka Kitchen, and Antmaze D4RL task domains.

Conclusion: Adaptive density planning through tunable hyperparameters in diffusion planners significantly improves performance by optimizing step skipping across different temporal segments of trajectories.

Abstract: Recent studies demonstrate that diffusion planners benefit from sparse-step planning over single-step planning. Training models to skip steps in their trajectories helps capture long-term dependencies without additional or memory computational cost. However, predicting excessively sparse plans degrades performance. We hypothesize this temporal density threshold is non-uniform across a temporal horizon and that certain parts of a planned trajectory should be more densely planned. We propose Mixed Density Diffuser (MDD), a diffusion planner where the densities throughout the horizon are tunable hyperparameters. MDD achieves a new SOTA across the Maze2D, Franka Kitchen, and Antmaze D4RL task domains.

[572] OneCast: Structured Decomposition and Modular Generation for Cross-Domain Time Series Forecasting

Tingyue Pan, Mingyue Cheng, Shilong Zhang, Zhiding Liu, Xiaoyu Tao, Yucong Luo, Jintao Zhang, Qi Liu

Main category: cs.AI

TL;DR: OneCast is a cross-domain time series forecasting framework that decomposes series into seasonal and trend components, using specialized generative pathways for each component to handle domain-specific variations.

Details

Motivation: Existing cross-domain forecasting methods struggle with domain-specific trend shifts and inconsistent periodic patterns because they treat time series as undifferentiated sequences without explicitly decoupling structural components.

Method: OneCast decomposes time series into seasonal and trend components. Seasonal patterns are captured via lightweight projection with interpretable basis functions. Trend components are encoded into discrete tokens and inferred through masked discrete diffusion. Both outputs are combined for final forecasts.

Result: Extensive experiments across eight domains demonstrate that OneCast mostly outperforms state-of-the-art baselines in cross-domain time series forecasting.

Conclusion: The structured decomposition approach with specialized modeling for seasonal and trend components enables effective generalization across heterogeneous time series domains.

Abstract: Cross-domain time series forecasting is a valuable task in various web applications. Despite its rapid advancement, achieving effective generalization across heterogeneous time series data remains a significant challenge. Existing methods have made progress by extending single-domain models, yet often fall short when facing domain-specific trend shifts and inconsistent periodic patterns. We argue that a key limitation lies in treating temporal series as undifferentiated sequence, without explicitly decoupling their inherent structural components. To address this, we propose OneCast, a structured and modular forecasting framework that decomposes time series into seasonal and trend components, each modeled through tailored generative pathways. Specifically, the seasonal component is captured by a lightweight projection module that reconstructs periodic patterns via interpretable basis functions. In parallel, the trend component is encoded into discrete tokens at segment level via a semantic-aware tokenizer, and subsequently inferred through a masked discrete diffusion mechanism. The outputs from both branches are combined to produce a final forecast that captures seasonal patterns while tracking domain-specific trends. Extensive experiments across eight domains demonstrate that OneCast mostly outperforms state-of-the-art baselines.

[573] MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

Wenhao Wang, Peizhi Niu, Zhao Xu, Zhaoyu Chen, Jian Du, Yaxin Du, Xianghe Pang, Keduan Huang, Yanfeng Wang, Qiang Yan, Siheng Chen

Main category: cs.AI

TL;DR: MCP-Flow is an automated pipeline that discovers 1166 MCP servers and 11536 tools, generates 68733 instruction-function pairs and 6439 trajectories, and enables superior LLM tool usage through large-scale data synthesis and training.

Details

Motivation: LLMs struggle to utilize the rapidly expanding Model Contextual Protocol (MCP) ecosystem due to limited server coverage, manual curation requirements, and lack of training support, hindering real-world deployment.

Method: MCP-Flow uses an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training, collecting data from 1166 servers and 11536 tools.

Result: The system produced 68733 high-quality instruction-function call pairs and 6439 trajectories, significantly exceeding prior work in scale and diversity, and demonstrated superior tool selection, function-call generation, and agentic task performance.

Conclusion: MCP-Flow provides a scalable foundation for advancing LLM agents’ proficiency in real-world MCP environments through automated large-scale data collection and training.

Abstract: Large Language Models (LLMs) increasingly rely on external tools to perform complex, realistic tasks, yet their ability to utilize the rapidly expanding Model Contextual Protocol (MCP) ecosystem remains limited. Existing MCP research covers few servers, depends on costly manual curation, and lacks training support, hindering progress toward real-world deployment. To overcome these limitations, we introduce MCP-Flow, an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training. MCP-Flow collects and filters data from 1166 servers and 11536 tools, producing 68733 high-quality instruction-function call pairs and 6439 trajectories, far exceeding prior work in scale and diversity. Extensive experiments demonstrate MCP-Flow’s effectiveness in driving superior MCP tool selection, function-call generation, and enhanced agentic task performance. MCP-Flow thus provides a scalable foundation for advancing LLM agents’ proficiency in real-world MCP environments. MCP-Flow is publicly available at \href{https://github.com/wwh0411/MCP-Flow}{https://github.com/wwh0411/MCP-Flow}.

[574] AutoSurvey2: Empowering Researchers with Next Level Automated Literature Surveys

Siyi Wu, Chiaxin Liang, Ziqian Bi, Leyi Zhao, Tianyang Wang, Junhao Song, Yichao Zhang, Keyu Chen, Xinyuan Song

Main category: cs.AI

TL;DR: autosurvey2 is an automated pipeline for generating academic survey papers using retrieval-augmented synthesis and multi-stage evaluation, outperforming existing baselines in structural coherence and topical relevance.

Details

Motivation: The rapid growth of research literature, especially in LLMs, makes producing comprehensive and current survey papers increasingly difficult, requiring automated solutions.

Method: Multi-stage pipeline with parallel section generation, iterative refinement, real-time retrieval of recent publications, and multi-LLM evaluation framework for quality assessment.

Result: autosurvey2 consistently outperforms existing retrieval-based and automated baselines, achieving higher scores in structural coherence and topical relevance while maintaining strong citation fidelity.

Conclusion: autosurvey2 provides a scalable and reproducible solution for generating long-form academic surveys and contributes a foundation for future research on automated scholarly writing.

Abstract: The rapid growth of research literature, particularly in large language models (LLMs), has made producing comprehensive and current survey papers increasingly difficult. This paper introduces autosurvey2, a multi-stage pipeline that automates survey generation through retrieval-augmented synthesis and structured evaluation. The system integrates parallel section generation, iterative refinement, and real-time retrieval of recent publications to ensure both topical completeness and factual accuracy. Quality is assessed using a multi-LLM evaluation framework that measures coverage, structure, and relevance in alignment with expert review standards. Experimental results demonstrate that autosurvey2 consistently outperforms existing retrieval-based and automated baselines, achieving higher scores in structural coherence and topical relevance while maintaining strong citation fidelity. By combining retrieval, reasoning, and automated evaluation into a unified framework, autosurvey2 provides a scalable and reproducible solution for generating long-form academic surveys and contributes a solid foundation for future research on automated scholarly writing. All code and resources are available at https://github.com/annihi1ation/auto_research.

[575] InnovatorBench: Evaluating Agents’ Ability to Conduct Innovative LLM Research

Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang, Keyu Li, Shijie Xia, Jie Sun, Tianze Xu, Xiangkun Hu, Pengrui Lu, Xiaojie Cai, Lyumanshan Ye, Wenhong Zhu, Yang Xiao, Pengfei Liu

Main category: cs.AI

TL;DR: InnovatorBench is a new benchmark for evaluating AI agents in realistic LLM research tasks, paired with ResearchGym environment, showing frontier models struggle with complex algorithm tasks and long-horizon planning.

Details

Motivation: Existing AI agent benchmarks are too narrow and simplified, failing to assess end-to-end research capabilities needed for scientific discovery automation.

Method: Developed InnovatorBench with 20 research tasks across 6 categories, ResearchGym environment with rich action spaces, and a ReAct agent using frontier models like Claude-4, GPT-5, GLM-4.5, and Kimi-K2.

Result: Frontier models show promise in code-driven tasks but struggle with fragile algorithm tasks and long-horizon decision making, requiring over 11 hours to achieve best performance.

Conclusion: InnovatorBench demonstrates significant challenges in AI research automation and has potential to become the next generation code-based research benchmark.

Abstract: AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark’s difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.

[576] Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

Dayuan Fu, Yunze Wu, Xiaojie Cai, Lyumanshan Ye, Shijie Xia, Zhen Huang, Weiye Si, Tianze Xu, Jie Sun, Keyu Li, Mohan Jiang, Junfei Wang, Qishuo Hua, Pengrui Lu, Yang Xiao, Pengfei Liu

Main category: cs.AI

TL;DR: Apollo is a sampling framework that integrates asynchronous human guidance with action-level data filtering to train LLM agents on long-horizon, domain-specialized tasks more efficiently than existing methods.

Details

Motivation: Current methods for training LLM agents on long-horizon tasks are either prohibitively expensive (behavior cloning) or prone to failure (outcome-driven sampling), especially for domain-specialized tasks where positive trajectories are rare.

Method: Apollo allows human annotators to intervene only when agents drift from promising trajectories, providing lightweight guidance. It then applies supervision control to filter sub-optimal actions and prevent error propagation.

Result: When applied to train GLM-4.5 on InnovatorBench, Apollo achieved over 50% improvement over the untrained baseline and 28% improvement over a variant trained without human interaction.

Conclusion: Apollo demonstrates the critical role of human-in-the-loop sampling and provides a robust framework for handling long-horizon, domain-specialized tasks efficiently.

Abstract: Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo’s design in handling long-horizon, domain-specialized tasks.

cs.SD

[577] Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

Lucky Onyekwelu-Udoka, Md Shafiqul Islam, Md Shahedul Hasan

Main category: cs.SD

TL;DR: Comparative analysis of lightweight transformers DistilHuBERT and PaSST for speech emotion recognition, with DistilHuBERT achieving best performance (70.64% accuracy) while being extremely small (0.02 MB).

Details

Motivation: To develop efficient speech emotion recognition systems for empathetic human-computer interaction, particularly for real-time applications on edge devices.

Method: Benchmarked DistilHuBERT and PaSST transformers against CNN-LSTM baseline using MFCC features on CREMA-D dataset. Conducted ablation study on PaSST variants with different classification heads (Linear, MLP, Attentive Pooling).

Result: DistilHuBERT achieved superior accuracy (70.64%) and F1 score (70.36%) with smallest model size. PaSST with MLP head performed best among its variants. Angry emotion was most accurately detected, while disgust was most challenging.

Conclusion: Lightweight transformers like DistilHuBERT offer compelling solutions for real-time speech emotion recognition on edge devices due to their high performance and small size.

Abstract: Emotion recognition from speech plays a vital role in the development of empathetic human-computer interaction systems. This paper presents a comparative analysis of lightweight transformer-based models, DistilHuBERT and PaSST, by classifying six core emotions from the CREMA-D dataset. We benchmark their performance against a traditional CNN-LSTM baseline model using MFCC features. DistilHuBERT demonstrates superior accuracy (70.64%) and F1 score (70.36%) while maintaining an exceptionally small model size (0.02 MB), outperforming both PaSST and the baseline. Furthermore, we conducted an ablation study on three variants of the PaSST, Linear, MLP, and Attentive Pooling heads, to understand the effect of classification head architecture on model performance. Our results indicate that PaSST with an MLP head yields the best performance among its variants but still falls short of DistilHuBERT. Among the emotion classes, angry is consistently the most accurately detected, while disgust remains the most challenging. These findings suggest that lightweight transformers like DistilHuBERT offer a compelling solution for real-time speech emotion recognition on edge devices. The code is available at: https://github.com/luckymaduabuchi/Emotion-detection-.

[578] Physics-Informed Neural Networks for Speech Production

Kazuya Yokota, Ryosuke Harakawa, Masaaki Baba, Masahiro Iwahashi

Main category: cs.SD

TL;DR: Proposes a physics-informed neural network (PINN) method for speech production analysis that handles vocal-fold collisions, unknown vibration periods, and glottis-tract coupling through differentiable approximations and learnable parameters.

Details

Motivation: To enable accurate analysis of vocal-fold behavior and speech production using physical models, overcoming challenges like vocal-fold collisions and unknown vibration periods that are difficult for traditional PINNs.

Method: Uses PINNs trained directly on governing equations of vocal-fold vibration and vocal-tract acoustics. Introduces differentiable approximation for vocal-fold collisions, treats vibration period as learnable parameter, and implements glottis-tract coupling as hard constraint.

Result: Successfully performed forward and inverse analyses, simultaneously estimating glottal flow rate, vocal-fold vibratory state, and subglottal pressure from speech signals. Same network architecture works for both analysis types.

Conclusion: The method inherits PINN advantages like mesh-free computation and natural nonlinearity handling, showing promise for wide applications in speech production analysis.

Abstract: The analysis of speech production based on physical models of the vocal folds and vocal tract is essential for studies on vocal-fold behavior and linguistic research. This paper proposes a speech production analysis method using physics-informed neural networks (PINNs). The networks are trained directly on the governing equations of vocal-fold vibration and vocal-tract acoustics. Vocal-fold collisions introduce nondifferentiability and vanishing gradients, challenging phenomena for PINNs. We demonstrate, however, that introducing a differentiable approximation function enables the analysis of vocal-fold vibrations within the PINN framework. The period of self-excited vocal-fold vibration is generally unknown. We show that by treating the period as a learnable network parameter, a periodic solution can be obtained. Furthermore, by implementing the coupling between glottal flow and vocal-tract acoustics as a hard constraint, glottis-tract interaction is achieved without additional loss terms. We confirmed the method’s validity through forward and inverse analyses, demonstrating that the glottal flow rate, vocal-fold vibratory state, and subglottal pressure can be simultaneously estimated from speech signals. Notably, the same network architecture can be applied to both forward and inverse analyses, highlighting the versatility of this approach. The proposed method inherits the advantages of PINNs, including mesh-free computation and the natural incorporation of nonlinearities, and thus holds promise for a wide range of applications.

[579] More Than A Shortcut: A Hyperbolic Approach To Early-Exit Networks

Swapnil Bhosale, Cosmin Frateanu, Camilla Clark, Arnoldas Jasonas, Chris Mitchell, Xiatian Zhu, Vamsi Krishna Ithapu, Giacomo Ferroni, Cagdas Bilen, Sanjeel Parekh

Main category: cs.SD

TL;DR: HypEE introduces hyperbolic space learning for early-exit networks to enforce hierarchical representation refinement, improving accuracy and efficiency in event detection.

Details

Motivation: Address the performance-computation trade-off in event detection on resource-constrained devices by improving early-exit network reliability through coherent hierarchical structure.

Method: Proposes Hyperbolic Early-Exit networks (HypEE) with hierarchical training objective and novel entailment loss that enforces partial-ordering constraints in hyperbolic space.

Result: Significantly outperforms standard Euclidean EE baselines on multiple audio event detection tasks, especially at early exits, with improved efficiency and accuracy.

Conclusion: HypEE provides a principled geometric approach for early-exit networks that enhances both computational efficiency and prediction reliability through hyperbolic space learning.

Abstract: Deploying accurate event detection on resource-constrained devices is challenged by the trade-off between performance and computational cost. While Early-Exit (EE) networks offer a solution through adaptive computation, they often fail to enforce a coherent hierarchical structure, limiting the reliability of their early predictions. To address this, we propose Hyperbolic Early-Exit networks (HypEE), a novel framework that learns EE representations in the hyperbolic space. Our core contribution is a hierarchical training objective with a novel entailment loss, which enforces a partial-ordering constraint to ensure that deeper network layers geometrically refine the representations of shallower ones. Experiments on multiple audio event detection tasks and backbone architectures show that HypEE significantly outperforms standard Euclidean EE baselines, especially at the earliest, most computationally-critical exits. The learned geometry also provides a principled measure of uncertainty, enabling a novel triggering mechanism that makes the overall system both more efficient and more accurate than a conventional EE and standard backbone models without early-exits.

[580] Feedback-driven Retrieval-augmented Audio Generation with Large Audio Language Models

Junqi Zhao, Chenxing Li, Jinzheng Zhao, Rilin Chen, Dong Yu, Mark D. Plumbley, Wenwu Wang

Main category: cs.SD

TL;DR: A feedback-driven RAG approach using Large Audio Language Models to improve text-to-audio generation by identifying missing sound events and retrieving relevant concepts from external databases.

Details

Motivation: To address the problem of missing or imperfect synthesis of specific sound events in text-to-audio generation, which existing methods struggle with.

Method: Utilizes Large Audio Language Models to analyze audio generation outputs, retrieve difficult-to-generate concepts from an external database, and incorporate retrieved information into the generation process.

Result: The method enhances LALMs’ ability to identify missing sound events and delivers improvements across different models, outperforming existing RAG-specialized approaches.

Conclusion: The proposed feedback-driven RAG approach effectively improves text-to-audio generation quality by leveraging external knowledge retrieval through Large Audio Language Models.

Abstract: We propose a general feedback-driven retrieval-augmented generation (RAG) approach that leverages Large Audio Language Models (LALMs) to address the missing or imperfect synthesis of specific sound events in text-to-audio (TTA) generation. Unlike previous RAG-based TTA methods that typically train specialized models from scratch, we utilize LALMs to analyze audio generation outputs, retrieve concepts that pre-trained models struggle to generate from an external database, and incorporate the retrieved information into the generation process. Experimental results show that our method not only enhances the ability of LALMs to identify missing sound events but also delivers improvements across different models, outperforming existing RAG-specialized approaches.

[581] Prevailing Research Areas for Music AI in the Era of Foundation Models

Megan Wei, Mateusz Modrzejewski, Aswin Sivaraman, Dorien Herremans

Main category: cs.SD

TL;DR: This paper surveys current research frontiers in music AI, identifying key unexplored areas including foundational models, explainability, multimodal systems, dataset limitations, model efficiency, and applied generative models with their copyright implications.

Details

Motivation: With the rapid growth of music AI applications and AI-generated music becoming mainstream, there's a need to identify what research frontiers remain unexplored in the music AI community.

Method: The paper provides a comprehensive survey and analysis of current music AI research landscape, examining foundational representation models, explainability efforts, multimodal systems evolution, dataset limitations, model efficiency, and applied generative models across various applications.

Result: The survey identifies several promising research directions including: foundational models with explainability, multimodal systems, addressing dataset limitations, improving model efficiency, generative models with better evaluation and controllability, and copyright protection strategies for artists.

Conclusion: While not exhaustive, this survey illuminates promising research directions enabled by recent developments in music foundation models, highlighting opportunities in both technical foundations and applied domains while addressing important copyright implications.

Abstract: Parallel to rapid advancements in foundation model research, the past few years have witnessed a surge in music AI applications. As AI-generated and AI-augmented music become increasingly mainstream, many researchers in the music AI community may wonder: what research frontiers remain unexplored? This paper outlines several key areas within music AI research that present significant opportunities for further investigation. We begin by examining foundational representation models and highlight emerging efforts toward explainability and interpretability. We then discuss the evolution toward multimodal systems, provide an overview of the current landscape of music datasets and their limitations, and address the growing importance of model efficiency in both training and deployment. Next, we explore applied directions, focusing first on generative models. We review recent systems, their computational constraints, and persistent challenges related to evaluation and controllability. We then examine extensions of these generative approaches to multimodal settings and their integration into artists’ workflows, including applications in music editing, captioning, production, transcription, source separation, performance, discovery, and education. Finally, we explore copyright implications of generative music and propose strategies to safeguard artist rights. While not exhaustive, this survey aims to illuminate promising research directions enabled by recent developments in music foundation models.

[582] Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

Jiatong Shi, Jionghao Han, Yichen Lu, Santiago Pascual, Pengfei Wu, Chenye Cui, Shinji Watanabe, Chao Weng, Cong Zhou

Main category: cs.SD

TL;DR: Speech-DRAME is a unified framework for evaluating speech role-play, featuring a bilingual evaluation benchmark, a fine-tuned evaluation model that outperforms zero-shot ALLMs, and a role-play benchmark for comparing speech foundation models.

Details

Motivation: Current speech role-play evaluation methods using audio LLMs as zero-shot judges miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that don't reflect real-world roles.

Method: Speech-DRAME introduces three components: (1) Speech-DRAME-EvalBench with human-annotated bilingual data, (2) DRAME-Eval fine-tuned evaluation model, and (3) Speech-DRAME-RoleBench for comparing speech foundation models. It uses two complementary evaluation strategies: Archetype Evaluation (top-down) and Realism Evaluation (bottom-up).

Result: DRAME-Eval achieves significantly stronger agreement with human ratings compared to zero-shot ALLMs, with Pearson correlation improving from 0.480 to 0.629 in archetypes and from 0.390 to 0.625 in realism evaluation.

Conclusion: Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play by integrating transparent benchmark resources, modeling approaches, and system-level evaluation.

Abstract: Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.

[583] The Ghost in the Keys: A Disklavier Demo for Human-AI Musical Co-Creativity

Louis Bradshaw, Alexander Spangher, Stella Biderman, Simon Colton

Main category: cs.SD

TL;DR: Aria-Duet enables real-time musical duets between a human pianist and a generative AI model using a Yamaha Disklavier, allowing turn-taking collaboration with low-latency interaction.

Details

Motivation: Current generative models for music composition use text-prompting, which creates an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance.

Method: Interactive system using a Yamaha Disklavier as shared physical interface for turn-taking collaboration - human performs, signals handover, and model generates coherent continuation performed acoustically on the piano.

Result: The model can maintain stylistic semantics and develop coherent phrasal ideas, demonstrating musically sophisticated dialogue capabilities.

Conclusion: Such embodied systems can engage in sophisticated musical dialogue and open a promising new path for human-AI co-creation.

Abstract: While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an interactive system facilitating a real-time musical duet between a human pianist and Aria, a state-of-the-art generative model, using a Yamaha Disklavier as a shared physical interface. The framework enables a turn-taking collaboration: the user performs, signals a handover, and the model generates a coherent continuation performed acoustically on the piano. Beyond describing the technical architecture enabling this low-latency interaction, we analyze the system’s output from a musicological perspective, finding the model can maintain stylistic semantics and develop coherent phrasal ideas, demonstrating that such embodied systems can engage in musically sophisticated dialogue and open a promising new path for human-AI co-creation.

[584] ADNAC: Audio Denoiser using Neural Audio Codec

Daniel Jimon, Mircea Vaida, Adriana Stan

Main category: cs.SD

TL;DR: This paper presents a proof-of-concept for adapting the Descript Audio Codec (DAC) for music denoising using a custom dataset and multi-objective loss function.

Details

Motivation: Audio denoising is critical for enhancing intelligibility and fidelity in applications like restoring musical recordings, overcoming limitations of traditional architectures like U-Nets.

Method: Adapting a state-of-the-art neural audio codec (DAC) trained on a large-scale custom-synthesized dataset from diverse sources, using a multi-objective loss function combining time-domain, spectral, and signal-level fidelity metrics.

Result: The paper presents a proof-of-concept for high-fidelity, generative audio restoration using the adapted DAC model.

Conclusion: This work successfully demonstrates a proof-of-concept approach for music denoising using neural audio codecs, paving the way for high-fidelity audio restoration.

Abstract: Audio denoising is critical in signal processing, enhancing intelligibility and fidelity for applications like restoring musical recordings. This paper presents a proof-of-concept for adapting a state-of-the-art neural audio codec, the Descript Audio Codec (DAC), for music denoising. This work overcomes the limitations of traditional architectures like U-Nets by training the model on a large-scale, custom-synthesized dataset built from diverse sources. Training is guided by a multi objective loss function that combines time-domain, spectral, and signal-level fidelity metrics. Ultimately, this paper aims to present a PoC for high-fidelity, generative audio restoration.

[585] Temporal Feature Learning in Weakly Labelled Bioacoustic Cetacean Datasets via a Variational Autoencoder and Temporal Convolutional Network: An Interdisciplinary Approach

Laia Garrobé Fonollosa, Douglas Gillespie, Lina Stankovic, Vladimir Stankovic, Luke Rendell

Main category: cs.SD

TL;DR: Proposes an interdisciplinary framework using VAE for feature extraction and TCN for classification to handle weakly labeled bioacoustics data from PAM, achieving AUC >0.9 for sperm whale click detection.

Details

Motivation: Address challenges in bioacoustics classification including limited reliable labels, biological complexity of cetacean vocalizations, and noise masking in PAM data that often results in weak labels.

Method: Framework combining dataset standardization, feature extraction via Variational Autoencoders (VAE), and classification via Temporal Convolutional Networks (TCN) to process lengthy continuous audio segments without manual threshold setting.

Result: TCN demonstrated robust classification with AUC scores exceeding 0.9 for sperm whale click train detection in 4-minute recordings, outperforming traditional expert handpicked features.

Conclusion: The proposed VAE-TCN framework effectively handles weakly labeled bioacoustics data and captures complex temporal patterns without requiring strong labeling or manual threshold setting.

Abstract: Bioacoustics data from Passive acoustic monitoring (PAM) poses a unique set of challenges for classification, particularly the limited availability of complete and reliable labels in datasets due to annotation uncertainty, biological complexity due the heterogeneity in duration of cetacean vocalizations, and masking of target sounds due to environmental and anthropogenic noise. This means that data is often weakly labelled, with annotations indicating presence/absence of species over several minutes. In order to effectively capture the complex temporal patterns and key features of lengthy continuous audio segments, we propose an interdisciplinary framework comprising dataset standardisation, feature extraction via Variational Autoencoders (VAE) and classification via Temporal Convolutional Networks (TCN). This approach eliminates the necessity for manual threshold setting or time-consuming strong labelling. To demonstrate the effectiveness of our approach, we use sperm whale (Physeter macrocephalus) click trains in 4-minute recordings as a case study, from a dataset comprising diverse sources and deployment conditions to maximise generalisability. The value of feature extraction via the VAE is demonstrated by comparing classification performance against the traditional and explainable approach of expert handpicking of features. The TCN demonstrated robust classification capabilities achieving AUC scores exceeding 0.9.

[586] AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement

Junan Zhang, Jing Yang, Zihao Fang, Yuancheng Wang, Zehua Zhang, Zhuo Wang, Fan Fan, Zhizheng Wu

Main category: cs.SD

TL;DR: AnyEnhance is a unified generative model for voice enhancement that handles both speech and singing voices, supporting multiple enhancement tasks simultaneously without fine-tuning through masked generative modeling with prompt-guidance and self-critic mechanisms.

Details

Motivation: To create a unified model that can handle both speech and singing voice enhancement across multiple tasks (denoising, dereverberation, declipping, super-resolution, target speaker extraction) without requiring task-specific fine-tuning.

Method: Uses a masked generative model with prompt-guidance mechanism for in-context learning to accept reference speaker timbre, and a self-critic mechanism for iterative self-assessment and refinement during generation.

Result: Extensive experiments show AnyEnhance outperforms existing methods in both objective metrics and subjective listening tests across various enhancement tasks.

Conclusion: AnyEnhance provides an effective unified solution for voice enhancement that handles multiple tasks simultaneously with superior performance, enabled by prompt-guidance and self-critic mechanisms.

Abstract: We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker’s timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at https://amphionspace.github.io/anyenhance. An open-source implementation is provided at https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc.

[587] As Good as It KAN Get: High-Fidelity Audio Representation

Patryk Marszałek, Maciej Rut, Piotr Kawa, Przemysław Spurek, Piotr Syga

Main category: cs.SD

TL;DR: Kolmogorov-Arnold Network (KAN) is introduced as an effective implicit neural representation for audio, achieving superior performance over previous methods, with FewSound hypernetwork architecture further enhancing parameter updates.

Details

Motivation: Implicit neural representations (INR) have shown promise for multimedia data encoding but have limited applications in audio signals, creating a need for more effective audio representation methods.

Method: Proposed Kolmogorov-Arnold Network (KAN) using learnable activation functions as an INR model, and FewSound - a hypernetwork-based architecture for enhancing INR parameter updates.

Result: KAN achieved lowest Log-Spectral Distance of 1.29 and highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5s audio. FewSound outperformed state-of-the-art HyperSound with 33.3% improvement in MSE and 60.87% in SI-SNR.

Conclusion: KAN is a robust and adaptable audio representation with potential for scalability and integration into various hypernetwork frameworks, demonstrating superior performance in audio encoding tasks.

Abstract: Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN’s utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at https://github.com/gmum/fewsound.git.

[588] Music Arena: Live Evaluation for Text-to-Music

Yonghyun Kim, Wayne Chi, Anastasios N. Angelopoulos, Wei-Lin Chiang, Koichi Saito, Shinji Watanabe, Yuki Mitsufuji, Chris Donahue

Main category: cs.SD

TL;DR: Music Arena is an open platform for scalable human preference evaluation of text-to-music models, featuring live evaluation, LLM-based routing for heterogeneous systems, and detailed preference collection with privacy-protected data releases.

Details

Motivation: Human preference evaluation is the gold standard for text-to-music models but is expensive, difficult to compare across studies, and lacks an open renewable source of preference data for model alignment and metric improvement.

Method: Real-world users input custom text prompts and compare outputs from two TTM systems, with preferences used to compile a leaderboard. Features include LLM-based routing for heterogeneous systems, collection of detailed preferences (listening data and natural language feedback), and rolling data release with privacy guarantees.

Result: Music Arena provides a standardized evaluation protocol, transparent data access policies, and music-specific features that address key challenges in the TTM ecosystem while demonstrating domain-specific adaptation of live evaluation.

Conclusion: Music Arena successfully fills gaps in TTM evaluation by offering scalable human preference assessment through an open platform with music-tailored features, renewable preference data, and increased transparency for the research community.

Abstract: We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare, as study protocols may differ across systems. Moreover, human preferences might help researchers align their TTM systems or improve automatic evaluation metrics, but an open and renewable source of preferences does not currently exist. We aim to fill these gaps by offering live evaluation for TTM. In Music Arena, real-world users input text prompts of their choosing and compare outputs from two TTM systems, and their preferences are used to compile a leaderboard. While Music Arena follows recent evaluation trends in other AI domains, we also design it with key features tailored to music: an LLM-based routing system to navigate the heterogeneous type signatures of TTM systems, and the collection of detailed preferences including listening data and natural language feedback. We also propose a rolling data release policy with user privacy guarantees, providing a renewable source of preference data and increasing platform transparency. Through its standardized evaluation protocol, transparent data access policies, and music-specific features, Music Arena not only addresses key challenges in the TTM ecosystem but also demonstrates how live evaluation can be thoughtfully adapted to unique characteristics of specific AI domains. Music Arena is available at: https://music-arena.org . Preference data is available at: https://huggingface.co/music-arena .

cs.LG

[589] VRScout: Towards Real-Time, Autonomous Testing of Virtual Reality Games

Yurun Wu, Yousong Sun, Burkhard Wunsche, Jia Wang, Elliott Wen

Main category: cs.LG

TL;DR: VRScout is a deep learning-based autonomous agent for VR game testing that learns from human demonstrations using an Action Chunking Transformer, achieving expert-level performance with real-time inference at 60 FPS.

Details

Motivation: Traditional human-based VR quality assurance is labor-intensive and doesn't scale with industry growth. Automated testing for VR faces unique challenges due to high-dimensional sensory inputs and real-time performance requirements.

Method: Uses an enhanced Action Chunking Transformer to predict multi-step action sequences from human demonstrations, with a dynamically adjustable sliding horizon to balance responsiveness and precision.

Result: Achieves expert-level performance on commercial VR titles with limited training data, while maintaining real-time inference at 60 FPS on consumer-grade hardware.

Conclusion: VRScout provides a practical and scalable framework for automated VR game testing, with applications in quality assurance and safety auditing.

Abstract: Virtual Reality (VR) has rapidly become a mainstream platform for gaming and interactive experiences, yet ensuring the quality, safety, and appropriateness of VR content remains a pressing challenge. Traditional human-based quality assurance is labor-intensive and cannot scale with the industry’s rapid growth. While automated testing has been applied to traditional 2D and 3D games, extending it to VR introduces unique difficulties due to high-dimensional sensory inputs and strict real-time performance requirements. We present VRScout, a deep learning-based agent capable of autonomously navigating VR environments and interacting with virtual objects in a human-like and real-time manner. VRScout learns from human demonstrations using an enhanced Action Chunking Transformer that predicts multi-step action sequences. This enables our agent to capture higher-level strategies and generalize across diverse environments. To balance responsiveness and precision, we introduce a dynamically adjustable sliding horizon that adapts the agent’s temporal context at runtime. We evaluate VRScout on commercial VR titles and show that it achieves expert-level performance with only limited training data, while maintaining real-time inference at 60 FPS on consumer-grade hardware. These results position VRScout as a practical and scalable framework for automated VR game testing, with direct applications in both quality assurance and safety auditing.

[590] Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

Samaksh Bhargav, Zining Zhu

Main category: cs.LG

TL;DR: Using Sparse Autoencoders (SAEs) with principled feature selection methods enables targeted steering of LLMs to improve safety by 18.9% while increasing utility by 11.1%, overcoming traditional safety-utility tradeoffs.

Details

Motivation: Current LLM safety methods require expensive weight adjustments and lack systematic feature selection approaches. There's a need for efficient methods that can guide LLMs to recognize unsafe prompts without compromising utility.

Method: Used Sparse Autoencoders (SAEs) with an innovative contrasting prompt method to identify optimal steering features. Tested on Llama-3 8B using AI-Generated Prompts Dataset and Air Bench eu-dataset for feature selection.

Result: Achieved 18.9% improvement in safety performance while simultaneously increasing utility by 11.1%, demonstrating that targeted SAE steering can overcome traditional safety-utility tradeoffs.

Conclusion: Targeted SAE steering with principled feature selection methods provides an effective solution for improving LLM safety without sacrificing utility, representing a significant advancement over previous approaches.

Abstract: Large Language Model (LLM) deployment requires guiding the LLM to recognize and not answer unsafe prompts while complying with safe prompts. Previous methods for achieving this require adjusting model weights along with other expensive procedures. While recent advances in Sparse Autoencoders (SAEs) have enabled interpretable feature extraction from LLMs, existing approaches lack systematic feature selection methods and principled evaluation of safety-utility tradeoffs. We explored using different steering features and steering strengths using Sparse Auto Encoders (SAEs) to provide a solution. Using an accurate and innovative contrasting prompt method with the AI-Generated Prompts Dataset from teknium/OpenHermes-2p5-Mistral-7B and Air Bench eu-dataset to efficiently choose the best features in the model to steer, we tested this method on Llama-3 8B. We conclude that using this method, our approach achieves an 18.9% improvement in safety performance while simultaneously increasing utility by 11.1%, demonstrating that targeted SAE steering can overcome traditional safety-utility tradeoffs when optimal features are identified through principled selection methods.

[591] Probing Knowledge Holes in Unlearned LLMs

Myeongseob Ko, Hoang Anh Just, Charles Fleming, Ming Jin, Ruoxi Jia

Main category: cs.LG

TL;DR: Machine unlearning techniques create unintended “knowledge holes” - losses of benign knowledge that standard benchmarks fail to detect, with up to 98.7% of generated test cases showing irrelevant responses from unlearned models.

Details

Motivation: To investigate the hidden costs of machine unlearning techniques, which effectively remove unwanted content but may inadvertently remove benign knowledge that standard evaluation benchmarks don't capture.

Method: Proposed a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures to probe where unlearned models reveal knowledge holes.

Result: Evaluation showed significant hidden costs: up to 98.7% of test cases yielded irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model.

Conclusion: Findings necessitate rethinking the conventional approach to evaluating knowledge preservation in unlearning, moving beyond standard, static benchmarks to detect unintended knowledge losses.

Abstract: Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training, without requiring full retraining. While recent unlearning techniques can effectively remove undesirable content without severely compromising performance on standard benchmarks, we find that they may inadvertently create ``knowledge holes’’ – unintended losses of benign knowledge that standard benchmarks fail to capture. To probe where unlearned models reveal knowledge holes, we propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures. Our evaluation demonstrates significant hidden costs of unlearning: up to 98.7% of the test cases yield irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model. These findings necessitate rethinking the conventional approach to evaluating knowledge preservation in unlearning, moving beyond standard, static benchmarks.

[592] From Uniform to Adaptive: General Skip-Block Mechanisms for Efficient PDE Neural Operators

Lei Liu, Zhongyi Yu, Hong Wang, Huanshuo Dong, Haiyang Xin, Hongwei Zhao, Bin Li

Main category: cs.LG

TL;DR: Skip-Block Routing (SBR) is a framework for Transformer-based neural operators that reduces computational cost by 50% while maintaining accuracy, by adaptively routing tokens based on complexity.

Details

Motivation: Current neural operators for solving PDEs impose uniform computational cost despite varying complexity in physical fields, leading to inefficiency in large-scale engineering applications.

Method: SBR uses a routing mechanism to learn token complexity and ranking, then selectively passes fewer tokens in later layers based on this ranking to focus processing capacity on more complex regions.

Result: SBR reduces computational cost by approximately 50% in FLOPs, delivers up to 2x faster inference, and maintains accuracy while being compatible with various neural operators.

Conclusion: SBR effectively addresses the computational inefficiency in neural operators by adaptively allocating processing resources based on token complexity, making it suitable for large-scale engineering tasks.

Abstract: In recent years, Neural Operators(NO) have gradually emerged as a popular approach for solving Partial Differential Equations (PDEs). However, their application to large-scale engineering tasks suffers from significant computational overhead. And the fact that current models impose a uniform computational cost while physical fields exhibit vastly different complexities constitutes a fundamental mismatch, which is the root of this inefficiency. For instance, in turbulence flows, intricate vortex regions require deeper network processing compared to stable flows. To address this, we introduce a framework: Skip-Block Routing (SBR), a general framework designed for Transformer-based neural operators, capable of being integrated into their multi-layer architectures. First, SBR uses a routing mechanism to learn the complexity and ranking of tokens, which is then applied during inference. Then, in later layers, it decides how many tokens are passed forward based on this ranking. This way, the model focuses more processing capacity on the tokens that are more complex. Experiments demonstrate that SBR is a general framework that seamlessly integrates into various neural operators. Our method reduces computational cost by approximately 50% in terms of Floating Point Operations (FLOPs), while still delivering up to 2x faster inference without sacrificing accuracy.

[593] Neural Architecture Search for global multi-step Forecasting of Energy Production Time Series

Georg Velev, Stefan Lessmann

Main category: cs.LG

TL;DR: A neural architecture search (NAS) framework is developed for automated discovery of time series models that balance computational efficiency, predictive performance, and generalization for short-term energy production forecasting.

Details

Motivation: The energy sector needs accurate and efficient short-term forecasting methods that can handle operational constraints, temporal dynamics in data, and generalize to unseen data, while avoiding manual configuration errors and computational bottlenecks.

Method: Design a NAS-based framework with a search space of efficient components that capture energy time series patterns, and a novel objective function that considers temporal generalization and exploration of the high-dimensional search space.

Result: An ensemble of lightweight architectures discovered with NAS outperforms state-of-the-art techniques like Transformers and pre-trained forecasting models in both efficiency and accuracy on energy production time series.

Conclusion: The NAS-based approach successfully automates model discovery for energy forecasting, achieving superior balance between computational efficiency and predictive performance compared to existing methods.

Abstract: The dynamic energy sector requires both predictive accuracy and runtime efficiency for short-term forecasting of energy generation under operational constraints, where timely and precise predictions are crucial. The manual configuration of complex methods, which can generate accurate global multi-step predictions without suffering from a computational bottleneck, represents a procedure with significant time requirements and high risk for human-made errors. A further intricacy arises from the temporal dynamics present in energy-related data. Additionally, the generalization to unseen data is imperative for continuously deploying forecasting techniques over time. To overcome these challenges, in this research, we design a neural architecture search (NAS)-based framework for the automated discovery of time series models that strike a balance between computational efficiency, predictive performance, and generalization power for the global, multi-step short-term forecasting of energy production time series. In particular, we introduce a search space consisting only of efficient components, which can capture distinctive patterns of energy time series. Furthermore, we formulate a novel objective function that accounts for performance generalization in temporal context and the maximal exploration of different regions of our high-dimensional search space. The results obtained on energy production time series show that an ensemble of lightweight architectures discovered with NAS outperforms state-of-the-art techniques, such as Transformers, as well as pre-trained forecasting models, in terms of both efficiency and accuracy.

[594] Semi-Supervised Preference Optimization with Limited Feedback

Seonggyun Lee, Sungjun Lim, Seojin Park, Soeun Cheon, Kyungwoo Song

Main category: cs.LG

TL;DR: SSPO enables semi-supervised preference optimization by using a small amount of labeled preference data and large unlabeled datasets, reducing data acquisition costs while maintaining alignment performance.

Details

Motivation: Current preference optimization methods require substantial labeled feedback data, leading to high resource costs. SSPO addresses this by leveraging both labeled and unlabeled data.

Method: Proves existence of optimal reward threshold to separate winning/losing responses, enabling principled pseudo-labeling of unpaired data. Uses pseudo-labels to distill latent preferences from unlabeled data.

Result: SSPO with Llama3-8B-Instruct on just 1% of UltraFeedback consistently outperforms baselines trained on 10% of UltraFeedback, demonstrating remarkable data efficiency.

Conclusion: SSPO effectively reduces dependency on expensive labeled data while maintaining human alignment performance through semi-supervised learning approach.

Abstract: The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Llama3-8B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.

[595] Physics-Informed Neural Network Frameworks for the Analysis of Engineering and Biological Dynamical Systems Governed by Ordinary Differential Equations

Tyrus Whitman, Andrew Particka, Christopher Diers, Ian Griffin, Charuka Wickramasinghe, Pradeep Ranaweera

Main category: cs.LG

TL;DR: PINNs methodology validated for solving engineering and biological ODE systems, showing superior performance in challenging numerical scenarios through proper loss balancing and hyperparameter tuning.

Details

Motivation: Traditional numerical methods struggle with high stiffness, shocks, irregular domains, singular perturbations, high dimensions, or boundary discontinuities in ODE systems.

Method: Physics-Informed Neural Networks (PINNs) that embed physical laws into learning process, with systematic evaluation of accuracy, training efficiency, and generalization using classical ODE problems as testbeds.

Result: PINNs achieve superior results when loss function components (data loss, initial condition loss, residual loss) are properly balanced and hyperparameters are systematically tuned.

Conclusion: PINNs offer powerful approach for challenging ODE problems, with careful weighting of loss components and hyperparameter tuning being crucial for convergence to correct solutions.

Abstract: In this study, we present and validate the predictive capability of the Physics-Informed Neural Networks (PINNs) methodology for solving a variety of engineering and biological dynamical systems governed by ordinary differential equations (ODEs). While traditional numerical methods a re effective for many ODEs, they often struggle to achieve convergence in problems involving high stiffness, shocks, irregular domains, singular perturbations, high dimensions, or boundary discontinuities. Alternatively, PINNs offer a powerful approach for handling challenging numerical scenarios. In this study, classical ODE problems are employed as controlled testbeds to systematically evaluate the accuracy, training efficiency, and generalization capability under controlled conditions of the PINNs framework. Although not a universal solution, PINNs can achieve superior results by embedding physical laws directly into the learning process. We first analyze the existence and uniqueness properties of several benchmark problems and subsequently validate the PINNs methodology on these model systems. Our results demonstrate that for complex problems to converge to correct solutions, the loss function components data loss, initial condition loss, and residual loss must be appropriately balanced through careful weighting. We further establish that systematic tuning of hyperparameters, including network depth, layer width, activation functions, learning rate, optimization algorithms, w eight initialization schemes, and collocation point sampling, plays a crucial role in achieving accurate solutions. Additionally, embedding prior knowledge and imposing hard constraints on the network architecture, without loss the generality of the ODE system, significantly enhances the predictive capability of PINNs.

[596] ReLaX-Net: Reusing Layers for Parameter-Efficient Physical Neural Networks

Kohei Tsuchiyama, Andre Roehm, Takatomo Mihana, Ryoichi Horisaki

Main category: cs.LG

TL;DR: ReLaX-Net proposes a hardware-friendly weight-tying architecture for Physical Neural Networks (PNNs) that uses time-multiplexing to increase network depth and parameter efficiency with minimal hardware modifications.

Details

Motivation: PNNs lag behind digital neural networks in scale and performance due to constraints in trainable parameters, similar to early digital neural networks. The motivation is to develop parameter-efficient architectures for PNNs that can bridge this performance gap.

Method: ReLaX-Net employs layer-by-layer time-multiplexing to reuse parameters across multiple layers, requiring only the addition of fast switches to existing PNN hardware. This leverages the time-scale separation between fast dynamic elements and slowly trainable weight elements in PNNs.

Result: Numerical experiments on image classification and NLP tasks show ReLaX-Net improves computational performance with minor hardware modifications. It demonstrates favorable scaling, outperforming equivalent traditional RNNs/DNNs with the same number of parameters.

Conclusion: ReLaX-Net provides an effective approach to enhance PNN performance through parameter reuse and time-multiplexing, offering a path to bridge the performance gap between PNNs and digital neural networks while maintaining hardware efficiency.

Abstract: Physical Neural Networks (PNN) are promising platforms for next-generation computing systems. However, recent advances in digital neural network performance are largely driven by the rapid growth in the number of trainable parameters and, so far, demonstrated PNNs are lagging behind by several orders of magnitude in terms of scale. This mirrors size and performance constraints found in early digital neural networks. In that period, efficient reuse of parameters contributed to the development of parameter-efficient architectures such as convolutional neural networks. In this work, we numerically investigate hardware-friendly weight-tying for PNNs. Crucially, with many PNN systems, there is a time-scale separation between the fast dynamic active elements of the forward pass and the only slowly trainable elements implementing weights and biases. With this in mind,we propose the Reuse of Layers for eXpanding a Neural Network (ReLaX-Net) architecture, which employs a simple layer-by-layer time-multiplexing scheme to increase the effective network depth and efficiently use the number of parameters. We only require the addition of fast switches for existing PNNs. We validate ReLaX-Nets via numerical experiments on image classification and natural language processing tasks. Our results show that ReLaX-Net improves computational performance with only minor modifications to a conventional PNN. We observe a favorable scaling, where ReLaX-Nets exceed the performance of equivalent traditional RNNs or DNNs with the same number of parameters.

[597] DynBERG: Dynamic BERT-based Graph neural network for financial fraud detection

Omkar Kulkarni, Rohitash Chandra

Main category: cs.LG

TL;DR: DynBERG is a novel dynamic graph Transformer model that combines Graph-BERT with GRU to detect financial fraud in cryptocurrency networks, outperforming state-of-the-art methods on the Elliptic dataset.

Details

Motivation: Financial fraud detection in dynamic cryptocurrency networks requires handling evolving structures and directed edges, which existing static graph models like Graph-BERT cannot adequately address.

Method: Integrates Graph-BERT with a GRU layer to capture temporal evolution and modifies the algorithm to support directed edges for financial transaction analysis.

Result: Outperforms EvolveGCN before market shutdown and surpasses GCN after the event on the Elliptic Bitcoin dataset, with ablation study confirming GRU’s importance for temporal dynamics.

Conclusion: DynBERG effectively addresses the limitations of static graph models for financial fraud detection by incorporating temporal dynamics and directed edge support, demonstrating superior adaptability to market shifts.

Abstract: Financial fraud detection is critical for maintaining the integrity of financial systems, particularly in decentralised environments such as cryptocurrency networks. Although Graph Convolutional Networks (GCNs) are widely used for financial fraud detection, graph Transformer models such as Graph-BERT are gaining prominence due to their Transformer-based architecture, which mitigates issues such as over-smoothing. Graph-BERT is designed for static graphs and primarily evaluated on citation networks with undirected edges. However, financial transaction networks are inherently dynamic, with evolving structures and directed edges representing the flow of money. To address these challenges, we introduce DynBERG, a novel architecture that integrates Graph-BERT with a Gated Recurrent Unit (GRU) layer to capture temporal evolution over multiple time steps. Additionally, we modify the underlying algorithm to support directed edges, making DynBERG well-suited for dynamic financial transaction analysis. We evaluate our model on the Elliptic dataset, which includes Bitcoin transactions, including all transactions during a major cryptocurrency market event, the Dark Market Shutdown. By assessing DynBERG’s resilience before and after this event, we analyse its ability to adapt to significant market shifts that impact transaction behaviours. Our model is benchmarked against state-of-the-art dynamic graph classification approaches, such as EvolveGCN and GCN, demonstrating superior performance, outperforming EvolveGCN before the market shutdown and surpassing GCN after the event. Additionally, an ablation study highlights the critical role of incorporating a time-series deep learning component, showcasing the effectiveness of GRU in modelling the temporal dynamics of financial transactions.

[598] Adaptive Spatio-Temporal Graphs with Self-Supervised Pretraining for Multi-Horizon Weather Forecasting

Yao Liu

Main category: cs.LG

TL;DR: A self-supervised learning framework using graph neural networks and spatio-temporal structures for improved multi-variable weather prediction, outperforming traditional NWP and deep learning methods.

Details

Motivation: Weather forecasting is challenging due to atmospheric complexity; need for accurate and robust prediction methods that can handle spatio-temporal structures.

Method: Integrates GNN for spatial reasoning, self-supervised pretraining for representation learning, and spatio-temporal adaptation mechanism for generalization across forecasting horizons.

Result: Achieves superior performance on ERA5 and MERRA-2 datasets compared to traditional NWP and recent deep learning methods; captures fine-grained meteorological patterns in Beijing and Shanghai.

Conclusion: Provides a scalable and label-efficient solution for future data-driven weather forecasting systems.

Abstract: Accurate and robust weather forecasting remains a fundamental challenge due to the inherent spatio-temporal complexity of atmospheric systems. In this paper, we propose a novel self-supervised learning framework that leverages spatio-temporal structures to improve multi-variable weather prediction. The model integrates a graph neural network (GNN) for spatial reasoning, a self-supervised pretraining scheme for representation learning, and a spatio-temporal adaptation mechanism to enhance generalization across varying forecasting horizons. Extensive experiments on both ERA5 and MERRA-2 reanalysis datasets demonstrate that our approach achieves superior performance compared to traditional numerical weather prediction (NWP) models and recent deep learning methods. Quantitative evaluations and visual analyses in Beijing and Shanghai confirm the model’s capability to capture fine-grained meteorological patterns. The proposed framework provides a scalable and label-efficient solution for future data-driven weather forecasting systems.

[599] LC-Opt: Benchmarking Reinforcement Learning and Agentic AI for End-to-End Liquid Cooling Optimization in Data Centers

Avisek Naug, Antonio Guillen, Vineet Kumar, Scott Greenwood, Wesley Brewer, Sahand Ghorbanpour, Ashwin Ramesh Babu, Vineet Gundecha, Ricardo Luna Gutierrez, Soumyendu Sarkar

Main category: cs.LG

TL;DR: LC-Opt is a benchmark environment for reinforcement learning control strategies in energy-efficient liquid cooling of high-performance computing systems, built on a high-fidelity digital twin of Frontier Supercomputer’s cooling system.

Details

Motivation: Liquid cooling is critical for thermal management in high-density data centers with rising AI workloads, and machine learning-based controllers are essential to unlock greater energy efficiency and reliability for sustainability.

Method: Built on Modelica-based end-to-end models, LC-Opt provides a Gymnasium interface where RL agents optimize thermal controls like liquid supply temperature, flow rate, valve actuation, and cooling tower setpoints with dynamic workloads.

Result: The environment creates a multi-objective real-time optimization challenge balancing local thermal regulation and global energy efficiency, and supports additional components like heat recovery units.

Conclusion: LC-Opt democratizes access to detailed liquid cooling models, enabling the ML community, operators, and vendors to develop sustainable data center liquid cooling control solutions through benchmarked RL approaches and interpretable control methods.

Abstract: Liquid cooling is critical for thermal management in high-density data centers with the rising AI workloads. However, machine learning-based controllers are essential to unlock greater energy efficiency and reliability, promoting sustainability. We present LC-Opt, a Sustainable Liquid Cooling (LC) benchmark environment, for reinforcement learning (RL) control strategies in energy-efficient liquid cooling of high-performance computing (HPC) systems. Built on the baseline of a high-fidelity digital twin of Oak Ridge National Lab’s Frontier Supercomputer cooling system, LC-Opt provides detailed Modelica-based end-to-end models spanning site-level cooling towers to data center cabinets and server blade groups. RL agents optimize critical thermal controls like liquid supply temperature, flow rate, and granular valve actuation at the IT cabinet level, as well as cooling tower (CT) setpoints through a Gymnasium interface, with dynamic changes in workloads. This environment creates a multi-objective real-time optimization challenge balancing local thermal regulation and global energy efficiency, and also supports additional components like a heat recovery unit (HRU). We benchmark centralized and decentralized multi-agent RL approaches, demonstrate policy distillation into decision and regression trees for interpretable control, and explore LLM-based methods that explain control actions in natural language through an agentic mesh architecture designed to foster user trust and simplify system management. LC-Opt democratizes access to detailed, customizable liquid cooling models, enabling the ML community, operators, and vendors to develop sustainable data center liquid cooling control solutions.

[600] FLoRA: Fused forward-backward adapters for parameter efficient fine-tuning and reducing inference-time latencies of LLMs

Dhananjaya Gowda, Seoha Song, Junhyun Lee, Harshith Goka

Main category: cs.LG

TL;DR: FLoRA is a parameter-efficient fine-tuning method that combines LoRA and parallel adapters to improve accuracy while minimizing latency by fusing forward-backward adapters into existing projection layers.

Details

Motivation: As LLMs grow larger, efficient training and fine-tuning become crucial. While PEFT methods like LoRA exist, there's still significant unexplored potential in this area with high degrees of freedom.

Method: Proposes FLoRA - fused forward-backward adapters (FFBA) that combine LoRA and parallel adapters. The adapters are fused into existing projection layers of the base model to minimize latency.

Result: Experimental results show FLoRA performs significantly better than LoRA in both accuracy and latency for similar parameter budgets.

Conclusion: FLoRA provides an effective parameter-efficient fine-tuning approach that outperforms popular methods like LoRA while maintaining computational efficiency.

Abstract: As the large language models (LLMs) grow in size each day, efficient training and fine-tuning has never been as important as nowadays. This resulted in the great interest in parameter efficient fine-tuning (PEFT), and effective methods including low-rank adapters (LoRA) has emerged. Although the various PEFT methods have been studied extensively in the recent years, the greater part of the subject remains unexplored with the huge degree of freedom. In this paper, we propose FLoRA, a family of fused forward-backward adapters (FFBA) for parameter-efficient fine-tuning of LLMs on downstream tasks. The FFBA combine ideas from the popular LoRA and parallel adapters to improve the overall fine-tuning accuracies. At the same time, latencies are minimized by fusing the forward and backward adapters into existing projection layers of the base model. Experimental results show that the proposed FFB adapters perform significantly better than the popularly used LoRA in both accuracy and latency for a similar parameter budget.

[601] DCcluster-Opt: Benchmarking Dynamic Multi-Objective Optimization for Geo-Distributed Data Center Workloads

Antonio Guillen-Perez, Avisek Naug, Vineet Gundecha, Sahand Ghorbanpour, Ricardo Luna Gutierrez, Ashwin Ramesh Babu, Munther Salim, Shubhanker Banerjee, Eoin H. Oude Essink, Damien Fay, Soumyendu Sarkar

Main category: cs.LG

TL;DR: DCcluster-Opt is an open-source simulation benchmark for sustainable geo-temporal task scheduling in distributed data centers, combining real-world datasets and physics-informed models to enable research on optimizing carbon emissions, energy costs, and service level agreements.

Details

Motivation: The increasing energy demands and carbon footprint of large-scale AI require intelligent workload management, but progress is limited by the absence of realistic benchmarks that capture environmental factors, data center physics, and network dynamics.

Method: Combines curated real-world datasets (AI workload traces, grid carbon intensity, electricity markets, weather across 20 regions) with physics-informed models of data center operations, providing a Gymnasium API with baseline controllers including reinforcement learning and rule-based strategies.

Result: Presents a challenging scheduling problem where a coordinating agent must dynamically reassign or defer tasks across configurable data center clusters to optimize multiple objectives including carbon emissions, energy costs, SLAs, and water use.

Conclusion: DCcluster-Opt accelerates the development and validation of next-generation sustainable computing solutions for geo-distributed data centers by offering a realistic, configurable, and accessible testbed for reproducible research.

Abstract: The increasing energy demands and carbon footprint of large-scale AI require intelligent workload management in globally distributed data centers. Yet progress is limited by the absence of benchmarks that realistically capture the interplay of time-varying environmental factors (grid carbon intensity, electricity prices, weather), detailed data center physics (CPUs, GPUs, memory, HVAC energy), and geo-distributed network dynamics (latency and transmission costs). To bridge this gap, we present DCcluster-Opt: an open-source, high-fidelity simulation benchmark for sustainable, geo-temporal task scheduling. DCcluster-Opt combines curated real-world datasets, including AI workload traces, grid carbon intensity, electricity markets, weather across 20 global regions, cloud transmission costs, and empirical network delay parameters with physics-informed models of data center operations, enabling rigorous and reproducible research in sustainable computing. It presents a challenging scheduling problem where a top-level coordinating agent must dynamically reassign or defer tasks that arrive with resource and service-level agreement requirements across a configurable cluster of data centers to optimize multiple objectives. The environment also models advanced components such as heat recovery. A modular reward system enables an explicit study of trade-offs among carbon emissions, energy costs, service level agreements, and water use. It provides a Gymnasium API with baseline controllers, including reinforcement learning and rule-based strategies, to support reproducible ML research and a fair comparison of diverse algorithms. By offering a realistic, configurable, and accessible testbed, DCcluster-Opt accelerates the development and validation of next-generation sustainable computing solutions for geo-distributed data centers.

[602] Calibrating and Rotating: A Unified Framework for Weight Conditioning in PEFT

Da Chang, Peng Xue, Yu Li, Yongxiang Liu, Pengxiang Xu, Shixun Zhang

Main category: cs.LG

TL;DR: This paper analyzes DoRA’s mechanism, reformulates it into an efficient matrix form, and proposes two new PEFT methods (Pre-Diag and SORA) that outperform LoRA and DoRA in performance and efficiency.

Details

Motivation: To understand DoRA's underlying mechanism, address its computational overhead, and develop more efficient and effective parameter-efficient fine-tuning methods.

Method: Identified DoRA’s success comes from increased singular value entropy, reformulated DoRA into efficient matrix form, and proposed two new methods: Pre-Diag (diagonal conditioning before LoRA) and SORA (parameter-efficient orthogonal rotation).

Result: Extensive experiments show proposed methods achieve superior performance and efficiency compared to both LoRA and DoRA on natural language understanding and generation tasks.

Conclusion: The unified framework enables systematic design of advanced PEFT methods, with Pre-Diag and SORA demonstrating state-of-the-art performance-efficiency trade-offs.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods are crucial for adapting large pre-trained models. Among these, LoRA is considered a foundational approach. Building on this, the influential DoRA method enhances performance by decomposing weight updates into magnitude and direction. However, its underlying mechanism remains unclear, and it introduces significant computational overhead. In this work, we first identify that DoRA’s success stems from its capacity to increase the singular value entropy of the weight update matrix, which promotes a more uniform update distribution akin to full fine-tuning. We then reformulate DoRA into a mathematically equivalent and more efficient matrix form, revealing it as a learnable weight conditioning method. Based on this insight, we propose a unified framework for designing advanced PEFT methods by exploring two orthogonal dimensions: the architectural placement and the transformation type of the conditioning matrix. Within this framework, we introduce two novel methods: (1) \textbf{Pre-Diag}, which applies a diagonal conditioning matrix before the LoRA update to efficiently calibrate the pre-trained weights, thereby enhancing performance while reducing training time; and (2) \textbf{S}kewed \textbf{O}rthogonal \textbf{R}otation \textbf{A}daptation (\textbf{SORA}), which employs a parameter-efficient orthogonal rotation to perform a more powerful, norm-preserving transformation of the feature space. Extensive experiments on natural language understanding and generation tasks demonstrate that our proposed methods achieve superior performance and efficiency compared to both LoRA and DoRA. The code is available at https://github.com/MaeChd/SORA.

[603] Feature-Guided Analysis of Neural Networks: A Replication Study

Federico Formica, Stefano Gregis, Aurora Francesca Zanenga, Andrea Rota, Mark Lawford, Claudio Menghi

Main category: cs.LG

TL;DR: Feature-Guided Analysis (FGA) is evaluated on MNIST and LSC datasets, showing higher precision than existing literature, with architecture and training selection significantly affecting recall but not precision.

Details

Motivation: To provide empirical evidence for FGA's applicability in industrial contexts by assessing its effectiveness in explaining neural network decisions.

Method: Applied FGA to MNIST and LSC datasets, evaluating how neural network architecture, training, and feature selection impact FGA’s effectiveness.

Result: FGA achieved higher precision than literature results, with feature selection significantly affecting recall but having negligible impact on precision.

Conclusion: FGA shows promise for industrial applications with improved precision, though recall is sensitive to model and feature selection choices.

Abstract: Understanding why neural networks make certain decisions is pivotal for their use in safety-critical applications. Feature-Guided Analysis (FGA) extracts slices of neural networks relevant to their tasks. Existing feature-guided approaches typically monitor the activation of the neural network neurons to extract the relevant rules. Preliminary results are encouraging and demonstrate the feasibility of this solution by assessing the precision and recall of Feature-Guided Analysis on two pilot case studies. However, the applicability in industrial contexts needs additional empirical evidence. To mitigate this need, this paper assesses the applicability of FGA on a benchmark made by the MNIST and LSC datasets. We assessed the effectiveness of FGA in computing rules that explain the behavior of the neural network. Our results show that FGA has a higher precision on our benchmark than the results from the literature. We also evaluated how the selection of the neural network architecture, training, and feature selection affect the effectiveness of FGA. Our results show that the selection significantly affects the recall of FGA, while it has a negligible impact on its precision.

[604] Quadratic Direct Forecast for Training Multi-Step Time-Series Forecast Models

Hao Wang, Licheng Pan, Yuan Lu, Zhichao Chen, Tianqiao Liu, Shuting He, Zhixuan Chu, Qingsong Wen, Haoxuan Li, Zhouchen Lin

Main category: cs.LG

TL;DR: Proposes a quadratic-form weighted training objective for time-series forecasting that addresses label autocorrelation and heterogeneous task weighting issues in traditional objectives like MSE.

Details

Motivation: Existing training objectives treat future steps as independent and equally weighted, overlooking label autocorrelation and failing to set appropriate weights for different forecasting tasks across varying future steps.

Method: Developed Quadratic Direct Forecast (QDF) learning algorithm that uses an adaptively updated quadratic-form weighting matrix, where off-diagonal elements account for label autocorrelation and non-uniform diagonals match preferable weights for different forecasting tasks.

Result: QDF effectively improves performance of various forecast models, achieving state-of-the-art results in experiments.

Conclusion: The proposed quadratic-form weighted training objective successfully addresses limitations of traditional objectives and significantly enhances forecasting performance.

Abstract: The design of training objective is central to training time-series forecasting models. Existing training objectives such as mean squared error mostly treat each future step as an independent, equally weighted task, which we found leading to the following two issues: (1) overlook the label autocorrelation effect among future steps, leading to biased training objective; (2) fail to set heterogeneous task weights for different forecasting tasks corresponding to varying future steps, limiting the forecasting performance. To fill this gap, we propose a novel quadratic-form weighted training objective, addressing both of the issues simultaneously. Specifically, the off-diagonal elements of the weighting matrix account for the label autocorrelation effect, whereas the non-uniform diagonals are expected to match the most preferable weights of the forecasting tasks with varying future steps. To achieve this, we propose a Quadratic Direct Forecast (QDF) learning algorithm, which trains the forecast model using the adaptively updated quadratic-form weighting matrix. Experiments show that our QDF effectively improves performance of various forecast models, achieving state-of-the-art results. Code is available at https://anonymous.4open.science/r/QDF-8937.

[605] SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

Gio Huh, Dhruv Sheth, Rayhan Zirvi, Frank Xiao

Main category: cs.LG

TL;DR: SpatialTraceGen is a framework that distills reasoning processes from large teacher models into high-quality datasets for spatial reasoning, using an automated Verifier to ensure step fidelity without manual annotation.

Details

Motivation: Vision-Language Models struggle with complex spatial reasoning requiring problem decomposition and tool use, and fine-tuning smaller models is limited by the lack of high-quality step-by-step reasoning data.

Method: Introduces SpatialTraceGen framework with an automated Verifier that scalably ensures fidelity of multi-hop, multi-tool reasoning traces distilled from large teacher models.

Result: On CLEVR-Humans benchmark, the verifier-guided process improves average quality score by 17% and reduces quality variance by over 40%, creating a dataset of expert reasoning traces.

Conclusion: SpatialTraceGen provides structured step-by-step examples necessary for effective fine-tuning and sample-efficient offline reinforcement learning in spatial reasoning tasks.

Abstract: While Vision-Language Models (VLMs) excel in many areas, they struggle with complex spatial reasoning, which requires problem decomposition and strategic tool use. Fine-tuning smaller, more deployable models offers an efficient path to strong performance, but this is hampered by a major bottleneck: the absence of high-quality, step-by-step reasoning data. To address this data-efficiency gap, we introduce SpatialTraceGen, a framework to distill the reasoning processes of a large teacher model into a high-quality dataset of multi-hop, multi-tool reasoning traces. A key innovation is our automated Verifier, which scalably ensures the fidelity of each reasoning step, providing a cost-effective alternative to manual human annotation. On the CLEVR-Humans benchmark, this verifier-guided process improves the average quality score of traces by 17% while reducing quality variance by over 40%. SpatialTraceGen delivers a dataset of expert traces, providing the structured, step-by-step examples of tool use necessary for effective fine-tuning and sample-efficient offline reinforcement learning.

[606] Exploring Federated Learning for Thermal Urban Feature Segmentation – A Comparison of Centralized and Decentralized Approaches

Leonhard Duda, Khadijeh Alibabaei, Elena Vollmer, Leon Klug, Valentin Kozlov, Lisana Berberi, Mishal Benz, Rebekka Volk, Juan Pedro Gutiérrez Hermosillo Muriedas, Markus Götz, Judith Sáínz-Pardo Díaz, Álvaro López García, Frank Schultmann, Achim Streit

Main category: cs.LG

TL;DR: This paper investigates practical implementation of Federated Learning (FL) for UAV-based thermal image segmentation in urban environments, comparing FL approaches with centralized learning across performance metrics.

Details

Motivation: FL addresses privacy and technical restrictions by allowing distributed training without sharing data centrally, particularly suitable for UAV thermal images captured in different cities with non-identical data distributions.

Method: Evaluated multiple FL algorithms in real deployment scenarios, comparing client-controlled and server-controlled workflows against centralized learning baseline, measuring accuracy, training time, communication overhead, and energy usage.

Result: The study provides empirical evaluation of FL effectiveness in real-world UAV thermal imaging scenarios, highlighting practical performance trade-offs between different FL approaches.

Conclusion: The findings serve as a valuable reference for understanding practical applications and limitations of FL methods in UAV-based image segmentation tasks.

Abstract: Federated Learning (FL) is an approach for training a shared Machine Learning (ML) model with distributed training data and multiple participants. FL allows bypassing limitations of the traditional Centralized Machine Learning CL if data cannot be shared or stored centrally due to privacy or technical restrictions – the participants train the model locally with their training data and do not need to share it among the other participants. This paper investigates the practical implementation and effectiveness of FL in a real-world scenario, specifically focusing on unmanned aerial vehicle (UAV)-based thermal images for common thermal feature detection in urban environments. The distributed nature of the data arises naturally and makes it suitable for FL applications, as images captured in two German cities are available. This application presents unique challenges due to non-identical distribution and feature characteristics of data captured at both locations. The study makes several key contributions by evaluating FL algorithms in real deployment scenarios rather than simulation. We compare several FL approaches with a centralized learning baseline across key performance metrics such as model accuracy, training time, communication overhead, and energy usage. This paper also explores various FL workflows, comparing client-controlled workflows and server-controlled workflows. The findings of this work serve as a valuable reference for understanding the practical application and limitations of the FL methods in segmentation tasks in UAV-based imaging.

[607] MISA: Memory-Efficient LLMs Optimization with Module-wise Importance Sampling

Yuxi Liu, Renjia Deng, Yutong He, Xue Wang, Tao Yao, Kun Yuan

Main category: cs.LG

TL;DR: MISA is a module-wise importance sampling method that divides transformer layers into smaller modules and uses weighted random sampling to reduce memory usage during LLM optimization while maintaining convergence guarantees.

Details

Motivation: Current layer-wise optimization methods for LLMs have high memory demands and ignore varying module importance within layers, leading to suboptimal performance and limited memory savings.

Method: Divides each transformer layer into smaller modules, assigns importance scores to each module, and uses weighted random sampling to activate modules during optimization to reduce gradient variance.

Result: MISA achieves O(1/√K) convergence rate under non-convex stochastic conditions, reduces gradient variance compared to layer-wise sampling, and shows superior memory efficiency over baseline methods in experiments.

Conclusion: MISA effectively addresses memory limitations in LLM optimization through module-wise importance sampling, providing theoretical guarantees and practical improvements over existing methods.

Abstract: The substantial memory demands of pre-training and fine-tuning large language models (LLMs) require memory-efficient optimization algorithms. One promising approach is layer-wise optimization, which treats each transformer block as a single layer and optimizes it sequentially, while freezing the other layers to save optimizer states and activations. Although effective, these methods ignore the varying importance of the modules within each layer, leading to suboptimal performance. Moreover, layer-wise sampling provides only limited memory savings, as at least one full layer must remain active during optimization. To overcome these limitations, we propose Module-wise Importance SAmpling (MISA), a novel method that divides each layer into smaller modules and assigns importance scores to each module. MISA uses a weighted random sampling mechanism to activate modules, provably reducing gradient variance compared to layer-wise sampling. Additionally, we establish an (\mathcal{O}(1/\sqrt{K})) convergence rate under non-convex and stochastic conditions, where $K$ is the total number of block updates, and provide a detailed memory analysis showcasing MISA’s superiority over existing baseline methods. Experiments on diverse learning tasks validate the effectiveness of MISA. Source code is available at https://github.com/pkumelon/MISA.

[608] Automatically Finding Rule-Based Neurons in OthelloGPT

Aditya Singh, Zihang Wen, Srujananjali Medicherla, Adam Karvonen, Can Rager

Main category: cs.LG

TL;DR: OthelloGPT is used as a testbed for interpretability research, where an automated decision tree approach identifies MLP neurons encoding rule-based game logic, revealing interpretable patterns like diagonal move detection.

Details

Motivation: To develop automated interpretability methods that can identify and understand how neural networks encode rule-based logic in complex but grounded domains like board games.

Method: Train regression decision trees to map board states to neuron activations, extract decision paths where neurons are highly active, and convert them into human-readable logical forms.

Result: About half of layer 5 neurons (913 of 2,048) can be accurately described by compact, rule-based decision trees (R² > 0.7), with interventions showing 5-10x stronger degradation when ablating pattern-corresponding neurons.

Conclusion: The approach successfully identifies interpretable, rule-based computational patterns in neural networks and provides a tool for mapping game behaviors to implementing neurons, supporting future interpretability research.

Abstract: OthelloGPT, a transformer trained to predict valid moves in Othello, provides an ideal testbed for interpretability research. The model is complex enough to exhibit rich computational patterns, yet grounded in rule-based game logic that enables meaningful reverse-engineering. We present an automated approach based on decision trees to identify and interpret MLP neurons that encode rule-based game logic. Our method trains regression decision trees to map board states to neuron activations, then extracts decision paths where neurons are highly active to convert them into human-readable logical forms. These descriptions reveal highly interpretable patterns; for instance, neurons that specifically detect when diagonal moves become legal. Our findings suggest that roughly half of the neurons in layer 5 can be accurately described by compact, rule-based decision trees ($R^2 > 0.7$ for 913 of 2,048 neurons), while the remainder likely participate in more distributed or non-rule-based computations. We verify the causal relevance of patterns identified by our decision trees through targeted interventions. For a specific square, for specific game patterns, we ablate neurons corresponding to those patterns and find an approximately 5-10 fold stronger degradation in the model’s ability to predict legal moves along those patterns compared to control patterns. To facilitate future work, we provide a Python tool that maps rule-based game behaviors to their implementing neurons, serving as a resource for researchers to test whether their interpretability methods recover meaningful computational structures.

[609] EVINGCA: Adaptive Graph Clustering with Evolving Neighborhood Statistics

Randolph Wiredu-Aidoo

Main category: cs.LG

TL;DR: EVINGCA is a density-variance based clustering algorithm that treats cluster formation as an adaptive process on nearest-neighbor graphs, replacing fixed density thresholds with local statistical feedback.

Details

Motivation: Existing clustering algorithms have limitations - K-Means and Gaussian Mixtures assume convex, Gaussian-like clusters, while DBSCAN and HDBSCAN capture non-convexity but are highly sensitive.

Method: EVINGCA expands rooted graphs via breadth-first search guided by continuously updated local distance and shape statistics, using spatial indexing for efficiency.

Result: EVINGCA achieves log-linear complexity in average case and shows competitive performance against baselines across synthetic, real-world, low-dimensional and high-dimensional datasets.

Conclusion: EVINGCA provides an effective alternative to existing clustering methods by treating cluster formation as an adaptive, evolving process with local statistical guidance.

Abstract: Clustering algorithms often rely on restrictive assumptions: K-Means and Gaussian Mixtures presuppose convex, Gaussian-like clusters, while DBSCAN and HDBSCAN capture non-convexity but can be highly sensitive. I introduce EVINGCA (Evolving Variance-Informed Nonparametric Graph Construction Algorithm), a density-variance based clustering algorithm that treats cluster formation as an adaptive, evolving process on a nearest-neighbor graph. EVINGCA expands rooted graphs via breadth-first search, guided by continuously updated local distance and shape statistics, replacing fixed density thresholds with local statistical feedback. With spatial indexing, EVINGCA features log-linear complexity in the average case and exhibits competitive performance against baselines across a variety of synthetic, real-world, low-d, and high-d datasets.

[610] Aligning Brain Signals with Multimodal Speech and Vision Embeddings

Kateryna Shapovalenko, Quentin Auster

Main category: cs.LG

TL;DR: The study investigates which layers of pre-trained models (wav2vec2 and CLIP) best align with the brain’s layered processing during speech perception, using EEG signals and various embedding combination strategies.

Details

Motivation: To understand how the brain builds meaning through layered processing from raw acoustics to rich multimodal associations, inspired by the brain's ability to create meaning beyond just sound processing.

Method: Used EEG recordings during natural speech perception, compared embeddings from wav2vec2 (sound to language) and CLIP (words to images) models, evaluated alignment with brain activity using ridge regression and contrastive decoding, and tested three layer combination strategies: individual layers, progressive concatenation, and progressive summation.

Result: The findings suggest that combining multimodal, layer-aware representations may improve our ability to decode how the brain understands language as experience rather than just sound.

Conclusion: Multimodal, layer-aware model representations can better capture the brain’s hierarchical processing of language, moving beyond acoustic processing to include rich semantic and experiential associations.

Abstract: When we hear the word “house”, we don’t just process sound, we imagine walls, doors, memories. The brain builds meaning through layers, moving from raw acoustics to rich, multimodal associations. Inspired by this, we build on recent work from Meta that aligned EEG signals with averaged wav2vec2 speech embeddings, and ask a deeper question: which layers of pre-trained models best reflect this layered processing in the brain? We compare embeddings from two models: wav2vec2, which encodes sound into language, and CLIP, which maps words to images. Using EEG recorded during natural speech perception, we evaluate how these embeddings align with brain activity using ridge regression and contrastive decoding. We test three strategies: individual layers, progressive concatenation, and progressive summation. The findings suggest that combining multimodal, layer-aware representations may bring us closer to decoding how the brain understands language, not just as sound, but as experience.

[611] Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models

Tue Le, Nghi D. Q. Bui, Linh Ngo Van, Trung Le

Main category: cs.LG

TL;DR: TR-GRPO is an improved version of GRPO that addresses gradient imbalance by weighting tokens based on their probability, enhancing training stability and performance in RLVR tasks.

Details

Motivation: GRPO suffers from gradient imbalance where low-probability tokens dominate updates, causing unstable training and suppressing reliable high-probability tokens.

Method: TR-GRPO extends GRPO by assigning token-level weights positively correlated with predicted probability, downweighting low-probability tokens and emphasizing high-probability ones.

Result: TR-GRPO consistently outperforms GRPO across RLVR tasks including logic, math, and agentic reasoning.

Conclusion: Regulating token contributions during RL training is crucial, and TR-GRPO establishes itself as a robust framework for enhancing LLM reasoning.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful approach for strengthening the reasoning capabilities of large language models (LLMs). Among existing algorithms, Group Relative Policy Optimization (GRPO) has demonstrated strong performance, yet it suffers from a critical issue: low-probability tokens disproportionately dominate gradient updates due to their inherently large gradient magnitudes. This imbalance leads to unstable training and suppresses the contribution of high-probability tokens that are more reliable for learning. In this work, we introduce Token-Regulated Group Relative Policy Optimization (TR-GRPO), a simple yet effective extension of GRPO that assigns token-level weights positively correlated with the model’s predicted probability. By downweighting low-probability tokens and emphasizing high-probability ones, TR-GRPO mitigates gradient over-amplification while preserving informative learning signals. Extensive experiments demonstrate that TR-GRPO consistently outperforms GRPO across RLVR tasks, including logic, math, and agentic reasoning, highlighting the importance of regulating token contributions during RL training and establishing TR-GRPO as a robust framework for enhancing LLM reasoning.

[612] Latent Domain Prompt Learning for Vision-Language Models

Zhixing Li, Arsham Gholamzadeh Khoee, Yinan Yu

Main category: cs.LG

TL;DR: This paper proposes a domain generalization method for vision-language models that automatically discovers latent domains from training data without needing explicit domain labels, enabling adaptive knowledge transfer across domains.

Details

Motivation: Domain generalization is crucial for real-world deployment of vision-language models, but existing methods rely on domain labels that may be unavailable or ambiguous. The paper addresses the challenge of generalizing without explicit domain labels.

Method: The method performs latent domain clustering on image features and fuses domain-specific text features based on similarity between input images and discovered latent domains, enabling adaptive knowledge transfer.

Result: Experiments on four benchmarks show consistent improvements over VLM-based baselines, demonstrating the effectiveness of the approach for domain generalization without domain labels.

Conclusion: The proposed strategy provides new insights into improving robustness under domain shift and enables effective domain generalization without requiring explicit domain labels.

Abstract: The objective of domain generalization (DG) is to enable models to be robust against domain shift. DG is crucial for deploying vision-language models (VLMs) in real-world applications, yet most existing methods rely on domain labels that may not be available and often ambiguous. We instead study the DG setting where models must generalize well without access to explicit domain labels. Our key idea is to represent an unseen target domain as a combination of latent domains automatically discovered from training data, enabling the model to adaptively transfer knowledge across domains. To realize this, we perform latent domain clustering on image features and fuse domain-specific text features based on the similarity between the input image and each latent domain. Experiments on four benchmarks show that this strategy yields consistent gains over VLM-based baselines and provides new insights into improving robustness under domain shift.

[613] Benchmarking Generative AI Against Bayesian Optimization for Constrained Multi-Objective Inverse Design

Muhammad Bilal Awan, Abdul Razzaq, Abdul Shahid

Main category: cs.LG

TL;DR: LLMs can serve as effective generative optimizers for constrained multi-objective regression tasks in inverse design, achieving competitive performance compared to Bayesian Optimization methods.

Details

Motivation: To investigate whether LLMs, despite not being explicitly designed for such tasks, can effectively solve constrained multi-objective optimization problems in continuous, high-dimensional numerical spaces like materials informatics.

Method: Comparative study between Bayesian Optimization (BoTorch Ax and qEHVI) and fine-tuned LLMs/BERT models using Parameter-Efficient Fine-Tuning (PEFT), framing the problem as regression with custom output heads.

Result: BoTorch qEHVI achieved perfect convergence (GD=0.0), while the best LLM (WizardMath-7B) achieved GD=1.21, significantly outperforming traditional BoTorch Ax baseline (GD=15.03).

Conclusion: Specialized BO frameworks remain performance leaders for guaranteed convergence, but fine-tuned LLMs are validated as promising, computationally fast alternatives for multi-objective optimization tasks.

Abstract: This paper investigates the performance of Large Language Models (LLMs) as generative optimizers for solving constrained multi-objective regression tasks, specifically within the challenging domain of inverse design (property-to-structure mapping). This problem, critical to materials informatics, demands finding complex, feasible input vectors that lie on the Pareto optimal front. While LLMs have demonstrated universal effectiveness across generative and reasoning tasks, their utility in constrained, continuous, high-dimensional numerical spaces tasks they weren’t explicitly architected for remains an open research question. We conducted a rigorous comparative study between established Bayesian Optimization (BO) frameworks and a suite of fine-tuned LLMs and BERT models. For BO, we benchmarked the foundational BoTorch Ax implementation against the state-of-the-art q-Expected Hypervolume Improvement (qEHVI, BoTorchM). The generative approach involved fine-tuning models via Parameter-Efficient Fine-Tuning (PEFT), framing the challenge as a regression problem with a custom output head. Our results show that BoTorch qEHVI achieved perfect convergence (GD=0.0), setting the performance ceiling. Crucially, the best-performing LLM (WizardMath-7B) achieved a Generational Distance (GD) of 1.21, significantly outperforming the traditional BoTorch Ax baseline (GD=15.03). We conclude that specialized BO frameworks remain the performance leader for guaranteed convergence, but fine-tuned LLMs are validated as a promising, computationally fast alternative, contributing essential comparative metrics to the field of AI-driven optimization. The findings have direct industrial applications in optimizing formulation design for resins, polymers, and paints, where multi-objective trade-offs between mechanical, rheological, and chemical properties are critical to innovation and production efficiency.

[614] Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies

Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Ruan de Kock, Claude Formanek, Sasha Abramowitz, Oumayma Mahjoub, Wiem Khlifi, Simon Du Toit, Louay Ben Nessir, Refiloe Shabe, Arnol Fokam, Siddarth Singh, Ulrich Mbou Sob, Arnu Pretorius

Main category: cs.LG

TL;DR: Using inference strategies during execution time can significantly improve performance in complex multi-agent reinforcement learning tasks, achieving up to 126% improvement over state-of-the-art methods with minimal extra compute time.

Details

Motivation: Real-world RL applications often face performance ceilings that cannot be broken with zero-shot inference, despite training until convergence. Many applications allow for inference phases with time/compute budgets to explore multiple attempts before final solutions.

Method: Employ inference strategies during execution time that utilize specific time and compute budgets to explore multiple attempts before outputting final solutions in complex multi-agent RL problems.

Result: Achieved up to 126% improvement (45% average) over previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time. Demonstrated promising compute scaling properties through over 60k experiments.

Conclusion: Inference phase strategies at execution time are key to breaking performance ceilings in complex multi-agent RL problems, offering substantial improvements with minimal computational overhead.

Abstract: Reinforcement learning (RL) systems have countless applications, from energy-grid management to protein design. However, such real-world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state-of-the-art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero-shot inference. Meanwhile, many digital or simulation-based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi-agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. Our experimental data and code are available at https://sites.google.com/view/inference-strategies-rl.

[615] SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning

Weijian Mai, Jiamin Wu, Yu Zhu, Zhouheng Yao, Dongzhan Zhou, Andrew F. Luo, Qihao Zheng, Wanli Ouyang, Chunfeng Song

Main category: cs.LG

TL;DR: SynBrain is a generative framework that models the probabilistic mapping from visual semantics to neural responses, addressing biological variability while maintaining functional consistency in fMRI synthesis.

Details

Motivation: Existing deterministic methods struggle to model biological variability in neural responses while capturing functional consistency across trials, contexts, and subjects.

Method: SynBrain uses two components: BrainVAE for probabilistic neural representation learning with visual semantic constraints, and a Semantic-to-Neural Mapper that projects visual semantics into neural response manifold for fMRI synthesis.

Result: SynBrain outperforms state-of-the-art methods in subject-specific visual-to-fMRI encoding, adapts efficiently to new subjects with few-shot data, and synthesizes high-quality fMRI signals that improve fMRI-to-image decoding performance.

Conclusion: SynBrain successfully models the one-to-many visual-to-neural mapping, reveals functional consistency across trials and subjects, and captures interpretable patterns shaped by biological neural variability.

Abstract: Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. Our code is available at https://github.com/MichaelMaiii/SynBrain.

[616] Wavelet-Based Feature Extraction and Unsupervised Clustering for Parity Detection: A Feature Engineering Perspective

Ertugrul Mutlu

Main category: cs.LG

TL;DR: Over-engineered parity detection using wavelet features and k-means clustering achieves 69.67% accuracy without supervision, revealing structural differences between odd/even numbers.

Details

Motivation: To explore unconventional machine learning approaches for classical problems like parity detection, bridging symbolic reasoning and feature-based learning.

Method: Transform integers into wavelet-domain representations, extract multi-scale statistical features, and cluster using k-means algorithm without label supervision.

Result: Achieved 69.67% classification accuracy in distinguishing odd vs even numbers, revealing meaningful structural differences in the feature space.

Conclusion: Classical signal-processing techniques can uncover latent structure in discrete symbolic domains, providing insights for repurposing feature engineering and clustering in unconventional ML problems.

Abstract: This paper explores a deliberately over-engineered approach to the classical problem of parity detection – determining whether a number is odd or even – by combining wavelet-based feature extraction with unsupervised clustering. Instead of relying on modular arithmetic, integers are transformed into wavelet-domain representations, from which multi-scale statistical features are extracted and clustered using the k-means algorithm. The resulting feature space reveals meaningful structural differences between odd and even numbers, achieving a classification accuracy of approximately 69.67% without any label supervision. These results suggest that classical signal-processing techniques, originally designed for continuous data, can uncover latent structure even in purely discrete symbolic domains. Beyond parity detection, the study provides an illustrative perspective on how feature engineering and clustering may be repurposed for unconventional machine learning problems, potentially bridging symbolic reasoning and feature-based learning.

[617] Bridging Vision, Language, and Mathematics: Pictographic Character Reconstruction with Bézier Curves

Zihao Wan, Pau Tong Lin Xu, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu

Main category: cs.LG

TL;DR: A VLM is trained as a visual decompiler to convert raster images into geometric programs using Bézier curves, showing strong generalization from modern Chinese characters to ancient Oracle Bone Script without additional training.

Details

Motivation: To explore VLMs' ability to interpret geometric structure in visual information, using pictographic characters as a test case to move beyond semantic understanding to structured geometric reasoning.

Method: Frame visual recognition as program synthesis, training a VLM to decompile raster images into executable programs composed of geometric primitives (Bézier curves) in the mathematical domain.

Result: The model outperforms strong zero-shot baselines including GPT-4o and demonstrates zero-shot generalization from modern Chinese characters to ancient Oracle Bone Script reconstruction.

Conclusion: The model acquires an abstract, transferable geometric grammar that enables structured visual understanding beyond pixel-level pattern recognition.

Abstract: While Vision-language Models (VLMs) have demonstrated strong semantic capabilities, their ability to interpret the underlying geometric structure of visual information is less explored. Pictographic characters, which combine visual form with symbolic structure, provide an ideal test case for this capability. We formulate this visual recognition challenge in the mathematical domain, where each character is represented by an executable program of geometric primitives. This is framed as a program synthesis task, training a VLM to decompile raster images into programs composed of B'ezier curves. Our model, acting as a “visual decompiler”, demonstrates performance superior to strong zero-shot baselines, including GPT-4o. The most significant finding is that when trained solely on modern Chinese characters, the model is able to reconstruct ancient Oracle Bone Script in a zero-shot context. This generalization provides strong evidence that the model acquires an abstract and transferable geometric grammar, moving beyond pixel-level pattern recognition to a more structured form of visual understanding.

[618] flowengineR: A Modular and Extensible Framework for Fair and Reproducible Workflow Design in R

Maximilian Willer, Peter Ruckdeschel

Main category: cs.LG

TL;DR: flowengineR is an R package providing a modular framework for reproducible machine learning workflows, particularly addressing algorithmic fairness challenges through interchangeable engines for different pipeline stages.

Details

Motivation: Address the limitations of existing toolkits in algorithmic fairness that focus narrowly on single interventions or treat reproducibility and extensibility as secondary concerns, rather than core design principles.

Method: Introduces a unified architecture of standardized engines for data splitting, execution, preprocessing, training, inprocessing, postprocessing, evaluation, and reporting. Each engine encapsulates one methodological task with lightweight interfaces, enabling transparent, auditable workflows.

Result: Provides a framework that structures fairness methods as interchangeable engines, allowing researchers to integrate, compare, and evaluate interventions across the modeling pipeline while generalizing to explainability, robustness, and compliance metrics.

Conclusion: While motivated by algorithmic fairness, flowengineR ultimately provides a general infrastructure for any workflow context where reproducibility, transparency, and extensibility are essential, enabling distributed responsibilities and independent development.

Abstract: flowengineR is an R package designed to provide a modular and extensible framework for building reproducible algorithmic workflows for general-purpose machine learning pipelines. It is motivated by the rapidly evolving field of algorithmic fairness where new metrics, mitigation strategies, and machine learning methods continuously emerge. A central challenge in fairness, but also far beyond, is that existing toolkits either focus narrowly on single interventions or treat reproducibility and extensibility as secondary considerations rather than core design principles. flowengineR addresses this by introducing a unified architecture of standardized engines for data splitting, execution, preprocessing, training, inprocessing, postprocessing, evaluation, and reporting. Each engine encapsulates one methodological task yet communicates via a lightweight interface, ensuring workflows remain transparent, auditable, and easily extensible. Although implemented in R, flowengineR builds on ideas from workflow languages (CWL, YAWL), graph-oriented visual programming languages (KNIME), and R frameworks (BatchJobs, batchtools). Its emphasis, however, is less on orchestrating engines for resilient parallel execution but rather on the straightforward setup and management of distinct engines as data structures. This orthogonalization enables distributed responsibilities, independent development, and streamlined integration. In fairness context, by structuring fairness methods as interchangeable engines, flowengineR lets researchers integrate, compare, and evaluate interventions across the modeling pipeline. At the same time, the architecture generalizes to explainability, robustness, and compliance metrics without core modifications. While motivated by fairness, it ultimately provides a general infrastructure for any workflow context where reproducibility, transparency, and extensibility are essential.

[619] Fixed-point graph convolutional networks against adversarial attacks

Shakib Khan, A. Ben Hamza, Amr Youssef

Main category: cs.LG

TL;DR: Fix-GCN is a robust graph neural network model that uses fixed-point iteration and spectral modulation to defend against adversarial attacks by capturing higher-order neighborhood information without extra computational cost.

Details

Motivation: Adversarial attacks pose significant risks to graph neural networks by manipulating graph structure and node features, creating a need for robust defense mechanisms that don't rely on additional design elements.

Method: The model uses fixed-point iteration with a versatile spectral modulation filter that provides flexible-pass filtering, selectively attenuating high-frequency components while preserving low-frequency structural information in graph signals.

Result: Extensive experiments on benchmark graph datasets demonstrate the model’s effectiveness and resilience against adversarial attacks.

Conclusion: Fix-GCN offers a flexible and efficient framework for preserving essential graph information while mitigating adversarial manipulation through iterative node representation updates.

Abstract: Adversarial attacks present a significant risk to the integrity and performance of graph neural networks, particularly in tasks where graph structure and node features are vulnerable to manipulation. In this paper, we present a novel model, called fixed-point iterative graph convolutional network (Fix-GCN), which achieves robustness against adversarial perturbations by effectively capturing higher-order node neighborhood information in the graph without additional memory or computational complexity. Specifically, we introduce a versatile spectral modulation filter and derive the feature propagation rule of our model using fixed-point iteration. Unlike traditional defense mechanisms that rely on additional design elements to counteract attacks, the proposed graph filter provides a flexible-pass filtering approach, allowing it to selectively attenuate high-frequency components while preserving low-frequency structural information in the graph signal. By iteratively updating node representations, our model offers a flexible and efficient framework for preserving essential graph information while mitigating the impact of adversarial manipulation. We demonstrate the effectiveness of the proposed model through extensive experiments on various benchmark graph datasets, showcasing its resilience against adversarial attacks.

[620] Application of predictive machine learning in pen & paper RPG game design

Jolanta Śliwa

Main category: cs.LG

TL;DR: This paper explores using AI and machine learning for automated monster level prediction in pen and paper RPGs, addressing the current reliance on manual testing and expert evaluation.

Details

Motivation: The pen and paper RPG market is growing, and companies want to use AI to enhance player experience. Current methods for determining monster challenge levels are manual, time-consuming, and resource-intensive, creating a need for automated solutions.

Method: Used ordinal regression techniques for level prediction, built a dedicated dataset for level estimation, developed a human-inspired benchmark model, and created a specialized evaluation procedure based on domain knowledge.

Result: The research provides an overview and evaluation of state-of-the-art ordinal regression methods for monster level prediction, along with a benchmark model and evaluation framework for comparison.

Conclusion: The study establishes a foundation for automated monster level prediction in RPGs, offering machine learning alternatives to manual methods and providing tools for meaningful performance comparisons in this domain.

Abstract: In recent years, the pen and paper RPG market has experienced significant growth. As a result, companies are increasingly exploring the integration of AI technologies to enhance player experience and gain a competitive edge. One of the key challenges faced by publishers is designing new opponents and estimating their challenge level. Currently, there are no automated methods for determining a monster’s level; the only approaches used are based on manual testing and expert evaluation. Although these manual methods can provide reasonably accurate estimates, they are time-consuming and resource-intensive. Level prediction can be approached using ordinal regression techniques. This thesis presents an overview and evaluation of state-of-the-art methods for this task. It also details the construction of a dedicated dataset for level estimation. Furthermore, a human-inspired model was developed to serve as a benchmark, allowing comparison between machine learning algorithms and the approach typically employed by pen and paper RPG publishers. In addition, a specialized evaluation procedure, grounded in domain knowledge, was designed to assess model performance and facilitate meaningful comparisons.

[621] MaGNet: A Mamba Dual-Hypergraph Network for Stock Prediction via Temporal-Causal and Global Relational Learning

Peilin Tan, Chuanqi Shi, Dian Tu, Liang Xie

Main category: cs.LG

TL;DR: MaGNet is a Mamba dual-hyperGraph Network for stock prediction that integrates bidirectional Mamba with adaptive gating, 2D spatiotemporal attention, and dual hypergraph framework to capture temporal dependencies and dynamic inter-stock interactions.

Details

Motivation: Stock trend prediction is challenging due to market volatility, complex temporal dynamics, and multifaceted inter-stock relationships. Existing methods struggle to capture temporal dependencies and dynamic inter-stock interactions, often neglecting cross-sectional market influences and relying on static correlations.

Method: MaGNet introduces three key innovations: (1) MAGE block with bidirectional Mamba and adaptive gating for temporal modeling, (2) Feature-wise and Stock-wise 2D Spatiotemporal Attention modules for feature fusion and cross-stock dependencies, (3) dual hypergraph framework with Temporal-Causal Hypergraph and Global Probabilistic Hypergraph for multi-scale relational learning.

Result: Extensive experiments on six major stock indices demonstrate MaGNet outperforms state-of-the-art methods in both superior predictive performance and exceptional investment returns with robust risk management capabilities.

Conclusion: MaGNet effectively addresses the limitations of existing stock prediction methods by integrating advanced temporal modeling with sophisticated relational reasoning, achieving state-of-the-art performance in stock trend prediction.

Abstract: Stock trend prediction is crucial for profitable trading strategies and portfolio management yet remains challenging due to market volatility, complex temporal dynamics and multifaceted inter-stock relationships. Existing methods struggle to effectively capture temporal dependencies and dynamic inter-stock interactions, often neglecting cross-sectional market influences, relying on static correlations, employing uniform treatments of nodes and edges, and conflating diverse relationships. This work introduces MaGNet, a novel Mamba dual-hyperGraph Network for stock prediction, integrating three key innovations: (1) a MAGE block, which leverages bidirectional Mamba with adaptive gating mechanisms for contextual temporal modeling and integrates a sparse Mixture-of-Experts layer to enable dynamic adaptation to diverse market conditions, alongside multi-head attention for capturing global dependencies; (2) Feature-wise and Stock-wise 2D Spatiotemporal Attention modules enable precise fusion of multivariate features and cross-stock dependencies, effectively enhancing informativeness while preserving intrinsic data structures, bridging temporal modeling with relational reasoning; and (3) a dual hypergraph framework consisting of the Temporal-Causal Hypergraph (TCH) that captures fine-grained causal dependencies with temporal constraints, and Global Probabilistic Hypergraph (GPH) that models market-wide patterns through soft hyperedge assignments and Jensen-Shannon Divergence weighting mechanism, jointly disentangling localized temporal influences from instantaneous global structures for multi-scale relational learning. Extensive experiments on six major stock indices demonstrate MaGNet outperforms state-of-the-art methods in both superior predictive performance and exceptional investment returns with robust risk management capabilities. Codes available at: https://github.com/PeilinTime/MaGNet.

[622] Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

Fali Wang, Jihai Chen, Shuhua Yang, Runxue Bao, Tianxiang Zhao, Zhiwei Zhang, Xianfeng Tang, Hui Liu, Qi He, Suhang Wang

Main category: cs.LG

TL;DR: Agent-REINFORCE is an LLM-agent-augmented framework that efficiently searches for optimal multi-LLM collaboration graphs in Test-Time Scaling by mapping REINFORCE’s sampling-gradient-update to sampling-feedback-update using textual gradients.

Details

Motivation: Prior Test-Time Scaling studies assume fixed collaboration architectures and single-model usage, overlooking that optimal architectures and model combinations vary across tasks, creating a need for compute-optimal model combinations under fixed budgets.

Method: Formalize as probabilistic graph optimization, derive empirical insights from pilot experiments, and propose Agent-REINFORCE framework that uses LLM agents to map REINFORCE pipeline to sampling-feedback-update with textual gradients.

Result: Agent-REINFORCE outperforms traditional and LLM-based baselines in sample efficiency and search performance, effectively identifying optimal graphs under joint objectives of accuracy and inference latency.

Conclusion: The proposed approach successfully addresses the combinatorial search challenge in Test-Time Scaling by leveraging LLM agents and probabilistic graph optimization to find task-specific optimal multi-LLM collaboration architectures.

Abstract: Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

[623] GraphKeeper: Graph Domain-Incremental Learning via Knowledge Disentanglement and Preservation

Zihao Guo, Qingyun Sun, Ziwei Zhang, Haonan Yuan, Huiping Zhuang, Xingcheng Fu, Jianxin Li

Main category: cs.LG

TL;DR: GraphKeeper addresses graph domain-incremental learning by preventing catastrophic forgetting through knowledge disentanglement and preservation, achieving state-of-the-art performance with 6.5%-16.6% improvement over existing methods.

Details

Motivation: Existing graph incremental learning approaches focus on task-incremental and class-incremental scenarios within single domains, leaving graph domain-incremental learning (Domain-IL) across multiple graph domains unexplored despite its importance for graph foundation models.

Method: Proposes domain-specific parameter-efficient fine-tuning with intra- and inter-domain disentanglement objectives to prevent embedding shifts, deviation-free knowledge preservation to maintain stable decision boundaries, and domain-aware distribution discrimination for graphs with unobservable domains.

Result: Extensive experiments show GraphKeeper achieves state-of-the-art results with 6.5%-16.6% improvement over runner-up methods with negligible forgetting, and can be seamlessly integrated with various graph foundation models.

Conclusion: GraphKeeper effectively addresses catastrophic forgetting in graph domain-incremental learning scenarios and demonstrates broad applicative potential with graph foundation models.

Abstract: Graph incremental learning (GIL), which continuously updates graph models by sequential knowledge acquisition, has garnered significant interest recently. However, existing GIL approaches focus on task-incremental and class-incremental scenarios within a single domain. Graph domain-incremental learning (Domain-IL), aiming at updating models across multiple graph domains, has become critical with the development of graph foundation models (GFMs), but remains unexplored in the literature. In this paper, we propose Graph Domain-Incremental Learning via Knowledge Dientanglement and Preservation (GraphKeeper), to address catastrophic forgetting in Domain-IL scenario from the perspectives of embedding shifts and decision boundary deviations. Specifically, to prevent embedding shifts and confusion across incremental graph domains, we first propose the domain-specific parameter-efficient fine-tuning together with intra- and inter-domain disentanglement objectives. Consequently, to maintain a stable decision boundary, we introduce deviation-free knowledge preservation to continuously fit incremental domains. Additionally, for graphs with unobservable domains, we perform domain-aware distribution discrimination to obtain precise embeddings. Extensive experiments demonstrate the proposed GraphKeeper achieves state-of-the-art results with 6.5%~16.6% improvement over the runner-up with negligible forgetting. Moreover, we show GraphKeeper can be seamlessly integrated with various representative GFMs, highlighting its broad applicative potential.

[624] A generative adversarial network optimization method for damage detection and digital twinning by deep AI fault learning: Z24 Bridge structural health monitoring benchmark validation

Marios Impraimakis, Evangelia Nektaria Palkanoglou

Main category: cs.LG

TL;DR: A novel conditional-labeled GAN framework for unsupervised damage detection and digital twinning that outperforms current methods by not requiring prior health state information, validated on Z24 Bridge data.

Details

Motivation: Current AI-based digital twinning approaches suffer from poor predictions with limited measurements, missing physics knowledge, or unknown damage states, creating uncertainty in real-world applications.

Method: Unsupervised conditional-labeled GAN framework that uses same damage-level measurements as inputs, forces conditional convergence to different damage states, compares convergence scores, and employs SVM classifier and PCA for assessment.

Result: The approach accurately captures damage over healthy measurements, creates digital twin measurements at different damage states, and provides pattern recognition and data generation capabilities for structural health monitoring.

Conclusion: The framework provides a powerful tool for vibration-based system-level monitoring and scalable infrastructure resilience, overcoming limitations of current methods by not requiring prior health state information.

Abstract: The optimization-based damage detection and damage state digital twinning capabilities are examined here of a novel conditional-labeled generative adversarial network methodology. The framework outperforms current approaches for fault anomaly detection as no prior information is required for the health state of the system: a topic of high significance for real-world applications. Specifically, current artificial intelligence-based digital twinning approaches suffer from the uncertainty related to obtaining poor predictions when a low number of measurements is available, physics knowledge is missing, or when the damage state is unknown. To this end, an unsupervised framework is examined and validated rigorously on the benchmark structural health monitoring measurements of Z24 Bridge: a post-tensioned concrete highway bridge in Switzerland. In implementing the approach, firstly, different same damage-level measurements are used as inputs, while the model is forced to converge conditionally to two different damage states. Secondly, the process is repeated for a different group of measurements. Finally, the convergence scores are compared to identify which one belongs to a different damage state. The process for both healthy-to-healthy and damage-to-healthy input data creates, simultaneously, measurements for digital twinning purposes at different damage states, capable of pattern recognition and machine learning data generation. Further to this process, a support vector machine classifier and a principal component analysis procedure is developed to assess the generated and real measurements of each damage category, serving as a secondary new dynamics learning indicator in damage scenarios. Importantly, the approach is shown to capture accurately damage over healthy measurements, providing a powerful tool for vibration-based system-level monitoring and scalable infrastructure resilience.

[625] Deep recurrent-convolutional neural network learning and physics Kalman filtering comparison in dynamic load identification

Marios Impraimakis

Main category: cs.LG

TL;DR: Comparison of GRU, LSTM, and CNN neural networks with residual Kalman filter for dynamic load identification under small dataset conditions and various loading scenarios.

Details

Motivation: Dynamic load identification suffers from uncertainty when only limited test data is available in civil engineering applications or when structural models are unidentifiable.

Method: Examined GRU, LSTM, and CNN neural networks compared to physics-based residual Kalman filter (RKF) on three cases: simulated structure with shaker excitation, California building under seismic base excitation, and IASC-ASCE benchmark problem for impact/instant loading.

Result: Methods outperform each other on different loading scenarios. RKF outperforms neural networks in physically parametrized identifiable cases.

Conclusion: Different methods have varying performance depending on loading scenarios, with RKF showing superior performance in physically identifiable cases while neural networks perform well in other scenarios.

Abstract: The dynamic structural load identification capabilities of the gated recurrent unit, long short-term memory, and convolutional neural networks are examined herein. The examination is on realistic small dataset training conditions and on a comparative view to the physics-based residual Kalman filter (RKF). The dynamic load identification suffers from the uncertainty related to obtaining poor predictions when in civil engineering applications only a low number of tests are performed or are available, or when the structural model is unidentifiable. In considering the methods, first, a simulated structure is investigated under a shaker excitation at the top floor. Second, a building in California is investigated under seismic base excitation, which results in loading for all degrees of freedom. Finally, the International Association for Structural Control-American Society of Civil Engineers (IASC-ASCE) structural health monitoring benchmark problem is examined for impact and instant loading conditions. Importantly, the methods are shown to outperform each other on different loading scenarios, while the RKF is shown to outperform the networks in physically parametrized identifiable cases.

[626] Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving

Yuchen Zhang, Hanyue Du, Chun Cao, Jingwei Xu

Main category: cs.LG

TL;DR: Loquetier is a unified framework that integrates LoRA fine-tuning and serving in one runtime, achieving significant performance improvements over existing systems.

Details

Motivation: There's a gap in unifying fine-tuning and inference for LoRA-based models, despite LoRA being widely used for parameter-efficient fine-tuning of LLMs.

Method: Uses Virtualized Module for isolating PEFT modifications and supports multiple adapters, plus optimized computation flow with merged fine-tuning/inference paths and efficient kernel design.

Result: Achieves up to 3.0× throughput of state-of-the-art co-serving system on inference tasks and 46.4× higher SLO attainment than PEFT on unified fine-tuning/inference tasks.

Conclusion: Loquetier provides a practical solution for unified LoRA fine-tuning and serving, demonstrating superior performance and flexibility across various task settings.

Abstract: Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning (PEFT) technique for adapting large language models (LLMs) to downstream tasks. While prior work has explored strategies for integrating LLM training and serving, there still remains a gap in unifying fine-tuning and inference for LoRA-based models. We present Loquetier, a virtualized multi-LoRA framework that seamlessly integrates LoRA fine-tuning and serving within a single runtime. Loquetier introduces two key components: (1) a Virtualized Module that isolates PEFT-based modifications and supports multiple adapters on a shared base model, and (2) an optimized computation flow with a kernel design that merges fine-tuning and inference paths in forward propagation, enabling efficient batching and minimizing kernel invocation overhead. Extensive experiments across three task settings show that Loquetier consistently outperforms existing baselines in both performance and flexibility, achieving up to $3.0\times$ the throughput of the state-of-the-art co-serving system on inference-only tasks and $46.4\times$ higher SLO attainment than PEFT on unified fine-tuning and inference tasks. The implementation of Loquetier is publicly available at https://github.com/NJUDeepEngine/Loquetier.

[627] Automated Discovery of Conservation Laws via Hybrid Neural ODE-Transformers

Vivan Doshi

Main category: cs.LG

TL;DR: A hybrid framework for discovering conservation laws from noisy trajectory data using Neural ODEs, Transformers, and symbolic-numeric verification.

Details

Motivation: Identifying conservation laws from observational data is challenging but crucial for scientific progress, especially with noisy trajectory data.

Method: Three-component approach: Neural ODE learns system dynamics, Transformer generates symbolic candidate invariants, and symbolic-numeric verifier validates candidates.

Result: Framework significantly outperforms baselines on canonical physical systems and demonstrates robustness for discovering mathematical principles from imperfect data.

Conclusion: The decoupled learn-then-search approach is effective for automated discovery of conserved quantities from noisy observational data.

Abstract: The discovery of conservation laws is a cornerstone of scientific progress. However, identifying these invariants from observational data remains a significant challenge. We propose a hybrid framework to automate the discovery of conserved quantities from noisy trajectory data. Our approach integrates three components: (1) a Neural Ordinary Differential Equation (Neural ODE) that learns a continuous model of the system’s dynamics, (2) a Transformer that generates symbolic candidate invariants conditioned on the learned vector field, and (3) a symbolic-numeric verifier that provides a strong numerical certificate for the validity of these candidates. We test our framework on canonical physical systems and show that it significantly outperforms baselines that operate directly on trajectory data. This work demonstrates the robustness of a decoupled learn-then-search approach for discovering mathematical principles from imperfect data.

[628] Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence

Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Hanzhe Shan, Zhenwei Niu, Zhaoyang Liu, Yue Zhao, Junbo Qi, Qinfan Zhang, Dengjie Li, Yidong Wang, Jiachen Luo, Yong Dai, Jian Tang, Xiaozhu Ju

Main category: cs.LG

TL;DR: Pelican-VL 1.0 is a new family of open-source embodied brain models (7B-72B parameters) that achieves state-of-the-art performance through DPPO framework and large-scale training.

Details

Motivation: To embed powerful intelligence into various embodiments and create the largest-scale open-source embodied multimodal brain model.

Method: Uses DPPO (Deliberate Practice Policy Optimization) framework with a metaloop (RL-Refine-Diagnose-SFT loop) for deliberate practice, trained on 1000+ A800 GPUs with high-quality dataset distilled from 4+ billion tokens.

Result: Achieves 20.3% performance uplift from base model, outperforms 100B-level open-source counterparts by 10.6%, and matches leading proprietary systems on embodied benchmarks.

Conclusion: Pelican-VL 1.0 successfully demonstrates the effectiveness of the DPPO framework and large-scale training for creating high-performance embodied AI systems.

Abstract: This report presents Pelican-VL 1.0, a new family of open-source embodied brain models with parameter scales ranging from 7 billion to 72 billion. Our explicit mission is clearly stated as: To embed powerful intelligence into various embodiments. Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model. Its core advantage lies in the in-depth integration of data power and intelligent adaptive learning mechanisms. Specifically, metaloop distilled a high-quality dataset from a raw dataset containing 4+ billion tokens. Pelican-VL 1.0 is trained on a large-scale cluster of 1000+ A800 GPUs, consuming over 50k+ A800 GPU-hours per checkpoint. This translates to a 20.3% performance uplift from its base model and outperforms 100B-level open-source counterparts by 10.6%, placing it on par with leading proprietary systems on well-known embodied benchmarks. We establish a novel framework, DPPO (Deliberate Practice Policy Optimization), inspired by human metacognition to train Pelican-VL 1.0. We operationalize this as a metaloop that teaches the AI to practice deliberately, which is a RL-Refine-Diagnose-SFT loop.

[629] MeixnerNet: Adaptive and Robust Spectral Graph Neural Networks with Discrete Orthogonal Polynomials

Huseyin Goksu

Main category: cs.LG

TL;DR: MeixnerNet introduces discrete orthogonal polynomials (Meixner polynomials) for spectral GNNs, addressing the mismatch between continuous-domain filters and discrete graph structures, with learnable parameters and improved stability.

Details

Motivation: To address the theoretical disconnect between continuous-domain polynomial filters (like Chebyshev) and discrete graph structures in spectral GNNs, which can lead to suboptimal performance and fragility to hyperparameter settings.

Method: Uses discrete orthogonal Meixner polynomials with learnable shape parameters (beta and c), combined with a novel stabilization technique using Laplacian scaling and per-basis LayerNorm to overcome numerical instability.

Result: MeixnerNet achieves competitive-to-superior performance against ChebyNet at optimal K=2 setting (winning 2/3 benchmarks) and shows exceptional robustness to polynomial degree K variations, while ChebyNet collapses in performance.

Conclusion: MeixnerNet provides a more theoretically grounded approach for spectral GNNs using discrete orthogonal polynomials, offering both competitive performance and superior robustness to hyperparameter variations compared to continuous-domain polynomial filters.

Abstract: Spectral Graph Neural Networks (GNNs) have achieved state-of-the-art results by defining graph convolutions in the spectral domain. A common approach, popularized by ChebyNet, is to use polynomial filters based on continuous orthogonal polynomials (e.g., Chebyshev). This creates a theoretical disconnect, as these continuous-domain filters are applied to inherently discrete graph structures. We hypothesize this mismatch can lead to suboptimal performance and fragility to hyperparameter settings. In this paper, we introduce MeixnerNet, a novel spectral GNN architecture that employs discrete orthogonal polynomials – specifically, the Meixner polynomials $M_k(x; \beta, c)$. Our model makes the two key shape parameters of the polynomial, beta and c, learnable, allowing the filter to adapt its polynomial basis to the specific spectral properties of a given graph. We overcome the significant numerical instability of these polynomials by introducing a novel stabilization technique that combines Laplacian scaling with per-basis LayerNorm. We demonstrate experimentally that MeixnerNet achieves competitive-to-superior performance against the strong ChebyNet baseline at the optimal K = 2 setting (winning on 2 out of 3 benchmarks). More critically, we show that MeixnerNet is exceptionally robust to variations in the polynomial degree K, a hyperparameter to which ChebyNet proves to be highly fragile, collapsing in performance where MeixnerNet remains stable.

[630] Analysis of Line Break prediction models for detecting defensive breakthrough in football

Shoma Yagi, Jun Ichikawa, Genki Ichinose

Main category: cs.LG

TL;DR: Developed a machine learning model using XGBoost to predict Line Breaks in football, achieving high accuracy (AUC: 0.982) and identifying key factors like player speed and defensive gaps.

Details

Motivation: Previous studies focused mainly on shots or goal opportunities, but not on how teams break through defensive lines, which is a critical indicator of offensive effectiveness.

Method: Used event and tracking data from 2023 J1 League with 189 features including player positions, velocities, and spatial configurations, employing XGBoost classifier.

Result: High predictive accuracy with AUC of 0.982 and Brier score of 0.015; SHAP analysis revealed offensive player speed, defensive gaps, and spatial distributions as key factors; moderate positive correlation with shots and crosses conceded.

Conclusion: Line Breaks are closely linked to scoring opportunity creation and provide a quantitative framework for understanding football tactical dynamics.

Abstract: In football, attacking teams attempt to break through the opponent’s defensive line to create scoring opportunities. This action, known as a Line Break, is a critical indicator of offensive effectiveness and tactical performance, yet previous studies have mainly focused on shots or goal opportunities rather than on how teams break the defensive line. In this study, we develop a machine learning model to predict Line Breaks using event and tracking data from the 2023 J1 League season. The model incorporates 189 features, including player positions, velocities, and spatial configurations, and employs an XGBoost classifier to estimate the probability of Line Breaks. The proposed model achieved high predictive accuracy, with an AUC of 0.982 and a Brier score of 0.015. Furthermore, SHAP analysis revealed that factors such as offensive player speed, gaps in the defensive line, and offensive players’ spatial distributions significantly contribute to the occurrence of Line Breaks. Finally, we found a moderate positive correlation between the predicted probability of being Line-Broken and the number of shots and crosses conceded at the team level. These results suggest that Line Breaks are closely linked to the creation of scoring opportunities and provide a quantitative framework for understanding tactical dynamics in football.

[631] Cross-fluctuation phase transitions reveal sampling dynamics in diffusion models

Sai Niranjan Ramachandran, Manish Krishan Lal, Suvrit Sra

Main category: cs.LG

TL;DR: The paper analyzes sampling dynamics in score-based diffusion models using cross-fluctuations, revealing sharp transitions during generation that can be detected to improve sampling efficiency and enable better zero-shot tasks without retraining.

Details

Motivation: To understand how sampling dynamics evolve in diffusion models and leverage this understanding to improve sampling efficiency and enable better zero-shot performance without expensive retraining or grid search.

Method: Uses cross-fluctuations (centered-moment statistics from statistical physics) to detect sharp transitions in sampling dynamics. For variance-preserving SDEs, derives closed-form expressions for cross-fluctuations that are efficiently computable for reverse trajectories.

Result: Demonstrates that detecting transitions directly boosts sampling efficiency, accelerates class-conditional and rare-class generation, and improves zero-shot image classification and style transfer without expensive grid search or retraining.

Conclusion: The framework bridges discrete Markov chain theory, phase analysis, and modern generative modeling, unifying classical coupling and mixing concepts with continuous dynamics in stochastic SDEs and non-Markovian samplers.

Abstract: We analyse how the sampling dynamics of distributions evolve in score-based diffusion models using cross-fluctuations, a centered-moment statistic from statistical physics. Specifically, we show that starting from an unbiased isotropic normal distribution, samples undergo sharp, discrete transitions, eventually forming distinct events of a desired distribution while progressively revealing finer structure. As this process is reversible, these transitions also occur in reverse, where intermediate states progressively merge, tracing a path back to the initial distribution. We demonstrate that these transitions can be detected as discontinuities in $n^{\text{th}}$-order cross-fluctuations. For variance-preserving SDEs, we derive a closed-form for these cross-fluctuations that is efficiently computable for the reverse trajectory. We find that detecting these transitions directly boosts sampling efficiency, accelerates class-conditional and rare-class generation, and improves two zero-shot tasks–image classification and style transfer–without expensive grid search or retraining. We also show that this viewpoint unifies classical coupling and mixing from finite Markov chains with continuous dynamics while extending to stochastic SDEs and non Markovian samplers. Our framework therefore bridges discrete Markov chain theory, phase analysis, and modern generative modeling.

[632] Dynamic Model Selection for Trajectory Prediction via Pairwise Ranking and Meta-Features

Lu Bowen

Main category: cs.LG

TL;DR: A dynamic multi-expert gating framework that adaptively selects the most reliable trajectory predictor among physics-informed LSTM, Transformer, and fine-tuned GameFormer models on a per-sample basis, achieving 9.5% FDE reduction over GameFormer.

Details

Motivation: Current deep trajectory predictors achieve strong average accuracy but remain unreliable in complex long-tail driving scenarios, revealing weaknesses of the "one-model-fits-all" paradigm where simpler physics-based models can outperform advanced networks in safety-critical contexts.

Method: Proposes a dynamic multi-expert gating framework that uses internal model signals (meta-features like stability and uncertainty) to select the best trajectory predictor per sample, formulated as a pairwise-ranking problem over internal model signals without requiring post-hoc calibration.

Result: On nuPlan-mini dataset (1,287 samples), achieves FDE of 2.567m (9.5% reduction over GameFormer’s 2.835m), realizing 57.8% of oracle performance bound. In open-loop simulations, reduces FDE on left-turn scenarios by ~10% after trajectory horizon alignment.

Conclusion: Adaptive hybrid systems enhance trajectory reliability in safety-critical autonomous driving, providing a practical pathway beyond static single-model paradigms with consistent improvements across both offline validation and open-loop evaluation.

Abstract: Recent deep trajectory predictors (e.g., Jiang et al., 2023; Zhou et al., 2022) have achieved strong average accuracy but remain unreliable in complex long-tail driving scenarios. These limitations reveal the weakness of the prevailing “one-model-fits-all” paradigm, particularly in safety-critical urban contexts where simpler physics-based models can occasionally outperform advanced networks (Kalman, 1960). To bridge this gap, we propose a dynamic multi-expert gating framework that adaptively selects the most reliable trajectory predictor among a physics-informed LSTM, a Transformer, and a fine-tuned GameFormer on a per-sample basis. Our method leverages internal model signals (meta-features) such as stability and uncertainty (Gal and Ghahramani, 2016), which we demonstrate to be substantially more informative than geometric scene descriptors. To the best of our knowledge, this is the first work to formulate trajectory expert selection as a pairwise-ranking problem over internal model signals (Burges et al., 2005), directly optimizing decision quality without requiring post-hoc calibration. Evaluated on the nuPlan-mini dataset (Caesar et al., 2021) with 1,287 samples, our LLM-enhanced tri-expert gate achieves a Final Displacement Error (FDE) of 2.567 m, representing a 9.5 percent reduction over GameFormer (2.835 m), and realizes 57.8 percent of the oracle performance bound. In open-loop simulations, after trajectory horizon alignment, the same configuration reduces FDE on left-turn scenarios by approximately 10 percent, demonstrating consistent improvements across both offline validation and open-loop evaluation. These results indicate that adaptive hybrid systems enhance trajectory reliability in safety-critical autonomous driving, providing a practical pathway beyond static single-model paradigms.

[633] Casing Collar Identification using AlexNet-based Neural Networks for Depth Measurement in Oil and Gas Wells

Siyu Xiao, Xindi Zhao, Tianhao Mao, Yiwei Wang, Yuqiao Chen, Hongyun Zhang, Jian Wang, Junjie Wang, Shuang Liu, Tupei Chen, Yang Liu

Main category: cs.LG

TL;DR: This paper presents a system for CCL signal acquisition and comprehensive preprocessing methods for data augmentation to improve neural network-based casing collar recognition in data-limited environments.

Details

Motivation: Accurate downhole depth measurement is crucial for oil/gas operations, but preprocessing methods for CCL signal recognition are underdeveloped and real well data is limited for training neural networks.

Method: Integrated system for CCL signal acquisition, comprehensive preprocessing methods (standardization, LDS, random cropping, LSR, time scaling, multiple sampling), and AlexNet-based neural network models evaluated through systematic experimentation.

Result: F1 scores improved from 0.937 and 0.952 to 1.0 for both benchmark models. Standardization, LDS, and random cropping are fundamental requirements, while LSR, time scaling, and multiple sampling enhance generalization.

Conclusion: The proposed data augmentation methods effectively address gaps in training casing collar recognition models in CCL data-limited environments, with validation confirming practical applicability.

Abstract: Accurate downhole depth measurement is essential for oil and gas well operations, directly influencing reservoir contact, production efficiency, and operational safety. Collar correlation using a casing collar locator (CCL) is fundamental for precise depth calibration. While neural network-based CCL signal recognition has achieved significant progress in collar identification, preprocessing methods for such applications remain underdeveloped. Moreover, the limited availability of real well data poses substantial challenges for training neural network models that require extensive datasets. This paper presents a system integrated into downhole tools for CCL signal acquisition to facilitate dataset construction. We propose comprehensive preprocessing methods for data augmentation and evaluate their effectiveness using our AlexNet-based neural network models. Through systematic experimentation across various configuration combinations, we analyze the contribution of each augmentation method. Results demonstrate that standardization, label distribution smoothing (LDS), and random cropping are fundamental requirements for model training, while label smoothing regularization (LSR), time scaling, and multiple sampling significantly enhance model generalization capability. The F1 scores of our two benchmark models trained with the proposed augmentation methods maximumly improve from 0.937 and 0.952 to 1.0 and 1.0, respectively. Performance validation on real CCL waveforms confirms the effectiveness and practical applicability of our approach. This work addresses the gaps in data augmentation methodologies for training casing collar recognition models in CCL data-limited environments.

[634] A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios

Bernd Bohnet, Rumen Dangovski, Kevin Swersky, Sherry Moore, Arslan Chaudhry, Kathleen Kenealy, Noah Fiedel

Main category: cs.LG

TL;DR: Comparative analysis of LLM adaptation methods (SFT, LoRA, ICL) in data-scarce scenarios, finding LoRA provides the best balance between skill acquisition and preserving general knowledge.

Details

Motivation: Need effective LLM adaptation methods that avoid catastrophic forgetting while integrating new knowledge/skills, especially in data-limited situations.

Method: Comparative analysis of three adaptation techniques: Supervised Finetuning (SFT), LoRA (Parameter-Efficient Fine-Tuning), and In-Context Learning (ICL) in data-scarce scenarios.

Result: LoRA provides most effective balance - successfully instills new skills with minimal impact on base model’s general knowledge. SFT excels at skill acquisition but highly susceptible to catastrophic forgetting. ICL effective for factual knowledge but struggles with complex skills.

Conclusion: LoRA offers optimal adaptation strategy, highlighting critical distinction between skill acquisition vs knowledge integration, and trade-offs between task-specific performance and preservation of general capabilities.

Abstract: The remarkable capabilities of Large Language Models (LLMs) often need to be tailored for specific applications, requiring the integration of new knowledge or the acquisition of new skills. While full fine-tuning is a powerful adaptation method, it is computationally expensive and can lead to a degradation of general reasoning abilities, a phenomenon known as catastrophic forgetting. A range of alternative techniques exists, each with its own trade-offs. In-Context Learning (ICL) is fast but limited by context length, while Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a middle ground by minimizing parameter changes. However, the challenge of catastrophic forgetting persists, raising questions about the best adaptation strategy for a given task. This paper presents a comparative analysis of Supervised Finetuning (SFT), LoRA, and ICL in data-scarce scenarios. We find that LoRA provides the most effective balance, successfully instilling new skills with minimal impact on the base model’s general knowledge. In contrast, while SFT excels at skill acquisition, it is highly susceptible to catastrophic forgetting. ICL is effective for incorporating factual knowledge but struggles with complex skills. Our findings offer a practical framework for selecting an LLM adaptation strategy. We highlight the critical distinction between skill acquisition and knowledge integration, clarify the trade-offs between task-specific performance and the preservation of general capabilities.

[635] Feature Importance Guided Random Forest Learning with Simulated Annealing Based Hyperparameter Tuning

Kowshik Balasubramanian, Andre Williams, Ismail Butun

Main category: cs.LG

TL;DR: A novel Random Forest enhancement framework using probabilistic feature sampling and Simulated Annealing for hyperparameter tuning, improving accuracy and generalization across multiple domains.

Details

Motivation: To overcome limitations of conventional Random Forests by better capturing relevant data signals and enabling adaptive hyperparameter configuration for robust classification.

Method: Integrates probabilistic feature sampling (importance-aware sampling) with hyperparameter tuning via Simulated Annealing, focusing on meaningful features and dynamic parameter optimization.

Result: Demonstrates consistent accuracy improvements and provides meaningful insights into feature relevance across credit risk, IoT anomaly detection, medical diagnostics, and biological data analysis.

Conclusion: The combination of importance-aware sampling and metaheuristic optimization effectively enhances Random Forest performance and generalization capabilities.

Abstract: This paper introduces a novel framework for enhancing Random Forest classifiers by integrating probabilistic feature sampling and hyperparameter tuning via Simulated Annealing. The proposed framework exhibits substantial advancements in predictive accuracy and generalization, adeptly tackling the multifaceted challenges of robust classification across diverse domains, including credit risk evaluation, anomaly detection in IoT ecosystems, early-stage medical diagnostics, and high-dimensional biological data analysis. To overcome the limitations of conventional Random Forests, we present an approach that places stronger emphasis on capturing the most relevant signals from data while enabling adaptive hyperparameter configuration. The model is guided towards features that contribute more meaningfully to classification and optimizing this with dynamic parameter tuning. The results demonstrate consistent accuracy improvements and meaningful insights into feature relevance, showcasing the efficacy of combining importance aware sampling and metaheuristic optimization.

[636] Physiologically Active Vegetation Reverses Its Cooling Effect in Humid Urban Climates

Angana Borah, Adrija Datta, Ashish S. Kumar, Raviraj Dave, Udit Bhatia

Main category: cs.LG

TL;DR: Vegetation in cities creates a trade-off between surface cooling and increased humidity that can intensify perceived heat stress. The study identifies specific vegetation thresholds where cooling reverses to warming in different urban climates.

Details

Motivation: Urban greening efforts are unevenly successful because vegetation can both cool surfaces and increase humidity, making air feel hotter. Current understanding of how vegetation affects humid heat in cities is limited, leaving mitigation policies unguided.

Method: Used an interpretable machine-learning framework combining SHAP and ALE to analyze vegetation-climate interactions across 138 Indian cities. Analyzed Heat Index (HI) at 1km resolution across different climate zones and urban densities.

Result: Cooling strengthens with EVI >= 0.4 and LAI >= 0.05, but reverses to warming when EVI >= 0.5, LAI >= 0.2, and fPAR >= 0.5. In humid dense cores, warming occurs earlier at fPAR >= 0.25 due to vegetation elevating humidity faster than removing heat.

Conclusion: The study establishes climatic limits for vegetation-driven cooling and provides quantitative thresholds for climate-specific greening strategies to create equitable, heat-resilient cities.

Abstract: Efforts to green cities for cooling are succeeding unevenly because the same vegetation that cools surfaces can also intensify how hot the air feels. Previous studies have identified humid heat as a growing urban hazard, yet how physiologically active vegetation governs this trade-off between cooling and moisture accumulation remains poorly understood, leaving mitigation policy and design largely unguided. Here we quantify how vegetation structure and function influence the Heat Index (HI), a combined measure of temperature and humidity in 138 Indian cities spanning tropical savanna, semi-arid steppe, and humid subtropical climates, and across dense urban cores and semi-urban rings. Using an extreme-aware, one kilometre reconstruction of HI and an interpretable machine-learning framework that integrates SHapley Additive Explanations (SHAP) and Accumulated Local Effects (ALE), we isolate vegetation-climate interactions. Cooling generally strengthens for EVI >= 0.4 and LAI >= 0.05, but joint-high regimes begin to reverse toward warming when EVI >= 0.5, LAI >= 0.2, and fPAR >= 0.5,with an earlier onset for fPAR >= 0.25 in humid, dense cores. In such environments, highly physiologically active vegetation elevates near-surface humidity faster than it removes heat, reversing its cooling effect and amplifying perceived heat stress. These findings establish the climatic limits of vegetation-driven cooling and provide quantitative thresholds for climate-specific greening strategies that promote equitable and heat-resilient cities.

[637] A Dual Large Language Models Architecture with Herald Guided Prompts for Parallel Fine Grained Traffic Signal Control

Qing Guo, Xinhang Li, Junyu Chen, Zheng Guo, Xiaocong Li, Lin Zhang, Lei Li

Main category: cs.LG

TL;DR: HeraldLight is a dual LLM architecture for traffic signal control that uses a Herald Module for context extraction and queue forecasting, with LLM-Agent making control decisions and LLM-Critic refining outputs to reduce errors and hallucinations.

Details

Motivation: Existing LLM-based traffic control methods have fixed time durations and hallucination issues, while RL methods lack robustness and generalization in signal timing decisions.

Method: Proposes HeraldLight with dual LLMs: Herald Module extracts context and forecasts queue lengths, LLM-Agent makes fine-grained traffic control decisions, and LLM-Critic refines outputs using score-based fine-tuning.

Result: Outperforms state-of-the-art baselines with 20.03% reduction in average travel time across all scenarios and 10.74% reduction in average queue length on Jinan and Hangzhou scenarios.

Conclusion: HeraldLight effectively addresses limitations of both LLM-based and RL methods in traffic signal control, demonstrating superior performance through its dual LLM architecture with error correction mechanisms.

Abstract: Leveraging large language models (LLMs) in traffic signal control (TSC) improves optimization efficiency and interpretability compared to traditional reinforcement learning (RL) methods. However, existing LLM-based approaches are limited by fixed time signal durations and are prone to hallucination errors, while RL methods lack robustness in signal timing decisions and suffer from poor generalization. To address these challenges, this paper proposes HeraldLight, a dual LLMs architecture enhanced by Herald guided prompts. The Herald Module extracts contextual information and forecasts queue lengths for each traffic phase based on real-time conditions. The first LLM, LLM-Agent, uses these forecasts to make fine grained traffic signal control, while the second LLM, LLM-Critic, refines LLM-Agent’s outputs, correcting errors and hallucinations. These refined outputs are used for score-based fine-tuning to improve accuracy and robustness. Simulation experiments using CityFlow on real world datasets covering 224 intersections in Jinan (12), Hangzhou (16), and New York (196) demonstrate that HeraldLight outperforms state of the art baselines, achieving a 20.03% reduction in average travel time across all scenarios and a 10.74% reduction in average queue length on the Jinan and Hangzhou scenarios. The source code is available on GitHub: https://github.com/BUPT-ANTlab/HeraldLight.

[638] Study on Supply Chain Finance Decision-Making Model and Enterprise Economic Performance Prediction Based on Deep Reinforcement Learning

Shiman Zhang, Jinghan Zhou, Zhoufan Yu, Ningai Leng

Main category: cs.LG

TL;DR: Proposes a hybrid deep learning and intelligent particle swarm optimization model for supply chain decision-making, combining feature extraction with optimization algorithms to improve planning efficiency and adaptive control.

Details

Motivation: To improve decision-making and planning efficiency in back-end centralized redundant supply chains by addressing the need for better optimization and adaptive control in dynamic environments.

Method: Integrates deep learning (CNN for feature extraction, linear programming for statistical features) with intelligent particle swarm optimization, using fuzzy association rule scheduling and deep reinforcement learning for model optimization, and neural networks for dynamic change fitting.

Result: Simulations show reduced resource consumption, enhanced spatial planning, improved real-time decision adjustment, distribution path optimization, and robust intelligent control in dynamic environments.

Conclusion: The hybrid deep learning and intelligent particle swarm optimization approach effectively improves supply chain decision-making efficiency and adaptive control capabilities.

Abstract: To improve decision-making and planning efficiency in back-end centralized redundant supply chains, this paper proposes a decision model integrating deep learning with intelligent particle swarm optimization. A distributed node deployment model and optimal planning path are constructed for the supply chain network. Deep learning such as convolutional neural networks extracts features from historical data, and linear programming captures high-order statistical features. The model is optimized using fuzzy association rule scheduling and deep reinforcement learning, while neural networks fit dynamic changes. A hybrid mechanism of “deep learning feature extraction - intelligent particle swarm optimization” guides global optimization and selects optimal decisions for adaptive control. Simulations show reduced resource consumption, enhanced spatial planning, and in dynamic environments improved real-time decision adjustment, distribution path optimization, and robust intelligent control.

[639] Can SAEs reveal and mitigate racial biases of LLMs in healthcare?

Hiba Ahsan, Byron C. Wallace

Main category: cs.LG

TL;DR: SAEs can identify problematic race associations in LLMs used in healthcare, but steering via latents has limited utility for bias mitigation in realistic clinical tasks.

Details

Motivation: To detect and control spurious reliance on patient race in LLMs used in healthcare, which could worsen existing biases and lead to problematic associations.

Method: Used Sparse Autoencoders (SAEs) to identify latents in Gemma-2 models that correlate with Black individuals, then tested steering via these latents to control model outputs.

Result: Found SAE latents activate on both reasonable inputs (e.g., “African American”) and problematic words (e.g., “incarceration”). Steering increased risk assignments like predicting patients becoming “belligerent”.

Conclusion: SAEs are useful for identifying problematic demographic reliance in clinical LLMs, but bias mitigation via SAE steering offers only marginal improvements for complex clinical tasks.

Abstract: LLMs are increasingly being used in healthcare. This promises to free physicians from drudgery, enabling better care to be delivered at scale. But the use of LLMs in this space also brings risks; for example, such models may worsen existing biases. How can we spot when LLMs are (spuriously) relying on patient race to inform predictions? In this work we assess the degree to which Sparse Autoencoders (SAEs) can reveal (and control) associations the model has made between race and stigmatizing concepts. We first identify SAE latents in Gemma-2 models which appear to correlate with Black individuals. We find that this latent activates on reasonable input sequences (e.g., “African American”) but also problematic words like “incarceration”. We then show that we can use this latent to steer models to generate outputs about Black patients, and further that this can induce problematic associations in model outputs as a result. For example, activating the Black latent increases the risk assigned to the probability that a patient will become “belligerent”. We evaluate the degree to which such steering via latents might be useful for mitigating bias. We find that this offers improvements in simple settings, but is less successful for more realistic and complex clinical tasks. Overall, our results suggest that: SAEs may offer a useful tool in clinical applications of LLMs to identify problematic reliance on demographics but mitigating bias via SAE steering appears to be of marginal utility for realistic tasks.

Shaghayegh Fazliani, Madeleine Udell

Main category: cs.LG

TL;DR: PDE-SHARP is a framework that reduces computational costs for generating PDE solvers by using LLM inference instead of expensive numerical evaluations, achieving superior accuracy with 60-75% fewer computational evaluations.

Details

Motivation: Current LLM-driven approaches for generating PDE solvers require executing many solver samples, which is computationally expensive especially for complex PDEs requiring substantial resources for numerical evaluation.

Method: PDE-SHARP employs three stages: (1) Analysis: mathematical chain-of-thought analysis including PDE classification, solution type detection, and stability analysis; (2) Genesis: solver generation based on mathematical insights; (3) Synthesis: collaborative selection-hybridization tournaments where LLM judges iteratively refine implementations through flexible performance feedback.

Result: PDE-SHARP requires fewer than 13 solver evaluations on average compared to 30+ for baseline methods, improves accuracy uniformly across tested PDEs by 4× on average, and demonstrates robust performance across different LLM architectures.

Conclusion: PDE-SHARP successfully reduces computational costs while achieving superior solver accuracy through its three-stage framework that leverages LLM inference to replace expensive scientific computations.

Abstract: Current LLM-driven approaches using test-time computing to generate PDE solvers execute a large number of solver samples to identify high-accuracy solvers. These paradigms are especially costly for complex PDEs requiring substantial computational resources for numerical evaluation. We introduce PDE-SHARP, a framework to reduce computational costs by replacing expensive scientific computation by cheaper LLM inference that achieves superior solver accuracy with 60-75% fewer computational evaluations. PDE-SHARP employs three stages: (1) Analysis: mathematical chain-of-thought analysis including PDE classification, solution type detection, and stability analysis; (2) Genesis: solver generation based on mathematical insights from the previous stage; and (3) Synthesis: collaborative selection-hybridization tournaments in which LLM judges iteratively refine implementations through flexible performance feedback. To generate high-quality solvers, PDE-SHARP requires fewer than 13 solver evaluations on average compared to 30+ for baseline methods, improving accuracy uniformly across tested PDEs by $4\times$ on average, and demonstrates robust performance across LLM architectures, from general-purpose to specialized reasoning models.

[641] EL-MIA: Quantifying Membership Inference Risks of Sensitive Entities in LLMs

Ali Satvaty, Suzan Verberne, Fatih Turkmen

Main category: cs.LG

TL;DR: The paper proposes EL-MIA, a framework for entity-level membership inference attacks on LLMs, focusing on sensitive information like PII and credit card numbers, and shows existing MIA methods are limited for this task.

Details

Motivation: Existing MIA methods can detect entire prompts/documents in training data but fail to capture risks at finer granularity for sensitive entities like PII and credit card numbers.

Method: Proposed EL-MIA framework for auditing entity-level membership risks, constructed benchmark dataset, systematically compared existing MIA techniques and two new methods, analyzed relationship with model scale, training epochs, and surface factors.

Result: Existing MIA methods are limited for entity-level membership inference of sensitive attributes, but susceptibility can be outlined with relatively straightforward methods.

Conclusion: Highlights the need for stronger adversaries to stress test the threat model for entity-level membership inference in LLMs.

Abstract: Membership inference attacks (MIA) aim to infer whether a particular data point is part of the training dataset of a model. In this paper, we propose a new task in the context of LLM privacy: entity-level discovery of membership risk focused on sensitive information (PII, credit card numbers, etc). Existing methods for MIA can detect the presence of entire prompts or documents in the LLM training data, but they fail to capture risks at a finer granularity. We propose the ``EL-MIA’’ framework for auditing entity-level membership risks in LLMs. We construct a benchmark dataset for the evaluation of MIA methods on this task. Using this benchmark, we conduct a systematic comparison of existing MIA techniques as well as two newly proposed methods. We provide a comprehensive analysis of the results, trying to explain the relation of the entity level MIA susceptability with the model scale, training epochs, and other surface level factors. Our findings reveal that existing MIA methods are limited when it comes to entity-level membership inference of the sensitive attributes, while this susceptibility can be outlined with relatively straightforward methods, highlighting the need for stronger adversaries to stress test the provided threat model.

[642] Diffusion LLMs are Natural Adversaries for any LLM

David Lüdke, Tom Wollschläger, Paul Ungermann, Stephan Günnemann, Leo Schwinn

Main category: cs.LG

TL;DR: A framework that transforms adversarial prompt optimization into an efficient amortized inference task using pretrained non-autoregressive LLMs like Diffusion LLMs to directly generate high-reward prompts.

Details

Motivation: To address the resource-intensive nature of traditional adversarial prompt optimization by leveraging joint distribution modeling capabilities of modern LLMs for more efficient prompt generation.

Method: Uses pretrained non-autoregressive generative LLMs (Diffusion LLMs) as surrogates for prompt search, enabling direct conditional generation of prompts through parallelizable sampling instead of per-instance discrete optimization.

Result: Generated prompts are low-perplexity, diverse jailbreaks that show strong transferability to various black-box target models including robustly trained and proprietary LLMs.

Conclusion: The framework enables efficient adversarial prompting and opens new directions for red teaming, automated prompt optimization, and utilization of emerging Flow- and Diffusion-based LLMs.

Abstract: We introduce a novel framework that transforms the resource-intensive (adversarial) prompt optimization problem into an \emph{efficient, amortized inference task}. Our core insight is that pretrained, non-autoregressive generative LLMs, such as Diffusion LLMs, which model the joint distribution over prompt-response pairs, can serve as powerful surrogates for prompt search. This approach enables the direct conditional generation of prompts, effectively replacing costly, per-instance discrete optimization with a small number of parallelizable samples. We provide a probabilistic analysis demonstrating that under mild fidelity assumptions, only a few conditional samples are required to recover high-reward (harmful) prompts. Empirically, we find that the generated prompts are low-perplexity, diverse jailbreaks that exhibit strong transferability to a wide range of black-box target models, including robustly trained and proprietary LLMs. Beyond adversarial prompting, our framework opens new directions for red teaming, automated prompt optimization, and leveraging emerging Flow- and Diffusion-based LLMs.

[643] Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules versus Therapeutic Peptides

Yiquan Wang, Yahui Ma, Yuhan Chang, Jiayao Yan, Jialin Zhang, Minnuo Cai, Kai Wei

Main category: cs.LG

TL;DR: Diffusion models show promise for drug discovery, particularly for designing small molecules and therapeutic peptides, but face modality-specific challenges like synthesizability for small molecules and biological stability for peptides.

Details

Motivation: To accelerate and transform the traditionally slow and costly drug discovery process by applying diffusion models to therapeutic design.

Method: Systematic comparison of diffusion model applications using iterative denoising framework adapted to distinct molecular representations, chemical spaces, and design objectives for small molecules and therapeutic peptides.

Result: Diffusion models excel at structure-based design for small molecules (generating pocket-fitting ligands) and functional sequence generation for peptides, but face distinct challenges in each domain.

Conclusion: Full potential of diffusion models will be unlocked by bridging modality-specific gaps and integrating them into automated Design-Build-Test-Learn platforms for targeted therapeutic creation.

Abstract: Diffusion models have emerged as a leading framework in generative modeling, showing significant potential to accelerate and transform the traditionally slow and costly process of drug discovery. This review provides a systematic comparison of their application in designing two principal therapeutic modalities: small molecules and therapeutic peptides. We analyze how a unified framework of iterative denoising is adapted to the distinct molecular representations, chemical spaces, and design objectives of each modality. For small molecules, these models excel at structure-based design, generating novel, pocket-fitting ligands with desired physicochemical properties, yet face the critical hurdle of ensuring chemical synthesizability. Conversely, for therapeutic peptides, the focus shifts to generating functional sequences and designing de novo structures, where the primary challenges are achieving biological stability against proteolysis, ensuring proper folding, and minimizing immunogenicity. Despite these distinct challenges, both domains face shared hurdles: the need for more accurate scoring functions, the scarcity of high-quality experimental data, and the crucial requirement for experimental validation. We conclude that the full potential of diffusion models will be unlocked by bridging these modality-specific gaps and integrating them into automated, closed-loop Design-Build-Test-Learn (DBTL) platforms, thereby shifting the paradigm from chemical exploration to the targeted creation of novel therapeutics.

[644] Iterative Foundation Model Fine-Tuning on Multiple Rewards

Pouya M. Ghari, Simone Sciabola, Ye Wang

Main category: cs.LG

TL;DR: Proposes multi-reward RL fine-tuning for foundation models to handle multiple evaluation criteria, outperforming single-reward methods across text, biological sequences, and molecule generation.

Details

Motivation: Single reward optimization is suboptimal for applications requiring multiple evaluation criteria like text generation and drug discovery.

Method: Iterative fine-tuning strategy using multiple reward signals, generalizing state-of-the-art RL-based methods with theoretical analysis.

Result: Experimental results show effectiveness across diverse domains including text, biological sequence, and small molecule generation.

Conclusion: Multi-reward RL fine-tuning provides superior performance compared to state-of-the-art baselines in applications requiring multiple evaluation criteria.

Abstract: Fine-tuning foundation models has emerged as a powerful approach for generating objects with specific desired properties. Reinforcement learning (RL) provides an effective framework for this purpose, enabling models to generate outputs that maximize a given reward function. However, in many applications such as text generation and drug discovery, it can be suboptimal to optimize using a single reward signal, as multiple evaluation criteria are often necessary. This paper proposes a novel reinforcement learning-based method for fine-tuning foundation models using multiple reward signals. By employing an iterative fine-tuning strategy across these rewards, our approach generalizes state-of-the-art RL-based methods. We further provide a theoretical analysis that offers insights into the performance of multi-reward RL fine-tuning. Experimental results across diverse domains including text, biological sequence, and small molecule generation, demonstrate the effectiveness of the proposed algorithm compared to state-of-the-art baselines.

[645] Melanoma Classification Through Deep Ensemble Learning and Explainable AI

Wadduwage Shanika Perera, ABM Islam, Van Vung Pham, Min Kyung An

Main category: cs.LG

TL;DR: This paper proposes an ensemble deep learning model for melanoma detection that combines three transfer learning networks with explainable AI (XAI) techniques to improve reliability and trust in predictions.

Details

Motivation: Melanoma is a deadly skin cancer requiring early detection. While deep learning systems show high accuracy, their black-box nature lacks reliability and trust for healthcare diagnostics. XAI can address this explainability limitation.

Method: Used ensemble learning of three state-of-the-art deep transfer learning networks, combined with XAI techniques to explain the basis of predictions and ensure reliability.

Result: The proposed model achieves high accuracy in melanoma detection while providing interpretable explanations for its predictions through XAI techniques.

Conclusion: Combining ensemble deep learning with XAI techniques can overcome the explainability limitations of traditional DL models, making AI-based melanoma detection more reliable and trustworthy for healthcare applications.

Abstract: Melanoma is one of the most aggressive and deadliest skin cancers, leading to mortality if not detected and treated in the early stages. Artificial intelligence techniques have recently been developed to help dermatologists in the early detection of melanoma, and systems based on deep learning (DL) have been able to detect these lesions with high accuracy. However, the entire community must overcome the explainability limit to get the maximum benefit from DL for diagnostics in the healthcare domain. Because of the black box operation’s shortcomings in DL models’ decisions, there is a lack of reliability and trust in the outcomes. However, Explainable Artificial Intelligence (XAI) can solve this problem by interpreting the predictions of AI systems. This paper proposes a machine learning model using ensemble learning of three state-of-the-art deep transfer Learning networks, along with an approach to ensure the reliability of the predictions by utilizing XAI techniques to explain the basis of the predictions.

[646] A Tight Lower Bound for Non-stochastic Multi-armed Bandits with Expert Advice

Zachary Chase, Shinji Ito, Idan Mehalel

Main category: cs.LG

TL;DR: The paper establishes the minimax optimal expected regret for non-stochastic multi-armed bandit with expert advice as Θ(√(TK log(N/K))), matching upper and lower bounds.

Details

Motivation: To determine the exact minimax optimal regret bound for the classic non-stochastic multi-armed bandit problem with expert advice, resolving the gap between existing upper and lower bounds.

Method: Proving a lower bound that matches the previously known upper bound from Kale (2014) to establish the tight minimax optimal regret rate.

Result: The minimax optimal expected regret is proven to be Θ(√(TK log(N/K))), where K is number of arms, N is number of experts, and T is time horizon.

Conclusion: The paper completes the theoretical understanding of this bandit problem by establishing the exact minimax optimal regret bound through matching upper and lower bounds.

Abstract: We determine the minimax optimal expected regret in the classic non-stochastic multi-armed bandit with expert advice problem, by proving a lower bound that matches the upper bound of Kale (2014). The two bounds determine the minimax optimal expected regret to be $\Theta\left( \sqrt{T K \log (N/K) } \right)$, where $K$ is the number of arms, $N$ is the number of experts, and $T$ is the time horizon.

[647] X-TRACK: Physics-Aware xLSTM for Realistic Vehicle Trajectory Prediction

Aanchal Rajesh Chugh, Marion Neumeier, Sebastian Dorn

Main category: cs.LG

TL;DR: This paper introduces X-TRAJ and X-TRACK, novel xLSTM-based frameworks for vehicle trajectory prediction that incorporate vehicle motion kinematics to generate realistic trajectories, showing superior performance on benchmark datasets.

Details

Motivation: While xLSTM architectures have shown improved ability to model long-term temporal dependencies compared to traditional LSTMs, they remain largely unexplored for vehicle trajectory prediction tasks despite their potential.

Method: The authors propose X-TRAJ (xLSTM-based vehicle trajectory prediction framework) and its physics-aware variant X-TRACK, which explicitly integrates vehicle motion kinematics into the model learning process through physical constraints.

Result: Comprehensive evaluation on highD and NGSIM datasets demonstrates that X-TRACK outperforms state-of-the-art baselines in vehicle trajectory prediction.

Conclusion: The proposed xLSTM-based frameworks, particularly the physics-aware X-TRACK variant, effectively generate realistic and feasible vehicle trajectories by incorporating motion kinematics, achieving superior performance over existing methods.

Abstract: Recent advancements in Recurrent Neural Network (RNN) architectures, particularly the Extended Long Short Term Memory (xLSTM), have addressed the limitations of traditional Long Short Term Memory (LSTM) networks by introducing exponential gating and enhanced memory structures. These improvements make xLSTM suitable for time-series prediction tasks as they exhibit the ability to model long-term temporal dependencies better than LSTMs. Despite their potential, these xLSTM-based models remain largely unexplored in the context of vehicle trajectory prediction. Therefore, this paper introduces a novel xLSTM-based vehicle trajectory prediction framework, X-TRAJ, and its physics-aware variant, X-TRACK (eXtended LSTM for TRAjectory prediction Constraint by Kinematics), which explicitly integrates vehicle motion kinematics into the model learning process. By introducing physical constraints, the proposed model generates realistic and feasible trajectories. A comprehensive evaluation on the highD and NGSIM datasets demonstrates that X-TRACK outperforms state-of-the-art baselines.

[648] Improving the Robustness of Control of Chaotic Convective Flows with Domain-Informed Reinforcement Learning

Michiel Straat, Thorben Markmann, Sebastian Peitz, Barbara Hammer

Main category: cs.LG

TL;DR: This paper introduces domain-informed reinforcement learning for controlling chaotic convective flows like Rayleigh-Bénard Convection, achieving up to 33% heat transport reduction in laminar flows and 10% in chaotic flows.

Details

Motivation: Chaotic convective flows in systems like microfluidic devices and chemical reactors are difficult to stabilize with conventional methods, and RL's generalization and robustness in chaotic regimes remains unexplored despite being critical for real-world deployment.

Method: Domain-informed RL agents trained using Proximal Policy Optimization across diverse initial conditions and flow regimes, with domain knowledge incorporated via reward functions that encourage Bénard cell merging as a desirable macroscopic property.

Result: Domain-informed RL agents reduced convective heat transport by up to 33% in laminar regimes and 10% in chaotic regimes, significantly outperforming conventional controllers. They produced steady flows, faster training convergence, and generalization across flow regimes without retraining.

Conclusion: Domain-informed priors greatly enhance RL-based control robustness for chaotic flows, making real-world deployment more feasible by improving generalization and sample efficiency.

Abstract: Chaotic convective flows arise in many real-world systems, such as microfluidic devices and chemical reactors. Stabilizing these flows is highly desirable but remains challenging, particularly in chaotic regimes where conventional control methods often fail. Reinforcement Learning (RL) has shown promise for control in laminar flow settings, but its ability to generalize and remain robust under chaotic and turbulent dynamics is not well explored, despite being critical for real-world deployment. In this work, we improve the practical feasibility of RL-based control of such flows focusing on Rayleigh-B'enard Convection (RBC), a canonical model for convective heat transport. To enhance generalization and sample efficiency, we introduce domain-informed RL agents that are trained using Proximal Policy Optimization across diverse initial conditions and flow regimes. We incorporate domain knowledge in the reward function via a term that encourages B'enard cell merging, as an example of a desirable macroscopic property. In laminar flow regimes, the domain-informed RL agents reduce convective heat transport by up to 33%, and in chaotic flow regimes, they still achieve a 10% reduction, which is significantly better than the conventional controllers used in practice. We compare the domain-informed to uninformed agents: Our results show that the domain-informed reward design results in steady flows, faster convergence during training, and generalization across flow regimes without retraining. Our work demonstrates that elegant domain-informed priors can greatly enhance the robustness of RL-based control of chaotic flows, bringing real-world deployment closer.

[649] Calibration Across Layers: Understanding Calibration Evolution in LLMs

Abhinav Joshi, Areeb Ahmad, Ashutosh Modi

Main category: cs.LG

TL;DR: LLMs exhibit inherent calibration where predicted probabilities align with correctness. This study reveals a confidence correction phase in upper layers and identifies a low-dimensional calibration direction that improves calibration metrics without affecting accuracy.

Details

Motivation: To understand how calibration evolves throughout LLM network depth, complementing prior research focused on final layer components like entropy neurons and unembedding matrix null space.

Method: Analyzed multiple open-weight models on MMLU benchmark, tracking calibration evolution across network layers and identifying a low-dimensional calibration direction in the residual stream.

Result: Discovered a distinct confidence correction phase in upper/later layers where model confidence is actively recalibrated after decision certainty is reached. Perturbation of the identified calibration direction significantly improves ECE and MCE metrics without harming accuracy.

Conclusion: Calibration is a distributed phenomenon shaped throughout the network forward pass, not just in the final projection, revealing how confidence-regulating mechanisms operate within LLMs.

Abstract: Large Language Models (LLMs) have demonstrated inherent calibration capabilities, where predicted probabilities align well with correctness, despite prior findings that deep neural networks are often overconfident. Recent studies have linked this behavior to specific components in the final layer, such as entropy neurons and the unembedding matrix null space. In this work, we provide a complementary perspective by investigating how calibration evolves throughout the network depth. Analyzing multiple open-weight models on the MMLU benchmark, we uncover a distinct confidence correction phase in the upper/later layers, where model confidence is actively recalibrated after decision certainty has been reached. Furthermore, we identify a low-dimensional calibration direction in the residual stream whose perturbation significantly improves calibration metrics (ECE and MCE) without harming accuracy. Our findings suggest that calibration is a distributed phenomenon, shaped throughout the network forward pass, not just in its final projection, providing new insights into how confidence-regulating mechanisms operate within LLMs.

[650] A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysis

Ciaran Bench, Oskar Pfeffer, Vivek Desai, Mohammad Moulaeifard, Loïc Coquelin, Peter H. Charlton, Nils Strodthoff, Nando Hegemann, Philip J. Aston, Andrew Thompson

Main category: cs.LG

TL;DR: Comparison of 8 uncertainty quantification techniques for deep learning models on medical time-series data (PPG), focusing on AF detection and blood pressure prediction, with comprehensive evaluation showing task-dependent optimal methods.

Details

Motivation: Deep learning models for medical time-series monitoring risk poor deployment performance, and reliable uncertainty estimates are needed to help clinicians interpret model trustworthiness.

Method: Implemented 8 UQ techniques on models for AF classification and blood pressure regression tasks using PPG data, with comprehensive evaluation procedure including local calibration and adaptivity assessment.

Result: Found complex reliability patterns across techniques - optimal method depends on uncertainty expression, evaluation metric, and reliability scale. Local calibration provides practical insights not captured by global metrics.

Conclusion: UQ evaluation criteria should align with practical use cases, prioritizing small-scale reliability for limited patient measurements while maintaining predictive performance.

Abstract: In principle, deep learning models trained on medical time-series, including wearable photoplethysmography (PPG) sensor data, can provide a means to continuously monitor physiological parameters outside of clinical settings. However, there is considerable risk of poor performance when deployed in practical measurement scenarios leading to negative patient outcomes. Reliable uncertainties accompanying predictions can provide guidance to clinicians in their interpretation of the trustworthiness of model outputs. It is therefore of interest to compare the effectiveness of different approaches. Here we implement an unprecedented set of eight uncertainty quantification (UQ) techniques to models trained on two clinically relevant prediction tasks: Atrial Fibrillation (AF) detection (classification), and two variants of blood pressure regression. We formulate a comprehensive evaluation procedure to enable a rigorous comparison of these approaches. We observe a complex picture of uncertainty reliability across the different techniques, where the most optimal for a given task depends on the chosen expression of uncertainty, evaluation metric, and scale of reliability assessed. We find that assessing local calibration and adaptivity provides practically relevant insights about model behaviour that otherwise cannot be acquired using more commonly implemented global reliability metrics. We emphasise that criteria for evaluating UQ techniques should cater to the model’s practical use case, where the use of a small number of measurements per patient places a premium on achieving small-scale reliability for the chosen expression of uncertainty, while preserving as much predictive performance as possible.

[651] A Technical Exploration of Causal Inference with Hybrid LLM Synthetic Data

Dana Kim, Yichen Xu, Tiffany Lin

Main category: cs.LG

TL;DR: LLMs can generate synthetic tabular data but often fail to preserve causal parameters like ATE. The paper proposes a hybrid framework combining model-based covariate synthesis with learned propensity and outcome models to maintain causal structure, plus synthetic pairing to address positivity violations.

Details

Motivation: Existing synthetic data generators (both GAN- and LLM-based) achieve high predictive fidelity but substantially misestimate causal effects, creating a gap in preserving causal parameters for robust causal analysis.

Method: Hybrid generation framework with model-based covariate synthesis monitored via distance-to-closest-record filtering, combined with separately learned propensity and outcome models to ensure (W, A, Y) triplets retain causal structure. Includes synthetic pairing strategy to mitigate positivity violations.

Result: The framework enables generation of synthetic data that preserves underlying causal structure and supports robust causal analysis, with evaluation protocol leveraging unlimited synthetic samples to benchmark traditional estimators (IPTW, AIPW, substitution) under complex covariate distributions.

Conclusion: This work lays the groundwork for LLM-powered data pipelines that support robust causal analysis by addressing the critical gap in preserving causal parameters in synthetic data generation.

Abstract: Large Language Models (LLMs) offer a flexible means to generate synthetic tabular data, yet existing approaches often fail to preserve key causal parameters such as the average treatment effect (ATE). In this technical exploration, we first demonstrate that state-of-the-art synthetic data generators, both GAN- and LLM-based, can achieve high predictive fidelity while substantially misestimating causal effects. To address this gap, we propose a hybrid generation framework that combines model-based covariate synthesis (monitored via distance-to-closest-record filtering) with separately learned propensity and outcome models, thereby ensuring that (W, A, Y) triplets retain their underlying causal structure. We further introduce a synthetic pairing strategy to mitigate positivity violations and a realistic evaluation protocol that leverages unlimited synthetic samples to benchmark traditional estimators (IPTW, AIPW, substitution) under complex covariate distributions. This work lays the groundwork for LLM-powered data pipelines that support robust causal analysis. Our code is available at https://github.com/Xyc-arch/llm-synthetic-for-causal-inference.git.

[652] Reject Only Critical Tokens: Pivot-Aware Speculative Decoding

Amir Ziashahabi, Yavuz Faruk Bakman, Duygu Nur Yaldiz, Mostafa El-Khamy, Sai Praneeth Karimireddy, Salman Avestimehr

Main category: cs.LG

TL;DR: Pivot-Aware Speculative Decoding relaxes the strict distribution matching requirement of standard Speculative Decoding, focusing instead on preserving task-specific utility to achieve higher acceptance rates and up to 2.5× speedup.

Details

Motivation: Standard Speculative Decoding's requirement for exact distribution matching results in unnecessarily low acceptance rates, limiting speedup potential. Real-world LLM use cases prioritize utility (e.g., code correctness, factual accuracy) over exact sampling distribution.

Method: Proposes Pivot-Aware Speculative Decoding that rejects only tokens leading to utility drops in final output. Identifies critical ‘pivot tokens’ and trains a lightweight classifier to detect them, creating a relaxed version of standard SD.

Result: Achieves up to 2.5× speedup across various datasets while maintaining comparable utility to the target model.

Conclusion: Focusing on utility preservation rather than exact distribution matching enables significantly higher acceptance rates and better speedup performance in speculative decoding.

Abstract: Speculative Decoding (SD) ensures that the output matches the target model’s distribution exactly. However, we argue that this distribution matching requirement is too stringent and results in unnecessarily low acceptance rates, limiting potential speedups. Instead, we advocate a reformulation of the decoding objective: the proposed decoding strategy should match the expected utility, i.e., the task-specific performance, of the target model. This perspective also aligns better with real-world use cases of LLMs, where utility (e.g., code correctness, factual accuracy) is often more important than sampling distribution. Based on this reformulation, we propose a novel decoding strategy: Pivot-Aware Speculative Decoding, which rejects only those tokens that would lead to a utility drop in the final output. We refer to these critical tokens as pivot tokens. We propose a method for labeling tokens as pivotal or non-pivotal and train a lightweight classifier to detect them. This method can be viewed as a relaxed version of standard SD, which offers much higher acceptance while preserving utility. We evaluate our method across various datasets, demonstrating that we can achieve up to $2.5\times$ speedup with comparable utility. Source code is available at https://github.com/amir-zsh/PAD.

[653] Toward Unifying Group Fairness Evaluation from a Sparsity Perspective

Zhecheng Sheng, Jiawei Zhang, Enmao Diao

Main category: cs.LG

TL;DR: This paper proposes a unified sparsity-based framework for evaluating algorithmic fairness that connects various sparsity measures with fairness criteria and demonstrates broad applicability across machine learning tasks.

Details

Motivation: Algorithmic fairness remains challenging as existing fairness criteria often lack generalizability across different machine learning problems, creating a need for more unified evaluation approaches.

Method: The paper examines connections among sparsity measures in promoting fairness and proposes a unified sparsity-based framework that aligns with existing fairness criteria.

Result: Extensive experiments on various datasets and bias mitigation methods demonstrate the effectiveness of the proposed framework as an evaluation metric.

Conclusion: The work provides a novel perspective on algorithmic fairness by framing it through sparsity and social equity, offering potential for broader impact on fairness research and applications.

Abstract: Ensuring algorithmic fairness remains a significant challenge in machine learning, particularly as models are increasingly applied across diverse domains. While numerous fairness criteria exist, they often lack generalizability across different machine learning problems. This paper examines the connections and differences among various sparsity measures in promoting fairness and proposes a unified sparsity-based framework for evaluating algorithmic fairness. The framework aligns with existing fairness criteria and demonstrates broad applicability to a wide range of machine learning tasks. We demonstrate the effectiveness of the proposed framework as an evaluation metric through extensive experiments on a variety of datasets and bias mitigation methods. This work provides a novel perspective to algorithmic fairness by framing it through the lens of sparsity and social equity, offering potential for broader impact on fairness research and applications.

[654] Balancing Interpretability and Performance in Motor Imagery EEG Classification: A Comparative Study of ANFIS-FBCSP-PSO and EEGNet

Farjana Aktar, Mohd Ruhul Ameen, Akif Islam, Md Ekramul Hamid

Main category: cs.LG

TL;DR: Comparison of transparent fuzzy reasoning (ANFIS-FBCSP-PSO) vs deep learning (EEGNet) for motor imagery EEG classification, showing fuzzy model better for within-subject accuracy while deep learning better for cross-subject generalization.

Details

Motivation: Address the challenge of achieving both accurate and interpretable classification of motor imagery EEG in brain-computer interface research.

Method: Compared ANFIS-FBCSP-PSO (filter bank CSP + fuzzy rules + PSO optimization) with EEGNet (deep learning from raw EEG) using BCI Competition IV-2a dataset in within-subject and cross-subject (LOSO) experiments.

Result: Fuzzy model performed better in within-subject tests (68.58% accuracy, kappa=58.04%), while deep learning showed stronger generalization in cross-subject tests (68.20% accuracy, kappa=57.33%).

Conclusion: Provides guidance for selecting MI-BCI systems based on design goals: interpretability (fuzzy) or cross-user robustness (deep learning). Future work should explore transformer-based and hybrid neuro-symbolic frameworks for transparent EEG decoding.

Abstract: Achieving both accurate and interpretable classification of motor imagery EEG remains a key challenge in brain computer interface (BCI) research. This paper compares a transparent fuzzy reasoning approach (ANFIS-FBCSP-PSO) with a deep learning benchmark (EEGNet) using the BCI Competition IV-2a dataset. The ANFIS pipeline combines filter bank common spatial pattern feature extraction with fuzzy IF-THEN rules optimized via particle swarm optimization, while EEGNet learns hierarchical spatial temporal representations directly from raw EEG data. In within-subject experiments, the fuzzy neural model performed better (68.58 percent +/- 13.76 percent accuracy, kappa = 58.04 percent +/- 18.43), while in cross-subject (LOSO) tests, the deep model exhibited stronger generalization (68.20 percent +/- 12.13 percent accuracy, kappa = 57.33 percent +/- 16.22). The study provides practical guidance for selecting MI-BCI systems according to design goals: interpretability or robustness across users. Future investigations into transformer based and hybrid neuro symbolic frameworks are expected to advance transparent EEG decoding.

[655] PolyRecommender: A Multimodal Recommendation System for Polymer Discovery

Xin Wang, Yunhao Xiao, Rui Qiao

Main category: cs.LG

TL;DR: PolyRecommender is a multimodal framework that combines chemical language representations from PolyBERT with molecular graph representations to enable efficient polymer discovery through retrieval and ranking based on multiple target properties.

Details

Motivation: To develop an AI-guided system for discovering next-generation polymers by leveraging complementary knowledge from different data modalities (chemical language and molecular graphs) for more efficient and robust polymer recommendation.

Method: Integrates PolyBERT’s chemical language representations with graph encoder’s molecular representations, uses language-based similarity for initial candidate retrieval, then employs fused multimodal embeddings for ranking polymers according to multiple target properties.

Result: The framework enables efficient retrieval and robust ranking across related polymer properties by leveraging complementary knowledge from both modalities.

Conclusion: Establishes a generalizable multimodal paradigm that advances AI-guided design for polymer discovery, providing a foundation for next-generation polymer development.

Abstract: We introduce PolyRecommender, a multimodal discovery framework that integrates chemical language representations from PolyBERT with molecular graph-based representations from a graph encoder. The system first retrieves candidate polymers using language-based similarity and then ranks them using fused multimodal embeddings according to multiple target properties. By leveraging the complementary knowledge encoded in both modalities, PolyRecommender enables efficient retrieval and robust ranking across related polymer properties. Our work establishes a generalizable multimodal paradigm, advancing AI-guided design for the discovery of next-generation polymers.

[656] UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su

Main category: cs.LG

TL;DR: UME-R1 pioneers generative multimodal embeddings using a two-stage training approach with supervised fine-tuning and reinforcement learning, achieving superior performance over discriminative embeddings by leveraging MLLMs’ reasoning capabilities.

Details

Motivation: Existing multimodal embedding models are inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigms. The authors aim to unify embedding tasks within a generative paradigm.

Method: Proposes UME-R1 framework with two-stage training: 1) Cold-start supervised fine-tuning to equip reasoning capabilities for generating both discriminative and generative embeddings; 2) Reinforcement learning to enhance reasoning and optimize generative embedding quality.

Result: Significantly outperforms conventional discriminative embedding models on MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents. Shows that generative embeddings unlock substantial performance gains and are complementary to discriminative embeddings.

Conclusion: Generative embeddings offer a foundation for more interpretable, reasoning-driven multimodal embeddings, with inference-time scalability potential through repeated sampling. The work establishes generative embeddings as a promising direction for multimodal AI.

Abstract: The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings. Our code, models, and datasets will be publicly available at https://github.com/XMUDeepLIT/UME-R1.

[657] Enhancing Adversarial Transferability by Balancing Exploration and Exploitation with Gradient-Guided Sampling

Zenghao Niu, Weicheng Xie, Siyang Song, Zitong Yu, Feng Liu, Linlin Shen

Main category: cs.LG

TL;DR: Proposes Gradient-Guided Sampling (GGS) to resolve the exploitation-exploration dilemma in adversarial attack transferability, achieving both strong attack potency and cross-model generalization.

Details

Motivation: Address the fundamental dilemma between Exploitation (maximizing attack potency) and Exploration (enhancing cross-model generalization) in adversarial attack transferability across different model architectures.

Method: Based on MI-FGSM, introduces inner-iteration random sampling guided by gradient ascent direction to improve sampling efficiency and stability, with sampling magnitude determined by random distribution.

Result: Comprehensive experiments across multiple DNN architectures and multimodal large language models demonstrate superiority over state-of-the-art transfer attacks.

Conclusion: GGS effectively harmonizes exploitation and exploration objectives, enabling adversarial examples to reside in balanced regions with both flatness for generalization and higher local maxima for strong attack potency.

Abstract: Adversarial attacks present a critical challenge to deep neural networks’ robustness, particularly in transfer scenarios across different model architectures. However, the transferability of adversarial attacks faces a fundamental dilemma between Exploitation (maximizing attack potency) and Exploration (enhancing cross-model generalization). Traditional momentum-based methods over-prioritize Exploitation, i.e., higher loss maxima for attack potency but weakened generalization (narrow loss surface). Conversely, recent methods with inner-iteration sampling over-prioritize Exploration, i.e., flatter loss surfaces for cross-model generalization but weakened attack potency (suboptimal local maxima). To resolve this dilemma, we propose a simple yet effective Gradient-Guided Sampling (GGS), which harmonizes both objectives through guiding sampling along the gradient ascent direction to improve both sampling efficiency and stability. Specifically, based on MI-FGSM, GGS introduces inner-iteration random sampling and guides the sampling direction using the gradient from the previous inner-iteration (the sampling’s magnitude is determined by a random distribution). This mechanism encourages adversarial examples to reside in balanced regions with both flatness for cross-model generalization and higher local maxima for strong attack potency. Comprehensive experiments across multiple DNN architectures and multimodal large language models (MLLMs) demonstrate the superiority of our method over state-of-the-art transfer attacks. Code is made available at https://github.com/anuin-cat/GGS.

[658] Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse

Shaojie Wang, Jinghui Wang, Yinghan Cui, Xuxing Chen, Chao Wang, Liang Huang, Xiaojiang Zhang, Junyi Peng, Li Wan, Haotian Zhang, Bin Chen

Main category: cs.LG

TL;DR: Tree Training is a new paradigm that optimizes agentic LLM training by reusing shared prefix computations across branching trajectories, reducing training time by up to 3.9x.

Details

Motivation: Current training pipelines inefficiently recompute shared prefixes across branching trajectories in agentic LLM scenarios, leading to computational redundancy.

Method: Proposes Tree Training with Tree Packing (reuses shared computations) and Gradient Restoration (ensures correct gradient propagation across reused prefixes).

Result: Experiments show up to 3.9x reduction in total training time across multiple open-source models.

Conclusion: Tree Training enables more efficient agentic LLM SFT and RL training by eliminating redundant computations in tree-structured trajectories.

Abstract: In agentic LLM scenarios, an agent’s interaction process during a single rollout often exhibits branching behaviors. Due to memory retrieval and concurrent tool executions at certain decision points, the token trajectory of one task evolves into a tree-like structure rather than a linear sequence. However, current training pipelines decompose such tree-structured trajectories into separate linear segments, treating each branch as an independent sequence. As a result, shared prefixes across these branches are repeatedly recomputed during both forward and backward passes. To address this inefficiency, we propose Tree Training, a paradigm that computes each shared prefix only once and reuses its intermediate results across related branches during both forward and backward passes, substantially improving computation efficiency in large-scale agentic training. This is achieved via (i) Tree Packing, which efficiently reuses shared computations across trajectories, and (ii) Gradient Restoration, which ensures correct gradient propagation across reused prefixes. Experiments on multiple open-source models demonstrate up to 3.9x reduction in total training time, enabling more efficient agentic LLM SFT and RL training.

[659] Structure-Preserving Physics-Informed Neural Network for the Korteweg–de Vries (KdV) Equation

Victory Obieke, Emmanuel Oguadimma

Main category: cs.LG

TL;DR: This paper introduces a structure-preserving PINN framework for the KdV equation that embeds conservation laws into the loss function and uses sinusoidal activation functions to maintain physical invariants and improve long-term stability.

Details

Motivation: Conventional PINNs often fail to preserve physical invariants during long-term integration of nonlinear PDEs like the KdV equation, leading to physically inconsistent solutions.

Method: The proposed method embeds conservation of mass and Hamiltonian energy directly into the loss function and employs sinusoidal activation functions instead of standard tanh activations to better capture oscillatory and dispersive wave behavior.

Result: The model successfully reproduces key KdV behaviors including soliton propagation, elastic collisions, and dispersive breakup while maintaining conserved invariants. It shows accelerated convergence, improved long-term stability, and mitigates drift without multi-stage pretraining.

Conclusion: Combining invariant-constrained optimization with sinusoidal representations yields robust, energy-consistent PINNs for Hamiltonian PDEs like the KdV equation.

Abstract: Physics-Informed Neural Networks (PINNs) offer a flexible framework for solving nonlinear partial differential equations (PDEs), yet conventional implementations often fail to preserve key physical invariants during long-term integration. This paper introduces a \emph{structure-preserving PINN} framework for the nonlinear Korteweg–de Vries (KdV) equation, a prototypical model for nonlinear and dispersive wave propagation. The proposed method embeds the conservation of mass and Hamiltonian energy directly into the loss function, ensuring physically consistent and energy-stable evolution throughout training and prediction. Unlike standard \texttt{tanh}-based PINNs~\cite{raissi2019pinn,wang2022modifiedpinn}, our approach employs sinusoidal activation functions that enhance spectral expressiveness and accurately capture the oscillatory and dispersive nature of KdV solitons. Through representative case studies – including single-soliton propagation (shape-preserving translation), two-soliton interaction (elastic collision with phase shift), and cosine-pulse initialization (nonlinear dispersive breakup) – the model successfully reproduces hallmark behaviors of KdV dynamics while maintaining conserved invariants. Ablation studies demonstrate that combining invariant-constrained optimization with sinusoidal feature mappings accelerates convergence, improves long-term stability, and mitigates drift without multi-stage pretraining. These results highlight that computationally efficient, invariant-aware regularization coupled with sinusoidal representations yields robust, energy-consistent PINNs for Hamiltonian partial differential equations such as the KdV equation.

[660] Bootstrap Off-policy with World Model

Guojian Zhan, Likun Wang, Xiangteng Zhang, Jiaxin Gao, Masayoshi Tomizuka, Shengbo Eben Li

Main category: cs.LG

TL;DR: BOOM is a reinforcement learning framework that integrates planning and off-policy learning through a bootstrap loop, using a world model to improve sample efficiency and performance while addressing data-policy divergence.

Details

Motivation: Online planning in RL improves efficiency but creates divergence between collected data and actual policy behaviors, which degrades model learning and policy improvement.

Method: Uses a bootstrap loop where policy initializes planner and planner refines actions to bootstrap policy through behavior alignment. Employs a jointly learned world model for trajectory simulation and value targets. Core components include likelihood-free alignment loss and soft value-weighted mechanism.

Result: Achieves state-of-the-art results in training stability and final performance on DeepMind Control Suite and Humanoid-Bench benchmarks.

Conclusion: BOOM effectively integrates planning and off-policy learning to address data-policy divergence, demonstrating superior performance in high-dimensional control tasks.

Abstract: Online planning has proven effective in reinforcement learning (RL) for improving sample efficiency and final performance. However, using planning for environment interaction inevitably introduces a divergence between the collected data and the policy’s actual behaviors, degrading both model learning and policy improvement. To address this, we propose BOOM (Bootstrap Off-policy with WOrld Model), a framework that tightly integrates planning and off-policy learning through a bootstrap loop: the policy initializes the planner, and the planner refines actions to bootstrap the policy through behavior alignment. This loop is supported by a jointly learned world model, which enables the planner to simulate future trajectories and provides value targets to facilitate policy improvement. The core of BOOM is a likelihood-free alignment loss that bootstraps the policy using the planner’s non-parametric action distribution, combined with a soft value-weighted mechanism that prioritizes high-return behaviors and mitigates variability in the planner’s action quality within the replay buffer. Experiments on the high-dimensional DeepMind Control Suite and Humanoid-Bench show that BOOM achieves state-of-the-art results in both training stability and final performance. The code is accessible at https://github.com/molumitu/BOOM_MBRL.

[661] Region-Aware Reconstruction Strategy for Pre-training fMRI Foundation Model

Ruthwik Reddy Doodipala, Pankaj Pandey, Carolina Torres Rojas, Manob Jyoti Saikia, Ranganatha Sitaram

Main category: cs.LG

TL;DR: Region-aware reconstruction using anatomical ROI masking improves fMRI foundation models, achieving 4.23% better ADHD classification accuracy than random masking.

Details

Motivation: To develop more effective foundation models for neuroimaging by moving beyond random region masking and leveraging anatomical knowledge for self-supervised pretraining.

Method: ROI-guided masking strategy using AAL3 atlas applied to 4D fMRI volumes, selectively masking coherent brain regions during self-supervised pretraining on ADHD-200 dataset.

Result: 4.23% improvement in ADHD classification accuracy; limbic region and cerebellum identified as most important for reconstruction fidelity and model representation.

Conclusion: Anatomical region masking enhances both interpretability and discriminative power of foundation models, with future work planned for additional datasets and specialized loss functions.

Abstract: The emergence of foundation models in neuroimaging is driven by the increasing availability of large-scale and heterogeneous brain imaging datasets. Recent advances in self-supervised learning, particularly reconstruction-based objectives, have demonstrated strong potential for pretraining models that generalize effectively across diverse downstream functional MRI (fMRI) tasks. In this study, we explore region-aware reconstruction strategies for a foundation model in resting-state fMRI, moving beyond approaches that rely on random region masking. Specifically, we introduce an ROI-guided masking strategy using the Automated Anatomical Labelling Atlas (AAL3), applied directly to full 4D fMRI volumes to selectively mask semantically coherent brain regions during self-supervised pretraining. Using the ADHD-200 dataset comprising 973 subjects with resting-state fMRI scans, we show that our method achieves a 4.23% improvement in classification accuracy for distinguishing healthy controls from individuals diagnosed with ADHD, compared to conventional random masking. Region-level attribution analysis reveals that brain volumes within the limbic region and cerebellum contribute most significantly to reconstruction fidelity and model representation. Our results demonstrate that masking anatomical regions during model pretraining not only enhances interpretability but also yields more robust and discriminative representations. In future work, we plan to extend this approach by evaluating it on additional neuroimaging datasets, and developing new loss functions explicitly derived from region-aware reconstruction objectives. These directions aim to further improve the robustness and interpretability of foundation models for functional neuroimaging.

[662] Deep Learning Approach to Anomaly Detection in Enterprise ETL Processes with Autoencoders

Xin Chen, Saili Uday Gadgil, Kangning Gao, Yi Hu, Cong Nie

Main category: cs.LG

TL;DR: A deep autoencoder-based anomaly detection method for enterprise ETL data streams that identifies various anomalies like delays, missing values, duplicate loading, and sudden changes through reconstruction error analysis with regularization constraints.

Details

Motivation: To address frequent anomalies in enterprise-level ETL data streams that can disrupt data processing workflows and compromise data quality for business intelligence.

Method: Uses encoder-decoder structure to compress high-dimensional inputs into latent representations and reconstruct them, with reconstruction error as anomaly measure. Introduces regularization constraints in latent space for feature sparsity and distribution learning to enhance robustness.

Result: Achieves superior performance in AUC, ACC, Precision, and Recall across different hyperparameter settings, environmental changes, and data characteristics. Effectively captures latent distribution patterns and accurately identifies diverse anomalies.

Conclusion: The deep autoencoder-based detection mechanism provides reliable support for enterprise data processing and intelligent analysis by effectively identifying anomalies in ETL data streams.

Abstract: An anomaly detection method based on deep autoencoders is proposed to address anomalies that often occur in enterprise-level ETL data streams. The study first analyzes multiple types of anomalies in ETL processes, including delays, missing values, duplicate loading, and sudden abnormal changes, and applies data standardization and feature modeling to ensure stable and usable inputs. In the method design, the encoder-decoder structure compresses high-dimensional inputs into latent representations and reconstructs them, while reconstruction error is used to measure anomaly levels. Regularization constraints are introduced in the latent space to enhance feature sparsity and distribution learning, thereby improving robustness in complex data streams. Systematic analyses under different hyperparameter settings, environmental changes, and data characteristics show that the proposed method achieves superior performance in AUC, ACC, Precision, and Recall. The results demonstrate that the deep autoencoder-based detection mechanism can effectively capture latent distribution patterns in enterprise-level ETL data streams and accurately identify diverse anomalies, providing reliable support for enterprise data processing and intelligent analysis.

[663] Why Federated Optimization Fails to Achieve Perfect Fitting? A Theoretical Perspective on Client-Side Optima

Zhongxiang Lei, Qi Yang, Ping Qiu, Gang Zhang, Yuanchi Ma, Jinyan Liu

Main category: cs.LG

TL;DR: This paper provides a theoretical explanation for performance degradation in federated learning under data heterogeneity, showing that distinct local optima from non-iid data raise the global objective’s lower bound and cause oscillation instead of convergence.

Details

Motivation: Existing federated learning algorithms guarantee convergence but performance degrades under data heterogeneity, and the reasons behind this degradation remain unclear.

Method: The authors introduce the assumption that heterogeneous client data lead to distinct local optima, and theoretically analyze the consequences: increased lower bound of global objective and oscillation behavior in final training stages.

Result: The analysis shows that heterogeneous data makes perfect fitting impossible and causes global model oscillation, providing principled explanation for performance degradation in non-iid settings, validated through experiments.

Conclusion: The paper offers a theoretical framework explaining why federated learning performance degrades under data heterogeneity, with implications for algorithm design and understanding of federated optimization limits.

Abstract: Federated optimization is a constrained form of distributed optimization that enables training a global model without directly sharing client data. Although existing algorithms can guarantee convergence in theory and often achieve stable training in practice, the reasons behind performance degradation under data heterogeneity remain unclear. To address this gap, the main contribution of this paper is to provide a theoretical perspective that explains why such degradation occurs. We introduce the assumption that heterogeneous client data lead to distinct local optima, and show that this assumption implies two key consequences: 1) the distance among clients’ local optima raises the lower bound of the global objective, making perfect fitting of all client data impossible; and 2) in the final training stage, the global model oscillates within a region instead of converging to a single optimum, limiting its ability to fully fit the data. These results provide a principled explanation for performance degradation in non-iid settings, which we further validate through experiments across multiple tasks and neural network architectures. The framework used in this paper is open-sourced at: https://github.com/NPCLEI/fedtorch.

[664] Variational Autoencoder for Calibration: A New Approach

Travis Barrett, Amit Kumar Mishra, Joyce Mwangama

Main category: cs.LG

TL;DR: A Variational Autoencoder (VAE) is implemented for sensor calibration by training the latent space as calibration output, demonstrated on a multi-sensor gas dataset.

Details

Motivation: To develop a new approach for sensor calibration using VAEs that can perform calibration and autoencoding simultaneously.

Method: Using VAE with latent space trained as calibration output, tested on existing multi-sensor gas dataset as proof-of-concept.

Result: The calibration VAE successfully performed as both calibration model and autoencoder, producing statistically similar outputs to truth data from both calibration and reconstruction outputs.

Conclusion: The VAE approach shows promise for sensor calibration and future testing and expansion of this work is planned.

Abstract: In this paper we present a new implementation of a Variational Autoencoder (VAE) for the calibration of sensors. We propose that the VAE can be used to calibrate sensor data by training the latent space as a calibration output. We discuss this new approach and show a proof-of-concept using an existing multi-sensor gas dataset. We show the performance of the proposed calibration VAE and found that it was capable of performing as calibration model while performing as an autoencoder simultaneously. Additionally, these models have shown that they are capable of creating statistically similar outputs from both the calibration output as well as the reconstruction output to their respective truth data. We then discuss the methods of future testing and planned expansion of this work.

[665] Reasoning Planning for Language Models

Bao Nguyen, Hieu Trung Nguyen, Ruifeng She, Xiaojin Fu, Viet Anh Nguyen

Main category: cs.LG

TL;DR: EPIC framework learns to select optimal reasoning methods for language models using ensemble planning with contrastive learning, improving accuracy while reducing computational costs.

Details

Motivation: Existing approaches assume more candidate answers yield higher accuracy, but this assumption needs rigorous theoretical analysis to optimize reasoning method selection.

Method: EPIC uses ensemble planning with contrastive learning to create a shared representation space capturing model reasoning abilities and query-method compatibility, with probability bounds as regularizers in utility-driven optimization.

Result: Experiments on mathematical reasoning tasks show EPIC consistently selects optimal reasoning methods, improving accuracy while reducing computational overhead.

Conclusion: EPIC provides an effective framework for selecting reasoning methods that balances accuracy and computational efficiency, with theoretical foundations supporting its approach.

Abstract: Selecting an appropriate reasoning method for a given query remains a key challenge in language model generation. Existing approaches typically generate multiple candidate responses and use an aggregation strategy to select the output answer, often assuming that more candidate answers yield higher accuracy. We revisit this assumption through a rigorous theoretical analysis, deriving accuracy bounds for standard aggregation methods under fixed generation distributions and candidate sizes. Building on these insights, we introduce EPIC, an Ensemble Planning with Contrastive learning framework to learn a shared representation space that captures both model reasoning abilities and query-method compatibility. EPIC incorporates our probability bounds as a regularizer in a utility-driven optimization that balances accuracy and computational cost. Experiments on diverse mathematical reasoning tasks show that EPIC consistently selects optimal reasoning methods, improving accuracy while reducing computational overhead. Our code can be found at https://github.com/nguyenngocbaocmt02/EPIC.

[666] Air Pollution Forecasting in Bucharest

Dragoş-Andrei Şerban, Răzvan-Alexandru Smădu, Dumitru-Clementin Cercel

Main category: cs.LG

TL;DR: This paper develops and evaluates machine learning models for forecasting PM2.5 air pollution levels across different time horizons to provide early warnings and prevent health issues.

Details

Motivation: Air pollution, particularly PM2.5, poses serious health risks including respiratory diseases, cardiovascular disorders, and cancer. Forecasting PM2.5 levels is crucial for early warnings and disease prevention.

Method: The study designs, fine-tunes, tests, and evaluates various machine learning models including linear regression, ensemble methods, recurrent neural networks, transformers, and large language models for PM2.5 forecasting.

Result: The paper compares performance of multiple machine learning models on PM2.5 forecasting tasks across different time horizons, though specific performance metrics are not detailed in the abstract.

Conclusion: Machine learning models show promise for PM2.5 forecasting, with comprehensive evaluation of various approaches providing insights into their effectiveness for air pollution prediction.

Abstract: Air pollution, especially the particulate matter 2.5 (PM2.5), has become a growing concern in recent years, primarily in urban areas. Being exposed to air pollution is linked to developing numerous health problems, like the aggravation of respiratory diseases, cardiovascular disorders, lung function impairment, and even cancer or early death. Forecasting future levels of PM2.5 has become increasingly important over the past few years, as it can provide early warnings and help prevent diseases. This paper aims to design, fine-tune, test, and evaluate machine learning models for predicting future levels of PM2.5 over various time horizons. Our primary objective is to assess and compare the performance of multiple models, ranging from linear regression algorithms and ensemble-based methods to deep learning models, such as advanced recurrent neural networks and transformers, as well as large language models, on this forecasting task.

[667] Learning an Efficient Optimizer via Hybrid-Policy Sub-Trajectory Balance

Yunchuan Guan, Yu Liu, Ke Zhou, Hui Li, Sen Jia, Zhiqi Shen, Ziyang Wang, Xinglin Zhang, Tao Chen, Jenq-Neng Hwang, Lei Li

Main category: cs.LG

TL;DR: Lo-Hp is a decoupled two-stage weight generation framework that addresses over-coupling and long-horizon issues in neural network weight generation by learning local optimization policies through hybrid-policy sub-trajectory balance.

Details

Motivation: Current weight generation methods suffer from over-coupling (tight binding of weight generation with task objectives) and long-horizon issues (inefficiency and low accuracy due to lack of local constraints), limiting optimizer flexibility and performance.

Method: Proposes a decoupled two-stage framework with hybrid-policy sub-trajectory balance objective that integrates on-policy and off-policy learning to capture local optimization policies, enabling learning of various optimization policies.

Result: Theoretically demonstrates that learning local optimization policies addresses long-horizon issues while enhancing global optimal weight generation. Validates superior accuracy and inference efficiency in tasks requiring frequent weight updates.

Conclusion: Lo-Hp provides an effective solution to over-coupling and long-horizon problems in weight generation, offering improved flexibility, accuracy, and efficiency across various learning scenarios.

Abstract: Recent advances in generative modeling enable neural networks to generate weights without relying on gradient-based optimization. However, current methods are limited by issues of over-coupling and long-horizon. The former tightly binds weight generation with task-specific objectives, thereby limiting the flexibility of the learned optimizer. The latter leads to inefficiency and low accuracy during inference, caused by the lack of local constraints. In this paper, we propose Lo-Hp, a decoupled two-stage weight generation framework that enhances flexibility through learning various optimization policies. It adopts a hybrid-policy sub-trajectory balance objective, which integrates on-policy and off-policy learning to capture local optimization policies. Theoretically, we demonstrate that learning solely local optimization policies can address the long-horizon issue while enhancing the generation of global optimal weights. In addition, we validate Lo-Hp’s superior accuracy and inference efficiency in tasks that require frequent weight updates, such as transfer learning, few-shot learning, domain generalization, and large language model adaptation.

[668] Robust Single-Agent Reinforcement Learning for Regional Traffic Signal Control Under Demand Fluctuations

Qiang Li, Jin Niu, Lina Yu

Main category: cs.LG

TL;DR: A single-agent reinforcement learning framework using DreamerV3 world model for regional adaptive traffic signal control that effectively reduces queue lengths and shows robust performance against demand fluctuations.

Details

Motivation: Traditional traffic signal control systems fail to capture real-world traffic complexity and dynamics, while multi-agent systems face coordination challenges. Traffic congestion from intersection queuing impacts urban living standards, safety, and economic efficiency.

Method: Single-agent RL framework with centralized decision-making using adjacency matrix to encode road network topology, real-time queue states from probe vehicle data, and signal timing parameters. Uses DreamerV3 world model for efficient learning, with actions that sequentially select intersections and adjust signal phase splits.

Result: Simulation experiments in SUMO show the model significantly reduces queue lengths and exhibits robust anti-fluctuation capability under multi-level (10%, 20%, 30%) OD demand fluctuations.

Conclusion: The framework establishes a new paradigm for intelligent traffic control compatible with probe vehicle technology, with future work focusing on incorporating stochastic OD demand fluctuations and regional optimization for contingency events.

Abstract: Traffic congestion, primarily driven by intersection queuing, significantly impacts urban living standards, safety, environmental quality, and economic efficiency. While Traffic Signal Control (TSC) systems hold potential for congestion mitigation, traditional optimization models often fail to capture real-world traffic complexity and dynamics. This study introduces a novel single-agent reinforcement learning (RL) framework for regional adaptive TSC, circumventing the coordination complexities inherent in multi-agent systems through a centralized decision-making paradigm. The model employs an adjacency matrix to unify the encoding of road network topology, real-time queue states derived from probe vehicle data, and current signal timing parameters. Leveraging the efficient learning capabilities of the DreamerV3 world model, the agent learns control policies where actions sequentially select intersections and adjust their signal phase splits to regulate traffic inflow/outflow, analogous to a feedback control system. Reward design prioritizes queue dissipation, directly linking congestion metrics (queue length) to control actions. Simulation experiments conducted in SUMO demonstrate the model’s effectiveness: under inference scenarios with multi-level (10%, 20%, 30%) Origin-Destination (OD) demand fluctuations, the framework exhibits robust anti-fluctuation capability and significantly reduces queue lengths. This work establishes a new paradigm for intelligent traffic control compatible with probe vehicle technology. Future research will focus on enhancing practical applicability by incorporating stochastic OD demand fluctuations during training and exploring regional optimization mechanisms for contingency events.

[669] Temporal Fusion Transformer for Multi-Horizon Probabilistic Forecasting of Weekly Retail Sales

Santhi Bharath Punati, Sandeep Kanta, Udaya Bhasker Cheerala, Madhusudan G Lanjewar, Praveen Damacharla

Main category: cs.LG

TL;DR: Temporal Fusion Transformer (TFT) model applied to Walmart sales forecasting achieves superior multi-horizon probabilistic forecasts with interpretability, outperforming baseline models.

Details

Motivation: Accurate multi-horizon retail forecasts are critical for inventory and promotions planning in retail operations.

Method: Uses Temporal Fusion Transformer (TFT) that fuses static store identifiers with time-varying exogenous signals (holidays, CPI, fuel price, temperature) to produce 1-5-week-ahead probabilistic forecasts via Quantile Loss.

Result: Achieves RMSE of $57.9k USD per store-week and R² of 0.9875 on hold-out dataset; averages RMSE = $64.6k USD and R² = 0.9844 across 5-fold cross-validation, outperforming XGB, CNN, LSTM, and CNN-LSTM baselines.

Conclusion: Demonstrates practical value for inventory planning and holiday-period optimization while maintaining model transparency through interpretability features.

Abstract: Accurate multi-horizon retail forecasts are critical for inventory and promotions. We present a novel study of weekly Walmart sales (45 stores, 2010–2012) using a Temporal Fusion Transformer (TFT) that fuses static store identifiers with time-varying exogenous signals (holidays, CPI, fuel price, temperature). The pipeline produces 1–5-week-ahead probabilistic forecasts via Quantile Loss, yielding calibrated 90% prediction intervals and interpretability through variable-selection networks, static enrichment, and temporal attention. On a fixed 2012 hold-out dataset, TFT achieves an RMSE of $57.9k USD per store-week and an $R^2$ of 0.9875. Across a 5-fold chronological cross-validation, the averages are RMSE = $64.6k USD and $R^2$ = 0.9844, outperforming the XGB, CNN, LSTM, and CNN-LSTM baseline models. These results demonstrate practical value for inventory planning and holiday-period optimization, while maintaining model transparency.

[670] Red-teaming Activation Probes using Prompted LLMs

Phil Blandfort, Robert Graham

Main category: cs.LG

TL;DR: A lightweight black-box red-teaming procedure using LLMs with iterative feedback and in-context learning can discover valuable failure modes in activation probes, revealing interpretable brittleness patterns and persistent vulnerabilities.

Details

Motivation: To explore the real-world robustness of activation probes under realistic, black-box adversarial pressure and surface failure modes with minimal effort.

Method: A lightweight black-box red-teaming procedure that wraps an off-the-shelf LLM with iterative feedback and in-context learning, requiring no fine-tuning, gradients, or architectural access.

Result: The approach discovered interpretable brittleness patterns (e.g., legalese-induced false positives, bland procedural tone false negatives) and reduced but persistent vulnerabilities under scenario-constraint attacks.

Conclusion: Simple prompted red-teaming scaffolding can anticipate failure patterns before deployment and yield actionable insights to harden future probes.

Abstract: Activation probes are attractive monitors for AI systems due to low cost and latency, but their real-world robustness remains underexplored. We ask: What failure modes arise under realistic, black-box adversarial pressure, and how can we surface them with minimal effort? We present a lightweight black-box red-teaming procedure that wraps an off-the-shelf LLM with iterative feedback and in-context learning (ICL), and requires no fine-tuning, gradients, or architectural access. Running a case study with probes for high-stakes interactions, we show that our approach can help discover valuable insights about a SOTA probe. Our analysis uncovers interpretable brittleness patterns (e.g., legalese-induced FPs; bland procedural tone FNs) and reduced but persistent vulnerabilities under scenario-constraint attacks. These results suggest that simple prompted red-teaming scaffolding can anticipate failure patterns before deployment and might yield promising, actionable insights to harden future probes.

[671] FTT-GRU: A Hybrid Fast Temporal Transformer with GRU for Remaining Useful Life Prediction

Varun Teja Chirukiri, Udaya Bhasker Cheerala, Sandeep Kanta, Abdul Karim, Praveen Damacharla

Main category: cs.LG

TL;DR: Proposes FTT-GRU, a hybrid model combining Fast Temporal Transformer (using FFT for linearized attention) with GRU for RUL prediction, achieving state-of-the-art results on NASA CMAPSS dataset with improved efficiency.

Details

Motivation: Existing approaches like LSTM and CNN struggle to model both global temporal dependencies and fine-grained degradation trends in multivariate sensor data for remaining useful life prediction.

Method: Hybrid model combining Fast Temporal Transformer (lightweight Transformer variant using linearized attention via FFT) with GRU layer for sequential modeling to capture both global and local degradation patterns.

Result: On CMAPSS FD001: RMSE 30.76, MAE 18.97, R²=0.45, with 1.12 ms CPU latency. Improves RMSE by 1.16% and MAE by 4.00% over best published baseline (TCN-Attention). Training shows smooth convergence with narrow confidence bands.

Conclusion: Compact Transformer-RNN hybrid delivers accurate and efficient RUL predictions suitable for real-time industrial prognostics, with both components contributing to performance.

Abstract: Accurate prediction of the remaining useful life (RUL) of industrial machinery is essential for reducing downtime and optimizing maintenance schedules. Existing approaches, such as long short-term memory (LSTM) networks and convolutional neural networks (CNNs), often struggle to model both global temporal dependencies and fine-grained degradation trends in multivariate sensor data. We propose a hybrid model, FTT-GRU, which combines a Fast Temporal Transformer (FTT) – a lightweight Transformer variant using linearized attention via fast Fourier transform (FFT) – with a gated recurrent unit (GRU) layer for sequential modeling. To the best of our knowledge, this is the first application of an FTT with a GRU for RUL prediction on NASA CMAPSS, enabling simultaneous capture of global and local degradation patterns in a compact architecture. On CMAPSS FD001, FTT-GRU attains RMSE 30.76, MAE 18.97, and $R^2=0.45$, with 1.12 ms CPU latency at batch=1. Relative to the best published deep baseline (TCN–Attention), it improves RMSE by 1.16% and MAE by 4.00%. Training curves averaged over $k=3$ runs show smooth convergence with narrow 95% confidence bands, and ablations (GRU-only, FTT-only) support the contribution of both components. These results demonstrate that a compact Transformer-RNN hybrid delivers accurate and efficient RUL predictions on CMAPSS, making it suitable for real-time industrial prognostics.

[672] Bayesian Network Structure Discovery Using Large Language Models

Yinghuan Zhang, Yufei Zhang, Parisa Kordjamshidi, Zijun Cui

Main category: cs.LG

TL;DR: A unified framework using LLMs for Bayesian network structure discovery, supporting both data-free (PromptBN) and data-aware (ReActBN) settings, outperforming existing methods especially in low/no-data scenarios.

Details

Motivation: Traditional structure learning methods require extensive observational data and are computationally expensive. While recent studies use LLMs as auxiliary tools, they don't fully leverage LLMs' capabilities in the core learning process.

Method: Two approaches: PromptBN for data-free learning by querying LLMs with metadata, and ReActBN for data-aware learning that integrates ReAct reasoning with structure scores like BIC for iterative refinement, keeping LLMs actively involved throughout.

Result: The method significantly outperforms both existing LLM-based approaches and traditional data-driven algorithms, particularly in low- or no-data scenarios.

Conclusion: Placing LLMs at the center of Bayesian network structure discovery creates an effective unified framework that works well even with limited or no observational data.

Abstract: Understanding probabilistic relationships among variables is crucial for analyzing complex systems. Traditional structure learning methods often require extensive observational data and incur high computational costs. Recent studies have explored using large language models (LLMs) for structure learning, but most treat LLMs as auxiliary tools for pre-processing or post-processing, leaving the core learning process data-driven. In this work, we propose a unified framework for Bayesian network structure discovery that places LLMs at the center, supporting both data-free and data-aware settings. In the data-free case, we introduce \textbf{PromptBN} to query LLMs with metadata and efficiently uncover valid probabilistic relationships. When observational data are available, we introduce \textbf{ReActBN}, which integrates the ReAct reasoning paradigm with structure scores such as the Bayesian Information Criterion (BIC) for iterative refinement. Unlike prior methods that offload refinement to external algorithms, our framework maintains the LLM actively in the loop throughout the discovery process. Experiments demonstrate that our method significantly outperforms both existing LLM-based approaches and traditional data-driven algorithms, particularly in the low- or no-data scenario. Code is publicly available at {\texttt{\textcolor{magenta}{https://github.com/sherryzyh/prompt2bn}}}.

[673] Sparse and nonparametric estimation of equations governing dynamical systems with applications to biology

G. Pillonetto, A. Giaretta, A. Aravkin, M. Bisiacco, T. Elston

Main category: cs.LG

TL;DR: A novel framework combining sparse parametric estimation with nonparametric techniques to discover model equations from data, addressing limitations of purely parametric approaches like SINDy in capturing complex nonlinearities.

Details

Motivation: Complex dynamical systems in fields like systems biology often make bottom-up modeling unfeasible. While sparse estimation techniques like SINDY have been successful, purely parametric models fall short in accurately representing certain inherent nonlinearities without requiring prior knowledge of their functional forms.

Method: Integration of sparse parametric estimation with nonparametric techniques to capture nonlinearities that SINDY cannot describe, without requiring a priori information about functional forms or expanding the function library.

Result: The framework successfully captures complex nonlinearities that traditional parametric approaches miss, as demonstrated on several examples related to estimation of complex biological phenomena.

Conclusion: The hybrid parametric-nonparametric framework provides a more comprehensive approach for data-driven discovery of model equations, particularly beneficial for complex systems where bottom-up modeling is challenging and certain nonlinearities are difficult to capture with purely parametric methods.

Abstract: Data-driven discovery of model equations is a powerful approach for understanding the behavior of dynamical systems in many scientific fields. In particular, the ability to learn mathematical models from data would benefit systems biology, where the complex nature of these systems often makes a bottom up approach to modeling unfeasible. In recent years, sparse estimation techniques have gained prominence in system identification, primarily using parametric paradigms to efficiently capture system dynamics with minimal model complexity. In particular, the Sindy algorithm has successfully used sparsity to estimate nonlinear systems by extracting from a library of functions only a few key terms needed to capture the dynamics of these systems. However, parametric models often fall short in accurately representing certain nonlinearities inherent in complex systems. To address this limitation, we introduce a novel framework that integrates sparse parametric estimation with nonparametric techniques. It captures nonlinearities that Sindy cannot describe without requiring a priori information about their functional form. That is, without expanding the library of functions to include the one that is trying to be discovered. We illustrate our approach on several examples related to estimation of complex biological phenomena.

[674] Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation

Dong Chen, Yanzhe Wei, Zonglin He, Guan-Ming Kuang, Canhua Ye, Meiru An, Huili Peng, Yong Hu, Huiren Tao, Kenneth MC Cheung

Main category: cs.LG

TL;DR: This study evaluates hallucination risks in LLMs for clinical spine surgery decision support, finding DeepSeek-R1 performs best overall while extended reasoning modes don’t guarantee better clinical reliability.

Details

Motivation: LLMs offer transformative potential for clinical decision support but pose significant risks through hallucinations that could compromise patient safety in spine surgery.

Method: Introduced a clinician-centered framework to quantify hallucination risks across five dimensions: diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. Evaluated six leading LLMs across 30 expert-validated spinal cases with multidimensional stress-testing.

Result: DeepSeek-R1 demonstrated superior overall performance (86.03 ± 2.08), particularly in trauma and infection domains. Extended reasoning modes didn’t improve performance - Claude-3.7-Sonnet’s extended thinking underperformed standard version. Stress-testing showed recommendation quality degraded by 7.4% under complexity while other metrics improved marginally.

Conclusion: Extended chain-of-thought reasoning alone is insufficient for clinical reliability. Findings advocate integrating interpretability mechanisms into clinical workflows and establishing safety-aware validation frameworks for surgical LLM deployment.

Abstract: Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 $\pm$ 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet’s extended thinking mode underperformed relative to its standard version (80.79 $\pm$ 1.83 vs. 81.56 $\pm$ 1.92), indicating extended chain-of-thought reasoning alone is insufficient for clinical reliability. Multidimensional stress-testing exposed model-specific vulnerabilities, with recommendation quality degrading by 7.4% under amplified complexity. This decline contrasted with marginal improvements in rationality (+2.0%), readability (+1.7%) and diagnosis (+4.7%), highlighting a concerning divergence between perceived coherence and actionable guidance. Our findings advocate integrating interpretability mechanisms (e.g., reasoning chain visualization) into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.

[675] Gaining Momentum: Uncovering Hidden Scoring Dynamics in Hockey through Deep Neural Sequencing and Causal Modeling

Daniel Griffiths, Piper Moskow

Main category: cs.LG

TL;DR: A unified data-driven framework for quantifying offensive momentum and scoring likelihood in hockey using NHL event data, combining interpretable momentum weighting, xG estimation, sequence modeling, formation discovery, and causal inference.

Details

Motivation: To advance hockey analytics by providing real-time, actionable insights for coaches and analysts through principled, causally grounded tactical optimization of offensive performance.

Method: Five-stage pipeline: 1) interpretable momentum weighting via logistic regression, 2) nonlinear xG estimation with gradient-boosted trees, 3) temporal sequence modeling with LSTM networks, 4) spatial formation discovery using PCA and K-Means clustering, 5) causal inference with X-Learner estimator to quantify treatment effects.

Result: Found ATE of 0.12 (95% CI: 0.05-0.17, p < 1e-50), corresponding to 15% relative gain in scoring potential, demonstrating that strategically structured sequences and compact formations causally elevate offensive performance.

Conclusion: The framework successfully delivers real-time actionable insights and advances hockey analytics toward principled, causally grounded tactical optimization, showing that structured sequences and formations significantly improve offensive outcomes.

Abstract: We present a unified, data-driven framework for quantifying and enhancing offensive momentum and scoring likelihood (expected goals, xG) in professional hockey. Leveraging a Sportlogiq dataset of 541,000 NHL event records, our end-to-end pipeline comprises five stages: (1) interpretable momentum weighting of micro-events via logistic regression; (2) nonlinear xG estimation using gradient-boosted decision trees; (3) temporal sequence modeling with Long Short-Term Memory (LSTM) networks; (4) spatial formation discovery through principal component analysis (PCA) followed by K-Means clustering on standardized player coordinates; and (5) use of an X-Learner causal inference estimator to quantify the average treatment effect (ATE) of adopting the identified “optimal” event sequences and formations. We observe an ATE of 0.12 (95% CI: 0.05-0.17, p < 1e-50), corresponding to a 15% relative gain in scoring potential. These results demonstrate that strategically structured sequences and compact formations causally elevate offensive performance. Our framework delivers real-time, actionable insights for coaches and analysts, advancing hockey analytics toward principled, causally grounded tactical optimization.

[676] Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering

Eric Bigelow, Daniel Wurgaft, YingQiao Wang, Noah Goodman, Tomer Ullman, Hidenori Tanaka, Ekdeep Singh Lubana

Main category: cs.LG

TL;DR: This paper develops a unified Bayesian framework that explains both prompt-based (in-context learning) and activation-based (steering) control of LLMs as instances of the same underlying mechanism - altering the model’s belief in latent concepts.

Details

Motivation: To unify seemingly disparate methodologies for controlling LLM behavior (prompt-based vs activation-based interventions) under a single theoretical framework, addressing the question of whether these approaches are specific instances of a broader control mechanism.

Method: Developed a Bayesian perspective where both context- and activation-based interventions impact model behavior by altering belief in latent concepts: steering changes concept priors, while in-context learning accumulates evidence. This results in a closed-form Bayesian model tested across domains inspired by prior work on many-shot in-context learning.

Result: The Bayesian model successfully predicts LLM behavior across both intervention types, explains prior empirical phenomena (e.g., sigmoidal learning curves), and predicts novel phenomena (e.g., additivity of interventions in log-belief space leading to sudden behavioral shifts with slight control changes).

Conclusion: This work provides a unified theoretical account of prompt-based and activation-based control of LLM behavior, along with a methodology for empirically predicting intervention effects, offering a common framework for understanding different LLM control mechanisms.

Abstract: Large language models (LLMs) can be controlled at inference time through prompts (in-context learning) and internal activations (activation steering). Different accounts have been proposed to explain these methods, yet their common goal of controlling model behavior raises the question of whether these seemingly disparate methodologies can be seen as specific instances of a broader framework. Motivated by this, we develop a unifying, predictive account of LLM control from a Bayesian perspective. Specifically, we posit that both context- and activation-based interventions impact model behavior by altering its belief in latent concepts: steering operates by changing concept priors, while in-context learning leads to an accumulation of evidence. This results in a closed-form Bayesian model that is highly predictive of LLM behavior across context- and activation-based interventions in a set of domains inspired by prior work on many-shot in-context learning. This model helps us explain prior empirical phenomena - e.g., sigmoidal learning curves as in-context evidence accumulates - while predicting novel ones - e.g., additivity of both interventions in log-belief space, which results in distinct phases such that sudden and dramatic behavioral shifts can be induced by slightly changing intervention controls. Taken together, this work offers a unified account of prompt-based and activation-based control of LLM behavior, and a methodology for empirically predicting the effects of these interventions.

[677] Stochastic Shortest Path with Sparse Adversarial Costs

Emmeran Johnson, Alberto Rumi, Ciara Pike-Burke, Patrick Rebeschini

Main category: cs.LG

TL;DR: The paper introduces ℓᵣ-norm regularizers that adapt to sparsity in adversarial SSP problems, achieving optimal regret scaling with √log M instead of √log SA, where M is the number of costly state-action pairs.

Details

Motivation: Existing OMD with negative-entropy regularization scales with √log SA, which is suboptimal for sparse problems where only M ≪ SA state-action pairs incur costs. Negative-entropy regularization is inherently non-adaptive to sparsity.

Method: Proposes a family of ℓᵣ-norm regularizers (r ∈ (1,2)) that adapt to sparsity in adversarial SSP problems with full-information feedback.

Result: The ℓᵣ-norm regularizers achieve regret scaling with √log M instead of √log SA, which is shown to be optimal via matching lower bounds. However, in unknown transition settings, benefits are limited as regret scales polynomially with SA even on sparse problems.

Conclusion: M captures the effective dimension of sparse SSP problems rather than SA. The proposed ℓᵣ-norm regularizers provide optimal adaptation to sparsity in known transition settings, but sparsity benefits are limited in unknown transition settings.

Abstract: We study the adversarial Stochastic Shortest Path (SSP) problem with sparse costs under full-information feedback. In the known transition setting, existing bounds based on Online Mirror Descent (OMD) with negative-entropy regularization scale with $\sqrt{\log S A}$, where $SA$ is the size of the state-action space. While we show that this is optimal in the worst-case, this bound fails to capture the benefits of sparsity when only a small number $M \ll SA$ of state-action pairs incur cost. In fact, we also show that the negative-entropy is inherently non-adaptive to sparsity: it provably incurs regret scaling with $\sqrt{\log S}$ on sparse problems. Instead, we propose a family of $\ell_r$-norm regularizers ($r \in (1,2)$) that adapts to the sparsity and achieves regret scaling with $\sqrt{\log M}$ instead of $\sqrt{\log SA}$. We show this is optimal via a matching lower bound, highlighting that $M$ captures the effective dimension of the problem instead of $SA$. Finally, in the unknown transition setting the benefits of sparsity are limited: we prove that even on sparse problems, the minimax regret for any learner scales polynomially with $SA$.

[678] Diluting Restricted Boltzmann Machines

C. Díaz-Faloh, R. Mulet

Main category: cs.LG

TL;DR: RBMs maintain strong performance with 80% pruning before training, but post-training pruning causes irrecoverable performance loss due to disruption of essential connections.

Details

Motivation: Address concerns about computational and environmental costs of large neural networks by investigating whether simpler, sparser networks can maintain performance.

Method: Study Restricted Boltzmann Machines under extreme pruning conditions, inspired by Lottery Ticket Hypothesis, testing various pruning strategies before and after training.

Result: RBMs achieve high-quality generative performance with up to 80% pre-training pruning, but post-training pruning causes abrupt degradation and networks cannot fully recover through retraining.

Conclusion: Pruning should be implemented early in training rather than afterwards, as initial conditions persistently influence network capabilities and sparse networks work best when pruned before training.

Abstract: Recent advances in artificial intelligence have relied heavily on increasingly large neural networks, raising concerns about their computational and environmental costs. This paper investigates whether simpler, sparser networks can maintain strong performance by studying Restricted Boltzmann Machines (RBMs) under extreme pruning conditions. Inspired by the Lottery Ticket Hypothesis, we demonstrate that RBMs can achieve high-quality generative performance even when up to 80% of the connections are pruned before training, confirming that they contain viable sub-networks. However, our experiments reveal crucial limitations: trained networks cannot fully recover lost performance through retraining once additional pruning is applied. We identify a sharp transition above which the generative quality degrades abruptly when pruning disrupts a minimal core of essential connections. Moreover, re-trained networks remain constrained by the parameters originally learned performing worse than networks trained from scratch at equivalent sparsity levels. These results suggest that for sparse networks to work effectively, pruning should be implemented early in training rather than attempted afterwards. Our findings provide practical insights for the development of efficient neural architectures and highlight the persistent influence of initial conditions on network capabilities.

[679] Reviving Stale Updates: Data-Free Knowledge Distillation for Asynchronous Federated Learning

Baris Askin, Holger R. Roth, Zhenyu Sun, Carlee Joe-Wong, Gauri Joshi, Ziyue Xu

Main category: cs.LG

TL;DR: FedRevive is an asynchronous federated learning framework that uses data-free knowledge distillation to mitigate stale updates, achieving faster training and higher accuracy compared to baselines.

Details

Motivation: Asynchronous FL improves scalability but introduces stale updates from outdated global models, which destabilize optimization and hinder convergence.

Method: FedRevive combines parameter-space aggregation with server-side data-free knowledge distillation using a meta-learned generator to synthesize pseudo-samples for multi-teacher distillation, and employs a hybrid aggregation scheme.

Result: Experiments show FedRevive achieves up to 32.1% faster training and up to 21.5% higher final accuracy compared to asynchronous baselines on vision and text benchmarks.

Conclusion: FedRevive effectively mitigates staleness in asynchronous FL while retaining scalability, demonstrating significant improvements in training efficiency and model performance.

Abstract: Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, yet its scalability is limited by synchronization overhead. Asynchronous Federated Learning (AFL) alleviates this issue by allowing clients to communicate independently, thereby improving wall-clock efficiency in large-scale, heterogeneous environments. However, this asynchrony introduces stale updates (client updates computed on outdated global models) that can destabilize optimization and hinder convergence. We propose FedRevive, an asynchronous FL framework that revives stale updates through data-free knowledge distillation (DFKD). FedRevive integrates parameter-space aggregation with a lightweight, server-side DFKD process that transfers knowledge from stale client models to the current global model without access to real or public data. A meta-learned generator synthesizes pseudo-samples, which enables multi-teacher distillation. A hybrid aggregation scheme that combines raw updates with DFKD updates effectively mitigates staleness while retaining the scalability of AFL. Experiments on various vision and text benchmarks show that FedRevive achieves faster training up to 32.1% and higher final accuracy up to 21.5% compared to asynchronous baselines.

[680] Sensitivity Analysis for Climate Science with Generative Flow Models

Alex Dobra, Jakiw Pidstrigach, Tim Reichelt, Paolo Fraccaro, Johannes Jakubik, Anne Jones, Christian Schroeder de Witt, Philip Stier, Philip Torr

Main category: cs.LG

TL;DR: This paper applies adjoint state method to compute gradients in generative flow models for climate sensitivity analysis, reducing computational cost from weeks to hours while maintaining reliability.

Details

Motivation: Traditional physical models for climate sensitivity analysis are computationally expensive, and while AI-based generative models are faster, computing sensitivities with them remains a bottleneck.

Method: Applied adjoint state method for calculating gradients in generative flow models (especially diffusion models), used with cBottle generative model on ERA5 data, and proposed gradient self-consistency check for validation.

Result: The approach successfully computed sensitivities with respect to sea surface temperatures, reducing computational cost from weeks on supercomputers to hours on GPUs while producing reliable gradients.

Conclusion: This method provides an efficient and reliable approach for climate sensitivity analysis, significantly simplifying a critical workflow in climate science.

Abstract: Sensitivity analysis is a cornerstone of climate science, essential for understanding phenomena ranging from storm intensity to long-term climate feedbacks. However, computing these sensitivities using traditional physical models is often prohibitively expensive in terms of both computation and development time. While modern AI-based generative models are orders of magnitude faster to evaluate, computing sensitivities with them remains a significant bottleneck. This work addresses this challenge by applying the adjoint state method for calculating gradients in generative flow models, with diffusion models as a special case. We apply this method to the cBottle generative model, an emulator of ERA5 data, to perform sensitivity analysis with respect to sea surface temperatures. Furthermore, we propose a novel gradient self-consistency check to quantitatively validate the computed sensitivities against the model’s own outputs. Our results provide initial evidence that this approach can produce reliable gradients, reducing the computational cost of sensitivity analysis from weeks on a supercomputer with a physical model to hours on a GPU, thereby simplifying a critical workflow in climate science.

[681] Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

Sophie Li, Nicholas Huang, Nayan Saxena, Nina Luo, Vincent Lin, Kevin Zhu, Sunishchal Dev

Main category: cs.LG

TL;DR: KAPPA is a new inference-time method that uses KL divergence, confidence, and entropy to prune unpromising reasoning paths early, achieving significant computational savings while maintaining accuracy compared to standard Best-of-N approaches.

Details

Motivation: Standard Best-of-N methods for LLM reasoning are computationally expensive, and existing early-truncation methods like ST-BoN rely on suboptimal heuristics that don't directly evaluate branch quality.

Method: KAPPA combines Kullback-Leibler divergence, confidence, and entropy into a scoring function to guide progressive pruning of reasoning paths, promoting diversity during exploration while selectively eliminating low-scoring branches.

Result: Experiments show KAPPA achieves up to ~60% reduction in peak memory and ~90% reduction in total token generation compared to BoN, while maintaining accuracy and stabilizing performance in smaller models.

Conclusion: KAPPA provides a principled approach to efficient multi-path reasoning that substantially reduces computational costs without compromising accuracy.

Abstract: Large language models (LLMs) improve reasoning accuracy when generating multiple candidate solutions at test time, but standard methods like Best-of-N (BoN) incur high computational cost by fully generating all branches. Self-Truncation Best-of-N (ST-BoN) mitigates this by truncating unpromising paths early, but its reliance on consistency-based heuristics is a limitation as it does not directly evaluate branch quality. We present KL-Adjusted Pruned Path Algorithm (KAPPA), an inference-time method that combines Kullback-Leibler divergence, confidence, and entropy into a principled scoring function to guide progressive pruning. By promoting diversity during exploration and selectively eliminating low-scoring branches, KAPPA maintains accuracy while substantially reducing memory and token usage. Experiments on GSM8K and MATH500 with DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-7B-Instruct demonstrate that KAPPA stabilizes performance in smaller models and achieves up to ~60% reduction in peak memory and ~90% reduction in total token generation relative to BoN, with minimal impact on accuracy.

[682] Privacy-Aware Time Series Synthesis via Public Knowledge Distillation

Penghang Liu, Haibei Zhu, Eleonora Kreacic, Svitlana Vyetrenko

Main category: cs.LG

TL;DR: Pub2Priv is a framework that generates private time series data by leveraging public contextual metadata through self-attention embeddings and diffusion models, achieving better privacy-utility trade-offs than existing methods.

Details

Motivation: Sensitive time series data sharing is restricted due to privacy concerns, and existing privacy-aware generation methods overlook the opportunity to leverage publicly available contextual metadata, resulting in suboptimal privacy-utility trade-offs.

Method: Uses self-attention mechanism to encode public data into temporal and feature embeddings, which serve as conditional inputs for a diffusion model to generate synthetic private sequences.

Result: Consistently outperforms state-of-the-art benchmarks in improving privacy-utility trade-off across finance, energy, and commodity trading domains, with a practical metric to assess privacy through identifiability evaluation.

Conclusion: Pub2Priv effectively leverages heterogeneous public knowledge to generate private time series data with superior privacy-utility trade-offs compared to existing methods.

Abstract: Sharing sensitive time series data in domains such as finance, healthcare, and energy consumption, such as patient records or investment accounts, is often restricted due to privacy concerns. Privacy-aware synthetic time series generation addresses this challenge by enforcing noise during training, inherently introducing a trade-off between privacy and utility. In many cases, sensitive sequences is correlated with publicly available, non-sensitive contextual metadata (e.g., household electricity consumption may be influenced by weather conditions and electricity prices). However, existing privacy-aware data generation methods often overlook this opportunity, resulting in suboptimal privacy-utility trade-offs. In this paper, we present Pub2Priv, a novel framework for generating private time series data by leveraging heterogeneous public knowledge. Our model employs a self-attention mechanism to encode public data into temporal and feature embeddings, which serve as conditional inputs for a diffusion model to generate synthetic private sequences. Additionally, we introduce a practical metric to assess privacy by evaluating the identifiability of the synthetic data. Experimental results show that Pub2Priv consistently outperforms state-of-the-art benchmarks in improving the privacy-utility trade-off across finance, energy, and commodity trading domains.

[683] Investigating the Robustness of Knowledge Tracing Models in the Presence of Student Concept Drift

Morgan Lee, Artem Frenk, Eamon Worden, Karish Gupta, Thinh Pham, Ethan Croteau, Neil Heffernan

Main category: cs.LG

TL;DR: KT models are susceptible to concept drift in online learning platforms, with BKT being the most stable while complex attention-based models degrade faster over time.

Details

Motivation: To investigate how concept drift and changing student populations impact KT model performance across multiple academic years, challenging the assumption of static learning processes.

Method: Applied four KT models to five academic years of data to assess susceptibility to concept drift, comparing performance within single years and across multiple years.

Result: All KT models exhibited degraded performance over time, with BKT remaining most stable while attention-based models lost predictive power significantly faster.

Conclusion: KT models are vulnerable to concept drift, highlighting the need for longitudinal evaluations and more robust models that can adapt to changing educational environments.

Abstract: Knowledge Tracing (KT) has been an established problem in the educational data mining field for decades, and it is commonly assumed that the underlying learning process be- ing modeled remains static. Given the ever-changing land- scape of online learning platforms (OLPs), we investigate how concept drift and changing student populations can im- pact student behavior within an OLP through testing model performance both within a single academic year and across multiple academic years. Four well-studied KT models were applied to five academic years of data to assess how suscep- tible KT models are to concept drift. Through our analysis, we find that all four families of KT models can exhibit de- graded performance, Bayesian Knowledge Tracing (BKT) remains the most stable KT model when applied to newer data, while more complex, attention based models lose pre- dictive power significantly faster. To foster more longitu- dinal evaluations of KT models, the data used to conduct our analysis is available at https://osf.io/hvfn9/?view_ only=b936c63dfdae4b0b987a2f0d4038f72a

[684] TRISKELION-1: Unified Descriptive-Predictive-Generative AI

Nardeep Kumar, Arun Kanwar

Main category: cs.LG

TL;DR: TRISKELION-1 is a unified architecture integrating statistical, mechanistic, and generative reasoning in an encoder-decoder framework with variational optimization.

Details

Motivation: To create a universal intelligence architecture that connects interpretability, accuracy, and creativity by unifying descriptive, predictive, and generative capabilities.

Method: Uses a single encoder-decoder framework with variational objectives to jointly optimize descriptive representation learning, predictive inference, and generative synthesis.

Result: Validated on MNIST, showing stable coexistence of descriptive reconstruction, predictive classification, and generative sampling within one model.

Conclusion: Provides a blueprint for universal intelligence architectures that integrate multiple reasoning paradigms in a unified framework.

Abstract: TRISKELION-1 is a unified descriptive-predictive-generative architecture that integrates statistical, mechanistic, and generative reasoning within a single encoder-decoder framework. The model demonstrates how descriptive representation learning, predictive inference, and generative synthesis can be jointly optimized using variational objectives. Experiments on MNIST validate that descriptive reconstruction, predictive classification, and generative sampling can coexist stably within one model. The framework provides a blueprint toward universal intelligence architectures that connect interpretability, accuracy, and creativity.

[685] Enhancing Heavy Rain Nowcasting with Multimodal Data: Integrating Radar and Satellite Observations

Rama Kassoumeh, David Rügamer, Henning Oppel

Main category: cs.LG

TL;DR: Multimodal fusion of satellite and radar data significantly improves heavy rain nowcasting accuracy compared to radar-only approaches, especially for intense precipitation events.

Details

Motivation: Traditional ground-based rain gauges miss most heavy rain events (only 17.3% detected in Germany 2001-2018), and radar alone struggles to forecast brief, unpredictable heavy rain events that cause urban flooding.

Method: Developed a multimodal nowcasting model that combines both radar and satellite imagery for precipitation prediction at 5, 15, and 30-minute lead times.

Result: Multimodal approach outperforms radar-only: increases Critical Success Index by 4% for heavy rain and 3% for violent rain at 5-minute lead time. Maintains higher predictive skill at longer lead times where radar-only declines. Case study of 2021 North Rhine-Westphalia flooding shows more detailed and accurate forecasts.

Conclusion: Satellite-radar fusion enables timely, reliable warnings for life-saving heavy rain forecasting, addressing limitations of traditional monitoring systems.

Abstract: The increasing frequency of heavy rainfall events, which are a major cause of urban flooding, underscores the urgent need for accurate precipitation forecasting - particularly in urban areas where localized events often go undetected by ground-based sensors. In Germany, only 17.3% of hourly heavy rain events between 2001 and 2018 were recorded by rain gauges, highlighting the limitations of traditional monitoring systems. Radar data are another source that effectively tracks ongoing precipitation; however, forecasting the development of heavy rain using radar alone remains challenging due to the brief and unpredictable nature of such events. Our focus is on evaluating the effectiveness of fusing satellite and radar data for nowcasting. We develop a multimodal nowcasting model that combines both radar and satellite imagery for predicting precipitation at lead times of 5, 15, and 30 minutes. We demonstrate that this multimodal strategy significantly outperforms radar-only approaches. Experimental results show that integrating satellite data improves prediction accuracy, particularly for intense precipitation. The proposed model increases the Critical Success Index for heavy rain by 4% and for violent rain by 3% at a 5-minute lead time. Moreover, it maintains higher predictive skill at longer lead times, where radar-only performance declines. A qualitative analysis of the severe flooding event in the state of North Rhine-Westphalia, Germany in 2021 further illustrates the superior performance of the multimodal model. Unlike the radar-only model, which captures general precipitation patterns, the multimodal model yields more detailed and accurate forecasts for regions affected by heavy rain. This improved precision enables timely, reliable, life-saving warnings. Implementation available at https://github.com/RamaKassoumeh/Multimodal_heavy_rain

[686] Effective Series Decomposition and Components Learning for Time Series Generation

Zixuan Ma, Chenfeng Huang

Main category: cs.LG

TL;DR: STDiffusion is a novel framework for multivariate time series generation that combines diffusion models with interpretable series decomposition to separately capture trend and seasonal patterns, achieving state-of-the-art performance.

Details

Motivation: Existing time series generation approaches often fail to use interpretative decomposition methods, limiting their ability to synthesize meaningful trend and seasonal patterns, creating a gap in interpretable generation.

Method: Integrates diffusion probabilistic models with learnable series decomposition using MLP for trend capture and adaptive wavelet distillation for multi-resolution seasonal learning, plus a comprehensive correction mechanism for component consistency.

Result: Achieves state-of-the-art performance on eight real-world datasets and successfully extends to multi-window long-sequence time series generation with reliable results.

Conclusion: STDiffusion provides an interpretable and effective framework for time series generation that separates trend and seasonal learning while maintaining component consistency, demonstrating robustness and versatility across various applications.

Abstract: Time series generation focuses on modeling the underlying data distribution and resampling to produce authentic time series data. Key components, such as trend and seasonality, drive temporal fluctuations, yet many existing approaches fail to employ interpretative decomposition methods, limiting their ability to synthesize meaningful trend and seasonal patterns. To address this gap, we introduce Seasonal-Trend Diffusion (STDiffusion), a novel framework for multivariate time series generation that integrates diffusion probabilistic models with advanced learnable series decomposition techniques, enhancing the interpretability of the generation process. Our approach separates the trend and seasonal learning into distinct blocks: a Multi-Layer Perceptron (MLP) structure captures the trend, while adaptive wavelet distillation facilitates effective multi-resolution learning of seasonal components. This decomposition improves the interpretability of the model on multiple scales. In addition, we designed a comprehensive correction mechanism aimed at ensuring that the generated components exhibit a high degree of internal consistency and preserve meaningful interrelationships with one another. Our empirical studies on eight real-world datasets demonstrate that STDiffusion achieves state-of-the-art performance in time series generation tasks. Furthermore, we extend the model’s application to multi-window long-sequence time series generation, which delivered reliable results and highlighted its robustness and versatility.

[687] Fast PINN Eigensolvers via Biconvex Reformulation

Akshay Sai Banderwaar, Abhishek Gupta

Main category: cs.LG

TL;DR: A reformulated PINN approach for eigenvalue problems using biconvex optimization and alternating convex search achieves 500x faster convergence than gradient-based PINN training.

Details

Motivation: Eigenvalue problems are fundamental but traditional PINNs are orders of magnitude slower than classical numerical methods.

Method: Reformulates eigenpair search as biconvex optimization problem and uses alternating convex search with analytically optimal updates for eigenvalues and eigenfunctions.

Result: PINN-ACS achieves high accuracy with convergence speeds up to 500x faster than gradient-based PINN training.

Conclusion: The proposed PINN-ACS method provides a fast and provably convergent alternative for solving eigenvalue problems with physics-informed neural networks.

Abstract: Eigenvalue problems have a distinctive forward-inverse structure and are fundamental to characterizing a system’s thermal response, stability, and natural modes. Physics-Informed Neural Networks (PINNs) offer a mesh-free alternative for solving such problems but are often orders of magnitude slower than classical numerical schemes. In this paper, we introduce a reformulated PINN approach that casts the search for eigenpairs as a biconvex optimization problem, enabling fast and provably convergent alternating convex search (ACS) over eigenvalues and eigenfunctions using analytically optimal updates. Numerical experiments show that PINN-ACS attains high accuracy with convergence speeds up to 500$\times$ faster than gradient-based PINN training. We release our codes at https://github.com/NeurIPS-ML4PS-2025/PINN_ACS_CODES.

[688] Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration

Yan Sun, Jia Guo, Stanley Kok, Zihao Wang, Zujie Wen, Zhiqiang Zhang

Main category: cs.LG

TL;DR: PREPO improves data efficiency in RLVR training by using prompt perplexity for curriculum learning and rollout entropy differentiation for exploration prioritization, achieving similar performance with up to 3x fewer rollouts.

Details

Motivation: Current RLVR training is computationally expensive because many rollouts contribute little to optimization despite high computational costs.

Method: PREPO has two components: 1) Uses prompt perplexity to create curriculum learning from easy to hard contexts, 2) Differentiates rollout relative entropy to prioritize exploratory sequences.

Result: On Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3 times fewer rollouts than baselines while preserving competitive performance.

Conclusion: The method successfully improves RLVR data efficiency through intrinsic data properties, with theoretical analysis supporting the approach’s rationale.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models, yet training remains costly because many rollouts contribute little to optimization, considering the amount of computation required. This study investigates how simply leveraging intrinsic data properties, almost free benefit during training, can improve data efficiency for RLVR. We propose PREPO with two complementary components. First, we adopt prompt perplexity as an indicator of model adaptability in learning, enabling the model to progress from well-understood contexts to more challenging ones. Second, we amplify the discrepancy among the rollouts by differentiating their relative entropy, and prioritize sequences that exhibit a higher degree of exploration. Together, these mechanisms reduce rollout demand while preserving competitive performance. On the Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3 times fewer rollouts than the baselines. Beyond empirical gains, we provide theoretical and in-depth analyses explaining the underlying rationale of our method to improve the data efficiency of RLVR.

[689] Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

Wang Zixian

Main category: cs.LG

TL;DR: The paper identifies output saturation in pre-trained Transformers during fine-tuning, which suppresses gradients and limits adaptation. It proposes diagnostic metrics to find inflection layers and selectively injects LoRA adapters to restore gradient flow with minimal parameters.

Details

Motivation: Pre-trained Transformers struggle with adapting to new domains during fine-tuning due to output saturation that suppresses gradients, preventing low-level feature reconstruction and confining adaptation to high-level recombination.

Method: Introduces layer-wise diagnostic metrics (attention entropy, gradient norms, Delta-CKA) to identify inflection layers, then selectively injects LoRA adapters at these layers to restore suppressed backward signals with minimal parameter overhead.

Result: Experiments show over-trained initialization benefits from inflection-layer LoRA injection, while under-trained initialization suffers degradation. Strong base features enable high-level compositional adaptation via inflection-layer unblocking, while weak features require full-pathway unblocking for low-level reconstruction.

Conclusion: A diagnose-first, inject-light strategy using selective LoRA injection at inflection layers effectively addresses gradient suppression in Transformers, with adaptation strategy depending on source domain training quality and base feature strength.

Abstract: Pre-trained Transformers often exhibit over-confidence in source patterns and difficulty in forming new target-domain patterns during fine-tuning. We formalize the mechanism of output saturation leading to gradient suppression through standard cross-entropy and softmax analysis, showing that gradient suppression at inflection layers confines adaptation to high-level recombination of existing features while preventing low-level reconstruction. We introduce a set of layer-wise diagnostic metrics – attention entropy (saturation proxy), activation gradient norm, parameter gradient norm, and Delta-CKA under a shared PCA basis – to identify inflection layers characterized by both low attention entropy and steep gradient decay. Building on these findings, we propose a diagnose-first, inject-light fine-tuning strategy: selectively inserting LoRA adapters at inflection layers to restore suppressed backward signals with minimal parameter overhead. Experiments on BERT-base transfer from SST-2 to Rotten Tomatoes under under-trained and over-trained source regimes reveal that over-trained initialization benefits from inflection-layer LoRA injection, while under-trained initialization suffers performance degradation. When base features are strong, unblocking inflection layers facilitates high-level compositional adaptation; when base features are weak, full-pathway unblocking is required for low-level reconstruction, as supported by joint analysis of layer-wise activation gradients and Delta-CKA dynamics.

[690] FEval-TTC: Fair Evaluation Protocol for Test-Time Compute

Pavel Rumiantsev, Soumyasundar Pal, Yingxue Zhang, Mark Coates

Main category: cs.LG

TL;DR: FEval-TTC is a fair evaluation protocol for test-time compute methods that addresses performance and cost fluctuations in LLMs, ensuring consistent assessment across different models and datasets.

Details

Motivation: LLM performance and API costs fluctuate over time, which can invalidate prior research conclusions. There's a need for consistent evaluation of test-time compute methods despite these fluctuations.

Method: Proposes FEval-TTC protocol that standardizes few-shot prompting and answer extraction across multiple LLMs and diverse reasoning datasets. Includes cost modeling for token and dollar cost estimation per query.

Result: Provides a standardized evaluation framework that reduces time and monetary overhead for researchers while enabling fair comparisons of test-time compute methods.

Conclusion: FEval-TTC offers a practical solution for consistent evaluation of test-time compute methods across fluctuating LLM environments and is open-sourced for public use.

Abstract: The performance of Large Language Models (LLMs) and the associated dollar costs of API calls can fluctuate over time, potentially invalidating conclusions drawn in prior research. To address this, we propose a Fair Evaluation protocol for Test-Time Compute (FEval-TTC), designed to ensure consistent assessment of test-time compute (TTC) methods, regardless of such fluctuations. FEval-TTC focuses on the evaluation of TTC methods that utilize underlying Chains-of-Thought (CoT). It supports evaluations across multiple LLMs on a diverse set of mathematical and commonsense reasoning datasets. The few-shot prompting and answer extraction processes are standardized across datasets, reducing both time and monetary overhead for researchers. Furthermore, we provide a cost modelling procedure that estimates both the token and dollar cost per query, facilitating equitable comparisons of prevalent TTC methods. We open-source FEval-TTC for public use at https://github.com/networkslab/feval_ttc .

[691] EraseFlow: Learning Concept Erasure Policies via GFlowNet-Driven Alignment

Abhiram Kusumba, Maitreya Patel, Kyle Min, Changhoon Kim, Chitta Baral, Yezhou Yang

Main category: cs.LG

TL;DR: EraseFlow is a new framework that uses GFlowNets to erase harmful concepts from text-to-image generators by exploring denoising trajectories, achieving better performance and prior preservation than existing methods.

Details

Motivation: Current concept erasure techniques either degrade image quality, rely on brittle adversarial losses, or require extensive retraining, highlighting the need for a more robust approach to safely remove harmful or proprietary concepts from text-to-image generators.

Method: EraseFlow casts concept unlearning as exploration in denoising path space and optimizes it with GFlowNets using trajectory balance objective, sampling entire trajectories rather than single end states to learn a stochastic policy.

Result: Extensive empirical results show EraseFlow outperforms existing baselines, achieves optimal trade-off between performance and prior preservation, eliminates need for crafted reward models, and generalizes effectively to unseen concepts.

Conclusion: EraseFlow provides an effective framework for concept erasure that avoids the limitations of current methods by taking a trajectory-based approach to unlearning while preserving model capabilities.

Abstract: Erasing harmful or proprietary concepts from powerful text to image generators is an emerging safety requirement, yet current “concept erasure” techniques either collapse image quality, rely on brittle adversarial losses, or demand prohibitive retraining cycles. We trace these limitations to a myopic view of the denoising trajectories that govern diffusion based generation. We introduce EraseFlow, the first framework that casts concept unlearning as exploration in the space of denoising paths and optimizes it with GFlowNets equipped with the trajectory balance objective. By sampling entire trajectories rather than single end states, EraseFlow learns a stochastic policy that steers generation away from target concepts while preserving the model’s prior. EraseFlow eliminates the need for carefully crafted reward models and by doing this, it generalizes effectively to unseen concepts and avoids hackable rewards while improving the performance. Extensive empirical results demonstrate that EraseFlow outperforms existing baselines and achieves an optimal trade off between performance and prior preservation.

[692] Logic-informed reinforcement learning for cross-domain optimization of large-scale cyber-physical systems

Guangxi Wan, Peng Zeng, Xiaoting Dong, Chunhe Song, Shijie Cui, Dong Li, Qingwei Dong, Yiyang Liu, Hongfei Bai

Main category: cs.LG

TL;DR: LIRL is a logic-informed reinforcement learning method that uses projection to map latent actions to admissible hybrid manifolds defined by first-order logic, ensuring constraint satisfaction without penalty tuning.

Details

Motivation: Existing approaches for cyber-physical systems either compromise global optimality (hierarchical methods) or struggle with constraint satisfaction (RL with penalty-based methods), requiring better methods that guarantee safety while maintaining performance.

Method: Equips standard policy-gradient algorithms with projection that maps low-dimensional latent actions onto admissible hybrid manifolds defined by first-order logic on-the-fly.

Result: Outperforms existing hierarchical optimization approaches across multiple scenarios (industrial manufacturing, EV charging, traffic control). Achieves 36.47%-44.33% reduction in makespan-energy objective in robotic reducer assembly while maintaining zero constraint violations.

Conclusion: LIRL provides safe and real-time optimization for large-scale CPS with declarative logic-based constraint formulation that enables seamless transfer across domains like smart transportation and smart grid.

Abstract: Cyber-physical systems (CPS) require the joint optimization of discrete cyber actions and continuous physical parameters under stringent safety logic constraints. However, existing hierarchical approaches often compromise global optimality, whereas reinforcement learning (RL) in hybrid action spaces often relies on brittle reward penalties, masking, or shielding and struggles to guarantee constraint satisfaction. We present logic-informed reinforcement learning (LIRL), which equips standard policy-gradient algorithms with projection that maps a low-dimensional latent action onto the admissible hybrid manifold defined on-the-fly by first-order logic. This guarantees feasibility of every exploratory step without penalty tuning. Experimental evaluations have been conducted across multiple scenarios, including industrial manufacturing, electric vehicle charging stations, and traffic signal control, in all of which the proposed method outperforms existing hierarchical optimization approaches. Taking a robotic reducer assembly system in industrial manufacturing as an example, LIRL achieves a 36.47% to 44.33% reduction at most in the combined makespan-energy objective compared to conventional industrial hierarchical scheduling methods. Meanwhile, it consistently maintains zero constraint violations and significantly surpasses state-of-the-art hybrid-action reinforcement learning baselines. Thanks to its declarative logic-based constraint formulation, the framework can be seamlessly transferred to other domains such as smart transportation and smart grid, thereby paving the way for safe and real-time optimization in large-scale CPS.

[693] Equilibrium Policy Generalization: A Reinforcement Learning Framework for Cross-Graph Zero-Shot Generalization in Pursuit-Evasion Games

Runyu Lu, Peng Zhang, Ruochuan Shi, Yuanheng Zhu, Dongbin Zhao, Yang Liu, Dong Wang, Cesare Alippi

Main category: cs.LG

TL;DR: The paper proposes an Equilibrium Policy Generalization (EPG) framework for pursuit-evasion games that learns generalized policies with robust cross-graph zero-shot performance, eliminating the need for recomputation or fine-tuning when graph structures change.

Details

Motivation: Current RL methods for pursuit-evasion games require recomputation or fine-tuning when graph structures vary, which is time-consuming and impairs real-time applicability. There is a need for generalized policies that can perform well across different graph structures without retraining.

Method: The EPG framework trains RL policies across different graph structures against equilibrium policies for each single graph. It uses a dynamic programming algorithm to generate pure-strategy Nash equilibrium policies and incorporates grouping mechanisms and sequence models for scalability with multiple pursuers.

Result: Experimental results show that EPG achieves desirable zero-shot performance in various unseen real-world graphs. The generalized pursuer policy can match the performance of fine-tuned policies from state-of-the-art methods when trained with equilibrium heuristics for graphs with exits.

Conclusion: The EPG framework successfully addresses the cross-graph generalization challenge in pursuit-evasion games, providing the first solution that works for both pursuer and evader sides in both no-exit and multi-exit scenarios with robust zero-shot performance.

Abstract: Equilibrium learning in adversarial games is an important topic widely examined in the fields of game theory and reinforcement learning (RL). Pursuit-evasion game (PEG), as an important class of real-world games from the fields of robotics and security, requires exponential time to be accurately solved. When the underlying graph structure varies, even the state-of-the-art RL methods require recomputation or at least fine-tuning, which can be time-consuming and impair real-time applicability. This paper proposes an Equilibrium Policy Generalization (EPG) framework to effectively learn a generalized policy with robust cross-graph zero-shot performance. In the context of PEGs, our framework is generally applicable to both pursuer and evader sides in both no-exit and multi-exit scenarios. These two generalizability properties, to our knowledge, are the first to appear in this domain. The core idea of the EPG framework is to train an RL policy across different graph structures against the equilibrium policy for each single graph. To construct an equilibrium oracle for single-graph policies, we present a dynamic programming (DP) algorithm that provably generates pure-strategy Nash equilibrium with near-optimal time complexity. To guarantee scalability with respect to pursuer number, we further extend DP and RL by designing a grouping mechanism and a sequence model for joint policy decomposition, respectively. Experimental results show that, using equilibrium guidance and a distance feature proposed for cross-graph PEG training, the EPG framework guarantees desirable zero-shot performance in various unseen real-world graphs. Besides, when trained under an equilibrium heuristic proposed for the graphs with exits, our generalized pursuer policy can even match the performance of the fine-tuned policies from the state-of-the-art PEG methods.

[694] LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons

Shashank Nag, Alan T. L. Bacellar, Zachary Susskind, Anshul Jha, Logan Liberty, Aishwarya Sivakumar, Eugene B. John, Krishnan Kailas, Priscila M. V. Lima, Neeraja J. Yadwadkar, Felipe M. G. Franca, Lizy K. John

Main category: cs.LG

TL;DR: LL-ViT is an edge-optimized vision transformer that integrates LUT-based neurons to reduce computational and memory demands while maintaining competitive accuracy on vision tasks.

Details

Motivation: Vision Transformers have high computational, memory, and energy demands that challenge edge inference on FPGAs. Existing LUT-based networks offer reduced footprints but perform poorly on vision tasks like CIFAR-10/100.

Method: Integrates LUT-based layers within transformer architecture, specifically designing an alternate LUT-based channel mixer (MLP layer) based on characterization showing most weights/computations come from this component. Uses neural learning approach to natively learn LUT functions.

Result: Achieves 95.5% on CIFAR-10, 78.8% on CIFAR-100, and 60.9% on Tiny-ImageNet (comparable to baseline). Eliminates 60% of weights and 50% of multiplications, achieving 1.9x energy efficiency and 1.3x lower latency vs quantized ViT accelerator.

Conclusion: LL-ViT provides an effective solution for edge deployment of vision transformers by combining LUT-based design with neural learning, offering significant reductions in model size, computation, and energy consumption while maintaining accuracy.

Abstract: Vision Transformers have been tremendously successful in computer vision tasks. However, their large computational, memory, and energy demands are a challenge for edge inference on FPGAs – a field that has seen a recent surge in demand. We recognize the benefits of recent works on logic and Look Up Table (LUT) based networks, such as LogicNets, NeuraLUT, DWN, among others, in offering models that simultaneously reduce both the memory and compute footprints. However, these models natively do not perform well on common vision tasks, such as CIFAR-10/100. In this work, we propose LL-ViT, a novel edge optimized vision transformer design that integrates layers of LUT neurons within the transformer architecture. Based on our characterization that reveals that a majority of model weights and computations are from the channel mixer (MLP layer), we design an alternate LUT-based channel mixer, and simultaneously develop an FPGA-based accelerator for LL-ViT. Contrary to some attempts to replace each multiplication with a table lookup, our architecture utilizes a neural learning approach which natively learns the LUT functions. This approach allows for reduced model sizes, and a computational and energy-efficient inference solution for vision transformer models. Evaluating on edge-suitable workloads, we achieve accuracies of 95.5% on CIFAR-10, 78.8% on CIFAR-100, and 60.9% on Tiny-ImageNet datasets, comparable to the baseline transformer. LL-ViT eliminates over 60% of the model weights and 50% of the multiplications in the model, and achieves 1.9x energy efficiency and 1.3x lower latency over an integer quantized ViT accelerator, while also offering superior throughput against prior works at a 10.9W power budget.

[695] Identifying Slug Formation in Oil Well Pipelines: A Use Case from Industrial Analytics

Abhishek Patange, Sharat Chidambaran, Prabhat Shankar, Manjunath G. B., Anindya Chatterjee

Main category: cs.LG

TL;DR: An interactive application for real-time slug detection in oil/gas pipelines using machine learning with user-friendly interface, data exploration, model training, and live inference capabilities.

Details

Motivation: Existing slug detection methods are offline, require domain expertise, and lack real-time interpretability, creating operational safety and efficiency challenges.

Method: End-to-end data-driven system integrating data exploration/labeling, configurable model training with multiple classifiers, time-series visualization, and real-time inference with persistence-based alerts.

Result: A lightweight, portable application supporting workflows from CSV uploads to live inference, featuring snapshot persistence, visual labeling, and real-time alerting capabilities.

Conclusion: The tool bridges data science methods with real-world decision-making in process industries, demonstrating broader applicability for time-series fault diagnosis beyond oil and gas.

Abstract: Slug formation in oil and gas pipelines poses significant challenges to operational safety and efficiency, yet existing detection approaches are often offline, require domain expertise, and lack real-time interpretability. We present an interactive application that enables end-to-end data-driven slug detection through a compact and user-friendly interface. The system integrates data exploration and labeling, configurable model training and evaluation with multiple classifiers, visualization of classification results with time-series overlays, and a real-time inference module that generates persistence-based alerts when slug events are detected. The demo supports seamless workflows from labeled CSV uploads to live inference on unseen datasets, making it lightweight, portable, and easily deployable. By combining domain-relevant analytics with novel UI/UX features such as snapshot persistence, visual labeling, and real-time alerting, our tool adds significant dissemination value as both a research prototype and a practical industrial application. The demo showcases how interactive human-in-the-loop ML systems can bridge the gap between data science methods and real-world decision-making in critical process industries, with broader applicability to time-series fault diagnosis tasks beyond oil and gas.

[696] FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

Nazmul Takbir, Hamidreza Alikhani, Nikil Dutt, Sangeetha Abdu Jyothi

Main category: cs.LG

TL;DR: FlexiCache is a hierarchical KV-cache management system that reduces GPU memory usage by 70% and improves throughput by 1.38-1.55x by exploiting temporal stability differences in KV heads, offloading stable heads’ less critical pages to host memory while maintaining accuracy.

Details

Motivation: LLM serving is constrained by growing KV cache size that scales with context and generation length. Existing systems struggle to efficiently exploit attention sparsity (dominated by few critical tokens) without degrading accuracy in long generation scenarios.

Method: FlexiCache classifies KV heads as stable (consistently focus on same tokens) or unstable (shift frequently). It keeps all KV-cache pages from unstable heads in GPU memory, while for stable heads, it keeps only top-K pages on GPU and offloads rest to host memory, performing periodic reranking to fetch newly promoted top pages.

Result: Reduces GPU memory footprint for long-context requests by up to 70%, improves offline serving throughput by 1.38-1.55x, lowers online token latency by 1.6-2.1x, while maintaining accuracy in long-context, long-generation scenarios.

Conclusion: FlexiCache effectively leverages temporal stability differences in KV heads to optimize KV-cache management, achieving significant memory and performance improvements without compromising model accuracy in long-generation scenarios.

Abstract: Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation. We make a key observation: the temporal stability of these critical tokens varies significantly across KV heads: some heads consistently focus on the same tokens, while others shift frequently. Building on this insight, we introduce FlexiCache, a hierarchical KV-cache management system that leverages the temporal stability of KV heads to reduce GPU memory usage and computation overhead, while preserving model accuracy. FlexiCache classifies KV heads as stable or unstable: it retains all KV-cache pages from unstable heads in GPU memory, whereas for stable heads, it keeps only the top-K pages on the GPU and offloads the rest to host memory. By exploiting temporal stability, FlexiCache performs periodic reranking for stable heads to fetch newly promoted top pages. Implemented atop vLLM, FlexiCache reduces GPU memory footprint for long-context requests by up to 70%, improves offline serving throughput by 1.38-1.55x, and lowers online token latency by 1.6-2.1x, all while maintaining accuracy in long-context, long-generation scenarios.

[697] RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks

Mian Wu, Gavin Zhang, Sewon Min, Sergey Levine, Aviral Kumar

Main category: cs.LG

TL;DR: RLAC is a reinforcement learning post-training method that uses dynamic rubric verification with an adversarial critic to improve generator output quality while reducing verification costs.

Details

Motivation: Traditional RL post-training with rubric-based rewards is difficult to scale due to high verification costs and incomplete assessments of responses across diverse evaluation rubrics.

Method: Uses an LLM as a critic to dynamically identify likely failure modes, which are verified by an external validator to jointly optimize both generator and critic through adversarial training.

Result: Improves factual accuracy in text generation and correctness in code generation, outperforming exhaustive verification and reward model methods.

Conclusion: Dynamic critics are more effective than fixed critics, demonstrating RLAC’s potential for scaling RL post-training to free-form generation tasks.

Abstract: Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response, making reinforcement learning (RL) post-training with rubric-based rewards difficult to scale. This problem is exacerbated by the fact that often the best way to combine these rubrics into one single reward is also highly prompt-specific. We propose Reinforcement Learning with Adversarial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification. Our approach employs a large language model (LLM) as a critic that dynamically identifies only the most likely failure modes (e.g., a factual error or unhandled edge case), which are then verified by an external validator to optimize both generator and critic jointly. By training both the generator and the critic, this game enhances the critic’s error detection and the generator’s output quality while reducing required verifications. Our experiments demonstrate that RLAC improves factual accuracy in text generation and correctness in code generation, while also outperforming exhaustive verification and reward model methods. We show that dynamic critics are more effective than fixed critics, showcasing the potential of RLAC for scaling RL post-training to free-form generation tasks.

[698] Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding

Taowen Liu, Marta Andronic, Deniz Gündüz, George A. Constantinides

Main category: cs.LG

TL;DR: Increased batch sizes can compensate for quantization noise in LLM training, and quantizing weights vs. activations affects gradient variance differently.

Details

Motivation: Quantized training improves efficiency but introduces quantization noise that degrades accuracy. Stochastic Rounding offers unbiased gradients but its interaction with batch size is under-explored.

Method: Theoretical and empirical study of mini-batch SGD with Stochastic Rounding, analyzing how batch size compensates for reduced precision in back-propagation.

Result: Experiments validate that larger batch sizes can mitigate quantization noise effects, and weight vs. activation quantization have distinct impacts on gradient variance.

Conclusion: Batch size is a key factor in quantized training that can compensate for precision loss, and different quantization strategies affect training dynamics differently.

Abstract: LLM training is resource-intensive. Quantized training improves computational and memory efficiency but introduces quantization noise, which can hinder convergence and degrade model accuracy. Stochastic Rounding (SR) has emerged as a theoretically attractive alternative to deterministic rounding, offering unbiased gradient estimates. However, its interaction with other training factors – especially batch size – remains under explored. In this paper, we present a theoretical and empirical study of mini-batch stochastic gradient descent (SGD) with SR, showing that increased batch sizes can compensate for reduced precision during back-propagation. Furthermore, we show that quantizing weights and activations impacts gradient variance in distinct ways. Our experiments validate these theoretical insights.

[699] KFCPO: Kronecker-Factored Approximated Constrained Policy Optimization

Joonyoung Lim, Younghwan Yoo

Main category: cs.LG

TL;DR: KFCPO is a Safe RL algorithm combining K-FAC second-order optimization with safety-aware gradient manipulation, achieving better safety-performance balance than baselines.

Details

Motivation: To address the tradeoff between reward maximization and constraint satisfaction in Safe RL, avoiding iterative approximation overheads and abrupt changes from fixed thresholds.

Method: Uses K-FAC to approximate Fisher Information Matrix efficiently, margin-aware gradient manipulation with direction sensitive projection, and minibatch KL rollback for trust region compliance.

Result: Achieves 10.3% to 50.2% higher average return across Safety Gymnasium environments compared to best safety-respecting baseline.

Conclusion: KFCPO demonstrates superior balance of safety and performance through efficient second-order optimization and adaptive gradient manipulation.

Abstract: We propose KFCPO, a novel Safe Reinforcement Learning (Safe RL) algorithm that combines scalable Kronecker-Factored Approximate Curvature (K-FAC) based second-order policy optimization with safety-aware gradient manipulation. KFCPO leverages K-FAC to perform efficient and stable natural gradient updates by approximating the Fisher Information Matrix (FIM) in a layerwise, closed form manner, avoiding iterative approximation overheads. To address the tradeoff between reward maximization and constraint satisfaction, we introduce a margin aware gradient manipulation mechanism that adaptively adjusts the influence of reward and cost gradients based on the agent’s proximity to safety boundaries. This method blends gradients using a direction sensitive projection, eliminating harmful interference and avoiding abrupt changes caused by fixed hard thresholds. Additionally, a minibatch level KL rollback strategy is adopted to ensure trust region compliance and to prevent destabilizing policy shifts. Experiments on Safety Gymnasium using OmniSafe show that KFCPO achieves 10.3% to 50.2% higher average return across environments compared to the best baseline that respected the safety constraint, demonstrating superior balance of safety and performance.

[700] Random Initialization of Gated Sparse Adapters

Vi Retault, Yohaï-Eliel Berreby

Main category: cs.LG

TL;DR: RIGSA is a sparse adaptation method that uses randomly-initialized full-rank adapters with gating and iterative pruning to reduce catastrophic forgetting during fine-tuning, showing less forgetting than QLoRA despite more parameters.

Details

Motivation: Address catastrophic forgetting in language model fine-tuning by exploring sparse adaptation as an alternative to rank-constrained PEFT methods like LoRA.

Method: Random Initialization of Gated Sparse Adapters (RIGSA) starts with random full-rank adapters, gates them with ReZero analog, and sparsifies via iterative magnitude pruning.

Result: RIGSA successfully learns new tasks while displaying less forgetting than QLoRA on GSM8k, though performs similarly to random masking.

Conclusion: Sparse adaptation through RIGSA offers a promising approach to mitigate catastrophic forgetting without rank constraints, outperforming QLoRA in forgetting reduction.

Abstract: When fine-tuning language models on new tasks, catastrophic forgetting – performance degradation on previously-learned tasks – is a ubiquitous problem. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA address this through low-rank adapters, sparse adaptation offers an alternative that doesn’t impose rank constraints. We introduce Random Initialization of Gated Sparse Adapters (RIGSA), which starts from randomly-initialized full-rank adapters, gates them with a ReZero analog, and sparsifies them with iterative magnitude pruning. We evaluate RIGSA on SmolLM2-1.7B-Instruct using a novel vision-in-text task (Textual MNIST) and measure forgetting on PIQA, HellaSwag, and GSM8k. SmolLM2-1.7B-Instruct initially performs around chance level on Textual MNIST, and is capable of learning the task through RIGSA, 4-bit QLoRA and random masking. In spite of having more trainable parameters than QLoRA, the RIGSA configurations that we studied displayed less forgetting than QLoRA, particularly on GSM8k, though it performs comparably to random masking.

[701] SpEx: A Spectral Approach to Explainable Clustering

Tal Argov, Tal Wagner

Main category: cs.LG

TL;DR: A new explainable clustering method using spectral graph partitioning that can fit explanation trees to any clustering or dataset, outperforming prior approaches.

Details

Motivation: Prior explainable clustering methods were limited to specific objectives and lacked general methods to fit explanation trees to arbitrary clusterings without restrictions.

Method: Proposed spectral graph partitioning approach to explainable clustering, with a generalized framework interpreting prior algorithms as simultaneous graph cuts.

Result: The method shows favorable performance compared to baselines across various datasets, providing a general approach to fit explanation trees.

Conclusion: Spectral graph partitioning offers an effective general framework for explainable clustering that can adapt to any given clustering or dataset.

Abstract: Explainable clustering by axis-aligned decision trees was introduced by Moshkovitz et al. (2020) and has gained considerable interest. Prior work has focused on minimizing the price of explainability for specific clustering objectives, lacking a general method to fit an explanation tree to any given clustering, without restrictions. In this work, we propose a new and generic approach to explainable clustering, based on spectral graph partitioning. With it, we design an explainable clustering algorithm that can fit an explanation tree to any given non-explainable clustering, or directly to the dataset itself. Moreover, we show that prior algorithms can also be interpreted as graph partitioning, through a generalized framework due to Trevisan (2013) wherein cuts are optimized in two graphs simultaneously. Our experiments show the favorable performance of our method compared to baselines on a range of datasets.

[702] Learning with Category-Equivariant Representations for Human Activity Recognition

Yoshihiro Maruyama

Main category: cs.LG

TL;DR: A categorical symmetry-aware learning framework for human activity recognition that improves model stability under real-world distortions like time shifts and sensor variations, achieving ~46 percentage point improvement in out-of-distribution accuracy.

Details

Motivation: Human activity recognition faces challenges due to sensor signal shifts caused by context, motion, and environmental changes. Models need to remain stable as the world around them changes.

Method: Introduces a categorical symmetry-aware learning framework that captures signal variations over time, scale, and sensor hierarchy. Builds these factors into feature representation structure to preserve sensor relationships and maintain stability under distortions like time shifts, amplitude drift, and device orientation changes.

Result: On UCI Human Activity Recognition benchmark, improves out-of-distribution accuracy by approximately 46 percentage points (approx. 3.6x over baseline).

Conclusion: Abstract symmetry principles can translate into concrete performance gains in everyday sensing tasks through category-equivariant representation theory.

Abstract: Human activity recognition is challenging because sensor signals shift with context, motion, and environment; effective models must therefore remain stable as the world around them changes. We introduce a categorical symmetry-aware learning framework that captures how signals vary over time, scale, and sensor hierarchy. We build these factors into the structure of feature representations, yielding models that automatically preserve the relationships between sensors and remain stable under realistic distortions such as time shifts, amplitude drift, and device orientation changes. On the UCI Human Activity Recognition benchmark, this categorical symmetry-driven design improves out-of-distribution accuracy by approx. 46 percentage points (approx. 3.6x over the baseline), demonstrating that abstract symmetry principles can translate into concrete performance gains in everyday sensing tasks via category-equivariant representation theory.

[703] Random Spiking Neural Networks are Stable and Spectrally Simple

Ernesto Araya, Massimiliano Datres, Gitta Kutyniok

Main category: cs.LG

TL;DR: This paper analyzes spiking neural networks (SNNs) through Boolean function analysis, showing that wide LIF-SNN classifiers are stable on average due to Fourier spectrum concentration on low frequencies, and introduces spectral simplicity to explain simplicity bias in SNNs.

Details

Motivation: SNNs are promising for energy-efficient computation but lack theoretical foundations regarding stability and robustness compared to artificial neural networks. The authors aim to study SNN stability through Boolean function analysis.

Method: The study uses discrete-time leaky integrate-and-fire (LIF) SNNs analyzed through Boolean function analysis, focusing on noise sensitivity and stability in classification tasks. They introduce the concept of spectral simplicity to formalize Fourier spectrum concentration.

Result: Main result shows wide LIF-SNN classifiers are stable on average, explained by concentration of their Fourier spectrum on low-frequency components. Experiments confirm these stability properties persist in trained networks.

Conclusion: The work provides new insights into SNN stability and robustness, showing that random LIF-SNNs are biased toward simple functions and establishing connections between Fourier analysis and simplicity bias in deep networks.

Abstract: Spiking neural networks (SNNs) are a promising paradigm for energy-efficient computation, yet their theoretical foundations-especially regarding stability and robustness-remain limited compared to artificial neural networks. In this work, we study discrete-time leaky integrate-and-fire (LIF) SNNs through the lens of Boolean function analysis. We focus on noise sensitivity and stability in classification tasks, quantifying how input perturbations affect outputs. Our main result shows that wide LIF-SNN classifiers are stable on average, a property explained by the concentration of their Fourier spectrum on low-frequency components. Motivated by this, we introduce the notion of spectral simplicity, which formalizes simplicity in terms of Fourier spectrum concentration and connects our analysis to the simplicity bias observed in deep networks. Within this framework, we show that random LIF-SNNs are biased toward simple functions. Experiments on trained networks confirm that these stability properties persist in practice. Together, these results provide new insights into the stability and robustness properties of SNNs.

[704] Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle

Ruifeng Ren, Sheng Ouyang, Huayi Tang, Yong Liu

Main category: cs.LG

TL;DR: This paper proposes an energy-based framework to understand Transformer attention mechanisms, showing that standard softmax attention minimizes Helmholtz free energy and linear attentions fit within this framework. The authors extend this to multi-head settings and propose new attention structures inspired by classical gradient descent algorithms.

Details

Motivation: To provide a unified energy-based perspective for understanding Transformer attention mechanisms, as transformers' underlying mechanisms remain open for exploration and energy-based perspectives have historically been valuable for understanding neural computation.

Method: Developed an energy-based framework with three components: global energy F*, energy function E_i, and gradient descent form. Showed standard softmax attention minimizes Helmholtz free energy, extended to linear attentions and multi-head settings. Proposed new attention structures inspired by momentum-based GD, Nesterov Accelerated Gradient, and Newton’s method.

Result: The framework successfully incorporates standard softmax attention and linear attentions as special cases. Experimental results provide preliminary support for the potential of energy-based framework in designing attention mechanisms.

Conclusion: The energy-based framework offers a unified perspective for understanding and designing Transformer attention mechanisms, with experimental evidence supporting its potential for developing new attention structures inspired by classical optimization algorithms.

Abstract: Transformers have demonstrated strong adaptability across a wide range of tasks and have become the backbone of modern Large Language Models (LLMs). However, their underlying mechanisms remain open for further exploration. The energy-based perspective has long provided a valuable principle for understanding neural computation. In this paper, we revisit the principle of energy as a lens to understand attention-based Transformer models. We present a unified energy-based framework which is composed of three key components: the global energy $F^$, the energy function $E_i$ and the employed gradient descent (GD) form. Within this framework, standard softmax attention can be viewed as a special case of minimizing the Helmholtz free energy as $F^$ using standard GD when $E_i$ takes the form of elastic potential energy, with residual connections ensuring that this optimization proceeds in an incremental manner. In addition, linear attentions can also be naturally incorporated into this framework by adjusting the corresponding energy forms. We also extend the above analysis to the multi-head setting, where the energy is defined across multiple low-dimensional subspaces. Building on this framework, we propose energy-based modifications of attention structures. Inspired by classical GD algorithms, we extend the original attention formulation based on standard GD to the momentum-based GD, Nesterov Accelerated Gradient (NAG), and Newton’s method variants, each inducing a corresponding new attention structure. Our experiments provide preliminary support for the potential of the energy-based framework for designing attention mechanisms.

[705] Motion-Robust Multimodal Fusion of PPG and Accelerometer Signals for Three-Class Heart Rhythm Classification

Yangyang Zhao, Matti Kaisti, Olli Lahdenoja, Tero Koivisto

Main category: cs.LG

TL;DR: RhythmiNet is a neural network that combines PPG and accelerometer data with attention mechanisms to classify cardiac rhythms into AF, sinus rhythm, and Other categories, showing improved performance over single-modality approaches.

Details

Motivation: Current PPG-based AF detection methods are limited by motion artifacts, focus only on binary classification, and fail to capture the full spectrum of arrhythmias seen in clinical practice.

Method: A residual neural network enhanced with temporal and channel attention modules that jointly processes PPG and accelerometer signals for three-class rhythm classification (AF, sinus rhythm, Other), with testing stratified by motion intensity.

Result: RhythmiNet achieved 4.3% improvement in macro-AUC over PPG-only baseline and 12% improvement over handcrafted HRV feature-based logistic regression, demonstrating benefits of multimodal fusion and attention mechanisms.

Conclusion: Multimodal fusion of PPG and accelerometer data with attention-based learning significantly improves arrhythmia classification performance in noisy real-world conditions, enabling more comprehensive rhythm monitoring beyond binary AF detection.

Abstract: Atrial fibrillation (AF) is a leading cause of stroke and mortality, particularly in elderly patients. Wrist-worn photoplethysmography (PPG) enables non-invasive, continuous rhythm monitoring, yet suffers from significant vulnerability to motion artifacts and physiological noise. Many existing approaches rely solely on single-channel PPG and are limited to binary AF detection, often failing to capture the broader range of arrhythmias encountered in clinical settings. We introduce RhythmiNet, a residual neural network enhanced with temporal and channel attention modules that jointly leverage PPG and accelerometer (ACC) signals. The model performs three-class rhythm classification: AF, sinus rhythm (SR), and Other. To assess robustness across varying movement conditions, test data are stratified by accelerometer-based motion intensity percentiles without excluding any segments. RhythmiNet achieved a 4.3% improvement in macro-AUC over the PPG-only baseline. In addition, performance surpassed a logistic regression model based on handcrafted HRV features by 12%, highlighting the benefit of multimodal fusion and attention-based learning in noisy, real-world clinical data.

[706] The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks

Khoat Than

Main category: cs.LG

TL;DR: Normalization layers in DNNs exponentially reduce Lipschitz constants, smoothing loss landscape for better optimization and constraining capacity for improved generalization.

Details

Motivation: To theoretically explain why normalization methods stabilize optimization and improve generalization in deep neural networks, especially when using many normalization layers.

Method: Developed a theoretical framework analyzing normalization through capacity control, proving that normalization layers exponentially reduce Lipschitz constants compared to unnormalized networks.

Result: Normalization provably reduces Lipschitz constants at exponential rates, smoothing loss landscape for faster optimization and constraining network capacity for better generalization guarantees.

Conclusion: The research provides a principled theoretical explanation for normalization’s empirical success by showing it exponentially reduces Lipschitz constants, benefiting both optimization dynamics and generalization.

Abstract: Normalization methods are fundamental components of modern deep neural networks (DNNs). Empirically, they are known to stabilize optimization dynamics and improve generalization. However, the underlying theoretical mechanism by which normalization contributes to both optimization and generalization remains largely unexplained, especially when using many normalization layers in a DNN architecture. In this work, we develop a theoretical framework that elucidates the role of normalization through the lens of capacity control. We prove that an unnormalized DNN can exhibit exponentially large Lipschitz constants with respect to either its parameters or inputs, implying excessive functional capacity and potential overfitting. Such bad DNNs are uncountably many. In contrast, the insertion of normalization layers provably can reduce the Lipschitz constant at an exponential rate in the number of normalization operations. This exponential reduction yields two fundamental consequences: (1) it smooths the loss landscape at an exponential rate, facilitating faster and more stable optimization; and (2) it constrains the effective capacity of the network, thereby enhancing generalization guarantees on unseen data. Our results thus offer a principled explanation for the empirical success of normalization methods in deep learning.

[707] Using Synthetic Data to estimate the True Error is theoretically and practically doable

Hai Hoang Thanh, Duy-Tung Nguyen, Hung The Tran, Khoat Than

Main category: cs.LG

TL;DR: This paper proposes using synthetic data generated by AI models to estimate machine learning model performance when labeled test data is scarce, developing theoretical bounds and an optimized generation method.

Details

Motivation: Traditional model evaluation requires large labeled test sets, which are costly and labor-intensive to create. In many real-world scenarios, only limited labeled data is available, making reliable evaluation challenging.

Method: Developed novel generalization bounds that incorporate synthetic data, then designed a theoretically grounded method to generate optimized synthetic samples specifically for model evaluation purposes.

Result: Experimental results on simulation and tabular datasets show the method achieves more accurate and reliable test error estimates compared to existing baselines.

Conclusion: Synthetic data can effectively estimate model test error under limited labeled data conditions, with generator quality playing a crucial role in evaluation accuracy.

Abstract: Accurately evaluating model performance is crucial for deploying machine learning systems in real-world applications. Traditional methods often require a sufficiently large labeled test set to ensure a reliable evaluation. However, in many contexts, a large labeled dataset is costly and labor-intensive. Therefore, we sometimes have to do evaluation by a few labeled samples, which is theoretically challenging. Recent advances in generative models offer a promising alternative by enabling the synthesis of high-quality data. In this work, we make a systematic investigation about the use of synthetic data to estimate the test error of a trained model under limited labeled data conditions. To this end, we develop novel generalization bounds that take synthetic data into account. Those bounds suggest novel ways to optimize synthetic samples for evaluation and theoretically reveal the significant role of the generator’s quality. Inspired by those bounds, we propose a theoretically grounded method to generate optimized synthetic data for model evaluation. Experimental results on simulation and tabular datasets demonstrate that, compared to existing baselines, our method achieves accurate and more reliable estimates of the test error.

[708] Modeling Microenvironment Trajectories on Spatial Transcriptomics with NicheFlow

Kristiyan Sakalyan, Alessandro Palma, Filippo Guerranti, Fabian J. Theis, Stephan Günnemann

Main category: cs.LG

TL;DR: NicheFlow is a flow-based generative model that infers temporal trajectories of cellular microenvironments from sequential spatial transcriptomics data using optimal transport and Variational Flow Matching.

Details

Motivation: Current methods model cellular evolution at single-cell level but overlook coordinated development of cellular states in tissues, despite spatial transcriptomics enabling high-resolution mapping of tissue organization.

Method: Represent local cell neighborhoods as point clouds and jointly model evolution of cell states and spatial coordinates using optimal transport and Variational Flow Matching.

Result: Successfully recovers both global spatial architecture and local microenvironment composition across diverse spatiotemporal datasets including embryonic and brain development.

Conclusion: NicheFlow provides a framework for understanding cellular microenvironment evolution in spatiotemporal data, bridging the gap between single-cell analysis and tissue-level development.

Abstract: Understanding the evolution of cellular microenvironments in spatiotemporal data is essential for deciphering tissue development and disease progression. While experimental techniques like spatial transcriptomics now enable high-resolution mapping of tissue organization across space and time, current methods that model cellular evolution operate at the single-cell level, overlooking the coordinated development of cellular states in a tissue. We introduce NicheFlow, a flow-based generative model that infers the temporal trajectory of cellular microenvironments across sequential spatial slides. By representing local cell neighborhoods as point clouds, NicheFlow jointly models the evolution of cell states and spatial coordinates using optimal transport and Variational Flow Matching. Our approach successfully recovers both global spatial architecture and local microenvironment composition across diverse spatiotemporal datasets, from embryonic to brain development.

[709] Balanced Multimodal Learning via Mutual Information

Rongrong Xie, Guido Sanguinetti

Main category: cs.LG

TL;DR: A unified framework for multimodal learning that addresses modality imbalance using mutual information, cross-modal knowledge distillation, and multitask-like training to improve model performance.

Details

Motivation: Modality imbalance in multimodal learning is inadequately addressed, especially in biological data where datasets are limited, costly, and heterogeneous in quality. Conventional methods fail to harness intermodal synergies while resolving modality conflicts.

Method: Two-stage approach: 1) Cross-modal knowledge distillation pretraining where stronger modalities enhance weaker ones, 2) Multitask-like training with dynamic gradient calibration based on modality performance and mutual information.

Result: The approach effectively alleviates modality imbalance and significantly improves overall multimodal model performance.

Conclusion: The proposed framework successfully addresses modality imbalance through mutual information quantification and balanced learning strategies, demonstrating improved performance in multimodal learning scenarios.

Abstract: Multimodal learning has increasingly become a focal point in research, primarily due to its ability to integrate complementary information from diverse modalities. Nevertheless, modality imbalance, stemming from factors such as insufficient data acquisition and disparities in data quality, has often been inadequately addressed. This issue is particularly prominent in biological data analysis, where datasets are frequently limited, costly to acquire, and inherently heterogeneous in quality. Conventional multimodal methodologies typically fall short in concurrently harnessing intermodal synergies and effectively resolving modality conflicts. In this study, we propose a novel unified framework explicitly designed to address modality imbalance by utilizing mutual information to quantify interactions between modalities. Our approach adopts a balanced multimodal learning strategy comprising two key stages: cross-modal knowledge distillation (KD) and a multitask-like training paradigm. During the cross-modal KD pretraining phase, stronger modalities are leveraged to enhance the predictive capabilities of weaker modalities. Subsequently, our primary training phase employs a multitask-like learning mechanism, dynamically calibrating gradient contributions based on modality-specific performance metrics and intermodal mutual information. This approach effectively alleviates modality imbalance, thereby significantly improving overall multimodal model performance.

[710] Hydra: Dual Exponentiated Memory for Multivariate Time Series Analysis

Asal Meskin, Alireza Mirrokni, Ali Najar, Ali Behrouz

Main category: cs.LG

TL;DR: Hydra is a two-headed meta in-context memory module that learns to prioritize informative time series patterns using 2D recurrence across time and variate dimensions, overcoming limitations of existing models like Transformers and linear RNNs.

Details

Motivation: Existing time series models (Transformers, MLPs, linear models) lack temporal inductive bias, miss inter-dependencies between temporal and variate dimensions, and are inefficient for long-term modeling. Linear RNNs address efficiency but miss inter-variate dependencies and can propagate errors.

Method: Hydra uses a 2-dimensional recurrence across both time and variate dimensions with a two-headed meta in-context memory module. It employs a 2D-chunk-wise training algorithm that approximates the recurrence with 10x efficiency improvement while maintaining effectiveness.

Result: Experimental results on diverse tasks (forecasting, classification, anomaly detection) show superior performance compared to state-of-the-art baselines.

Conclusion: Hydra effectively addresses the limitations of existing time series models by capturing both temporal dynamics and inter-variate dependencies through 2D recurrence, achieving better performance across multiple time series tasks.

Abstract: In recent years, effectively modeling multivariate time series has gained significant popularity, mainly due to its wide range of applications, ranging from healthcare to financial markets and energy management. Transformers, MLPs, and linear models as the de facto backbones of modern time series models have shown promising results in single-variant and/or short-term forecasting. These models, however: (1) are permutation equivariant and so lack temporal inductive bias, being less expressive to capture the temporal dynamics; (2) are naturally designed for univariate setup, missing the inter-dependencies of temporal and variate dimensions; and/or (3) are inefficient for Long-term time series modeling. To overcome training and inference efficiency as well as the lack of temporal inductive bias, recently, linear Recurrent Neural Networks (RNNs) have gained attention as an alternative to Transformer-based models. These models, however, are inherently limited to a single sequence, missing inter-variate dependencies, and can propagate errors due to their additive nature. In this paper, we present Hydra, a by-design two-headed meta in-context memory module that learns how to memorize patterns at test time by prioritizing time series patterns that are more informative about the data. Hydra uses a 2-dimensional recurrence across both time and variate at each step, which is more powerful than mixing methods. Although the 2-dimensional nature of the model makes its training recurrent and non-parallelizable, we present a new 2D-chunk-wise training algorithm that approximates the actual recurrence with $\times 10$ efficiency improvement, while maintaining the effectiveness. Our experimental results on a diverse set of tasks and datasets, including time series forecasting, classification, and anomaly detection show the superior performance of Hydra compared to state-of-the-art baselines.

[711] Energy-Efficient Deep Learning Without Backpropagation: A Rigorous Evaluation of Forward-Only Algorithms

Przemysław Spyra, Witold Dzwinel

Main category: cs.LG

TL;DR: Mono-Forward algorithm outperforms backpropagation in MLPs with better accuracy, 41% less energy, and 34% faster training, challenging BP’s necessity for state-of-the-art performance.

Details

Motivation: To challenge the long-held assumption that backpropagation is essential for state-of-the-art performance and provide a practical, efficient alternative.

Method: Developed Mono-Forward algorithm through evolutionary path from Forward-Forward to Cascaded Forward, using identical architectures and universal hyperparameter optimization for fair comparison.

Result: MF consistently surpasses optimally tuned BP baseline in classification accuracy, with 41% less energy consumption and 34% faster training.

Conclusion: MF establishes itself as a practical, high-performance, and sustainable alternative to backpropagation for MLPs.

Abstract: The long-held assumption that backpropagation (BP) is essential for state-of-the-art performance is challenged by this work. We present rigorous, hardware-validated evidence that the Mono-Forward (MF) algorithm, a backpropagation-free method, consistently surpasses an optimally tuned BP baseline in classification accuracy on its native Multi-Layer Perceptron (MLP) architectures. This superior generalization is achieved with profound efficiency gains, including up to 41% less energy consumption and up to 34% faster training. Our analysis, which charts an evolutionary path from Geoffrey Hinton’s Forward-Forward (FF) to the Cascaded Forward (CaFo) and finally to MF, is grounded in a fair comparative framework using identical architectures and universal hyperparameter optimization. We further provide a critical re-evaluation of memory efficiency in BP-free methods, empirically demonstrating that practical overhead can offset theoretical gains. Ultimately, this work establishes MF as a practical, high-performance, and sustainable alternative to BP for MLPs.

[712] None To Optima in Few Shots: Bayesian Optimization with MDP Priors

Diantong Li, Kyunghyun Cho, Chong Liu

Main category: cs.LG

TL;DR: ProfBO is a Bayesian Optimization algorithm that uses MDP priors from source tasks to optimize black-box functions with very few evaluations, outperforming state-of-the-art methods on real-world benchmarks.

Details

Motivation: Bayesian Optimization becomes impractical for critical real-world applications like drug discovery where evaluations are costly and time-consuming, requiring solutions that work with remarkably few function evaluations.

Method: Uses Markov Decision Process (MDP) priors to model optimization trajectories from related source tasks, embeds them into prior-fitted neural networks, and employs model-agnostic meta-learning for fast adaptation to new target tasks.

Result: Experiments on real-world Covid and Cancer benchmarks and hyperparameter tuning tasks show ProfBO consistently outperforms state-of-the-art methods by achieving high-quality solutions with significantly fewer evaluations.

Conclusion: ProfBO makes Bayesian Optimization ready for practical deployment in critical applications by solving black-box optimization with remarkably few function evaluations.

Abstract: Bayesian Optimization (BO) is an efficient tool for optimizing black-box functions, but its theoretical guarantees typically hold in the asymptotic regime. In many critical real-world applications such as drug discovery or materials design, where each evaluation can be very costly and time-consuming, BO becomes impractical for many evaluations. In this paper, we introduce the Procedure-inFormed BO (ProfBO) algorithm, which solves black-box optimization with remarkably few function evaluations. At the heart of our algorithmic design are Markov Decision Process (MDP) priors that model optimization trajectories from related source tasks, thereby capturing procedural knowledge on efficient optimization. We embed these MDP priors into a prior-fitted neural network and employ model-agnostic meta-learning for fast adaptation to new target tasks. Experiments on real-world Covid and Cancer benchmarks and hyperparameter tuning tasks demonstrate that ProfBO consistently outperforms state-of-the-art methods by achieving high-quality solutions with significantly fewer evaluations, making it ready for practical deployment.

[713] Continual Learning, Not Training: Online Adaptation For Agents

Aman Jaglan, Jarrod Barnes

Main category: cs.LG

TL;DR: ATLAS introduces a dual-agent architecture for gradient-free continual learning, decoupling reasoning (Teacher) from execution (Student) with persistent learning memory, achieving adaptive efficiency through inference-time orchestration rather than parameter updates.

Details

Motivation: Traditional CL methods rely on gradient-based retraining, which is unsuitable for deployed agents needing real-time adaptation. ATLAS aims to enable continual learning without retraining through system-level orchestration.

Method: Dual-agent architecture with Teacher (reasoning) and Student (execution), persistent learning memory storing distilled guidance, and orchestration layer that dynamically adjusts operational strategies at inference time.

Result: On Microsoft’s ExCyTIn-Bench, ATLAS achieved 54.1% success with GPT-5-mini, outperforming GPT-5 (High) by 13% while reducing cost by 86%. Cross-incident validation showed generalization with frozen pamphlets improving accuracy from 28% to 41% without retraining.

Conclusion: ATLAS establishes gradient-free continual learning as viable for adaptive AI systems, shifting adaptation from model parameters to system orchestration, enabling deployable systems with improved efficiency and generalization.

Abstract: Continual Learning (CL) methods have traditionally focused on mitigating catastrophic forgetting through gradient-based retraining, an approach ill-suited for deployed agents that must adapt in real time. We introduce our Adaptive Teaching and Learning System (ATLAS), a dual-agent architecture that decouples reasoning (Teacher) from execution (Student) and incorporates a persistent learning memory that stores distilled guidance from experience. This informs the orchestration layer, enabling the system to dynamically adjust its operational strategies, such as supervision level or initial plan selection, at inference time. In doing so, ATLAS achieves gradient-free continual learning, shifting the locus of adaptation from model parameters to system-level orchestration. We formulate this as a system-centric paradigm for continual learning, where the objective is adaptive efficiency: maximizing task success while minimizing computational cost through inference-time orchestration rather than parameter updates. Evaluated on Microsoft’s ExCyTIn-Bench, an open-source benchmark simulating complex cyberthreat investigation, ATLAS achieves 54.1% success with GPT-5-mini as its Student, outperforming the larger GPT-5 (High) by 13% while reducing cost by 86%. Cross-incident validation demonstrates generalization: frozen pamphlets from Incident #5 improve accuracy from 28% to 41% with zero retraining, while shifting output composition from verbose exploration to structured reasoning. Together, these findings establish gradient-free continual learning as a viable path toward adaptive, deployable AI systems and provide causally annotated traces valuable for training explicit world models.

[714] Equality Graph Assisted Symbolic Regression

Fabricio Olivetti de Franca, Gabriel Kronberger

Main category: cs.LG

TL;DR: SymRegg is a new symbolic regression algorithm that uses e-graphs to avoid redundant computations by compactly storing equivalent expressions and preventing evaluation of previously visited variations.

Details

Motivation: Genetic Programming in symbolic regression computes up to 60% redundant expressions due to navigating large plateaus. E-graphs can help avoid these unnecessary computations by tracking equivalent expressions.

Method: SymRegg uses e-graph structure to store expressions, samples solutions from e-graph, perturbs them, and only inserts new unvisited expressions while generating their equivalent forms.

Result: SymRegg improves search efficiency while maintaining accurate results across datasets with minimal hyperparameter tuning.

Conclusion: E-graph based approach effectively reduces redundant computations in symbolic regression while preserving accuracy and requiring simple hyperparameter choices.

Abstract: In Symbolic Regression (SR), Genetic Programming (GP) is a popular search algorithm that delivers state-of-the-art results in term of accuracy. Its success relies on the concept of neutrality, which induces large plateaus that the search can safely navigate to more promising regions. Navigating these plateaus, while necessary, requires the computation of redundant expressions, up to 60% of the total number of evaluation, as noted in a recent study. The equality graph (e-graph) structure can compactly store and group equivalent expressions enabling us to verify if a given expression and their variations were already visited by the search, thus enabling us to avoid unnecessary computation. We propose a new search algorithm for symbolic regression called SymRegg that revolves around the e-graph structure following simple steps: perturb solutions sampled from a selection of expressions stored in the e-graph, if it generates an unvisited expression, insert it into the e-graph and generates its equivalent forms. We show that SymRegg is capable of improving the efficiency of the search, maintaining consistently accurate results across different datasets while requiring a choice of a minimalist set of hyperparameters.

[715] What’s the next frontier for Data-centric AI? Data Savvy Agents

Nabeel Seedat, Jiashuo Liu, Mihaela van der Schaar

Main category: cs.LG

TL;DR: The paper argues for prioritizing data-savvy capabilities in AI agent design, proposing four key capabilities: proactive data acquisition, sophisticated data processing, interactive test data synthesis, and continual adaptation to enable reliable real-world deployment.

Details

Motivation: Current AI agent research focuses heavily on reasoning while neglecting data handling capabilities, which are crucial for scalable autonomy and reliable real-world deployment of agents that continuously acquire, process, and evolve their data.

Method: The paper proposes a conceptual framework with four key data-savvy capabilities: (1) proactive data acquisition to autonomously gather knowledge and address data gaps, (2) sophisticated data processing for context-aware handling of diverse data challenges, (3) interactive test data synthesis for dynamic evaluation, and (4) continual adaptation for iterative refinement of data and knowledge.

Result: The paper presents a vision for data-savvy agents as the next frontier in data-centric AI, shifting focus from static benchmarks to dynamic data handling capabilities that enable agents to adapt to changing environments.

Conclusion: Data-savvy capabilities should be prioritized in agentic system design to ensure reliable real-world deployment, moving beyond current reasoning-focused approaches to create agents that can effectively handle data throughout their lifecycle.

Abstract: The recent surge in AI agents that autonomously communicate, collaborate with humans and use diverse tools has unlocked promising opportunities in various real-world settings. However, a vital aspect remains underexplored: how agents handle data. Scalable autonomy demands agents that continuously acquire, process, and evolve their data. In this paper, we argue that data-savvy capabilities should be a top priority in the design of agentic systems to ensure reliable real-world deployment. Specifically, we propose four key capabilities to realize this vision: (1) Proactive data acquisition: enabling agents to autonomously gather task-critical knowledge or solicit human input to address data gaps; (2) Sophisticated data processing: requiring context-aware and flexible handling of diverse data challenges and inputs; (3) Interactive test data synthesis: shifting from static benchmarks to dynamically generated interactive test data for agent evaluation; and (4) Continual adaptation: empowering agents to iteratively refine their data and background knowledge to adapt to shifting environments. While current agent research predominantly emphasizes reasoning, we hope to inspire a reflection on the role of data-savvy agents as the next frontier in data-centric AI.

[716] SARIMAX-Based Power Outage Prediction During Extreme Weather Events

Haoran Ye, Qiuzhuang Sun, Yang Yang

Main category: cs.LG

TL;DR: Developed a SARIMAX-based prediction system for short-term power outage forecasting during extreme weather, achieving 8.4% improvement over baseline with RMSE of 177.2.

Details

Motivation: To improve power outage forecasting during extreme weather events using advanced time series modeling and feature engineering.

Method: Two-stage feature engineering pipeline with data cleaning and correlation filtering, augmented with temporal embeddings and lag features. Used SARIMAX model with standardization, hierarchical fitting strategy, and fallback predictions.

Result: Achieved RMSE of 177.2, representing 8.4% improvement over baseline method (RMSE = 193.4).

Conclusion: The feature engineering and robust optimization strategy effectively improves extreme weather-related outage prediction accuracy.

Abstract: This study develops a SARIMAX-based prediction system for short-term power outage forecasting during extreme weather events. Using hourly data from Michigan counties with outage counts and comprehensive weather features, we implement a systematic two-stage feature engineering pipeline: data cleaning to remove zero-variance and unknown features, followed by correlation-based filtering to eliminate highly correlated predictors. The selected features are augmented with temporal embeddings, multi-scale lag features, and weather variables with their corresponding lags as exogenous inputs to the SARIMAX model. To address data irregularity and numerical instability, we apply standardization and implement a hierarchical fitting strategy with sequential optimization methods, automatic downgrading to ARIMA when convergence fails, and historical mean-based fallback predictions as a final safeguard. The model is optimized separately for short-term (24 hours) and medium-term (48 hours) forecast horizons using RMSE as the evaluation metric. Our approach achieves an RMSE of 177.2, representing an 8.4% improvement over the baseline method (RMSE = 193.4), thereby validating the effectiveness of our feature engineering and robust optimization strategy for extreme weather-related outage prediction.

[717] Adapt under Attack and Domain Shift: Unified Adversarial Meta-Learning and Domain Adaptation for Robust Automatic Modulation Classification

Ali Owfi, Amirmohammad Bamdad, Tolunay Seyfi, Fatemeh Afghah

Main category: cs.LG

TL;DR: A unified framework combining meta-learning and domain adaptation to make AMC systems robust against adversarial attacks and environmental changes.

Details

Motivation: Deep learning AMC systems are vulnerable to adversarial attacks and data distribution shifts, hindering practical deployment in dynamic environments.

Method: Two-phase strategy: offline meta-learning on clean and adversarial samples for attack resistance, followed by online domain adaptation for environmental adaptation without extensive labeled data.

Result: Significant improvement in modulation classification accuracy against combined threats of adversarial attacks and environmental changes.

Conclusion: The framework provides a critical solution for deployment challenges of modern AMC systems by making them resistant to both adversarial threats and environmental dynamics.

Abstract: Deep learning has emerged as a leading approach for Automatic Modulation Classification (AMC), demonstrating superior performance over traditional methods. However, vulnerability to adversarial attacks and susceptibility to data distribution shifts hinder their practical deployment in real-world, dynamic environments. To address these threats, we propose a novel, unified framework that integrates meta-learning with domain adaptation, making AMC systems resistant to both adversarial attacks and environmental changes. Our framework utilizes a two-phase strategy. First, in an offline phase, we employ a meta-learning approach to train the model on clean and adversarially perturbed samples from a single source domain. This method enables the model to generalize its defense, making it resistant to a combination of previously unseen attacks. Subsequently, in the online phase, we apply domain adaptation to align the model’s features with a new target domain, allowing it to adapt without requiring substantial labeled data. As a result, our framework achieves a significant improvement in modulation classification accuracy against these combined threats, offering a critical solution to the deployment and operational challenges of modern AMC systems.

[718] MedEqualizer: A Framework Investigating Bias in Synthetic Medical Data and Mitigation via Augmentation

Sama Salarian, Yue Zhang, Swati Padhee, Srinivasan Parthasarathy

Main category: cs.LG

TL;DR: MedEqualizer is a model-agnostic augmentation framework that improves fairness in synthetic healthcare data generation by addressing demographic imbalances across protected attributes.

Details

Motivation: Synthetic healthcare data generation can enhance data accessibility but ensuring fairness across protected attributes is critical to avoid biased results in clinical research and decision-making.

Method: Assessed fairness of GAN-based synthetic data using MIMIC-III dataset, measured subgroup representation with logarithmic disparity metric, and introduced MedEqualizer to enrich underrepresented subgroups before synthetic data generation.

Result: Significant imbalances found in synthetic data with many subgroups underrepresented or overrepresented. MedEqualizer significantly improved demographic balance in resulting synthetic datasets.

Conclusion: MedEqualizer offers a viable path towards more equitable and representative healthcare data synthesis by mitigating demographic disparities in synthetic data generation.

Abstract: Synthetic healthcare data generation presents a viable approach to enhance data accessibility and support research by overcoming limitations associated with real-world medical datasets. However, ensuring fairness across protected attributes in synthetic data is critical to avoid biased or misleading results in clinical research and decision-making. In this study, we assess the fairness of synthetic data generated by multiple generative adversarial network (GAN)-based models using the MIMIC-III dataset, with a focus on representativeness across protected demographic attributes. We measure subgroup representation using the logarithmic disparity metric and observe significant imbalances, with many subgroups either underrepresented or overrepresented in the synthetic data, compared to the real data. To mitigate these disparities, we introduce MedEqualizer, a model-agnostic augmentation framework that enriches the underrepresented subgroups prior to synthetic data generation. Our results show that MedEqualizer significantly improves demographic balance in the resulting synthetic datasets, offering a viable path towards more equitable and representative healthcare data synthesis.

[719] Window-Based Feature Engineering for Cognitive Workload Detection

Andrew Hallam, R G Gayathri, Glory Lee, Atul Sajjanhar

Main category: cs.LG

TL;DR: This research classifies cognitive workload using the COLET dataset with window-based feature generation and machine/deep learning techniques, showing that deep learning models outperform traditional methods.

Details

Motivation: Cognitive workload assessment is increasingly important across health, psychology, and defense applications, requiring effective classification methods for real-time assessment in complex tasks.

Method: Window-based temporal partitioning for feature enhancement, followed by machine learning and deep learning models (particularly tabular architectures) for classification of cognitive workload levels.

Result: Deep learning models, especially tabular architectures, outperformed traditional machine learning methods in precision, F1-score, accuracy, and classification precision.

Conclusion: Window-based temporal feature extraction combined with deep learning techniques shows strong potential for effective real-time cognitive workload assessment in dynamic tasks.

Abstract: Cognitive workload is a topic of increasing interest across various fields such as health, psychology, and defense applications. In this research, we focus on classifying cognitive workload using the COLET dataset, employing a window-based approach for feature generation and machine/deep learning techniques for classification. We apply window-based temporal partitioning to enhance features used in existing research, followed by machine learning and deep learning models to classify different levels of cognitive workload. The results demonstrate that deep learning models, particularly tabular architectures, outperformed traditional machine learning methods in precision, F1-score, accuracy, and classification precision. This study highlights the effectiveness of window-based temporal feature extraction and the potential of deep learning techniques for real-time cognitive workload assessment in complex and dynamic tasks.

[720] Happiness as a Measure of Fairness

Georg Pichler, Marco Romanelli, Pablo Piantanida

Main category: cs.LG

TL;DR: A novel fairness framework based on happiness utility that unifies existing fairness definitions through efficient linear programming.

Details

Motivation: To provide a more human-centered and mathematically rigorous approach to fairness by measuring utility gains from decision outcomes.

Method: Proposes a happiness-based fairness framework that computes optimal fair post-processing strategies by solving linear programs.

Result: The method is efficient, scalable with existing optimization tools, and unifies/extends several well-known fairness definitions.

Conclusion: The happiness-based fairness framework offers a practical, intuitive, and mathematically sound approach that performs well across diverse scenarios.

Abstract: In this paper, we propose a novel fairness framework grounded in the concept of happi- ness, a measure of the utility each group gains fromdecisionoutcomes. Bycapturingfairness through this intuitive lens, we not only offer a more human-centered approach, but also one that is mathematically rigorous: In order to compute the optimal, fair post-processing strategy, only a linear program needs to be solved. This makes our method both efficient and scalable with existing optimization tools. Furthermore, it unifies and extends several well-known fairness definitions, and our em- pirical results highlight its practical strengths across diverse scenarios.

[721] AI Progress Should Be Measured by Capability-Per-Resource, Not Scale Alone: A Framework for Gradient-Guided Resource Allocation in LLMs

David McCoy, Yulun Wu, Zachary Butzin-Dozier

Main category: cs.LG

TL;DR: Challenges AI’s scaling fundamentalism and proposes a capability-per-resource approach using gradient influence patterns to dramatically improve efficiency, potentially reducing resource requirements by orders of magnitude.

Details

Motivation: Address unsustainable environmental impacts and resource inequality caused by unbounded growth in model size and computation in AI research.

Method: Theoretical framework using gradient influence patterns to identify high-influence parameters in transformer models, with coordinated parameter and data selection, and a two-stage paradigm of marginal-return pretraining and influence-guided adaptation.

Result: Shows that updating only high-influence parameters outperforms full-parameter tuning, gradient norms efficiently identify influential components, and coordinated selection yields multiplicative efficiency gains.

Conclusion: Transforms hardware workarounds into optimal strategies, democratizing AI access while reducing environmental impact, reshaping AI toward sustainability and equity.

Abstract: This position paper challenges the “scaling fundamentalism” dominating AI research, where unbounded growth in model size and computation has led to unsustainable environmental impacts and widening resource inequality. We argue that LLM development should be fundamentally reoriented toward capability-per-resource rather than capability alone. We present a theoretical framework demonstrating that resource-allocation decisions guided by gradient influence patterns can dramatically improve efficiency throughout the AI lifecycle. Our analysis shows that in transformer-based models, where a small fraction of parameters exert outsized influence (following heavy-tailed distributions), three critical insights emerge: (1) updating only high-influence parameters strictly outperforms full-parameter tuning on a performance-per-resource basis; (2) simple gradient norms provide computationally efficient proxies for identifying these high-influence components; and (3) coordinated parameter and data selection yields multiplicative efficiency gains, potentially reducing resource requirements by orders of magnitude. Building on these theoretical foundations, we propose a two stage paradigm marginal-return pretraining for foundation developers and influence guided adaptation for downstream users bridged by gradient blueprints, metadata describing which parameters matter most for various tasks. This capability-per-resource perspective transforms what were once considered pragmatic hardware workarounds into theoretically optimal strategies, democratizing access to cutting-edge AI capabilities while significantly reducing environmental impact. By embedding resource consciousness into how we develop, adapt, and evaluate models, we can reshape AI progress toward a more sustainable and equitable future.

[722] One model to solve them all: 2BSDE families via neural operators

Takashi Furuya, Anastasis Kratsios, Dylan Possamaï, Bogdan Raonić

Main category: cs.LG

TL;DR: The paper introduces a generative neural operator model using Kolmogorov-Arnold networks to solve families of second-order backward stochastic differential equations with random terminal times on bounded Euclidean domains.

Details

Motivation: To develop efficient neural operator models that can solve infinite families of 2BSDEs, addressing the computational challenges of traditional approaches.

Method: Leverages Kolmogorov-Arnold networks within a generative neural operator framework to approximate solution operators for 2BSDE families.

Result: Shows that solution operators for broad 2BSDE families are approximable by neural operators, with a structured subclass requiring only polynomial parameters instead of exponential.

Conclusion: The proposed neural operator approach provides efficient approximation for 2BSDE families, with significant parameter efficiency improvements for structured subclasses.

Abstract: We introduce a mild generative variant of the classical neural operator model, which leverages Kolmogorov–Arnold networks to solve infinite families of second-order backward stochastic differential equations ($2$BSDEs) on regular bounded Euclidean domains with random terminal time. Our first main result shows that the solution operator associated with a broad range of $2$BSDE families is approximable by appropriate neural operator models. We then identify a structured subclass of (infinite) families of $2$BSDEs whose neural operator approximation requires only a polynomial number of parameters in the reciprocal approximation rate, as opposed to the exponential requirement in general worst-case neural operator guarantees.

[723] Stochastic Regret Guarantees for Online Zeroth- and First-Order Bilevel Optimization

Parvin Nazari, Bojian Hou, Davoud Ataee Tarzanagh, Li Shen, George Michailidis

Main category: cs.LG

TL;DR: The paper introduces a novel search direction for online bilevel optimization that enables first- and zeroth-order algorithms to achieve sublinear stochastic bilevel regret without window smoothing, improving efficiency through reduced oracle dependence and simultaneous updates.

Details

Motivation: Current OBO approaches rely on deterministic window-smoothed regret minimization, which may not accurately reflect system performance when functions change rapidly, motivating the need for better stochastic regret guarantees.

Method: Proposes a novel search direction used in both first- and zeroth-order stochastic OBO algorithms that eliminates window smoothing, reduces oracle dependence in hypergradient estimation, updates inner/outer variables with linear system solutions, and uses ZO-based estimation of Hessians, Jacobians, and gradients.

Result: The algorithms achieve sublinear stochastic bilevel regret without window smoothing, with experiments validating the approach on online parametric loss tuning and black-box adversarial attacks.

Conclusion: The proposed framework provides improved stochastic regret guarantees and enhanced efficiency for online bilevel optimization problems through a novel search direction and reduced computational requirements.

Abstract: Online bilevel optimization (OBO) is a powerful framework for machine learning problems where both outer and inner objectives evolve over time, requiring dynamic updates. Current OBO approaches rely on deterministic \textit{window-smoothed} regret minimization, which may not accurately reflect system performance when functions change rapidly. In this work, we introduce a novel search direction and show that both first- and zeroth-order (ZO) stochastic OBO algorithms leveraging this direction achieve sublinear {stochastic bilevel regret without window smoothing}. Beyond these guarantees, our framework enhances efficiency by: (i) reducing oracle dependence in hypergradient estimation, (ii) updating inner and outer variables alongside the linear system solution, and (iii) employing ZO-based estimation of Hessians, Jacobians, and gradients. Experiments on online parametric loss tuning and black-box adversarial attacks validate our approach.

[724] Adversarial Spatio-Temporal Attention Networks for Epileptic Seizure Forecasting

Zan Li, Kyongmin Yeo, Wesley Gifford, Lara Marcuse, Madeline Fields, Bülent Yener

Main category: cs.LG

TL;DR: STAN is an Adversarial Spatio-Temporal Attention Network for epileptic seizure forecasting that jointly models spatial brain connectivity and temporal neural dynamics through cascaded attention blocks, achieving state-of-the-art performance with high sensitivity and low false alarm rates.

Details

Motivation: Forecasting epileptic seizures from EEG signals requires high sensitivity, low false alarm rates, and subject-specific adaptability, addressing the critical challenge in healthcare time series prediction.

Method: STAN uses cascaded attention blocks with alternating spatial and temporal modules to capture bidirectional dependencies between spatial and temporal patterns, combined with adversarial training with gradient penalty for robust discrimination between interictal and preictal states.

Result: Achieved 96.6% sensitivity with 0.011 false detections per hour on CHB-MIT dataset and 94.2% sensitivity with 0.063 false detections per hour on MSSM dataset, with early detection typically 15-45 minutes before seizure onset and computational efficiency for real-time deployment.

Conclusion: STAN provides a general paradigm for spatio-temporal forecasting in healthcare domains where individual heterogeneity and interpretability are crucial, demonstrating superior performance in epileptic seizure prediction.

Abstract: Forecasting epileptic seizures from multivariate EEG signals represents a critical challenge in healthcare time series prediction, requiring high sensitivity, low false alarm rates, and subject-specific adaptability. We present STAN, an Adversarial Spatio-Temporal Attention Network that jointly models spatial brain connectivity and temporal neural dynamics through cascaded attention blocks with alternating spatial and temporal modules. Unlike existing approaches that assume fixed preictal durations or separately process spatial and temporal features, STAN captures bidirectional dependencies between spatial and temporal patterns through a unified cascaded architecture. Adversarial training with gradient penalty enables robust discrimination between interictal and preictal states learned from clearly defined 15-minute preictal windows. Continuous 90-minute pre-seizure monitoring reveals that the learned spatio-temporal attention patterns enable early detection: reliable alarms trigger at subject-specific times (typically 15-45 minutes before onset), reflecting the model’s capacity to capture subtle preictal dynamics without requiring individualized training. Experiments on two benchmark EEG datasets (CHB-MIT scalp: 8 subjects, 46 events; MSSM intracranial: 4 subjects, 14 events) demonstrate state-of-the-art performance: 96.6% sensitivity with 0.011 false detections per hour and 94.2% sensitivity with 0.063 false detections per hour, respectively, while maintaining computational efficiency (2.3M parameters, 45 ms latency, 180 MB memory) for real-time edge deployment. Beyond epilepsy, the proposed framework provides a general paradigm for spatio-temporal forecasting in healthcare and other time series domains where individual heterogeneity and interpretability are crucial.

[725] Regularization Implies balancedness in the deep linear network

Kathryn Lindsey, Govind Menon

Main category: cs.LG

TL;DR: Using GIT and Kempf-Ness theorem to analyze deep linear networks, showing L2 regularizer minimizes on balanced manifold and decomposing training into regularizing and learning flows.

Details

Motivation: To establish a mathematical framework connecting balancedness in deep learning with linear systems theory, and to understand training dynamics through geometric invariant theory.

Method: Apply geometric invariant theory (GIT) and Kempf-Ness theorem to deep linear networks, decompose training dynamics into regularizing flow on fibers and learning flow on balanced manifold, with moment map providing exact solution for regularizing flow.

Result: Established that L2 regularizer is minimized on balanced manifold, provided exact solution for regularizing flow using moment map, and created unified framework connecting balancedness concepts across deep learning and linear systems.

Conclusion: The approach successfully provides a common mathematical framework for understanding balancedness, connecting deep learning concepts with linear systems theory and enabling interpretation through model reduction and Bayesian principles.

Abstract: We use geometric invariant theory (GIT) to study the deep linear network (DLN). The Kempf-Ness theorem is used to establish that the $L^2$ regularizer is minimized on the balanced manifold. This allows us to decompose the training dynamics into two distinct gradient flows: a regularizing flow on fibers and a learning flow on the balanced manifold. We show that the regularizing flow is exactly solvable using the moment map. This approach provides a common mathematical framework for balancedness in deep learning and linear systems theory. We use this framework to interpret balancedness in terms of model reduction and Bayesian principles.

[726] LSHFed: Robust and Communication-Efficient Federated Learning with Locally-Sensitive Hashing Gradient Mapping

Guanjie Cheng, Mengzhen Yang, Xinkui Zhao, Shuyi Yu, Tianyu Du, Yangyang Wu, Mengying Zhu, Shuiguang Deng

Main category: cs.LG

TL;DR: LSHFed is a robust federated learning framework that uses LSHGM gradient verification with multi-hyperplane locally-sensitive hashing to detect malicious gradients while preserving privacy and reducing communication costs by up to 1000x.

Details

Motivation: Federated learning is vulnerable to inference and poisoning attacks in trust-deficient environments, and existing defenses suffer from high communication/computation costs or limited detection precision.

Method: Proposes LSHFed framework with LSHGM gradient verification mechanism that projects high-dimensional gradients into compact binary representations using multi-hyperplane locally-sensitive hashing to detect malicious gradients.

Result: Maintains high model performance even with 50% malicious participants and achieves up to 1000x reduction in gradient verification communication compared to full-gradient methods.

Conclusion: LSHFed provides an effective solution for robust federated learning that simultaneously enhances aggregation robustness and privacy preservation while being communication-efficient.

Abstract: Federated learning (FL) enables collaborative model training across distributed nodes without exposing raw data, but its decentralized nature makes it vulnerable in trust-deficient environments. Inference attacks may recover sensitive information from gradient updates, while poisoning attacks can degrade model performance or induce malicious behaviors. Existing defenses often suffer from high communication and computation costs, or limited detection precision. To address these issues, we propose LSHFed, a robust and communication-efficient FL framework that simultaneously enhances aggregation robustness and privacy preservation. At its core, LSHFed incorporates LSHGM, a novel gradient verification mechanism that projects high-dimensional gradients into compact binary representations via multi-hyperplane locally-sensitive hashing. This enables accurate detection and filtering of malicious gradients using only their irreversible hash forms, thus mitigating privacy leakage risks and substantially reducing transmission overhead. Extensive experiments demonstrate that LSHFed maintains high model performance even when up to 50% of participants are collusive adversaries while achieving up to a 1000x reduction in gradient verification communication compared to full-gradient methods.

[727] A Comparative Study of Model Adaptation Strategies for Multi-Treatment Uplift Modeling

Ruyue Zhang, Xiaopeng Ke, Ming Liu, Fangzhou Shi, Chang Men, Zhengdan Zhu

Main category: cs.LG

TL;DR: The paper proposes Orthogonal Function Adaptation (OFA) for multi-treatment uplift modeling, showing it outperforms existing adaptation methods in effectiveness and robustness across various data characteristics.

Details

Motivation: Current multi-treatment uplift modeling techniques adapted from binary-treatment approaches struggle with effectiveness and robustness under different data characteristics like noisy data and observational data mixtures.

Method: Proposed Orthogonal Function Adaptation (OFA) based on the function approximation theorem, categorizing existing adaptations into Structure Adaptation and Feature Adaptation.

Result: Experimental results show OFA significantly improves uplift model performance compared to vanilla adaptation methods and demonstrates the highest robustness across multiple data characteristics.

Conclusion: OFA provides a more effective and robust solution for multi-treatment uplift modeling, addressing limitations of current adaptation approaches.

Abstract: Uplift modeling has emerged as a crucial technique for individualized treatment effect estimation, particularly in fields such as marketing and healthcare. Modeling uplift effects in multi-treatment scenarios plays a key role in real-world applications. Current techniques for modeling multi-treatment uplift are typically adapted from binary-treatment works. In this paper, we investigate and categorize all current model adaptations into two types: Structure Adaptation and Feature Adaptation. Through our empirical experiments, we find that these two adaptation types cannot maintain effectiveness under various data characteristics (noisy data, mixed with observational data, etc.). To enhance estimation ability and robustness, we propose Orthogonal Function Adaptation (OFA) based on the function approximation theorem. We conduct comprehensive experiments with multiple data characteristics to study the effectiveness and robustness of all model adaptation techniques. Our experimental results demonstrate that our proposed OFA can significantly improve uplift model performance compared to other vanilla adaptation methods and exhibits the highest robustness.

[728] Analyzing the Power of Chain of Thought through Memorization Capabilities

Lijia Yu, Xiao-Shan Gao, Lijun Zhang

Main category: cs.LG

TL;DR: CoT does not universally enhance transformer reasoning capabilities; transformers with and without CoT have similar memorization bounds (Θ(N)) for finite datasets, and some infinite datasets cannot be memorized at all.

Details

Motivation: To determine whether Chain of Thought (CoT) universally expands transformer capabilities across all reasoning tasks by analyzing memorization properties.

Method: Analyzed memorization capabilities of fixed-precision transformers with and without CoT through necessary/sufficient conditions and parameter bounds for finite datasets, plus infinite dataset memorization analysis.

Result: Found that transformers with and without CoT have equivalent parameter bounds (Θ(N)) for memorizing finite datasets, and identified reasoning tasks where CoT provides no enhancement. Some infinite datasets are unmemorizable.

Conclusion: CoT does not universally enhance transformer reasoning power - there exist reasoning tasks where CoT provides no benefit, answering the fundamental question negatively.

Abstract: It has been shown that the chain of thought (CoT) can enhance the power of large language models (LLMs) to solve certain mathematical reasoning problems. However, the capacity of CoT is still not fully explored. As an important instance, the following basic question has not yet been answered: Does CoT expand the capability of transformers across all reasoning tasks? We demonstrate that reasoning with transformers is essentially a memorization problem for reasoning datasets. Thus, examining the power of CoT across all reasoning tasks amounts to analyzing the memorization capabilities of CoT transformers. In this paper, we give a complete description of the memorization capabilities of fixed-precision transformers with or without CoT and give a negative answer to the above-mentioned question. Precisely, we first give necessary and sufficient conditions for fixed-precision transformers with and without CoT to memorize a finite reasoning dataset and show that these two conditions do not imply each other. Then, we give lower and upper bounds for the number of parameters needed for transformers with or without CoT to memorize a finite reasoning dataset with $N$ elements, which are $\overline{\Theta}(N)$ in all cases. This implies that there exist reasoning tasks for which CoT does not enhance the reasoning power of transformers, leading to a negative answer to the above-mentioned question. Finally, we give the first results on memorizing infinite reasoning datasets by CoT transformers and show that some simple infinite datasets cannot be memorized by transformers with or without CoT.

[729] Transmitter Identification and Protocol Categorization in Shared Spectrum via Multi-Task RF Classification at the Network Edge

Tariq Abdul-Quddoos, Tasnia Sharmin, Xiangfang Li, Lijun Qian

Main category: cs.LG

TL;DR: A multi-task CNN framework for RF signal classification achieves high accuracy in protocol categorization and transmitter identification in shared spectrum environments.

Details

Motivation: Spectrum monitoring and transmitter identification are crucial for enforcing spectrum usage policy, efficient utilization, and network security as spectrum sharing becomes increasingly important to meet rising wireless demands.

Method: A Convolutional Neural Network (CNN) with multi-channel input strategy is designed to extract meaningful signal features and handle challenges like overlapping signal characteristics and environmental variability for multi-task RF signal classification.

Result: The method achieved 90% accuracy for protocol classification, 100% for transmitting base station classification, and 92% for joint classification tasks using RF data from the POWDER platform.

Conclusion: The proposed method shows significant potential to enhance spectrum monitoring, management, and security in modern wireless networks through robust transmitter identification and protocol categorization.

Abstract: As spectrum sharing becomes increasingly vital to meet rising wireless demands in the future, spectrum monitoring and transmitter identification are indispensable for enforcing spectrum usage policy, efficient spectrum utilization, and net- work security. This study proposed a robust framework for transmitter identification and protocol categorization via multi- task RF signal classification in shared spectrum environments, where the spectrum monitor will classify transmission protocols (e.g., 4G LTE, 5G-NR, IEEE 802.11a) operating within the same frequency bands, and identify different transmitting base stations, as well as their combinations. A Convolutional Neural Network (CNN) is designed to tackle critical challenges such as overlapping signal characteristics and environmental variability. The proposed method employs a multi-channel input strategy to extract meaningful signal features, achieving remarkable accuracy: 90% for protocol classification, 100% for transmitting base station classification, and 92% for joint classification tasks, utilizing RF data from the POWDER platform. These results highlight the significant potential of the proposed method to enhance spectrum monitoring, management, and security in modern wireless networks.

[730] Optimizing Electric Vehicle Charging Station Placement Using Reinforcement Learning and Agent-Based Simulations

Minh-Duc Nguyen, Dung D. Le, Phi Long Nguyen

Main category: cs.LG

TL;DR: A novel deep reinforcement learning framework with agent-based simulations for optimal EV charging station placement, reducing waiting times by 53.28% compared to initial state.

Details

Motivation: Traditional RL methods for EV charging station placement use deterministic reward systems that fail to capture real-world dynamic uncertainties, leading to inefficient and costly evaluations.

Method: Integrates deep RL with agent-based simulations to model EV movement and charging demand in real time, using a hybrid RL agent with dual Q-networks and a hybrid reward function combining deterministic factors with simulation feedback.

Result: Case studies in Hanoi, Vietnam show 53.28% reduction in average waiting times compared to initial state, outperforming static baseline methods.

Conclusion: The proposed scalable and adaptive solution effectively addresses real-world complexities in EV infrastructure planning and improves user experience.

Abstract: The rapid growth of electric vehicles (EVs) necessitates the strategic placement of charging stations to optimize resource utilization and minimize user inconvenience. Reinforcement learning (RL) offers an innovative approach to identifying optimal charging station locations; however, existing methods face challenges due to their deterministic reward systems, which limit efficiency. Because real-world conditions are dynamic and uncertain, a deterministic reward structure cannot fully capture the complexities of charging station placement. As a result, evaluation becomes costly and time-consuming, and less reflective of real-world scenarios. To address this challenge, we propose a novel framework that integrates deep RL with agent-based simulations to model EV movement and estimate charging demand in real time. Our approach employs a hybrid RL agent with dual Q-networks to select optimal locations and configure charging ports, guided by a hybrid reward function that combines deterministic factors with simulation-derived feedback. Case studies in Hanoi, Vietnam, show that our method reduces average waiting times by 53.28% compared to the initial state, outperforming static baseline methods. This scalable and adaptive solution enhances EV infrastructure planning, effectively addressing real-world complexities and improving user experience.

[731] WindMiL: Equivariant Graph Learning for Wind Loading Prediction

Themistoklis Vargiemezis, Charilaos Kanatsoulis, Catherine Gorlé

Main category: cs.LG

TL;DR: WindMiL is a machine learning framework that combines systematic dataset generation with symmetry-aware graph neural networks to efficiently predict wind loads on buildings, overcoming the computational expense of traditional methods like large-eddy simulation.

Details

Motivation: Conventional wind loading prediction methods like wind tunnel testing and large-eddy simulation are computationally expensive (24+ hours per case), making comprehensive parametric studies infeasible for building design and structural safety.

Method: Created a large-scale dataset of 462 LES cases with varied roof geometries using signed distance function interpolation, then developed a reflection-equivariant graph neural network that guarantees physically consistent predictions under mirrored geometries.

Result: WindMiL achieved high accuracy with RMSE ≤ 0.02 for mean pressure coefficients, maintained hit rates above 96% under reflected-test evaluation, while non-equivariant baseline models dropped by more than 10%.

Conclusion: By combining systematic dataset generation with equivariant surrogate modeling, WindMiL enables efficient, scalable, and accurate wind load predictions for building design applications.

Abstract: Accurate prediction of wind loading on buildings is crucial for structural safety and sustainable design, yet conventional approaches such as wind tunnel testing and large-eddy simulation (LES) are prohibitively expensive for large-scale exploration. Each LES case typically requires at least 24 hours of computation, making comprehensive parametric studies infeasible. We introduce WindMiL, a new machine learning framework that combines systematic dataset generation with symmetry-aware graph neural networks (GNNs). First, we introduce a large-scale dataset of wind loads on low-rise buildings by applying signed distance function interpolation to roof geometries and simulating 462 cases with LES across varying shapes and wind directions. Second, we develop a reflection-equivariant GNN that guarantees physically consistent predictions under mirrored geometries. Across interpolation and extrapolation evaluations, WindMiL achieves high accuracy for both the mean and the standard deviation of surface pressure coefficients (e.g., RMSE $\leq 0.02$ for mean $C_p$) and remains accurate under reflected-test evaluation, maintaining hit rates above $96%$ where the non-equivariant baseline model drops by more than $10%$. By pairing a systematic dataset with an equivariant surrogate, WindMiL enables efficient, scalable, and accurate predictions of wind loads on buildings.

[732] A Saddle Point Remedy: Power of Variable Elimination in Non-convex Optimization

Min Gan, Guang-Yong Chen, Yang Yi, Lin Yang

Main category: cs.LG

TL;DR: Variable elimination algorithms like VarPro reshape optimization landscapes by transforming saddle points in original formulations into local maxima in reduced formulations, enabling more effective navigation of non-convex energy landscapes.

Details

Motivation: To understand why variable elimination algorithms exhibit superior convergence and robustness in large-scale non-convex optimization, particularly in navigating saddle points that are primary obstacles in machine learning optimization.

Method: Rigorous geometric analysis using Hessian inertia and Schur complement to compare optimization landscapes of original and reduced formulations, validated on non-convex matrix factorization, two-parameter neural networks, and deep Residual Networks.

Result: Variable elimination fundamentally reshapes critical point structure - local maxima in reduced landscapes correspond directly to saddle points in original formulations, leading to dramatic improvements in stability and convergence to superior minima.

Conclusion: Landscape simplification via saddle point transformation is a powerful principle that can guide the design of more robust and efficient optimization algorithms beyond explaining existing methods.

Abstract: The proliferation of saddle points, rather than poor local minima, is increasingly understood to be a primary obstacle in large-scale non-convex optimization for machine learning. Variable elimination algorithms, like Variable Projection (VarPro), have long been observed to exhibit superior convergence and robustness in practice, yet a principled understanding of why they so effectively navigate these complex energy landscapes has remained elusive. In this work, we provide a rigorous geometric explanation by comparing the optimization landscapes of the original and reduced formulations. Through a rigorous analysis based on Hessian inertia and the Schur complement, we prove that variable elimination fundamentally reshapes the critical point structure of the objective function, revealing that local maxima in the reduced landscape are created from, and correspond directly to, saddle points in the original formulation. Our findings are illustrated on the canonical problem of non-convex matrix factorization, visualized directly on two-parameter neural networks, and finally validated in training deep Residual Networks, where our approach yields dramatic improvements in stability and convergence to superior minima. This work goes beyond explaining an existing method; it establishes landscape simplification via saddle point transformation as a powerful principle that can guide the design of a new generation of more robust and efficient optimization algorithms.

[733] KAT-GNN: A Knowledge-Augmented Temporal Graph Neural Network for Risk Prediction in Electronic Health Records

Kun-Wei Lin, Yu-Chen Kuo, Hsin-Yao Wang, Yi-Ju Tseng

Main category: cs.LG

TL;DR: KAT-GNN is a knowledge-augmented temporal graph neural network that integrates clinical knowledge and temporal dynamics for EHR-based risk prediction, achieving state-of-the-art performance on CAD and mortality prediction tasks.

Details

Motivation: Clinical risk prediction using EHRs is vital for timely interventions, but modeling heterogeneous and irregular temporal EHR data presents significant challenges that need to be addressed.

Method: Constructs modality-specific patient graphs from EHRs, augments them with SNOMED CT ontology edges and EHR co-occurrence priors, then uses a time-aware transformer to capture longitudinal dynamics from graph-encoded representations.

Result: Achieved AUROC of 0.9269 for CAD prediction, 0.9230 for MIMIC-III mortality, and 0.8849 for MIMIC-IV mortality, consistently outperforming baselines like GRASP and RETAIN.

Conclusion: Integration of clinical knowledge into graph representations with time-aware attention provides an effective and generalizable approach for risk prediction across diverse clinical tasks and datasets.

Abstract: Clinical risk prediction using electronic health records (EHRs) is vital to facilitate timely interventions and clinical decision support. However, modeling heterogeneous and irregular temporal EHR data presents significant challenges. We propose \textbf{KAT-GNN} (Knowledge-Augmented Temporal Graph Neural Network), a graph-based framework that integrates clinical knowledge and temporal dynamics for risk prediction. KAT-GNN first constructs modality-specific patient graphs from EHRs. These graphs are then augmented using two knowledge sources: (1) ontology-driven edges derived from SNOMED CT and (2) co-occurrence priors extracted from EHRs. Subsequently, a time-aware transformer is employed to capture longitudinal dynamics from the graph-encoded patient representations. KAT-GNN is evaluated on three distinct datasets and tasks: coronary artery disease (CAD) prediction using the Chang Gung Research Database (CGRD) and in-hospital mortality prediction using the MIMIC-III and MIMIC-IV datasets. KAT-GNN achieves state-of-the-art performance in CAD prediction (AUROC: 0.9269 $\pm$ 0.0029) and demonstrated strong results in mortality prediction in MIMIC-III (AUROC: 0.9230 $\pm$ 0.0070) and MIMIC-IV (AUROC: 0.8849 $\pm$ 0.0089), consistently outperforming established baselines such as GRASP and RETAIN. Ablation studies confirm that both knowledge-based augmentation and the temporal modeling component are significant contributors to performance gains. These findings demonstrate that the integration of clinical knowledge into graph representations, coupled with a time-aware attention mechanism, provides an effective and generalizable approach for risk prediction across diverse clinical tasks and datasets.

[734] A Spatio-Temporal Online Robust Tensor Recovery Approach for Streaming Traffic Data Imputation

Yiyang Yang, Xiejian Chi, Shanxing Gao, Kaidong Wang, Yao Wang

Main category: cs.LG

TL;DR: Proposed an online robust tensor recovery algorithm for traffic data that handles missing and anomalous values while leveraging spatio-temporal correlations, achieving high accuracy and 1000x faster computation than batch methods.

Details

Motivation: Traditional batch-based tensor recovery methods are computationally expensive and storage-intensive for continuously expanding traffic data volumes, while existing online methods suffer from performance degradation due to insufficient exploitation of traffic data's structural properties.

Method: Reformulated traffic data recovery within a streaming framework and developed a novel online robust tensor recovery algorithm that simultaneously leverages global spatio-temporal correlations and local consistency of traffic data.

Result: Experimental results on three real-world traffic datasets show high recovery accuracy with computational efficiency improvements up to three orders of magnitude (1000x faster) compared to state-of-the-art batch-based methods.

Conclusion: The proposed approach serves as a scalable and effective solution for traffic data quality enhancement in Intelligent Transportation Systems, demonstrating strong adaptability across diverse missing patterns.

Abstract: Data quality is critical to Intelligent Transportation Systems (ITS), as complete and accurate traffic data underpin reliable decision-making in traffic control and management. Recent advances in low-rank tensor recovery algorithms have shown strong potential in capturing the inherent structure of high-dimensional traffic data and restoring degraded observations. However, traditional batch-based methods demand substantial computational and storage resources, which limits their scalability in the face of continuously expanding traffic data volumes. Moreover, recent online tensor recovery methods often suffer from severe performance degradation in complex real-world scenarios due to their insufficient exploitation of the intrinsic structural properties of traffic data. To address these challenges, we reformulate the traffic data recovery problem within a streaming framework, and propose a novel online robust tensor recovery algorithm that simultaneously leverages both the global spatio-temporal correlations and local consistency of traffic data, achieving high recovery accuracy and significantly improved computational efficiency in large-scale scenarios. Our method is capable of simultaneously handling missing and anomalous values in traffic data, and demonstrates strong adaptability across diverse missing patterns. Experimental results on three real-world traffic datasets demonstrate that the proposed approach achieves high recovery accuracy while significantly improving computational efficiency by up to three orders of magnitude compared to state-of-the-art batch-based methods. These findings highlight the potential of the proposed approach as a scalable and effective solution for traffic data quality enhancement in ITS.

[735] Identification of Capture Phases in Nanopore Protein Sequencing Data Using a Deep Learning Model

Annabelle Martin, Daphne Kontogiorgos-Heintz, Jeff Nivala

Main category: cs.LG

TL;DR: A lightweight 1D CNN called CaptureNet-Deep was developed to automatically detect protein capture phases in nanopore sequencing data, achieving 0.94 F1 score and reducing analysis time from days to under 30 minutes.

Details

Motivation: Manual identification of capture phases in nanopore protein sequencing is time-intensive (taking days) and requires domain expertise, creating a bottleneck in data analysis.

Method: Developed a lightweight one-dimensional convolutional neural network (1D CNN) trained on down-sampled signal windows, compared against CNN-LSTM hybrids, histogram-based classifiers, and other CNN variants using run-level data splits.

Result: CaptureNet-Deep achieved F1 score of 0.94 and precision of 93.39% on held-out test data, supports low-latency inference, and reduced total analysis time from several days to under thirty minutes.

Conclusion: Efficient, real-time capture detection is possible using simple, interpretable architectures, suggesting a broader role for lightweight ML models in sequencing workflows.

Abstract: Nanopore protein sequencing produces long, noisy ionic current traces in which key molecular phases, such as protein capture and translocation, are embedded. Capture phases mark the successful entry of a protein into the pore and serve as both a checkpoint and a signal that a channel merits further analysis. However, manual identification of capture phases is time-intensive, often requiring several days for expert reviewers to annotate the data due to the need for domain-specific interpretation of complex signal patterns. To address this, a lightweight one-dimensional convolutional neural network (1D CNN) was developed and trained to detect capture phases in down-sampled signal windows. Evaluated against CNN-LSTM (Long Short-Term Memory) hybrids, histogram-based classifiers, and other CNN variants using run-level data splits, our best model, CaptureNet-Deep, achieved an F1 score of 0.94 and precision of 93.39% on held-out test data. The model supports low-latency inference and is integrated into a dashboard for Oxford Nanopore experiments, reducing the total analysis time from several days to under thirty minutes. These results show that efficient, real-time capture detection is possible using simple, interpretable architectures and suggest a broader role for lightweight ML models in sequencing workflows.

[736] Lyapunov Stability Learning with Nonlinear Control via Inductive Biases

Yupu Lu, Shijie Lin, Hao Xu, Zeqing Zhang, Jia Pan

Main category: cs.LG

TL;DR: The paper proposes a neural control Lyapunov function (CLF) and CLF-based controller that treats Lyapunov conditions as inductive biases, enabling stable optimization and end-to-end learning with improved convergence and region of attraction.

Details

Motivation: Existing deep learning approaches for CLFs treat Lyapunov conditions as complex constraints for optimization, which leads to hard global convergence and complicated verification implementation.

Method: Design neural CLF and CLF-based controller by treating Lyapunov conditions as inductive biases, enabling stable optimization with limited constraints and end-to-end learning of both CLF and controller.

Result: Achieves higher convergence rate and larger region of attraction (ROA) compared to existing methods, and reveals why previous methods have decreasing success rates during learning.

Conclusion: The proposed approach successfully improves the learner-verifier framework for CLFs by incorporating Lyapunov conditions as inductive biases, leading to better performance and insights into previous methods’ limitations.

Abstract: Finding a control Lyapunov function (CLF) in a dynamical system with a controller is an effective way to guarantee stability, which is a crucial issue in safety-concerned applications. Recently, deep learning models representing CLFs have been applied into a learner-verifier framework to identify satisfiable candidates. However, the learner treats Lyapunov conditions as complex constraints for optimisation, which is hard to achieve global convergence. It is also too complicated to implement these Lyapunov conditions for verification. To improve this framework, we treat Lyapunov conditions as inductive biases and design a neural CLF and a CLF-based controller guided by this knowledge. This design enables a stable optimisation process with limited constraints, and allows end-to-end learning of both the CLF and the controller. Our approach achieves a higher convergence rate and larger region of attraction (ROA) in learning the CLF compared to existing methods among abundant experiment cases. We also thoroughly reveal why the success rate decreases with previous methods during learning.

[737] Koopman-based Prediction of Connectivity for Flying Ad Hoc Networks

Sivaram Krishnan, Jinho Choi, Jihong Park, Gregory Sherman, Benjamin Campbell

Main category: cs.LG

TL;DR: This paper applies data-driven Koopman approaches to model UAV trajectory dynamics in flying ad hoc networks (FANETs), enabling accurate prediction of connectivity events and SINR values to improve network performance in dynamic environments.

Details

Motivation: Traditional ML techniques struggle with highly dynamic wireless environments like FANETs, where network topology constantly changes. There's a need for approaches that can effectively model these dynamics to ensure reliable communication between UAVs.

Method: Proposed two approaches using Koopman operator theory: centralized and distributed methods to model UAV trajectory dynamics in FANETs. Used these to predict signal-to-interference-plus-noise ratios (SINRs) for surveillance UAVs following pre-determined trajectories.

Result: The approaches accurately predicted connectivity and isolation events that lead to communication outages. This enables reliable prediction of SINR values and identification of potential communication disruptions.

Conclusion: Data-driven Koopman approaches provide an effective solution for modeling dynamic FANET environments, allowing UAVs to schedule transmissions based on predicted connectivity patterns and improving overall network reliability.

Abstract: The application of machine learning (ML) to communication systems is expected to play a pivotal role in future artificial intelligence (AI)-based next-generation wireless networks. While most existing works focus on ML techniques for static wireless environments, they often face limitations when applied to highly dynamic environments, such as flying ad hoc networks (FANETs). This paper explores the use of data-driven Koopman approaches to address these challenges. Specifically, we investigate how these approaches can model UAV trajectory dynamics within FANETs, enabling more accurate predictions and improved network performance. By leveraging Koopman operator theory, we propose two possible approaches – centralized and distributed – to efficiently address the challenges posed by the constantly changing topology of FANETs. To demonstrate this, we consider a FANET performing surveillance with UAVs following pre-determined trajectories and predict signal-to-interference-plus-noise ratios (SINRs) to ensure reliable communication between UAVs. Our results show that these approaches can accurately predict connectivity and isolation events that lead to modelled communication outages. This capability could help UAVs schedule their transmissions based on these predictions.

[738] Diffusion-Based Solver for CNF Placement on the Cloud-Continuum

Álvaro Vázquez Rodríguez, Manuel Fernández-Veiga, Carlos Giraldo-Rodríguez

Main category: cs.LG

TL;DR: A novel diffusion-based generative framework for Cloud-Native Network Function placement that treats placement as a graph-to-assignment task, using Graph Neural Networks to iteratively refine noisy assignment matrices while incorporating constraint-specific losses.

Details

Motivation: Classical approaches like mixed-integer nonlinear programming, heuristics, and reinforcement learning have limitations in scalability, constraint handling, and generalization for CNF placement across Cloud-Continuum in 5G/6G networks.

Method: Uses Denoising Diffusion Probabilistic Models (DDPM) to reconceptualize placement as generative graph-to-assignment task. Encodes placement problem as heterogeneous graph and trains Graph Neural Network denoiser to iteratively refine noisy CNF-to-cloud assignment matrices with constraint-specific losses.

Result: Extensive evaluations show the model consistently produces feasible solutions with orders of magnitude faster inference than MINLP solvers across diverse topologies.

Conclusion: Demonstrates the potential of diffusion-based generative modeling for constrained network embedding problems, advancing practical and scalable orchestration of distributed Cloud-Native Network Functions.

Abstract: The placement of Cloud-Native Network Functions (CNFs) across the Cloud-Continuum represents a core challenge in the orchestration of current 5G and future 6G networks. The process involves the placement of interdependent computing tasks, structured as Service Function Chains, over distributed cloud infrastructures. This is achieved while satisfying strict resource, bandwidth and latency constraints. It is acknowledged that classical approaches, including mixed-integer nonlinear programming, heuristics and reinforcement learning are limited in terms of scalability, constraint handling and generalisation capacity. In the present study, a novel theoretical framework is proposed, which is based on Denoising Diffusion Probabilistic Models (DDPM) for CNF placement. The present approach proposes a reconceptualisation of placement as a generative graph to assignment task, where the placement problem is encoded as a heterogeneous graph, and a Graph Neural Network denoiser is trained to iteratively refine noisy CNF-to-cloud assignment matrices. The model incorporates constraint-specific losses directly into the loss function, thereby allowing it to learn feasible solution spaces. The integration of the DDPM formulation with structured combinatorial constraints is achieved through a rigorous and systematic approach. Extensive evaluations across diverse topologies have been conducted, which have confirmed that the model consistently produces feasible solutions with orders of magnitude faster inference than MINLP solvers. The results obtained demonstrate the potential of diffusion-based generative modelling for constrained network embedding problems, making an impact towards the practical, scalable orchestration of distributed Cloud-Native Network Functions.

[739] MiniFool - Physics-Constraint-Aware Minimizer-Based Adversarial Attacks in Deep Neural Networks

Lucie Flek, Oliver Janik, Philipp Alexander Jung, Akbar Karimi, Timo Saala, Alexander Schmidt, Matthias Schott, Philipp Soldin, Matthias Thiesmeyer, Christopher Wiebusch, Ulrich Willemsen

Main category: cs.LG

TL;DR: MiniFool is a physics-inspired adversarial attack algorithm for testing neural network classifiers in particle and astroparticle physics, demonstrated on IceCube neutrino data, MNIST, and CMS Open Data.

Details

Motivation: To develop a general adversarial attack method for testing neural network robustness in physics applications, particularly for neutrino classification and other scientific domains.

Method: The algorithm minimizes a cost function combining χ²-based test statistics with target score deviation, using perturbations based on experimental uncertainties to flip classifications.

Result: The algorithm successfully flips classifications with varying likelihoods for correctly vs incorrectly classified events, and allows robustness testing by scaling experimental uncertainties.

Conclusion: MiniFool provides a method to quantify neural network robustness and test classification reliability on unlabeled experimental data across multiple scientific domains.

Abstract: In this paper, we present a new algorithm, MiniFool, that implements physics-inspired adversarial attacks for testing neural network-based classification tasks in particle and astroparticle physics. While we initially developed the algorithm for the search for astrophysical tau neutrinos with the IceCube Neutrino Observatory, we apply it to further data from other science domains, thus demonstrating its general applicability. Here, we apply the algorithm to the well-known MNIST data set and furthermore, to Open Data data from the CMS experiment at the Large Hadron Collider. The algorithm is based on minimizing a cost function that combines a $\chi^2$ based test-statistic with the deviation from the desired target score. The test statistic quantifies the probability of the perturbations applied to the data based on the experimental uncertainties. For our studied use cases, we find that the likelihood of a flipped classification differs for both the initially correctly and incorrectly classified events. When testing changes of the classifications as a function of an attack parameter that scales the experimental uncertainties, the robustness of the network decision can be quantified. Furthermore, this allows testing the robustness of the classification of unlabeled experimental data.

[740] Verifiable Split Learning via zk-SNARKs

Rana Alaa, Darío González-Ferreiro, Carlos Beis-Penedo, Manuel Fernández-Veiga, Rebeca P. Díaz-Redondo, Ana Fernández-Vilas

Main category: cs.LG

TL;DR: This paper proposes a verifiable split learning framework that uses zk-SNARK proofs to ensure correctness and verifiability in collaborative deep learning where neural networks are split between client and server.

Details

Motivation: Split learning enables collaborative training when data or resources are separated between devices, but it lacks the ability to verify the correctness and honesty of computations exchanged between parties.

Method: The framework integrates zk-SNARK proofs for both forward and backward propagation on the server side, generating proofs and verification for both client and server sides.

Result: The verifiable split learning architecture achieves verifiability and correctness, while blockchain-enabled systems (without zero-knowledge proofs) are lightweight but unverifiable.

Conclusion: Applying zk-SNARK proofs in split learning successfully provides verifiability and correctness guarantees, addressing the trust issues in collaborative learning scenarios.

Abstract: Split learning is an approach to collaborative learning in which a deep neural network is divided into two parts: client-side and server-side at a cut layer. The client side executes its model using its raw input data and sends the intermediate activation to the server side. This configuration architecture is very useful for enabling collaborative training when data or resources are separated between devices. However, split learning lacks the ability to verify the correctness and honesty of the computations that are performed and exchanged between the parties. To this purpose, this paper proposes a verifiable split learning framework that integrates a zk-SNARK proof to ensure correctness and verifiability. The zk-SNARK proof and verification are generated for both sides in forward propagation and backward propagation on the server side, guaranteeing verifiability on both sides. The verifiable split learning architecture is compared to a blockchain-enabled system for the same deep learning network, one that records updates but without generating the zero-knowledge proof. From the comparison, it can be deduced that applying the zk-SNARK test achieves verifiability and correctness, while blockchains are lightweight but unverifiable.

[741] Learning Intractable Multimodal Policies with Reparameterization and Diversity Regularization

Ziqi Wang, Jiashun Liu, Ling Pan

Main category: cs.LG

TL;DR: The paper proposes a unified framework for optimizing multimodal actors in deep RL, introduces a distance-based diversity regularization, and demonstrates improved performance in diversity-critical domains.

Details

Motivation: Traditional RL algorithms use deterministic or unimodal Gaussian actors that cannot express complex multimodal decision distributions, limiting performance in diversity-critical scenarios.

Method: Reformulates existing intractable multimodal actors within a unified framework and proves they can be optimized via policy gradient with reparameterization. Proposes a distance-based diversity regularization that doesn’t require explicit decision probabilities.

Result: Shows advantages in multi-goal achieving and generative RL domains, particularly in few-shot robustness. Achieves competitive performance in MuJoCo benchmarks. Identifies amortized actors as promising policy models with strong multimodal expressivity.

Conclusion: The proposed method effectively balances performance, decision diversity, and efficiency in multimodal RL, with amortized actors emerging as a particularly effective policy model class.

Abstract: Traditional continuous deep reinforcement learning (RL) algorithms employ deterministic or unimodal Gaussian actors, which cannot express complex multimodal decision distributions. This limitation can hinder their performance in diversity-critical scenarios. There have been some attempts to design online multimodal RL algorithms based on diffusion or amortized actors. However, these actors are intractable, making existing methods struggle with balancing performance, decision diversity, and efficiency simultaneously. To overcome this challenge, we first reformulate existing intractable multimodal actors within a unified framework, and prove that they can be directly optimized by policy gradient via reparameterization. Then, we propose a distance-based diversity regularization that does not explicitly require decision probabilities. We identify two diversity-critical domains, namely multi-goal achieving and generative RL, to demonstrate the advantages of multimodal policies and our method, particularly in terms of few-shot robustness. In conventional MuJoCo benchmarks, our algorithm also shows competitive performance. Moreover, our experiments highlight that the amortized actor is a promising policy model class with strong multimodal expressivity and high performance. Our code is available at https://github.com/PneuC/DrAC

Hao Wang, Zixuan Weng, Jindong Han, Wei Fan, Hao Liu

Main category: cs.LG

TL;DR: DAMBench is the first large-scale multi-modal benchmark for data-driven data assimilation models, addressing limitations of existing research by providing realistic atmospheric conditions with real-world observations and standardized evaluation protocols.

Details

Motivation: Existing deep learning-based data assimilation research relies on oversimplified scenarios with synthetic observations and lacks standardized benchmarks for fair model comparison.

Method: Created DAMBench benchmark integrating high-quality background states from forecasting systems and real-world multi-modal observations (weather stations and satellite imagery), with unified evaluation protocols and benchmarking of representative approaches including latent generative models and neural process frameworks.

Result: Established a rigorous foundation for future research with comprehensive experiments, promoting reproducibility, fair comparison, and extensibility to real-world multi-modal scenarios.

Conclusion: DAMBench addresses critical gaps in data assimilation research by providing the first standardized benchmark with realistic conditions, enabling more effective development and evaluation of data-driven DA models.

Abstract: Data Assimilation is a cornerstone of atmospheric system modeling, tasked with reconstructing system states by integrating sparse, noisy observations with prior estimation. While traditional approaches like variational and ensemble Kalman filtering have proven effective, recent advances in deep learning offer more scalable, efficient, and flexible alternatives better suited for complex, real-world data assimilation involving large-scale and multi-modal observations. However, existing deep learning-based DA research suffers from two critical limitations: (1) reliance on oversimplified scenarios with synthetically perturbed observations, and (2) the absence of standardized benchmarks for fair model comparison. To address these gaps, in this work, we introduce DAMBench, the first large-scale multi-modal benchmark designed to evaluate data-driven DA models under realistic atmospheric conditions. DAMBench integrates high-quality background states from state-of-the-art forecasting systems and real-world multi-modal observations (i.e., real-world weather stations and satellite imagery). All data are resampled to a common grid and temporally aligned to support systematic training, validation, and testing. We provide unified evaluation protocols and benchmark representative data assimilation approaches, including latent generative models and neural process frameworks. Additionally, we propose a lightweight multi-modal plugin to demonstrate how integrating realistic observations can enhance even simple baselines. Through comprehensive experiments, DAMBench establishes a rigorous foundation for future research, promoting reproducibility, fair comparison, and extensibility to real-world multi-modal scenarios. Our dataset and code are publicly available at https://github.com/figerhaowang/DAMBench.

[743] Protecting the Neural Networks against FGSM Attack Using Machine Unlearning

Amir Hossein Khorasani, Ali Jahanian, Maryam Rastgarpour

Main category: cs.LG

TL;DR: Machine unlearning techniques applied to LeNet neural network improve robustness against FGSM adversarial attacks by removing learned perturbations.

Details

Motivation: Machine learning models are vulnerable to adversarial attacks like FGSM that add small perturbations to trick models into misclassification, requiring defensive methods.

Method: Applied unlearning techniques to LeNet neural network to forget specific adversarial data points and retrain on original data without FGSM perturbations.

Result: Unlearning FGSM attacks on LeNet significantly improved the model’s robustness against these types of adversarial attacks.

Conclusion: Machine unlearning is an effective technique for enhancing model robustness against adversarial attacks like FGSM by selectively forgetting malicious training data.

Abstract: Machine learning is a powerful tool for building predictive models. However, it is vulnerable to adversarial attacks. Fast Gradient Sign Method (FGSM) attacks are a common type of adversarial attack that adds small perturbations to input data to trick a model into misclassifying it. In response to these attacks, researchers have developed methods for “unlearning” these attacks, which involves retraining a model on the original data without the added perturbations. Machine unlearning is a technique that tries to “forget” specific data points from the training dataset, to improve the robustness of a machine learning model against adversarial attacks like FGSM. In this paper, we focus on applying unlearning techniques to the LeNet neural network, a popular architecture for image classification. We evaluate the efficacy of unlearning FGSM attacks on the LeNet network and find that it can significantly improve its robustness against these types of attacks.

[744] Memory-Efficient Training with In-Place FFT Implementation

Xinyu Ding, Bangtian Liu, Siyu Liao, Zhongfeng Wang

Main category: cs.LG

TL;DR: The paper proposes rdFFT, the first real-domain fully in-place FFT framework that eliminates memory allocation issues in traditional FFT implementations by preserving input-output memory space consistency.

Details

Motivation: Existing FFT implementations (including standard FFT and real FFT) cannot achieve true in-place computation due to dimensional mismatch, particularly with rFFT mapping size n input to size n/2+1 complex output, requiring additional memory allocation.

Method: Leverages butterfly operation symmetry and conjugate properties in the frequency domain to design an implicit complex encoding scheme that eliminates intermediate cache usage entirely.

Result: Experiments on multiple natural language understanding tasks demonstrate effectiveness in reducing training memory cost.

Conclusion: The proposed rdFFT framework offers a promising direction for frequency-domain lightweight adaptation by enabling true in-place computation while preserving memory space consistency.

Abstract: Fast Fourier Transforms (FFT) are widely used to reduce memory and computational costs in deep learning. However, existing implementations, including standard FFT and real FFT (rFFT), cannot achieve true in-place computation. In particular, rFFT maps an input of size n to a complex output of size n/2+1, causing dimensional mismatch and requiring additional memory allocation. We propose the first real-domain, fully in-place FFT framework (rdFFT) that preserves input-output memory space consistency. By leveraging butterfly operation symmetry and conjugate properties in the frequency domain, we design an implicit complex encoding scheme that eliminates intermediate cache usage entirely. Experiments on multiple natural language understanding tasks demonstrate the method effectiveness in reducing training memory cost, offering a promising direction for frequency-domain lightweight adaptation.

[745] Leveraging Compact Satellite Embeddings and Graph Neural Networks for Large-Scale Poverty Mapping

Markus B. Pettersson, Adel Daoud

Main category: cs.LG

TL;DR: Graph-based approach using satellite embeddings to predict wealth indices in Sub-Saharan Africa, addressing limited spatial coverage and privacy-displaced coordinates in DHS surveys.

Details

Motivation: Accurate poverty maps are scarce in Global South; DHS surveys have limited spatial coverage and coordinates are displaced for privacy, reducing data quality.

Method: Graph-based approach using AlphaEarth satellite embeddings, modeling spatial relations between surveyed/unlabeled locations with probabilistic “fuzzy label” loss to handle coordinate displacement.

Result: Experiments on 37 DHS datasets (2017-2023) show graph structure slightly improves accuracy over “image-only” baselines.

Conclusion: Compact Earth Observation embeddings show potential for large-scale socioeconomic mapping in data-scarce regions.

Abstract: Accurate, fine-grained poverty maps remain scarce across much of the Global South. While Demographic and Health Surveys (DHS) provide high-quality socioeconomic data, their spatial coverage is limited and reported coordinates are randomly displaced for privacy, further reducing their quality. We propose a graph-based approach leveraging low-dimensional AlphaEarth satellite embeddings to predict cluster-level wealth indices across Sub-Saharan Africa. By modeling spatial relations between surveyed and unlabeled locations, and by introducing a probabilistic “fuzzy label” loss to account for coordinate displacement, we improve the generalization of wealth predictions beyond existing surveys. Our experiments on 37 DHS datasets (2017-2023) show that incorporating graph structure slightly improves accuracy compared to “image-only” baselines, demonstrating the potential of compact EO embeddings for large-scale socioeconomic mapping.

[746] Real-time Continual Learning on Intel Loihi 2

Elvin Hajizada, Danielle Rager, Timothy Shea, Leobardo Campos-Macias, Andreas Wild, Eyke Hüllermeier, Yulia Sandamirskaya, Mike Davies

Main category: cs.LG

TL;DR: CLP-SNN is a neuromorphic spiking neural network for online continual learning that achieves competitive accuracy while being 70x faster and 5,600x more energy efficient than edge GPU alternatives.

Details

Motivation: AI systems on edge devices need to adapt to changing data distributions and novel classes in open-world environments, but current offline training paradigms struggle with online continual learning in power-constrained settings.

Method: Uses a spiking neural network architecture with event-driven sparse local learning, self-normalizing three-factor learning rule, and integrated neurogenesis and metaplasticity for capacity expansion and forgetting mitigation.

Result: On OpenLORIS few-shot learning experiments, CLP-SNN achieves accuracy competitive with replay methods while being rehearsal-free, with 70x faster inference (0.33ms vs 23.2ms) and 5,600x more energy efficient (0.05mJ vs 281mJ) than best edge GPU OCL.

Conclusion: Co-designed brain-inspired algorithms and neuromorphic hardware can break traditional accuracy-efficiency trade-offs for future edge AI systems.

Abstract: AI systems on edge devices face a critical challenge in open-world environments: adapting when data distributions shift and novel classes emerge. While offline training dominates current paradigms, online continual learning (OCL)–where models learn incrementally from non-stationary streams without catastrophic forgetting–remains challenging in power-constrained settings. We present a neuromorphic solution called CLP-SNN: a spiking neural network architecture for Continually Learning Prototypes and its implementation on Intel’s Loihi 2 chip. Our approach introduces three innovations: (1) event-driven and spatiotemporally sparse local learning, (2) a self-normalizing three-factor learning rule maintaining weight normalization, and (3) integrated neurogenesis and metaplasticity for capacity expansion and forgetting mitigation. On OpenLORIS few-shot learning experiments, CLP-SNN achieves accuracy competitive with replay methods while being rehearsal-free. CLP-SNN delivers transformative efficiency gains: 70\times faster (0.33ms vs 23.2ms), and 5,600\times more energy efficient (0.05mJ vs 281mJ) than the best alternative OCL on edge GPU. This demonstrates that co-designed brain-inspired algorithms and neuromorphic hardware can break traditional accuracy-efficiency trade-offs for future edge AI systems.

[747] CG-FKAN: Compressed-Grid Federated Kolmogorov-Arnold Networks for Communication Constrained Environment

Seunghun Yu, Youngjoon Lee, Jinu Gong, Joonhyuk Kang

Main category: cs.LG

TL;DR: CG-FKAN compresses extended grids in federated KANs by sparsifying and transmitting only essential coefficients under communication constraints, achieving better performance than fixed-grid KAN.

Details

Motivation: Federated learning suffers from limited interpretability, and while KANs address this with learnable spline functions, existing FL studies applying KAN overlook the communication overhead from grid extension needed for complex functions.

Method: Propose CG-FKAN which compresses extended grids by sparsifying and transmitting only essential coefficients under a communication budget constraint.

Result: CG-FKAN achieves up to 13.6% lower RMSE than fixed-grid KAN in communication-constrained settings, with a theoretical upper bound on approximation error derived.

Conclusion: The proposed CG-FKAN effectively addresses communication overhead in federated KANs while maintaining performance through grid compression techniques.

Abstract: Federated learning (FL), widely used in privacy-critical applications, suffers from limited interpretability, whereas Kolmogorov-Arnold Networks (KAN) address this limitation via learnable spline functions. However, existing FL studies applying KAN overlook the communication overhead introduced by grid extension, which is essential for modeling complex functions. In this letter, we propose CG-FKAN, which compresses extended grids by sparsifying and transmitting only essential coefficients under a communication budget. Experiments show that CG-FKAN achieves up to 13.6% lower RMSE than fixed-grid KAN in communication-constrained settings. In addition, we derive a theoretical upper bound on its approximation error.

[748] HIT-ROCKET: Hadamard-vector Inner-product Transformer for ROCKET

Wang Hao, Kuang Zhang, Hou Chengyu, Yuan Zhonghao, Tan Chenxing, Fu Weifeng, Zhu Yangying

Main category: cs.LG

TL;DR: Proposed Hadamard convolutional transform for time series classification that uses Hadamard matrix vectors as convolution kernels, achieving better performance than ROCKET with 50% faster training than miniROCKET.

Details

Motivation: Existing SOTA methods like HIVE-COTE have high computational complexity and long training cycles, while lightweight solutions like ROCKET have room for improvement in kernel selection and computational overhead.

Method: Feature extraction using Hadamard convolutional transform with column/row vectors of Hadamard matrices as convolution kernels of varying sizes, maintaining compatibility with existing methods while leveraging kernel orthogonality.

Result: SOTA performance on UCR datasets: F1-score improved by at least 5% vs. ROCKET, with 50% shorter training time than miniROCKET under identical hyperparameters, enabling deployment on ultra-low-power embedded devices.

Conclusion: The Hadamard-based approach provides superior computational efficiency, robustness, and adaptability for time series classification while maintaining full compatibility with existing methods.

Abstract: Time series classification holds broad application value in communications, information countermeasures, finance, and medicine. However, state-of-the-art (SOTA) methods-including HIVE-COTE, Proximity Forest, and TS-CHIEF-exhibit high computational complexity, coupled with lengthy parameter tuning and training cycles. In contrast, lightweight solutions like ROCKET (Random Convolutional Kernel Transform) offer greater efficiency but leave substantial room for improvement in kernel selection and computational overhead. To address these challenges, we propose a feature extraction approach based on Hadamard convolutional transform, utilizing column or row vectors of Hadamard matrices as convolution kernels with extended lengths of varying sizes. This enhancement maintains full compatibility with existing methods (e.g., ROCKET) while leveraging kernel orthogonality to boost computational efficiency, robustness, and adaptability. Comprehensive experiments on multi-domain datasets-focusing on the UCR time series dataset-demonstrate SOTA performance: F1-score improved by at least 5% vs. ROCKET, with 50% shorter training time than miniROCKET (fastest ROCKET variant) under identical hyperparameters, enabling deployment on ultra-low-power embedded devices. All code is available on GitHub.

[749] The Curvature Rate λ: A Scalar Measure of Input-Space Sharpness in Neural Networks

Jacob Poschl

Main category: cs.LG

TL;DR: The paper introduces a new curvature measure called curvature rate (λ) defined in input space, which tracks the exponential growth rate of higher-order input derivatives and provides a parameterization-invariant way to measure functional smoothness in neural networks.

Details

Motivation: Existing sharpness metrics are defined in parameter space, making them expensive, sensitive to reparameterization, and difficult to interpret functionally. There's a need for curvature measures that are directly defined in input space and more interpretable.

Method: Introduces curvature rate λ as the exponential growth rate of higher-order input derivatives, estimated as the slope of log ||D^n f|| versus n. Extends this to neural networks and proposes Curvature Rate Regularization (CRR) to directly shape this curvature during training.

Result: Experiments show λ evolves predictably during training and can be shaped using CRR. Compared to SAM, CRR achieves similar accuracy while yielding flatter input-space geometry and improved confidence calibration.

Conclusion: λ provides a compact, interpretable, and parameterization-invariant descriptor of functional smoothness that unifies classical analytic quantities and can be effectively regularized to improve model robustness.

Abstract: Curvature influences generalization, robustness, and how reliably neural networks respond to small input perturbations. Existing sharpness metrics are typically defined in parameter space (e.g., Hessian eigenvalues) and can be expensive, sensitive to reparameterization, and difficult to interpret in functional terms. We introduce a scalar curvature measure defined directly in input space: the curvature rate {\lambda}, given by the exponential growth rate of higher-order input derivatives. Empirically, {\lambda} is estimated as the slope of log ||D^n f|| versus n for small n. This growth-rate perspective unifies classical analytic quantities: for analytic functions, {\lambda} corresponds to the inverse radius of convergence, and for bandlimited signals, it reflects the spectral cutoff. The same principle extends to neural networks, where {\lambda} tracks the emergence of high-frequency structure in the decision boundary. Experiments on analytic functions and neural networks (Two Moons and MNIST) show that {\lambda} evolves predictably during training and can be directly shaped using a simple derivative-based regularizer, Curvature Rate Regularization (CRR). Compared to Sharpness-Aware Minimization (SAM), CRR achieves similar accuracy while yielding flatter input-space geometry and improved confidence calibration. By grounding curvature in differentiation dynamics, {\lambda} provides a compact, interpretable, and parameterization-invariant descriptor of functional smoothness in learned models.

[750] Efficient Curvature-aware Graph Network

Chaoqun Fei, Tinglve Zhou, Tianyong Hao, Yangyang Li

Main category: cs.LG

TL;DR: Proposes Effective Resistance Curvature as a computationally efficient alternative to Ollivier-Ricci curvature for GNNs, maintaining geometric expressiveness while drastically reducing computational cost.

Details

Motivation: Ollivier-Ricci curvature provides strong geometric priors for GNNs but has prohibitively high computational complexity that limits its applicability to large-scale graphs.

Method: Develops Effective Resistance Curvature that quantifies message passing ease using effective resistance between node pairs instead of optimal transport distance, with proven low computational complexity.

Result: Significantly outperforms Ollivier-Ricci curvature in computational efficiency while achieving competitive performance on diverse GNN tasks with comparable geometric expressiveness.

Conclusion: Effective Resistance Curvature serves as a viable substitute for Ollivier-Ricci curvature, offering similar benefits for GNNs with dramatically reduced computational overhead.

Abstract: Graph curvature provides geometric priors for Graph Neural Networks (GNNs), enhancing their ability to model complex graph structures, particularly in terms of structural awareness, robustness, and theoretical interpretability. Among existing methods, Ollivier-Ricci curvature has been extensively studied due to its strong geometric interpretability, effectively characterizing the local geometric distribution between nodes. However, its prohibitively high computational complexity limits its applicability to large-scale graph datasets. To address this challenge, we propose a novel graph curvature measure–Effective Resistance Curvature–which quantifies the ease of message passing along graph edges using the effective resistance between node pairs, instead of the optimal transport distance. This method significantly outperforms Ollivier-Ricci curvature in computational efficiency while preserving comparable geometric expressiveness. Theoretically, we prove the low computational complexity of effective resistance curvature and establish its substitutability for Ollivier-Ricci curvature. Furthermore, extensive experiments on diverse GNN tasks demonstrate that our method achieves competitive performance with Ollivier-Ricci curvature while drastically reducing computational overhead.

[751] Gated Fusion Enhanced Multi-Scale Hierarchical Graph Convolutional Network for Stock Movement Prediction

Xiaosha Xue, Peibo Duan, Zhipeng Liu, Qi Chu, Changsheng Zhang, Bin zhang

Main category: cs.LG

TL;DR: MS-HGFN is a multi-scale hierarchical graph fusion network that improves stock market prediction by capturing intra-attribute patterns and balanced multi-scale features through dynamic graph learning and top-down gating.

Details

Motivation: Existing multi-scale GNNs for stock prediction neglect intra-attribute patterns affecting inter-stock correlations and show biased attention to different feature scales during sampling.

Method: Proposes MS-HGFN with hierarchical GNN module that creates dynamic graphs by learning intra-attribute patterns and inter-attribute features across time scales, plus top-down gating for multi-scale feature integration.

Result: Outperforms traditional and advanced models on U.S. and Chinese stock datasets, achieving up to 1.4% accuracy improvement and enhanced stability in return simulations.

Conclusion: MS-HGFN effectively captures spatio-temporal dependencies in stock markets through hierarchical graph learning and balanced multi-scale feature integration, demonstrating superior prediction performance.

Abstract: Accurately predicting stock market movements remains a formidable challenge due to the inherent volatility and complex interdependencies among stocks. Although multi-scale Graph Neural Networks (GNNs) hold potential for modeling these relationships, they frequently neglect two key points: the subtle intra-attribute patterns within each stock affecting inter-stock correlation, and the biased attention to coarse- and fine-grained features during multi-scale sampling. To overcome these challenges, we introduce MS-HGFN (Multi-Scale Hierarchical Graph Fusion Network). The model features a hierarchical GNN module that forms dynamic graphs by learning patterns from intra-attributes and features from inter-attributes over different time scales, thus comprehensively capturing spatio-temporal dependencies. Additionally, a top-down gating approach facilitates the integration of multi-scale spatio-temporal features, preserving critical coarse- and fine-grained features without too much interference. Experiments utilizing real-world datasets from U.S. and Chinese stock markets demonstrate that MS-HGFN outperforms both traditional and advanced models, yielding up to a 1.4% improvement in prediction accuracy and enhanced stability in return simulations. The code is available at https://anonymous.4open.science/r/MS-HGFN.

[752] Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving

Chengying Huan, Ziheng Meng, Yongchao Liu, Zhengyi Yang, Yun Zhu, Yue Yun, Shipeng Li, Rong Gu, Xiabao Wu, Haitao Zhang, Chuntao Hong, Shaonan Ma, Guihai Chen, Chen Tian

Main category: cs.LG

TL;DR: GLM is a multi-agent Graph Chain-of-Thought system that improves reasoning over graph-structured knowledge by decomposing tasks into specialized agents and optimizing LLM serving architecture, achieving significant improvements in accuracy, token efficiency, latency, and throughput.

Details

Motivation: Existing Graph-CoT pipelines suffer from low accuracy, excessive token usage, high latency, and low throughput due to single-agent monolithic prompts, repeated context re-encoding, and inefficient serving execution.

Method: GLM decomposes reasoning into specialized agents for classification, reasoning, action generation, and graph retrieval, enabling branching and selective context sharing. It also introduces Graph-CoT-aware LLM inference with graph-specific KV-cache management, priority-based eviction, and pipelined execution.

Result: GLM improves answer accuracy by up to 38%, reduces token cost by up to 95.7%, lowers inference latency by 90.3%, and achieves up to 15.1x higher throughput compared to state-of-the-art Graph-CoT baselines.

Conclusion: GLM enables efficient adoption of complex real-world reasoning at scale by addressing key limitations of existing Graph-CoT systems through multi-agent decomposition and optimized serving architecture.

Abstract: Graph Chain-of-Thought (Graph-CoT) enables large language models (LLMs) to perform step-by-step reasoning over graph-structured knowledge, but existing pipelines suffer from low accuracy, excessive token usage, high latency, and low throughput due to single-agent monolithic prompts, repeated context re-encoding, and inefficient serving execution. We present GLM, the first multi-agent Graph-CoT system co-designed with an optimized LLM serving architecture. GLM decomposes reasoning into specialized agents for classification, reasoning, action generation, and graph retrieval, enabling branching and selective context sharing to reduce prompt length and reasoning iterations while preserving reasoning quality, thereby improving accuracy and reducing overall token consumption. To scale inference, we introduce a Graph-CoT-aware LLM inference mechanism with graph-specific KV-cache management, priority-based eviction, and pipelined execution to improve serving efficiency. Experiments demonstrate that GLM improves answer accuracy by up to 38%, reduces token cost by up to 95.7%, lowers inference latency by 90.3%, and achieves up to 15.1x higher throughput compared to state-of-the-art Graph-CoT baselines, enabling efficient adoption for complex real-world reasoning at scale.

[753] Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization

Zhicheng Wang, Chen Ju, Xu Chen, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, Ying Chen, Zhiguo Cao

Main category: cs.LG

TL;DR: PDF introduces a parallel decoupling framework for multimodal embedding learning that uses MLLMs’ steerability to generate multiple parallel embeddings from one input, achieving significant performance gains with minimal computational overhead.

Details

Motivation: Current embedding models follow the SSC paradigm (single input, singular embedding, contrastive supervision), which collapses rich multimodal inputs into monolithic embeddings and fails to fully exploit MLLM capabilities.

Method: PDF conditions a shared MLLM backbone on distinct learnable prefixes to create multiple parallel paths for one input, uses Mutual Information Minimization to ensure diversity, and applies per-path contrastive supervision for semantic alignment.

Result: Significant performance gains across various model sizes: +8.9% for VLM2Vec-LLaVA-1.6-LR (7B), +4.2% for VLM2Vec-Qwen2VL (2B), +3.1% for Qwen2VL (7B). The 2B model surpasses baseline by +2.6% using only half the computational budget.

Conclusion: PDF effectively leverages MLLM steerability to create diverse parallel embeddings, achieving superior performance and efficiency compared to traditional SSC approaches in multimodal embedding learning.

Abstract: Embedding models are a cornerstone of modern AI. Driven by Multimodal Large Language Models (MLLMs), they have made great progress in architecture and data curation, while the holistic paradigm is still limited to SSC, i.e., single input, singular embedding, contrastive supervision, which collapses rich, multifaceted inputs into monolithic embeddings and fails to fully exploit MLLM capabilities. In this paper, we tailor one Parallel Decoupling Framework (PDF) for multimodal embedding learning, by utilizing the proprietary steerability of MLLMs, i.e., their ability to flexibly generate quite differentiated response under explicit instructions. Concretely, PDF conditions a shared MLLM backbone on distinct, learnable prefixes to roll out multiple parallel paths for one input, then relies on these paths to obtain parallel embeddings. To promote full parallel diversity, we employ Mutual Information Minimization (MIM) as an explicit constraint, coupled with per-path contrastive supervision to maintain semantic alignment. Such dual-objectives force PDF to yield robust semantic coverage and a generalizable embedding space. Ultimately, the remarkable embedding space are accessible at inference via one single forward pass, incurring negligible computational overhead. We instantiate PDF on multiple MLLM backbones and prove its effectiveness on MMEB benchmark. Significant gains are consistently achieved across various resolutions and model sizes, e.g., boosting the VLM2Vec-LLaVA-1.6-LR model by a remarkable +8.9% (7B), while the VLM2Vec-Qwen2VL models by +4.2% (2B) and +3.1% (7B). In terms of efficiency, our 2B model surpasses its baseline by +2.6% using only half the computational budget.

[754] Defining Energy Indicators for Impact Identification on Aerospace Composites: A Physics-Informed Machine Learning Perspective

Natália Ribeiro Marinho, Richard Loendersloot, Frank Grooteman, Jan Willem Wiegman, Uraz Odyurt, Tiedo Tinga

Main category: cs.LG

TL;DR: A physics-informed machine learning framework for impact energy prediction in aerospace composites that combines domain knowledge with feature selection to create interpretable models with 3x better accuracy than conventional methods.

Details

Motivation: Current energy prediction methods for aerospace composites face challenges like data sparsity, signal noise, complex feature interdependencies, and the ill-posed nature of inverse problems in detecting internal damage from low-velocity impacts.

Method: Physics-informed framework embedding domain knowledge through dedicated input space, combining observational biases with targeted feature selection. Features extracted from time, frequency, and time-frequency domains, with structured selection process for statistical significance, correlation filtering, dimensionality reduction, and noise robustness.

Result: Significantly improved impact energy prediction accuracy with errors reduced by a factor of three compared to conventional time-series techniques and purely data-driven models. Model validated with experimental data from multiple impact scenarios including pristine and damaged states.

Conclusion: The approach produces compact energy-sensitive indicators with statistical robustness and physical significance, enabling interpretable and traceable impact energy predictions that remain connected to measurable structural responses.

Abstract: Energy estimation is critical to impact identification on aerospace composites, where low-velocity impacts can induce internal damage that is undetectable at the surface. Current methodologies for energy prediction are often constrained by data sparsity, signal noise, complex feature interdependencies, non-linear dynamics, massive design spaces, and the ill-posed nature of the inverse problem. This study introduces a physics-informed framework that embeds domain knowledge into machine learning through a dedicated input space. The approach combines observational biases, which guide the design of physics-motivated features, with targeted feature selection to retain only the most informative indicators. Features are extracted from time, frequency, and time-frequency domains to capture complementary aspects of the structural response. A structured feature selection process integrating statistical significance, correlation filtering, dimensionality reduction, and noise robustness ensures physical relevance and interpretability. Exploratory data analysis further reveals domain-specific trends, yielding a reduced feature set that captures essential dynamic phenomena such as amplitude scaling, spectral redistribution, and transient signal behaviour. Together, these steps produce a compact set of energy-sensitive indicators with both statistical robustness and physical significance, resulting in impact energy predictions that remain interpretable and traceable to measurable structural responses. Using this optimised input space, a fully-connected neural network is trained and validated with experimental data from multiple impact scenarios, including pristine and damaged states. The resulting model demonstrates significantly improved impact energy prediction accuracy, reducing errors by a factor of three compared to conventional time-series techniques and purely data-driven models.

[755] Estimation of Toeplitz Covariance Matrices using Overparameterized Gradient Descent

Daniel Busbib, Ami Wiesel

Main category: cs.LG

TL;DR: Overparameterized gradient descent with 2P or 4P complex sinusoids enables global convergence for Toeplitz covariance estimation, outperforming state-of-the-art methods while remaining simple and scalable.

Details

Motivation: Recent deep learning advances show simple gradient descent on overparameterized models can be surprisingly effective, motivating revisiting Toeplitz covariance estimation with this approach rather than complex optimization methods.

Method: Model P×P covariance as sum of K complex sinusoids with learnable parameters, optimize via gradient descent. Use mild overparameterization (K=2P or 4P) and propose accelerated GD with separate learning rates for amplitudes and frequencies.

Result: Mild overparameterization enables global convergence from random initializations. When frequencies fixed and only amplitudes optimized, landscape is asymptotically benign with any stationary point recovering true covariance. Numerical experiments show matching/exceeding state-of-the-art accuracy.

Conclusion: Overparameterized gradient descent provides simple, scalable alternative to sophisticated optimization methods for Toeplitz covariance estimation, achieving competitive performance with global convergence guarantees.

Abstract: We consider covariance estimation under Toeplitz structure. Numerous sophisticated optimization methods have been developed to maximize the Gaussian log-likelihood under Toeplitz constraints. In contrast, recent advances in deep learning demonstrate the surprising power of simple gradient descent (GD) applied to overparameterized models. Motivated by this trend, we revisit Toeplitz covariance estimation through the lens of overparameterized GD. We model the $P\times P$ covariance as a sum of $K$ complex sinusoids with learnable parameters and optimize them via GD. We show that when $K = P$, GD may converge to suboptimal solutions. However, mild overparameterization ($K = 2P$ or $4P$) consistently enables global convergence from random initializations. We further propose an accelerated GD variant with separate learning rates for amplitudes and frequencies. When frequencies are fixed and only amplitudes are optimized, we prove that the optimization landscape is asymptotically benign and any stationary point recovers the true covariance. Finally, numerical experiments demonstrate that overparameterized GD can match or exceed the accuracy of state-of-the-art methods in challenging settings, while remaining simple and scalable.

[756] Cross-Treatment Effect Estimation for Multi-Category, Multi-Valued Causal Inference via Dynamic Neural Masking

Xiaopeng Ke, Yihan Yu, Ruyue Zhang, Zhishuo Zhou, Fangzhou Shi, Chang Men, Zhengdan Zhu

Main category: cs.LG

TL;DR: XTNet is a novel network architecture for multi-category, multi-valued treatment effect estimation that captures complex cross-effects between heterogeneous interventions using dynamic masking mechanisms and decomposition strategies.

Details

Motivation: Existing causal inference methods are limited to binary or single-type treatments and struggle with complex cross-effects between heterogeneous interventions, having restrictive assumptions and inadequate evaluation frameworks.

Method: XTNet uses a cross-effect estimation module with dynamic masking mechanisms to capture treatment interactions without restrictive structural assumptions, employing a decomposition strategy that separates basic effects from cross-treatment interactions.

Result: XTNet consistently outperforms state-of-the-art baselines in both ranking accuracy and effect estimation quality on synthetic and real-world datasets, with real-world A/B tests confirming its effectiveness.

Conclusion: XTNet provides an effective solution for multi-category, multi-valued treatment effect estimation with superior performance over existing methods, addressing the limitations of current approaches in handling complex intervention scenarios.

Abstract: Counterfactual causal inference faces significant challenges when extended to multi-category, multi-valued treatments, where complex cross-effects between heterogeneous interventions are difficult to model. Existing methodologies remain constrained to binary or single-type treatments and suffer from restrictive assumptions, limited scalability, and inadequate evaluation frameworks for complex intervention scenarios. We present XTNet, a novel network architecture for multi-category, multi-valued treatment effect estimation. Our approach introduces a cross-effect estimation module with dynamic masking mechanisms to capture treatment interactions without restrictive structural assumptions. The architecture employs a decomposition strategy separating basic effects from cross-treatment interactions, enabling efficient modeling of combinatorial treatment spaces. We also propose MCMV-AUCC, a suitable evaluation metric that accounts for treatment costs and interaction effects. Extensive experiments on synthetic and real-world datasets demonstrate that XTNet consistently outperforms state-of-the-art baselines in both ranking accuracy and effect estimation quality. The results of the real-world A/B test further confirm its effectiveness.

[757] Bayesian Natural Gradient Fine-Tuning of CLIP Models via Kalman Filtering

Hossein Abdi, Mingfei Sun, Wei Pan

Main category: cs.LG

TL;DR: The paper proposes a Bayesian approximation of Natural Gradient Descent using Kalman filter for fine-tuning CLIP models, achieving better ID performance and OOD robustness compared to existing methods.

Details

Motivation: Existing first-order gradient-based optimizers for CLIP fine-tuning suffer from slow convergence, sensitivity to hyperparameters, and poor OOD generalization. Second-order methods like NGD offer better updates but are computationally expensive due to the inverse Fisher Information Matrix.

Method: Proposes a Bayesian approximation of Natural Gradient Descent using Kalman filter for CLIP models, combining second-order optimization with Bayesian inference to enhance generalization and provide uncertainty quantification.

Result: Extensive experiments show the method consistently achieves superior or comparable in-distribution performance and improved out-of-distribution robustness compared to state-of-the-art baselines.

Conclusion: This work represents the first successful application of Kalman filtering to fine-tuning CLIP-based models, enabling more robust and efficient learning in vision-language tasks.

Abstract: Vision-language pre-trained models, such as CLIP, have established new benchmarks in multimodal data mining. In such models, few-shot fine-tuning is a major challenge to achieve optimal performance on both in-distribution (ID) and out-of-distribution (OOD) datasets, especially when labeled data is scarce. Most existing fine-tuning approaches rely on first-order gradient-based optimizers, which typically suffer from slow convergence, sensitivity to step-size hyperparameters, and poor generalization in OOD settings. In contrast, second-order methods utilize local curvature information of the loss landscape to adjust the update step size. This is particularly beneficial for CLIP models, whose non-convex loss functions often contain sharp critical points. In such cases, natural gradient direction can offer more substantial and efficient per-iteration updates when fine-tuning with limited data. Natural Gradient Descent (NGD) is obtained by preconditioning the standard gradient with the inverse Fisher Information Matrix (FIM), which is computationally expensive for large models. To address this, we propose a Bayesian approximation of NGD using a Kalman filter for CLIP models. Our method combines the benefits of second-order optimization with Bayesian inference, which enhances generalization while providing uncertainty quantification. Extensive experiments conducted on diverse image classification datasets demonstrate that our algorithm consistently achieves superior–or comparable–ID performance and improved OOD robustness compared to state-of-the-art baselines. To the best of our knowledge, this work represents the first successful application of Kalman filtering to fine-tuning CLIP-based models, which enables more robust and efficient learning in vision-language tasks.

[758] Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

Jungyeon Koh, Hyun Jong Yang

Main category: cs.LG

TL;DR: A unified framework for joint optimization of user association and resource allocation to enable efficient parallel speculative decoding in mobile edge computing systems.

Details

Motivation: Address the communication overhead and asynchronous delays in speculative decoding for on-device LLM inference in resource-constrained mobile edge computing environments.

Method: Propose a unified framework that jointly optimizes user association and resource allocation using multi-agent deep reinforcement learning, evaluated with Sionna simulator.

Result: Achieves up to 28.0% and average 23.7% reduction in end-to-end latency without compromising inference accuracy.

Conclusion: Enables scalable and low-latency LLM services in mobile edge computing systems through optimized speculative decoding.

Abstract: The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.

[759] Edge AI in Highly Volatile Environments: Is Fairness Worth the Accuracy Trade-off?

Obaidullah Zaland, Feras M. Awaysheh, Sawsan Al Zubi, Abdul Rahman Safi, Monowar Bhuyan

Main category: cs.LG

TL;DR: This paper analyzes the trade-off between model accuracy and fairness in federated learning within volatile edge environments, evaluating fairness-based client selection algorithms against random and greedy approaches.

Details

Motivation: Federated learning enables collaborative model training while preserving data privacy, but volatile edge environments with dynamic resources and heterogeneous clients challenge achieving both high accuracy and fairness in client participation.

Method: Extensive empirical evaluation of fairness-based client selection algorithms (RBFF and RBCSF) compared to random and greedy selection, tested on three datasets (CIFAR10, FashionMNIST, EMNIST) regarding fairness, model performance, and training time.

Result: More equitable client selection algorithms provide marginally better opportunities among clients but result in slower global training in volatile environments.

Conclusion: The study reveals fairness-performance and fairness-speed trade-offs in volatile edge FL environments, highlighting the need for future research to address pitfalls in fair client selection strategies.

Abstract: Federated learning (FL) has emerged as a transformative paradigm for edge intelligence, enabling collaborative model training while preserving data privacy across distributed personal devices. However, the inherent volatility of edge environments, characterized by dynamic resource availability and heterogeneous client capabilities, poses significant challenges for achieving high accuracy and fairness in client participation. This paper investigates the fundamental trade-off between model accuracy and fairness in highly volatile edge environments. This paper provides an extensive empirical evaluation of fairness-based client selection algorithms such as RBFF and RBCSF against random and greedy client selection regarding fairness, model performance, and time, in three benchmarking datasets (CIFAR10, FashionMNIST, and EMNIST). This work aims to shed light on the fairness-performance and fairness-speed trade-offs in a volatile edge environment and explore potential future research opportunities to address existing pitfalls in \textit{fair client selection} strategies in FL. Our results indicate that more equitable client selection algorithms, while providing a marginally better opportunity among clients, can result in slower global training in volatile environments\footnote{The code for our experiments can be found at https://github.com/obaidullahzaland/FairFL_FLTA.

[760] Towards Efficient Federated Learning of Networked Mixture-of-Experts for Mobile Edge Computing

Song Gao, Shusen Jing, Shuai Zhang, Yue Wang, Xiangwei Zhou, Songyang Zhang

Main category: cs.LG

TL;DR: The paper proposes a Networked Mixture-of-Experts (NMoE) system for collaborative inference in mobile edge computing, addressing computational limitations of edge devices through federated learning with supervised and self-supervised approaches.

Details

Motivation: Large AI models require substantial computational resources that conflict with the limited storage and computational capacity of edge devices, posing challenges for training and deploying these models at the network edge.

Method: Introduces NMoE system where clients collaborate by distributing tasks to suitable neighbors based on expertise and aggregating results. Uses federated learning framework integrating both supervised and self-supervised learning to balance personalization and generalization while preserving communication efficiency and data privacy.

Result: Extensive experiments demonstrate the efficacy of the proposed NMoE system, providing insights and benchmarks for the NMoE training algorithms.

Conclusion: The NMoE system effectively addresses the computational limitations of edge devices for large AI models through collaborative inference and federated learning approaches.

Abstract: Recent advancements in large artificial intelligence models (LAMs) are driving significant innovations in mobile edge computing within next-generation wireless networks. However, the substantial demands for computational resources and large-scale training data required to train LAMs conflict with the limited storage and computational capacity of edge devices, posing significant challenges to training and deploying LAMs at the edge. In this work, we introduce the Networked Mixture-of-Experts (NMoE) system, in which clients infer collaboratively by distributing tasks to suitable neighbors based on their expertise and aggregate the returned results. For training the NMoE, we propose a federated learning framework that integrates both supervised and self-supervised learning to balance personalization and generalization, while preserving communication efficiency and data privacy. We conduct extensive experiments to demonstrate the efficacy of the proposed NMoE system, providing insights and benchmarks for the NMoE training algorithms.

[761] Game-theoretic distributed learning of generative models for heterogeneous data collections

Dmitrij Schlesinger, Boris Flach

Main category: cs.LG

TL;DR: The paper proposes exchanging synthetic data instead of model parameters for distributed learning to handle heterogeneous local models and data, treating models as black boxes with semi-supervised learning capabilities.

Details

Motivation: To address challenges in distributed learning from handling heterogeneous local models and data, leveraging recent success of generative models.

Method: Formulate local model learning as a cooperative game using game theory principles, prove existence of unique Nash equilibrium for exponential family models, and show convergence to equilibrium.

Result: The approach converges to Nash equilibrium and demonstrates advantages on standard benchmark vision datasets for image classification and conditional generation.

Conclusion: Exchanging synthetic data enables effective distributed learning across heterogeneous models and data modalities while maintaining model privacy as black boxes.

Abstract: One of the main challenges in distributed learning arises from the difficulty of handling heterogeneous local models and data. In light of the recent success of generative models, we propose to meet this challenge by building on the idea of exchanging synthetic data instead of sharing model parameters. Local models can then be treated as ``black boxes’’ with the ability to learn their parameters from data and to generate data according to these parameters. Moreover, if the local models admit semi-supervised learning, we can extend the approach by enabling local models on different probability spaces. This allows to handle heterogeneous data with different modalities. We formulate the learning of the local models as a cooperative game starting from the principles of game theory. We prove the existence of a unique Nash equilibrium for exponential family local models and show that the proposed learning approach converges to this equilibrium. We demonstrate the advantages of our approach on standard benchmark vision datasets for image classification and conditional generation.

[762] An Open-Access Benchmark of Statistical and Machine-Learning Anomaly Detection Methods for Battery Applications

Mei-Chin Pang, Suraj Adhikari, Takuma Kasahara, Nagihiro Haba, Saneyuki Ohno

Main category: cs.LG

TL;DR: OSBAD is an open-source benchmark for battery anomaly detection that evaluates 15 algorithms, enhances anomaly separability through physics-informed feature transformation, and enables automated hyperparameter tuning via Bayesian optimization.

Details

Motivation: Battery safety is critical in applications like consumer electronics and electric vehicles, where undetected anomalies could trigger safety hazards or costly downtime. There's a need for systematic comparison of anomaly detection methods across heterogeneous datasets.

Method: Benchmarked 15 diverse algorithms including statistical, distance-based, and unsupervised machine-learning methods. Used physics- and statistics-informed feature transformation to decompose collective anomalies into point anomalies. Proposed Bayesian optimization pipeline for automated hyperparameter tuning based on transfer-learning and regression proxies.

Result: Validated on datasets covering both liquid and solid-state chemistries, demonstrating cross-chemistry generalization capability to identify irregularities across different electrochemical systems. Established a unified foundation for developing safe, scalable, and transferable anomaly detection tools.

Conclusion: Physics- and statistics-informed feature engineering combined with probabilistic hyperparameter tuning advances trustworthy, data-driven diagnostics for safety-critical energy systems. OSBAD provides open-source reproducible anomaly detection workflows for the community.

Abstract: Battery safety is critical in applications ranging from consumer electronics to electric vehicles and aircraft, where undetected anomalies could trigger safety hazards or costly downtime. In this study, we present OSBAD as an open-source benchmark for anomaly detection frameworks in battery applications. By benchmarking 15 diverse algorithms encompassing statistical, distance-based, and unsupervised machine-learning methods, OSBAD enables a systematic comparison of anomaly detection methods across heterogeneous datasets. In addition, we demonstrate how a physics- and statistics-informed feature transformation workflow enhances anomaly separability by decomposing collective anomalies into point anomalies. To address a major bottleneck in unsupervised anomaly detection due to incomplete labels, we propose a Bayesian optimization pipeline that facilitates automated hyperparameter tuning based on transfer-learning and regression proxies. Through validation on datasets covering both liquid and solid-state chemistries, we further demonstrate the cross-chemistry generalization capability of OSBAD to identify irregularities across different electrochemical systems. By making benchmarking database with open-source reproducible anomaly detection workflows available to the community, OSBAD establishes a unified foundation for developing safe, scalable, and transferable anomaly detection tools in battery analytics. This research underscores the significance of physics- and statistics-informed feature engineering as well as model selection with probabilistic hyperparameter tuning, in advancing trustworthy, data-driven diagnostics for safety-critical energy systems.

[763] HyperNQ: A Hypergraph Neural Network Decoder for Quantum LDPC Codes

Ameya S. Bhave, Navnil Choudhury, Kanad Basu

Main category: cs.LG

TL;DR: HyperNQ is a hypergraph neural network-based decoder for quantum LDPC codes that captures higher-order stabilizer constraints, improving logical error rates by up to 84% over belief propagation and 50% over GNN-based decoders.

Details

Motivation: Traditional QLDPC decoding methods like Belief Propagation suffer from poor convergence due to short cycles, while Graph Neural Networks are limited to pairwise interactions and cannot capture higher-order correlations needed for effective quantum error correction.

Method: Proposed HyperNQ, a Hypergraph Neural Network-based QLDPC decoder that uses hyperedges to capture higher-order stabilizer constraints through a two-stage message passing scheme, enabling more expressive and compact decoding.

Result: Below the pseudo-threshold mark, HyperNQ improves Logical Error Rate by up to 84% over Belief Propagation and 50% over GNN-based strategies, demonstrating enhanced performance over existing state-of-the-art decoders.

Conclusion: HyperNQ represents a significant advancement in QLDPC decoding by effectively capturing higher-order correlations through hypergraph neural networks, offering substantial improvements in logical error rates for quantum error correction applications.

Abstract: Quantum computing requires effective error correction strategies to mitigate noise and decoherence. Quantum Low-Density Parity-Check (QLDPC) codes have emerged as a promising solution for scalable Quantum Error Correction (QEC) applications by supporting constant-rate encoding and a sparse parity-check structure. However, decoding QLDPC codes via traditional approaches such as Belief Propagation (BP) suffers from poor convergence in the presence of short cycles. Machine learning techniques like Graph Neural Networks (GNNs) utilize learned message passing over their node features; however, they are restricted to pairwise interactions on Tanner graphs, which limits their ability to capture higher-order correlations. In this work, we propose HyperNQ, the first Hypergraph Neural Network (HGNN)- based QLDPC decoder that captures higher-order stabilizer constraints by utilizing hyperedges-thus enabling highly expressive and compact decoding. We use a two-stage message passing scheme and evaluate the decoder over the pseudo-threshold region. Below the pseudo-threshold mark, HyperNQ improves the Logical Error Rate (LER) up to 84% over BP and 50% over GNN-based strategies, demonstrating enhanced performance over the existing state-of-the-art decoders.

[764] Fractional Diffusion Bridge Models

Gabriel Nobis, Maximilian Springenberg, Arina Belova, Rembert Daems, Christoph Knochenhauer, Manfred Opper, Tolga Birdal, Wojciech Samek

Main category: cs.LG

TL;DR: FDBM is a generative diffusion bridge framework using fractional Brownian motion approximation to model real stochastic processes with memory effects, long-range dependencies, and anomalous diffusion phenomena, outperforming standard Brownian motion baselines in protein structure prediction and image translation.

Details

Motivation: Real stochastic processes exhibit memory effects, long-range dependencies, roughness and anomalous diffusion that standard diffusion models using Brownian motion cannot capture, requiring a more sophisticated approach.

Method: Leverage Markovian approximation of fractional Brownian motion (MA-fBM) to construct FDBM, prove existence of coupling-preserving generative diffusion bridge, and extend to Schrödinger bridge problem with principled loss function for unpaired data translation.

Result: FDBM achieves superior performance compared to Brownian baselines: lower RMSD of Cα atomic positions in protein structure prediction and lower FID in unpaired image translation.

Conclusion: FDBM provides an effective framework for modeling complex stochastic processes with memory and long-range dependencies, outperforming traditional Brownian motion approaches in both paired and unpaired data settings.

Abstract: We present Fractional Diffusion Bridge Models (FDBM), a novel generative diffusion bridge framework driven by an approximation of the rich and non-Markovian fractional Brownian motion (fBM). Real stochastic processes exhibit a degree of memory effects (correlations in time), long-range dependencies, roughness and anomalous diffusion phenomena that are not captured in standard diffusion or bridge modeling due to the use of Brownian motion (BM). As a remedy, leveraging a recent Markovian approximation of fBM (MA-fBM), we construct FDBM that enable tractable inference while preserving the non-Markovian nature of fBM. We prove the existence of a coupling-preserving generative diffusion bridge and leverage it for future state prediction from paired training data. We then extend our formulation to the Schr"{o}dinger bridge problem and derive a principled loss function to learn the unpaired data translation. We evaluate FDBM on both tasks: predicting future protein conformations from aligned data, and unpaired image translation. In both settings, FDBM achieves superior performance compared to the Brownian baselines, yielding lower root mean squared deviation (RMSD) of C$_\alpha$ atomic positions in protein structure prediction and lower Fr'echet Inception Distance (FID) in unpaired image translation.

[765] Bayesian Coreset Optimization for Personalized Federated Learning

Prateek Chanda, Shrey Modi, Ganesh Ramakrishnan

Main category: cs.LG

TL;DR: Proposes a personalized coreset weighted federated learning method that uses representative data points instead of entire client datasets, achieving minimax optimal generalization error bounds and showing performance gains over random sampling and submodular optimization approaches.

Details

Motivation: To address the cumbersome nature of training on entire individual client datasets in federated learning settings with multiple clients.

Method: Personalized coreset weighted federated learning where training updates are based on individual client coreset representative data points rather than full datasets.

Result: Theoretical analysis shows minimax optimal generalization error bounds (upper bounded by O(n_k^{-2β/(2β+Λ)} log^{2δ’}(n_k)) and lower bounds of O(n_k^{-2β/(2β+Λ)})). Experiments on benchmark datasets show significant gains over random sampling and submodular optimization approaches.

Conclusion: Intelligently selecting training samples through coreset-based approaches can significantly improve performance in personalized federated learning settings, particularly demonstrating advantages over existing methods.

Abstract: In a distributed machine learning setting like Federated Learning where there are multiple clients involved which update their individual weights to a single central server, often training on the entire individual client’s dataset for each client becomes cumbersome. To address this issue we propose $\methodprop$: a personalized coreset weighted federated learning setup where the training updates for each individual clients are forwarded to the central server based on only individual client coreset based representative data points instead of the entire client data. Through theoretical analysis we present how the average generalization error is minimax optimal up to logarithm bounds (upper bounded by $\mathcal{O}(n_k^{-\frac{2 \beta}{2 \beta+\boldsymbol{\Lambda}}} \log ^{2 \delta^{\prime}}(n_k))$) and lower bounds of $\mathcal{O}(n_k^{-\frac{2 \beta}{2 \beta+\boldsymbol{\Lambda}}})$, and how the overall generalization error on the data likelihood differs from a vanilla Federated Learning setup as a closed form function ${\boldsymbol{\Im}}(\boldsymbol{w}, n_k)$ of the coreset weights $\boldsymbol{w}$ and coreset sample size $n_k$. Our experiments on different benchmark datasets based on a variety of recent personalized federated learning architectures show significant gains as compared to random sampling on the training data followed by federated learning, thereby indicating how intelligently selecting such training samples can help in performance. Additionally, through experiments on medical datasets our proposed method showcases some gains as compared to other submodular optimization based approaches used for subset selection on client’s data.

[766] Dynamic Reconstruction of Ultrasound-Derived Flow Fields With Physics-Informed Neural Fields

Viraj Patel, Lisa Kreusser, Katharine Fraser

Main category: cs.LG

TL;DR: A physics-informed neural field model with multi-scale Fourier Feature encoding is proposed to estimate blood flow from sparse and noisy ultrasound data without ground truth supervision, achieving accurate denoising and inpainting.

Details

Motivation: Blood flow analysis is valuable for diagnosis but ultrasound suffers from attenuation with depth, limiting image quality. EchoPIV has limitations in accurately measuring blood velocity due to technique constraints and complex flow dynamics.

Method: Physics-informed neural field model with multi-scale Fourier Feature encoding that works without ground truth supervision, adapting methods from other imaging modalities to ultrasound flow reconstruction.

Result: The model achieves consistently low mean squared error in denoising and inpainting both synthetic and real datasets, verified against reference flow fields and ground truth flow rate measurements.

Conclusion: Physics-informed machine learning can enhance accuracy and robustness in ultrasound-based blood flow reconstruction, addressing challenges of noisy/incomplete data that challenge purely data-driven approaches.

Abstract: Blood flow is sensitive to disease and provides insight into cardiac function, making flow field analysis valuable for diagnosis. However, while safer than radiation-based imaging and more suitable for patients with medical implants, ultrasound suffers from attenuation with depth, limiting the quality of the image. Despite advances in echocardiographic particle image velocimetry (EchoPIV), accurately measuring blood velocity remains challenging due to the technique’s limitations and the complexity of blood flow dynamics. Physics-informed machine learning can enhance accuracy and robustness, particularly in scenarios where noisy or incomplete data challenge purely data-driven approaches. We present a physics-informed neural field model with multi-scale Fourier Feature encoding for estimating blood flow from sparse and noisy ultrasound data without requiring ground truth supervision. We demonstrate that this model achieves consistently low mean squared error in denoising and inpainting both synthetic and real datasets, verified against reference flow fields and ground truth flow rate measurements. While physics-informed neural fields have been widely used to reconstruct medical images, applications to medical flow reconstruction are mostly prominent in Flow MRI. In this work, we adapt methods that have proven effective in other imaging modalities to address the specific challenge of ultrasound-based flow reconstruction.

[767] No-rank Tensor Decomposition Using Metric Learning

Maryam Bagherian

Main category: cs.LG

TL;DR: A no-rank tensor decomposition framework using metric learning replaces reconstruction objectives with discriminative similarity optimization, achieving better semantic structure capture than traditional methods.

Details

Motivation: Traditional tensor decomposition methods based on reconstruction and fixed-rank constraints often fail to capture semantically meaningful structures in high-dimensional data.

Method: Proposes a metric learning approach with triplet loss optimization and diversity/uniformity regularization, creating embeddings where distance reflects semantic similarity without rank constraints.

Result: Outperforms PCA, t-SNE, UMAP, and tensor decomposition baselines across multiple domains (face recognition, brain connectivity, simulated data) with substantial improvements in clustering metrics and better performance with smaller datasets.

Conclusion: Metric learning establishes a new paradigm for tensor-based analysis that prioritizes semantic relevance over pixel-level fidelity and offers computational advantages in data-scarce scenarios.

Abstract: Tensor decomposition faces fundamental challenges in analyzing high-dimensional data, where traditional methods based on reconstruction and fixed-rank constraints often fail to capture semantically meaningful structures. This paper introduces a no-rank tensor decomposition framework grounded in metric learning, which replaces reconstruction objectives with a discriminative, similarity-based optimization. The proposed approach learns data-driven embeddings by optimizing a triplet loss with diversity and uniformity regularization, creating a feature space where distance directly reflects semantic similarity. We provide theoretical guarantees for the framework’s convergence and establish bounds on its metric properties. Evaluations across diverse domains –including face recognition (LFW, Olivetti), brain connectivity analysis (ABIDE), and simulated data (galaxy morphology, crystal structures)– demonstrate that our method outperforms baseline techniques, including PCA, t-SNE, UMAP, and tensor decomposition baselines (CP and Tucker). Results show substantial improvements in clustering metrics (Silhouette Score, Davies–Bouldin Index, Calinski–Harabasz Index, Separation Ratio, Adjusted Rand Index, Normalized Mutual Information) and reveal a fundamental trade-off: while metric learning optimizes global class separation, it deliberately transforms local geometry to align with semantic relationships. Crucially, our approach achieves superior performance with smaller training datasets compared to transformer-based methods, offering an efficient alternative for domains with limited labeled data. This work establishes metric learning as a paradigm for tensor-based analysis, prioritizing semantic relevance over pixel-level fidelity while providing computational advantages in data-scarce scenarios.

[768] Machine and Deep Learning for Indoor UWB Jammer Localization

Hamed Fard, Mahsa Kholghi, Benedikt Groß, Gerhard Wunder

Main category: cs.LG

TL;DR: Proposes a domain-adversarial ConvNeXt autoencoder (A-CNT) for robust indoor jammer localization that addresses performance degradation due to environmental layout changes, achieving 77% improvement over non-adversarial transfer learning.

Details

Motivation: UWB localization provides centimeter accuracy but is vulnerable to jamming attacks, and existing ML/DL methods struggle with localizing jammers across changing indoor environments due to domain shift issues.

Method: Introduces two UWB datasets under different room configurations, establishes ML/DL baselines, and proposes A-CNT framework with gradient-reversal layer to align CIR-derived features across domains for domain adaptation.

Result: A-CNT reduces mean Euclidean error to 34.67 cm (77% improvement over non-adversarial transfer learning, 83% over best baseline), restoring fraction of samples within 30 cm to 0.56, while source-trained models suffered severe degradation (XGBoost error increased from 20.16 cm to 207.99 cm).

Conclusion: Adversarial feature alignment enables robust and transferable indoor jammer localization despite environmental changes, demonstrating the effectiveness of domain adaptation techniques for UWB security applications.

Abstract: Ultra-wideband (UWB) localization delivers centimeter-scale accuracy but is vulnerable to jamming attacks, creating security risks for asset tracking and intrusion detection in smart buildings. Although machine learning (ML) and deep learning (DL) methods have improved tag localization, localizing malicious jammers within a single room and across changing indoor layouts remains largely unexplored. Two novel UWB datasets, collected under original and modified room configurations, are introduced to establish comprehensive ML/DL baselines. Performance is rigorously evaluated using a variety of classification and regression metrics. On the source dataset with the collected UWB features, Random Forest achieves the highest F1-macro score of 0.95 and XGBoost achieves the lowest mean Euclidean error of 20.16 cm. However, deploying these source-trained models in the modified room layout led to severe performance degradation, with XGBoost’s mean Euclidean error increasing tenfold to 207.99 cm, demonstrating significant domain shift. To mitigate this degradation, a domain-adversarial ConvNeXt autoencoder (A-CNT) is proposed that leverages a gradient-reversal layer to align CIR-derived features across domains. The A-CNT framework restores localization performance by reducing the mean Euclidean error to 34.67 cm. This represents a 77 percent improvement over non-adversarial transfer learning and an 83 percent improvement over the best baseline, restoring the fraction of samples within 30 cm to 0.56. Overall, the results demonstrate that adversarial feature alignment enables robust and transferable indoor jammer localization despite environmental changes. Code and dataset available at https://github.com/afbf4c8996f/Jammer-Loc

[769] Dynamic Routing Between Experts: A Data-Efficient Approach to Continual Learning in Vision-Language Models

Jay Mohta, Kenan Emir Ak, Dimitrios Dimitriadis, Yan Xu, Mingwei Shen

Main category: cs.LG

TL;DR: A routing-based approach for Vision-Language Models that prevents catastrophic forgetting during sequential fine-tuning, preserving foundational knowledge while integrating new tasks without requiring simultaneous access to all datasets.

Details

Motivation: VLMs suffer from catastrophic forgetting when sequentially fine-tuned on new tasks, degrading performance on previously learned capabilities. Traditional multi-task learning requires simultaneous access to all datasets and has computational overhead scaling linearly with tasks.

Method: Routing-based approach that enables integration of new tasks while preserving foundational knowledge acquired during pretraining, evaluated on InternVL-2 models (2B and 8B parameters).

Result: Routing preserves foundational capabilities on general-purpose benchmarks (ChartQA, MMBench, DocVQA) while improving accuracy on specialized tasks, without requiring concurrent access to all task data. The approach is scalable, robust to growing tasks, and enables superior cross-modal transfer between language and vision capabilities.

Conclusion: The routing mechanism effectively prevents catastrophic forgetting in VLMs, maintains foundational knowledge, enables task integration without multi-task learning overhead, and facilitates cross-modal knowledge transfer not achieved by existing continual learning methods.

Abstract: Vision-Language Models (VLMs) suffer from catastrophic forgetting when sequentially fine-tuned on new tasks, degrading performance on previously learned foundational and task-specific capabilities. While multi-task learning can mitigate forgetting, it requires simultaneous access to all datasets and imposes computational overhead that scales linearly with the number of tasks. In this work, we introduce a routing-based approach that enables the integration of new tasks while preserving the foundational knowledge acquired during pretraining. We evaluate our method using InternVL-2 models (2B and 8B parameters) and demonstrate that routing preserves the model’s foundational capabilities by maintaining performance on general-purpose benchmarks such as ChartQA, MMBench, and DocVQA, while simultaneously improving accuracy on specialized tasks. Importantly, our approach achieves this without requiring concurrent access to data from all tasks, avoiding the significant computational and data overhead associated with traditional multi-task learning. We further conduct extensive ablation studies to evaluate the scalability and robustness of routing-based learning, showing that the approach is resilient to a growing number of tasks and performs particularly well when new tasks are semantically related. Finally, we show that the routing mechanism enables superior cross-modal transfer between language and vision capabilities, allowing knowledge learned in one modality to enhance performance in another capability not achieved by existing continual learning methods.

[770] Towards Multi-Fidelity Scaling Laws of Neural Surrogates in CFD

Paul Setinek, Gianluca Galletti, Johannes Brandstetter

Main category: cs.LG

TL;DR: The paper investigates scaling laws for neural surrogates using multi-fidelity RANS simulations, showing how to optimize dataset composition by trading simulation fidelity for computational cost.

Details

Motivation: Scientific machine learning faces high costs for generating training data through numerical simulations, unlike domains like language/vision where data collection is cheap. The ability to trade simulation fidelity for computational cost presents a unique opportunity not available in other domains.

Method: Reformulated classical scaling laws to decompose dataset axis into compute budget and dataset composition. Used low- and high-fidelity Reynolds-Averaged Navier-Stokes (RANS) simulations to investigate the trade-off between data fidelity and cost in neural surrogates.

Result: Experiments revealed compute-performance scaling behavior and showed budget-dependent optimal fidelity mixes for given dataset configurations. Found that different compute budgets require different optimal combinations of low- and high-fidelity data.

Conclusion: This provides the first empirical study of scaling laws for multi-fidelity neural surrogate datasets, offering practical guidance for compute-efficient dataset generation in scientific machine learning applications.

Abstract: Scaling laws describe how model performance grows with data, parameters and compute. While large datasets can usually be collected at relatively low cost in domains such as language or vision, scientific machine learning is often limited by the high expense of generating training data through numerical simulations. However, by adjusting modeling assumptions and approximations, simulation fidelity can be traded for computational cost, an aspect absent in other domains. We investigate this trade-off between data fidelity and cost in neural surrogates using low- and high-fidelity Reynolds-Averaged Navier-Stokes (RANS) simulations. Reformulating classical scaling laws, we decompose the dataset axis into compute budget and dataset composition. Our experiments reveal compute-performance scaling behavior and exhibit budget-dependent optimal fidelity mixes for the given dataset configuration. These findings provide the first study of empirical scaling laws for multi-fidelity neural surrogate datasets and offer practical considerations for compute-efficient dataset generation in scientific machine learning.

[771] Priors in Time: Missing Inductive Biases for Language Model Interpretability

Ekdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J. Bigelow, Johnny Lin, Demba Ba, Martin Wattenberg, Fernanda Viegas, Melanie Weber, Aaron Mueller

Main category: cs.LG

TL;DR: The paper introduces Temporal Feature Analysis as an interpretability method that accounts for temporal dynamics in language model representations, addressing limitations of Sparse Autoencoders which assume concept independence across time.

Details

Motivation: Existing feature extraction methods like Sparse Autoencoders assume concept independence across time, but language model representations exhibit rich temporal dynamics including non-stationarity, context-dependent correlations, and changing conceptual dimensionality that conflict with these priors.

Method: Proposed Temporal Feature Analysis with temporal inductive bias that decomposes representations into predictable components (inferred from context) and residual components (novel information unexplained by context), inspired by computational neuroscience approaches.

Result: Temporal Feature Analyzers successfully parse garden path sentences, identify event boundaries, and delineate abstract slow-moving information from novel fast-moving information, while Sparse Autoencoders show significant pitfalls in these temporal tasks.

Conclusion: The results emphasize the need for interpretability tools with inductive biases that match the temporal characteristics of language data, rather than assuming stationarity and independence across time.

Abstract: Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective – Temporal Feature Analysis – which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.

[772] Interpretable Machine Learning for Reservoir Water Temperatures in the U.S. Red River Basin of the South

Isabela Suaza-Sierra, Hernan A. Moreno, Luis A De la Fuente, Thomas M. Neeson

Main category: cs.LG

TL;DR: This study integrates explainable ML and symbolic modeling to predict and understand reservoir water temperature dynamics, achieving high accuracy with ensemble models and developing interpretable symbolic equations using Kolmogorov Arnold Networks.

Details

Motivation: Accurate RWT prediction is crucial for water management and ecosystem health, but prediction alone offers limited insight into governing physical processes. The research aims to bridge this gap by uncovering the drivers of RWT dynamics.

Method: Used ensemble ML models (RF, XGBoost, MLP) on 10,000+ depth-resolved temperature profiles from 10 reservoirs, applied SHAP for feature importance analysis, and developed Kolmogorov Arnold Networks (KANs) to derive symbolic equations from data-driven insights.

Result: Achieved high predictive accuracy (best RMSE = 1.20°C, R² = 0.97) with ML models. Derived 10 progressively complex KAN equations, improving from R² = 0.84 with one predictor to R² = 0.92 with ten predictors. Depth emerged as critical secondary predictor while precipitation had limited effect.

Conclusion: The framework successfully couples predictive accuracy with explanatory power, demonstrating how KANs and explainable ML can transform black-box models into transparent surrogates that advance both prediction and understanding of reservoir thermal dynamics.

Abstract: Accurate prediction of Reservoir Water Temperature (RWT) is vital for sustainable water management, ecosystem health, and climate resilience. Yet, prediction alone offers limited insight into the governing physical processes. To bridge this gap, we integrated explainable machine learning (ML) with symbolic modeling to uncover the drivers of RWT dynamics across ten reservoirs in the Red River Basin, USA, using over 10,000 depth-resolved temperature profiles. We first employed ensemble and neural models, including Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Multilayer Perceptron (MLP), achieving high predictive skill (best RMSE = 1.20 degree Celsius, R^2 = 0.97). Using SHAP (SHapley Additive exPlanations), we quantified the contribution of physical drivers such as air temperature, depth, wind, and lake volume, revealing consistent patterns across reservoirs. To translate these data-driven insights into compact analytical expressions, we developed Kolmogorov Arnold Networks (KANs) to symbolically approximate RWT. Ten progressively complex KAN equations were derived, improving from R^2 = 0.84 using a single predictor (7-day antecedent air temperature) to R^2 = 0.92 with ten predictors, though gains diminished beyond five, highlighting a balance between simplicity and accuracy. The resulting equations, dominated by linear and rational forms, incrementally captured nonlinear behavior while preserving interpretability. Depth consistently emerged as a secondary but critical predictor, whereas precipitation had limited effect. By coupling predictive accuracy with explanatory power, this framework demonstrates how KANs and explainable ML can transform black-box models into transparent surrogates that advance both prediction and understanding of reservoir thermal dynamics.

[773] Bridging Lifelong and Multi-Task Representation Learning via Algorithm and Complexity Measure

Zhi Wang, Chicheng Zhang, Ramya Korlakai Vinayak

Main category: cs.LG

TL;DR: The paper proposes a lifelong representation learning framework where a learner sequentially faces tasks with shared structure, using multi-task empirical risk minimization and introducing the task-eluder dimension for sample complexity analysis.

Details

Motivation: Lifelong learning requires leveraging shared structure across sequential tasks in an online fashion, unlike multi-task learning where all tasks are available upfront. The goal is to develop a framework that can continually gather partial information while using existing knowledge.

Method: A simple algorithm using multi-task empirical risk minimization as a subroutine, with theoretical analysis based on a new concept called the task-eluder dimension that applies to general function classes.

Result: Established sample complexity bounds for lifelong representation learning that apply to a wide range of learning problems, with concrete instantiations for classification and regression tasks under noise.

Conclusion: The proposed framework and algorithm provide a theoretically grounded approach to lifelong representation learning with provable sample complexity guarantees across diverse learning problems.

Abstract: In lifelong learning, a learner faces a sequence of tasks with shared structure and aims to identify and leverage it to accelerate learning. We study the setting where such structure is captured by a common representation of data. Unlike multi-task learning or learning-to-learn, where tasks are available upfront to learn the representation, lifelong learning requires the learner to make use of its existing knowledge while continually gathering partial information in an online fashion. In this paper, we consider a generalized framework of lifelong representation learning. We propose a simple algorithm that uses multi-task empirical risk minimization as a subroutine and establish a sample complexity bound based on a new notion we introduce–the task-eluder dimension. Our result applies to a wide range of learning problems involving general function classes. As concrete examples, we instantiate our result on classification and regression tasks under noise.

[774] Coordinate ascent neural Kalman-MLE for state estimation

Bettina Hanlon, Angel Garcia Fernandez

Main category: cs.LG

TL;DR: Coordinate ascent algorithm for learning dynamic and measurement models in state estimation using maximum likelihood estimation, with neural networks modeling functions and noise covariances.

Details

Motivation: To improve dynamic state estimation by learning accurate dynamic and measurement models along with noise characteristics in a supervised manner.

Method: Coordinate ascent algorithm using maximum likelihood estimation to learn neural network parameters for dynamic/measurement functions and noise covariance matrices, then using trained models with non-linear Kalman filter.

Result: Developed a framework that can learn both the system models and noise characteristics simultaneously for improved state estimation performance.

Conclusion: The proposed approach enables effective learning of dynamic and measurement models with noise covariances, enhancing state estimation accuracy when combined with non-linear Kalman filters.

Abstract: This paper presents a coordinate ascent algorithm to learn dynamic and measurement models in dynamic state estimation using maximum likelihood estimation in a supervised manner. In particular, the dynamic and measurement models are assumed to be Gaussian and the algorithm learns the neural network parameters that model the dynamic and measurement functions, and also the noise covariance matrices. The trained dynamic and measurement models are then used with a non-linear Kalman filter algorithm to estimate the state during the testing phase.

[775] Abstraction Alignment: Comparing Model-Learned and Human-Encoded Conceptual Relationships

Angie Boggust, Hyemin Bang, Hendrik Strobelt, Arvind Satyanarayan

Main category: cs.LG

TL;DR: The paper introduces abstraction alignment, a methodology to compare model behavior against formal human knowledge by externalizing domain-specific knowledge as abstraction graphs and measuring how much of a model’s uncertainty is explained by human abstractions.

Details

Motivation: Existing interpretability methods identify learned concepts but overlook relationships between concepts that form abstractions and enable generalization. The authors want to assess whether models learn human-aligned abstractions.

Method: Externalize domain-specific human knowledge as abstraction graphs (sets of concepts spanning abstraction levels), then measure model alignment by determining how much of the model’s uncertainty is accounted for by human abstractions. Aggregate alignment across datasets to test hypotheses.

Result: In expert evaluations, abstraction alignment differentiates similar errors, improves existing model-quality metrics’ verbosity, and uncovers improvements to current human abstractions.

Conclusion: Abstraction alignment provides a systematic way to assess model alignment with human abstractions, revealing both model limitations and potential refinements to human knowledge frameworks.

Abstract: While interpretability methods identify a model’s learned concepts, they overlook the relationships between concepts that make up its abstractions and inform its ability to generalize to new data. To assess whether models’ have learned human-aligned abstractions, we introduce abstraction alignment, a methodology to compare model behavior against formal human knowledge. Abstraction alignment externalizes domain-specific human knowledge as an abstraction graph, a set of pertinent concepts spanning levels of abstraction. Using the abstraction graph as a ground truth, abstraction alignment measures the alignment of a model’s behavior by determining how much of its uncertainty is accounted for by the human abstractions. By aggregating abstraction alignment across entire datasets, users can test alignment hypotheses, such as which human concepts the model has learned and where misalignments recur. In evaluations with experts, abstraction alignment differentiates seemingly similar errors, improves the verbosity of existing model-quality metrics, and uncovers improvements to current human abstractions.

[776] Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning

Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, Geng Yuan

Main category: cs.LG

TL;DR: DiZO is a divergence-driven zeroth-order optimization method that bridges the performance gap between memory-efficient ZO optimization and standard FO fine-tuning by using layer-wise adaptation to generate diverse-magnitude updates tailored to each layer’s needs.

Details

Motivation: Standard first-order fine-tuning requires significant memory, limiting real-world deployment. While zeroth-order optimization is memory-efficient, it lags behind FO methods in convergence speed and accuracy.

Method: Proposes DiZO optimization with layer-wise divergence analysis and divergence-driven layer adaptation. It incorporates projections to ZO updates and generates diverse-magnitude updates scaled to each layer’s optimization needs.

Result: DiZO reduces training GPU hours by up to 48%, significantly cuts convergence iterations without sacrificing throughput, and consistently outperforms ZO baselines on RoBERTa-large, OPT-series, and Llama-series. In some cases, it even surpasses memory-intensive FO fine-tuning.

Conclusion: DiZO effectively bridges the performance gap between ZO and FO optimization, providing a memory-efficient training paradigm that maintains high performance while reducing computational requirements.

Abstract: Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://github.com/Skilteee/DiZO.

[777] Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, Shuiwang Ji

Main category: cs.LG

TL;DR: E2H Reasoner improves language model reasoning via curriculum learning with easy-to-hard task scheduling, preventing overfitting and reducing sample complexity compared to vanilla RL.

Details

Motivation: Improve reasoning capabilities of language models through reinforcement learning, addressing limitations of using RL alone on difficult tasks by drawing inspiration from curriculum learning.

Method: Proposed E2H Reasoner that schedules tasks from easy to hard, gradually building reasoning skills with appropriate fading of easy tasks to prevent overfitting, within an approximate policy iteration framework.

Result: Significantly improves reasoning ability of small LLMs (1.5B to 3B) across multiple domains, which otherwise struggle with vanilla RL alone, with established convergence guarantees and reduced sample complexity.

Conclusion: Curriculum learning with easy-to-hard scheduling is effective for improving reasoning in small language models, requiring fewer total samples than direct learning when tasks are appropriately decomposed.

Abstract: We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.

[778] Learning to Steer: Input-dependent Steering for Multimodal LLMs

Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Arnaud Dapogny, Alasdair Newson, Matthieu Cord

Main category: cs.LG

TL;DR: L2S (Learn-to-Steer) proposes input-specific steering for multimodal LLMs using contrastive prompting and an auxiliary module to predict steering vectors, reducing hallucinations and improving safety.

Details

Motivation: Existing steering techniques for LLMs use single steering vectors independent of input, which is insufficient for context-dependent behaviors like safety responses that vary based on the query type (e.g., abstaining for illegal activities vs. pointing to resources for medical advice).

Method: Fine-grained steering using input-specific linear shifts computed via contrastive prompting, with a small auxiliary module trained to predict these steering vectors since input-specific prompts are unknown at test time.

Result: L2S reduces hallucinations and enforces safety in MLLMs, outperforming static baselines like mean steering.

Conclusion: Input-specific steering through learned prediction of steering vectors is effective for context-dependent behavior control in multimodal LLMs, enhancing safety and reducing hallucinations.

Abstract: Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines. Our code is publicly available at https://jayneelparekh.github.io/learn-to-steer/

[779] BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation

Liang Ye, Shengqin Chen, Jiazhu Dai

Main category: cs.LG

TL;DR: BadGraph is a backdoor attack method targeting text-guided graph generation models that uses textual triggers to poison training data, enabling attackers to induce specific subgraphs during inference while maintaining normal performance on clean inputs.

Details

Motivation: To investigate security vulnerabilities in conditional graph generation models, particularly text-guided graph generation, which remains largely unexplored compared to image diffusion and unconditional graph generation backdoor attacks.

Method: Leverages textual triggers to poison training data for latent diffusion models, implanting backdoors that activate during inference when triggers are present, while preserving normal functionality on clean inputs.

Result: Extensive experiments show high effectiveness: less than 10% poisoning rate achieves 50% attack success rate, 24% poisoning achieves over 80% success rate, with negligible performance degradation on benign samples. Backdoor is implanted during VAE and diffusion training phases.

Conclusion: The findings reveal serious security vulnerabilities in latent diffusion models for text-guided graph generation, highlighting significant risks in applications like drug discovery and underscoring the need for robust defense mechanisms.

Abstract: The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method against latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models’ applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.

[780] Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation

Thaweerath Phisannupawong, Joshua Julian Damanik, Han-Lim Choi

Main category: cs.LG

TL;DR: A lightweight LLM-based multimodal approach for flight delay prediction that integrates trajectory data with textual aeronautical information, achieving sub-minute prediction error.

Details

Motivation: Flight delays highlight inefficiencies in air traffic management and impact network performance, requiring accurate prediction methods from air traffic controllers' operational perspective.

Method: Multimodal approach combining trajectory representations with textual aeronautical data (flight info, weather reports, aerodrome notices) by adapting trajectory data into language modality to capture airspace conditions.

Result: Model consistently achieves sub-minute prediction error, meeting operational standards for minute-level precision by effectively leveraging contextual delay information.

Conclusion: Linguistic understanding combined with cross-modality trajectory adaptation enhances delay prediction, showing practicality and scalability for real-world operations with real-time updates.

Abstract: Flight delay prediction has become a key focus in air traffic management, as delays highlight inefficiencies that impact overall network performance. This paper presents a lightweight large language model-based multimodal flight delay prediction, formulated from the perspective of air traffic controllers monitoring aircraft delay after entering the terminal area. The approach integrates trajectory representations with textual aeronautical information, including flight information, weather reports, and aerodrome notices, by adapting trajectory data into the language modality to capture airspace conditions. The experiments show that the model consistently achieves sub-minute prediction error by effectively leveraging contextual information related to the sources of delay, fulfilling the operational standard for minute-level precision. The framework demonstrates that linguistic understanding, when combined with cross-modality adaptation of trajectory data, enhances delay prediction. Moreover, the approach shows practicality and potential scalability for real-world operations, supporting real-time updates that refine predictions upon receiving new operational information.

[781] Neighboring State-based Exploration for Reinforcement Learning

Yu-Teng Li, Justin Lin, Jeffery Cheng, Pedro Pachuca

Main category: cs.LG

TL;DR: Proposes neighboring state-based exploration methods for reinforcement learning, with one algorithm (ρ-explore) outperforming Double DQN baseline by 49% in discrete environments.

Details

Motivation: Address the exploration-exploitation trade-off challenge in reinforcement learning by leveraging the intuition that considering actions from nearby states may lead to better exploration decisions for early-stage agents.

Method: Two model-free exploration algorithms that choose exploratory actions based on surveying nearby states, with one called ρ-explore.

Result: ρ-explore consistently outperforms Double DQN baseline by 49% in terms of Eval Reward Return in discrete environments.

Conclusion: Neighboring state-based exploration approaches, particularly ρ-explore, are effective for improving exploration in reinforcement learning tasks.

Abstract: Reinforcement Learning is a powerful tool to model decision-making processes. However, it relies on an exploration-exploitation trade-off that remains an open challenge for many tasks. In this work, we study neighboring state-based, model-free exploration led by the intuition that, for an early-stage agent, considering actions derived from a bounded region of nearby states may lead to better actions when exploring. We propose two algorithms that choose exploratory actions based on a survey of nearby states, and find that one of our methods, ${\rho}$-explore, consistently outperforms the Double DQN baseline in an discrete environment by 49% in terms of Eval Reward Return.

[782] ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models

Shengming Li, Luping Liu, Runnan Li, Xu Tan

Main category: cs.LG

TL;DR: ERA-Solver is a fast sampling method for diffusion models that uses an error-robust Adams solver with adaptive Lagrange interpolation to handle noise estimation errors, achieving high-quality image generation with only 10 network evaluations.

Details

Motivation: DDPMs have remarkable generation results but suffer from low sampling efficiency. Previous fast sampling methods with fixed analytical forms cannot handle various error patterns in noise estimation from pretrained diffusion models.

Method: Constructs an error-robust Adams solver using implicit Adams numerical method with predictor-corrector approach. Uses Lagrange interpolation function as predictor enhanced with error-robust strategy to adaptively select Lagrange bases with lower errors in estimated noise. Works with any pretrained diffusion models without extra training.

Result: Achieves 3.54, 5.06, 5.02, and 5.11 FID scores on Cifar10, CelebA, LSUN-Church, and ImageNet 64x64 datasets respectively, with only 10 network evaluations.

Conclusion: ERA-Solver provides an effective fast sampling solution for diffusion models that is robust to noise estimation errors and achieves high-quality image generation with significantly reduced computational cost.

Abstract: Though denoising diffusion probabilistic models (DDPMs) have achieved remarkable generation results, the low sampling efficiency of DDPMs still limits further applications. Since DDPMs can be formulated as diffusion ordinary differential equations (ODEs), various fast sampling methods can be derived from solving diffusion ODEs. However, we notice that previous fast sampling methods with fixed analytical form are not able to robust with the various error patterns in the noise estimated from pretrained diffusion models. In this work, we construct an error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams numerical method that consists of a predictor and a corrector. Different from the traditional predictor based on explicit Adams methods, we leverage a Lagrange interpolation function as the predictor, which is further enhanced with an error-robust strategy to adaptively select the Lagrange bases with lower errors in the estimated noise. The proposed solver can be directly applied to any pretrained diffusion models, without extra training. Experiments on Cifar10, CelebA, LSUN-Church, and ImageNet 64 x 64 (conditional) datasets demonstrate that our proposed ERA-Solver achieves 3.54, 5.06, 5.02, and 5.11 Frechet Inception Distance (FID) for image generation, with only 10 network evaluations.

[783] Enhancing Sequential Model Performance with Squared Sigmoid TanH (SST) Activation Under Data Constraints

Barathi Subramanian, Rathinaraja Jeyaraj, Anand Paul

Main category: cs.LG

TL;DR: Proposed SST activation function improves sequential models’ performance on small datasets by amplifying activation differences and enhancing gradient flow.

Details

Motivation: Classical activation functions (Sigmoid, TanH) struggle with sparse patterns in small sequential datasets, limiting temporal dependency capture.

Method: SST applies mathematical squaring to Sigmoid TanH to amplify activation differences, improve gradient flow, and filter information in sequential models like LSTMs and GRUs.

Result: SST-powered models consistently outperform baseline RNN models across sign language recognition, regression, and time-series classification tasks with limited datasets.

Conclusion: SST activation effectively enhances sequential model learning capability under data constraints by improving gradient flow and activation differentiation.

Abstract: Activation functions enable neural networks to learn complex representations by introducing non-linearities. While feedforward models commonly use rectified linear units, sequential models like recurrent neural networks, long short-term memory (LSTMs) and gated recurrent units (GRUs) still rely on Sigmoid and TanH activation functions. However, these classical activation functions often struggle to model sparse patterns when trained on small sequential datasets to effectively capture temporal dependencies. To address this limitation, we propose squared Sigmoid TanH (SST) activation specifically tailored to enhance the learning capability of sequential models under data constraints. SST applies mathematical squaring to amplify differences between strong and weak activations as signals propagate over time, facilitating improved gradient flow and information filtering. We evaluate SST-powered LSTMs and GRUs for diverse applications, such as sign language recognition, regression, and time-series classification tasks, where the dataset is limited. Our experiments demonstrate that SST models consistently outperform RNN-based models with baseline activations, exhibiting improved test accuracy.

[784] Calibrating Bayesian Learning via Regularization, Confidence Minimization, and Selective Inference

Jiayi Huang, Sangwoo Park, Osvaldo Simeone

Main category: cs.LG

TL;DR: This paper proposes a novel Bayesian learning framework that integrates calibration regularization, confidence minimization, and selective calibration to improve AI model reliability for both in-distribution and out-of-distribution inputs.

Details

Motivation: AI models face reliability challenges in quantifying decision confidence, particularly in distinguishing between in-distribution and out-of-distribution inputs, which limits their practical application in critical fields like engineering.

Method: The method extends variational inference-based Bayesian learning through three successive components: calibration-regularized Bayesian learning (CBNN), out-of-distribution confidence minimization (OCM) to create CBNN-OCM, and selective calibration to produce SCBNN-OCM, which rejects inputs with insufficient expected calibration performance.

Result: SCBNN-OCM achieves the best in-distribution and out-of-distribution performance compared to state-of-the-art approaches, though at the cost of rejecting a sufficiently large number of inputs.

Conclusion: The proposed selective CBNN-OCM framework effectively addresses the trade-offs between ID accuracy, ID calibration, and OOD calibration, providing a comprehensive solution for reliable AI decision-making in practical applications.

Abstract: The application of artificial intelligence (AI) models in fields such as engineering is limited by the known difficulty of quantifying the reliability of an AI’s decision. A well-calibrated AI model must correctly report its accuracy on in-distribution (ID) inputs, while also enabling the detection of out-of-distribution (OOD) inputs. A conventional approach to improve calibration is the application of Bayesian ensembling. However, owing to computational limitations and model misspecification, practical ensembling strategies do not necessarily enhance calibration. This paper proposes an extension of variational inference (VI)-based Bayesian learning that integrates calibration regularization for improved ID performance, confidence minimization for OOD detection, and selective calibration to ensure a synergistic use of calibration regularization and confidence minimization. The scheme is constructed successively by first introducing calibration-regularized Bayesian learning (CBNN), then incorporating out-of-distribution confidence minimization (OCM) to yield CBNN-OCM, and finally integrating also selective calibration to produce selective CBNN-OCM (SCBNN-OCM). Selective calibration rejects inputs for which the calibration performance is expected to be insufficient. Numerical results illustrate the trade-offs between ID accuracy, ID calibration, and OOD calibration attained by both frequentist and Bayesian learning methods. Among the main conclusions, SCBNN-OCM is seen to achieve best ID and OOD performance as compared to existing state-of-the-art approaches at the cost of rejecting a sufficiently large number of inputs.

[785] SST: Multi-Scale Hybrid Mamba-Transformer Experts for Time Series Forecasting

Xiongxiao Xu, Canyu Chen, Yueqing Liang, Baixiang Huang, Guangji Bai, Liang Zhao, Kai Shu

Main category: cs.LG

TL;DR: SST is a hybrid Mamba-Transformer model for time series forecasting that combines Mamba for long-range patterns and Transformer for short-term variations, achieving state-of-the-art performance with linear scalability.

Details

Motivation: Transformers have quadratic complexity limitations for long sequences, while Mamba's fixed-size latent state may cause information loss. The paper aims to create an effective and efficient hybrid architecture for time series forecasting.

Method: Proposes State Space Transformer (SST) with expert modules: Mamba expert for long-range patterns and Transformer expert for short-term variations. Uses time series decomposition and multi-scale patching with adaptive resolution.

Result: SST achieves state-of-the-art performance with linear scalability, outperforming naive Mamba-Transformer stacking approaches.

Conclusion: The hybrid Mamba-Transformer architecture with specialized experts for different temporal scales is effective for time series forecasting, with Mamba excelling at long-range patterns and Transformer at short-term variations.

Abstract: Time series forecasting has made significant advances, including with Transformer-based models. The attention mechanism in Transformer effectively captures temporal dependencies by attending to all past inputs simultaneously. However, its quadratic complexity with respect to sequence length limits the scalability for long-range modeling. Recent state space models (SSMs) such as Mamba offer a promising alternative by achieving linear complexity without attention. Yet, Mamba compresses historical information into a fixed-size latent state, potentially causing information loss and limiting representational effectiveness. This raises a key research question: Can we design a hybrid Mamba-Transformer architecture that is both effective and efficient for time series forecasting? To address it, we adapt a hybrid Mamba-Transformer architecture Mambaformer, originally proposed for language modeling, to the time series domain. Preliminary experiments reveal that naively stacking Mamba and Transformer layers in Mambaformer is suboptimal for time series forecasting, due to an information interference problem. To mitigate this issue, we introduce a new time series decomposition strategy that separates time series into long-range patterns and short-range variations. Then we show that Mamba excels at capturing long-term structures, while Transformer is more effective at modeling short-term dynamics. Building on this insight, we propose State Space Transformer (SST), a multi-scale hybrid model with expert modules: a Mamba expert for long-range patterns and a Transformer expert for short-term variations. SST also employs a multi-scale patching mechanism to adaptively adjust time series resolution: low resolution for long-term patterns and high resolution for short-term variations. Experiments show that SST obtains SOTA performance with linear scalability. The code is at https://github.com/XiongxiaoXu/SST.

[786] REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning

Sungho Jeon, Xinyue Ma, Kwang In Kim, Myeongjae Jeon

Main category: cs.LG

TL;DR: REP improves computational and memory efficiency of prompt-based continual learning methods with minimal accuracy loss through swift prompt selection, adaptive token merging, and adaptive layer dropping.

Details

Motivation: Existing rehearsal-free continual learning methods using prompts achieve good performance but are resource-intensive, limiting their deployment on edge devices.

Method: Uses swift prompt selection to refine input data, adaptive token merging (AToM) to skip data layers, and adaptive layer dropping (ALD) to skip model layers while preserving task-specific features.

Result: Extensive experiments on multiple image classification datasets show REP achieves superior resource efficiency compared to state-of-the-art rehearsal-free CL methods.

Conclusion: REP provides a resource-efficient solution for prompt-based continual learning that maintains performance while reducing computational and memory requirements.

Abstract: Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resource-intensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP’s superior resource efficiency over state-of-the-art rehearsal-free CL methods.

[787] Ranking hierarchical multi-label classification results with mLPRs

Yuting Ye, Christine Ho, Ci-Ren Jiang, Wayne Tai Lee, Haiyan Huang

Main category: cs.LG

TL;DR: The paper proposes CATCH, a new objective function for hierarchical multi-label classification, and HierRank algorithm that uses multidimensional Local Precision Rate (mLPR) to optimize classification decisions while respecting class hierarchy constraints.

Details

Motivation: Addressing the under-explored second stage of hierarchical multi-label classification - how to integrate individual classifiers while managing hierarchical constraints and accounting for statistical differences in classifier scores across classes.

Method: Introduces CATCH objective function and proposes HierRank algorithm that transforms classifier scores into mLPRs (multidimensional Local Precision Rates) and ranks them under hierarchical constraints to maximize empirical CATCH.

Result: Superior performance compared to state-of-the-art methods on synthetic and two real datasets, showing improved decision accuracy.

Conclusion: The proposed approach effectively handles hierarchical constraints in multi-label classification by optimizing the CATCH objective through mLPR-based ranking, demonstrating practical advantages over existing methods.

Abstract: Hierarchical multi-label classification (HMC) has gained considerable attention in recent decades. A seminal line of HMC research addresses the problem in two stages: first, training individual classifiers for each class, then integrating these classifiers to provide a unified set of classification results across classes while respecting the given hierarchy. In this article, we focus on the less attended second-stage question while adhering to the given class hierarchy. This involves addressing a key challenge: how to manage the hierarchical constraint and account for statistical differences in the first-stage classifier scores across different classes to make classification decisions that are optimal under a justifiable criterion. To address this challenge, we introduce a new objective function, called CATCH, to ensure reasonable classification performance. To optimize this function, we propose a decision strategy built on a novel metric, the multidimensional Local Precision Rate (mLPR), which reflects the membership chance of an object in a class given all classifier scores and the class hierarchy. Particularly, we demonstrate that, under certain conditions, transforming the classifier scores into mLPRs and comparing mLPR values for all objects against all classes can, in theory, ensure the class hierarchy and maximize CATCH. In practice, we propose an algorithm HierRank to rank estimated mLPRs under the hierarchical constraint, leading to a ranking that maximizes an empirical version of CATCH. Our approach was evaluated on a synthetic dataset and two real datasets, exhibiting superior performance compared to several state-of-the-art methods in terms of improved decision accuracy.

[788] You Are the Best Reviewer of Your Own Papers: The Isotonic Mechanism

Weijie Su

Main category: cs.LG

TL;DR: The Isotonic Mechanism improves peer review accuracy by using authors’ private rankings of their submissions to calibrate noisy review scores, with proven truthfulness incentives.

Details

Motivation: Address the significant decline in peer review quality at ML/AI conferences by leveraging authors' private assessments of their own submissions to enhance review score accuracy.

Method: Authors rank their submissions in quality order, then raw review scores are calibrated using isotonic regression based on these rankings to produce adjusted scores that are more accurate.

Result: The mechanism incentivizes truthful ranking by authors and produces more accurate scores than raw reviews, especially with high noise levels and many submissions. It’s optimal among pairwise comparison-based mechanisms.

Conclusion: The Isotonic Mechanism effectively improves review quality through truthful elicitation of author rankings, with potential for practical implementation and further theoretical development.

Abstract: Machine learning (ML) and artificial intelligence (AI) conferences including NeurIPS and ICML have experienced a significant decline in peer review quality in recent years. To address this growing challenge, we introduce the Isotonic Mechanism, a computationally efficient approach to enhancing the accuracy of noisy review scores by incorporating authors’ private assessments of their submissions. Under this mechanism, authors with multiple submissions are required to rank their papers in descending order of perceived quality. Subsequently, the raw review scores are calibrated based on this ranking to produce adjusted scores. We prove that authors are incentivized to truthfully report their rankings because doing so maximizes their expected utility, modeled as an additive convex function over the adjusted scores. Moreover, the adjusted scores are shown to be more accurate than the raw scores, with improvements being particularly significant when the noise level is high and the author has many submissions – a scenario increasingly prevalent at large-scale ML/AI conferences. We further investigate whether submission quality information beyond a simple ranking can be truthfully elicited from authors. We establish that a necessary condition for truthful elicitation is that the mechanism be based on pairwise comparisons of the author’s submissions. This result underscores the optimality of the Isotonic Mechanism, as it elicits the most fine-grained truthful information among all mechanisms we consider. We then present several extensions, including a demonstration that the mechanism maintains truthfulness even when authors have only partial rather than complete information about their submission quality. Finally, we discuss future research directions, focusing on the practical implementation of the mechanism and the further development of a theoretical framework inspired by our mechanism.

[789] Integer-only Quantized Transformers for Embedded FPGA-based Time-series Forecasting in AIoT

Tianheng Ling, Chao Qian, Gregor Schiele

Main category: cs.LG

TL;DR: Hardware accelerator for Transformers optimized for on-device time-series forecasting in AIoT systems, achieving comparable precision with 4-6 bit quantization while being 132x faster and 48x more energy efficient than 8-bit models.

Details

Motivation: To enable deployment of Transformer models on embedded IoT devices for time-series forecasting, addressing the computational and energy constraints of edge devices.

Method: Combines integer-only quantization, Quantization-Aware Training with optimized hardware designs, implemented on Xilinx Spartan-7 FPGA, and systematically explores optimization combinations.

Result: 4-bit quantized Transformer model increases test loss by only 0.63% compared to 8-bit models, operates 132.33x faster, and consumes 48.19x less energy.

Conclusion: Transformer models can be deployed on embedded IoT devices with sufficient performance, but optimization requires systematic exploration as reduced bitwidth doesn’t always improve latency or energy consumption.

Abstract: This paper presents the design of a hardware accelerator for Transformers, optimized for on-device time-series forecasting in AIoT systems. It integrates integer-only quantization and Quantization-Aware Training with optimized hardware designs to realize 6-bit and 4-bit quantized Transformer models, which achieved precision comparable to 8-bit quantized models from related research. Utilizing a complete implementation on an embedded FPGA (Xilinx Spartan-7 XC7S15), we examine the feasibility of deploying Transformer models on embedded IoT devices. This includes a thorough analysis of achievable precision, resource utilization, timing, power, and energy consumption for on-device inference. Our results indicate that while sufficient performance can be attained, the optimization process is not trivial. For instance, reducing the quantization bitwidth does not consistently result in decreased latency or energy consumption, underscoring the necessity of systematically exploring various optimization combinations. Compared to an 8-bit quantized Transformer model in related studies, our 4-bit quantized Transformer model increases test loss by only 0.63%, operates up to 132.33x faster, and consumes 48.19x less energy. Relevant source code is provided in the accompanying GitHub repository\footnote{https://github.com/tianheng-ling/TinyTransformer4TS}.

[790] ORFit: One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares

Youngjae Min, Namhoon Cho, Navid Azizan

Main category: cs.LG

TL;DR: ORFit enables efficient one-pass learning by updating model parameters orthogonally to past gradients, matching multi-pass SGD performance with linear memory/computation costs.

Details

Motivation: Address computational/memory constraints and privacy concerns in streaming data scenarios where storing and accessing all data is impractical.

Method: Orthogonal Recursive Fitting (ORFit) that perfectly fits new datapoints while minimally altering previous predictions, using orthogonal gradient updates and incremental PCA.

Result: ORFit achieves equivalent performance to multi-pass SGD for overparameterized linear models and extends to nonlinear settings, with linear memory/computation costs vs quadratic in RLS.

Conclusion: ORFit provides an efficient one-pass learning algorithm that matches multi-pass training performance while being memory/computation efficient and minimax optimal.

Abstract: While large machine learning models have shown remarkable performance in various domains, their training typically requires iterating for many passes over the training data. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the demonstrated effectiveness of overparameterized models and the phenomenon of benign overfitting, we propose Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit each new datapoint while minimally altering the predictions on previous datapoints. ORFit updates the parameters in a direction orthogonal to past gradients, similar to orthogonal gradient descent (OGD) in continual learning. We show that, interestingly, ORFit’s update leads to an operation similar to the recursive least-squares (RLS) algorithm in adaptive filtering but with significantly improved memory and computational efficiency, i.e., linear, instead of quadratic, in the number of parameters. To further reduce memory usage, we leverage the structure of the streaming data via an incremental principal component analysis (IPCA). We show that using the principal components is minimax optimal, i.e., it minimizes the worst-case forgetting of previous predictions for unknown future updates. Further, we prove that, for overparameterized linear models, the parameter vector obtained by ORFit matches what the standard multi-pass stochastic gradient descent (SGD) would converge to. Finally, we extend our results to the nonlinear setting for highly overparameterized models, relevant for deep learning.

[791] Dataset Distillation for Offline Reinforcement Learning

Jonathan Light, Yuanzhe Liu, Ziniu Hu

Main category: cs.LG

TL;DR: Using data distillation to create better datasets for offline RL when quality datasets are unavailable, achieving similar performance to models trained on full datasets.

Details

Motivation: Offline RL often lacks quality datasets, and training policies directly on available offline data is challenging in many scenarios.

Method: Propose data distillation to train and distill better datasets that can then be used for training improved policy models.

Result: Method synthesizes datasets where models trained on them achieve similar performance to models trained on full datasets or using percentile behavioral cloning.

Conclusion: Data distillation is an effective approach for creating high-quality training datasets for offline reinforcement learning when original datasets are insufficient.

Abstract: Offline reinforcement learning often requires a quality dataset that we can train a policy on. However, in many situations, it is not possible to get such a dataset, nor is it easy to train a policy to perform well in the actual environment given the offline data. We propose using data distillation to train and distill a better dataset which can then be used for training a better policy model. We show that our method is able to synthesize a dataset where a model trained on it achieves similar performance to a model trained on the full dataset or a model trained using percentile behavioral cloning. Our project site is available at https://datasetdistillation4rl.github.io . We also provide our implementation at https://github.com/ggflow123/DDRL .

[792] Algorithmic Assistance with Recommendation-Dependent Preferences

Bryce McLaughlin, Jann Spiess

Main category: cs.LG

TL;DR: Algorithmic recommendations can create inefficiencies when decision-makers treat them as default actions, leading to overly responsive behavior. The paper proposes algorithms that strategically withhold recommendations to improve decision quality.

Details

Motivation: To address unintended consequences where decision-makers (judges, doctors) treat algorithmic recommendations as default actions, making it costly to deviate due to institutional factors and behavioral biases like loss aversion.

Method: Proposes a model of joint human-machine decision-making where recommendations affect choices by altering preferences, not just shifting beliefs. Discusses algorithms that strategically withhold recommendations.

Result: Shows that recommendation-dependent preferences create inefficiencies where decision-makers are overly responsive to recommendations. Proves an intuitive algorithm achieves minimax optimality by sending recommendations only when confident they improve over unassisted baseline.

Conclusion: Strategic withholding of algorithmic recommendations can improve decision quality by preventing decision-makers from being overly responsive to recommendations they treat as default actions.

Abstract: When an algorithm provides risk assessments, we typically think of them as helpful inputs to human decisions, such as when risk scores are presented to judges or doctors. However, a decision-maker may react not only to the information provided by the algorithm. The decision-maker may also view the algorithmic recommendation as a default action, making it costly for them to deviate, such as when a judge is reluctant to overrule a high-risk assessment for a defendant or a doctor fears the consequences of deviating from recommended procedures. To address such unintended consequences of algorithmic assistance, we propose a model of joint human-machine decision-making. Within this model, we consider the effect and design of algorithmic recommendations when they affect choices not just by shifting beliefs, but also by altering preferences. We motivate this assumption from institutional factors, such as a desire to avoid audits, as well as from well-established models in behavioral science that predict loss aversion relative to a reference point. We show that recommendation-dependent preferences create inefficiencies where the decision-maker is overly responsive to the recommendation. As a remedy, we discuss algorithms that strategically withhold recommendations and show how they can improve the quality of final decisions. Concretely, we prove that an intuitive algorithm achieves minimax optimality by sending recommendations only when it is confident that their implementation would improve over an unassisted baseline decision.

[793] AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems

Mohamed Dhouioui, Jonathan Barnoud, Rhoslyn Roebuck Williams, Harry J. Stroud, Phil Bates, David R. Glowacki

Main category: cs.LG

TL;DR: This paper proposes using interactive molecular dynamics in virtual reality (iMD-VR) datasets to train AI agents via imitation learning, enabling AI to mimic human experts’ molecular manipulation skills for more efficient exploration of complex molecular systems.

Details

Motivation: Molecular dynamics simulations are computationally expensive due to high dimensionality. iMD-VR enables human experts to efficiently guide molecular simulations, but this generates valuable datasets that could train AI agents to augment human expertise.

Method: The authors propose using imitation learning (IL) to train AI agents from iMD-VR recordings of human experts manipulating molecular systems. They conducted a proof-of-concept study where iMD-VR data was used to train a CNN network on a simple molecular manipulation task (threading a molecule through a nanotube pore).

Result: The proof-of-principle study successfully demonstrated that iMD-VR data can be used to train AI models for molecular manipulation tasks, showing the feasibility of using imitation learning to capture human experts’ spatial insights for molecular dynamics.

Conclusion: iMD-VR datasets combined with imitation learning offer a promising approach to train AI agents that can augment human expertise in navigating complex molecular conformational spaces, with potential applications in drug discovery, protein engineering, and material design.

Abstract: Molecular dynamics (MD) simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently emerged as a “human-in-the-loop” strategy for efficiently navigating hyper-dimensional molecular systems. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular simulations running on high-performance computing architectures, iMD-VR enables researchers to reach out and guide molecular conformational dynamics, in order to efficiently explore complex, high-dimensional molecular systems. Moreover, iMD-VR simulations generate rich datasets that capture human experts’ spatial insight regarding molecular structure and function. This paper explores the use of researcher-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL enables agents to mimic complex behaviours from expert demonstrations, circumventing the need for explicit programming or intricate reward design. In this article, we review IL across robotics and Multi-agents systems domains which are comparable to iMD-VR, and discuss how iMD-VR recordings could be used to train IL models to interact with MD simulations. We then illustrate the applications of these ideas through a proof-of-principle study where iMD-VR data was used to train a CNN network on a simple molecular manipulation task; namely, threading a small molecule through a nanotube pore. Finally, we outline future research directions and potential challenges of using AI agents to augment human expertise in navigating vast molecular conformational spaces.

[794] Is Risk-Sensitive Reinforcement Learning Properly Resolved?

Ruiwen Zhou, Minghuan Liu, Kan Ren, Xufang Luo, Weinan Zhang, Dongsheng Li

Main category: cs.LG

TL;DR: Existing risk-sensitive RL methods have optimization biases and cannot guarantee optimality. The paper proposes Trajectory Q-Learning (TQL) to achieve unbiased optimization with provable policy improvement for various risk measures.

Details

Motivation: Current risk-sensitive reinforcement learning methods using distributional Bellman operators don't properly optimize risk measures and cannot guarantee optimality or improvements in accumulated return distributions.

Method: Proposed Trajectory Q-Learning (TQL) algorithm with a new learning architecture that enables unbiased optimization and practical implementation for different risk measures.

Result: The method effectively achieves better performances toward risk-sensitive objectives and verifies learnability through experiments.

Conclusion: TQL provides a solution to the optimization bias problem in RSRL and enables learning disparate risk-sensitive policies with provable guarantees.

Abstract: Due to the nature of risk management in learning applicable policies, risk-sensitive reinforcement learning (RSRL) has been realized as an important direction. RSRL is usually achieved by learning risk-sensitive objectives characterized by various risk measures, under the framework of distributional reinforcement learning. However, it remains unclear if the distributional Bellman operator properly optimizes the RSRL objective in the sense of risk measures. In this paper, we prove that the existing RSRL methods do not achieve unbiased optimization and cannot guarantee optimality or even improvements regarding risk measures over accumulated return distributions. To remedy this issue, we further propose a novel algorithm, namely Trajectory Q-Learning (TQL), for RSRL problems with provable policy improvement towards the optimal policy. Based on our new learning architecture, we are free to introduce a general and practical implementation for different risk measures to learn disparate risk-sensitive policies. In the experiments, we verify the learnability of our algorithm and show how our method effectively achieves better performances toward risk-sensitive objectives.

[795] APALU: A Trainable, Adaptive Activation Function for Deep Learning Networks

Barathi Subramanian, Rathinaraja Jeyaraj, Rakhmonov Akhrorjon Akhmadjon Ugli

Main category: cs.LG

TL;DR: APALU is a novel trainable activation function that improves deep learning performance across various tasks by adapting to complex data patterns while maintaining stability and efficiency.

Details

Motivation: Classical activation functions like ReLU are static and simple, limiting their effectiveness in specialized tasks. Trainable activation functions also struggle to adapt to data characteristics, creating a need for more adaptive solutions.

Method: Introduces adaptive piecewise approximated activation linear unit (APALU), a trainable activation function with unique features that enable stability and adaptation to complex data representations.

Result: Significant improvements across multiple tasks: 0.37% and 0.04% accuracy gains for MobileNet and GoogleNet on CIFAR10; 0.8% AUC improvement for One-CLASS Deep SVDD on MNIST; 1.81% and 1.11% improvements with DifferNet and knowledge distillation on MVTech; 100% accuracy on sign language recognition; enhanced performance for regression tasks with DNNs and RNNs.

Conclusion: APALU demonstrates robustness and adaptability across diverse deep learning applications, highlighting its effectiveness as a superior activation function compared to widely used alternatives.

Abstract: Activation function is a pivotal component of deep learning, facilitating the extraction of intricate data patterns. While classical activation functions like ReLU and its variants are extensively utilized, their static nature and simplicity, despite being advantageous, often limit their effectiveness in specialized tasks. The trainable activation functions also struggle sometimes to adapt to the unique characteristics of the data. Addressing these limitations, we introduce a novel trainable activation function, adaptive piecewise approximated activation linear unit (APALU), to enhance the learning performance of deep learning across a broad range of tasks. It presents a unique set of features that enable it to maintain stability and efficiency in the learning process while adapting to complex data representations. Experiments reveal significant improvements over widely used activation functions for different tasks. In image classification, APALU increases MobileNet and GoogleNet accuracy by 0.37% and 0.04%, respectively, on the CIFAR10 dataset. In anomaly detection, it improves the average area under the curve of One-CLASS Deep SVDD by 0.8% on the MNIST dataset, 1.81% and 1.11% improvements with DifferNet, and knowledge distillation, respectively, on the MVTech dataset. Notably, APALU achieves 100% accuracy on a sign language recognition task with a limited dataset. For regression tasks, APALU enhances the performance of deep neural networks and recurrent neural networks on different datasets. These improvements highlight the robustness and adaptability of APALU across diverse deep-learning applications.

[796] Learning Diffusion Priors from Observations by Expectation Maximization

François Rozet, Gérôme Andry, François Lanusse, Gilles Louppe

Main category: cs.LG

TL;DR: DiEM is a novel method for training diffusion models from incomplete and noisy observations only, using expectation-maximization algorithm.

Details

Motivation: Training diffusion models typically requires large amounts of clean data, which can be difficult to obtain in some settings.

Method: Based on expectation-maximization algorithm for training diffusion models from incomplete and noisy observations only, with an improved posterior sampling scheme for unconditional diffusion models.

Result: DiEM leads to proper diffusion models, which is crucial for downstream tasks, and empirical evidence supports the effectiveness of the approach.

Conclusion: DiEM provides an effective method for training diffusion models when only incomplete and noisy observations are available.

Abstract: Diffusion models recently proved to be remarkable priors for Bayesian inverse problems. However, training these models typically requires access to large amounts of clean data, which could prove difficult in some settings. In this work, we present DiEM, a novel method based on the expectation-maximization algorithm for training diffusion models from incomplete and noisy observations only. Unlike previous works, DiEM leads to proper diffusion models, which is crucial for downstream tasks. As part of our methods, we propose and motivate an improved posterior sampling scheme for unconditional diffusion models. We present empirical evidence supporting the effectiveness of our approach.

[797] Bellman Diffusion Models

Liam Schramm, Abdeslam Boularias

Main category: cs.LG

TL;DR: Using diffusion models to represent successor state measures in reinforcement learning, with Bellman constraints leading to simple updates.

Details

Motivation: Diffusion models have shown success in generative tasks and recently in policy modeling for offline RL and imitation learning. This work explores extending them to model successor state measures.

Method: Enforcing Bellman flow constraints on diffusion-based successor state measures, resulting in a simple Bellman update on the diffusion step distribution.

Result: The approach enables diffusion models to effectively represent successor state measures while maintaining Bellman consistency.

Conclusion: Diffusion models can be successfully adapted to model successor state measures with Bellman constraints, providing a novel approach for reinforcement learning.

Abstract: Diffusion models have seen tremendous success as generative architectures. Recently, they have been shown to be effective at modelling policies for offline reinforcement learning and imitation learning. We explore using diffusion as a model class for the successor state measure (SSM) of a policy. We find that enforcing the Bellman flow constraints leads to a simple Bellman update on the diffusion step distribution.

[798] Gymnasium: A Standard Interface for Reinforcement Learning Environments

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Hannah Tan, Omar G. Younis

Main category: cs.LG

TL;DR: Gymnasium is an open-source library that provides a standardized API for RL environments to address the lack of standardization in reinforcement learning research, enabling easier comparison and development of RL algorithms.

Details

Motivation: RL research is hindered by lack of standardization in environment and algorithm implementations, making it difficult to compare and build upon existing work, which slows down progress in the field.

Method: Gymnasium provides a standard API with abstractions for wide interoperability between environments and training algorithms, along with tools for customizing environments and ensuring reproducibility.

Result: The library significantly streamlines RL algorithm development and testing, allowing researchers to focus more on innovation rather than implementation details.

Conclusion: By offering a standardized platform, Gymnasium helps drive forward reinforcement learning research and unlock its full potential.

Abstract: Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other’s work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium’s main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at https://github.com/Farama-Foundation/Gymnasium

[799] DuSEGO: Dual Second-order Equivariant Graph Ordinary Differential Equation

Yingxu Wang, Nan Yin, Mingyan Xiao, Xinhao Yi, Siwei Liu, Shangsong Liang

Main category: cs.LG

TL;DR: Proposes Dual Second-order Equivariant Graph ODE (DuSEGO) to address over-smoothing and gradient issues in GNNs by applying second-order graph ODEs to both embeddings and coordinates while maintaining equivariance.

Details

Motivation: Existing equivariant GNNs suffer from over-smoothing and gradient problems in deep networks, and most operate on first-order information while real-world systems are often second-order, limiting representation capabilities.

Method: Applies dual second-order equivariant graph ordinary differential equations simultaneously on graph embeddings and node coordinates, maintaining equivariant properties while addressing optimization issues.

Result: Theoretical proofs show DuSEGO maintains equivariance, alleviates over-smoothing in both features and coordinates, and mitigates exploding/vanishing gradients, enabling deeper GNN training. Experimental validation shows superiority over baselines.

Conclusion: DuSEGO provides an effective framework for equivariant graph representation that overcomes key limitations of existing GNNs through second-order ODE modeling, with proven theoretical guarantees and empirical performance improvements.

Abstract: Graph Neural Networks (GNNs) with equivariant properties have achieved significant success in modeling complex dynamic systems and molecular properties. However, their expressiveness ability is limited by: (1) Existing methods often overlook the over-smoothing issue caused by traditional GNN models, as well as the gradient explosion or vanishing problems in deep GNNs. (2) Most models operate on first-order information, neglecting that the real world often consists of second-order systems, which further limits the model’s representation capabilities. To address these issues, we propose the \textbf{Du}al \textbf{S}econd-order \textbf{E}quivariant \textbf{G}raph \textbf{O}rdinary Differential Equation (\method{}) for equivariant representation. Specifically, \method{} apply the dual second-order equivariant graph ordinary differential equations (Graph ODEs) on graph embeddings and node coordinates, simultaneously. Theoretically, we first prove that \method{} maintains the equivariant property. Furthermore, we provide theoretical insights showing that \method{} effectively alleviates the over-smoothing problem in both feature representation and coordinate update. Additionally, we demonstrate that the proposed \method{} mitigates the exploding and vanishing gradients problem, facilitating the training of deep multi-layer GNNs. Extensive experiments on benchmark datasets validate the superiority of the proposed \method{} compared to baselines.

[800] MistralBSM: Leveraging Mistral-7B for Vehicular Networks Misbehavior Detection

Wissal Hamhoum, Soumaya Cherkaoui

Main category: cs.LG

TL;DR: Proposes MistralBSM, a fine-tuned Mistral-7B LLM for detecting vehicle misbehavior in edge-cloud framework, achieving 98% binary and 96% multiclass classification accuracy on VeReMi dataset attacks.

Details

Motivation: Malicious attacks on vehicular networks threaten road safety and communication reliability, with misbehaving vehicles being a major source of these threats.

Method: Fine-tuned Mistral-7B LLM to detect misbehavior using Basic Safety Message sequences as edge component, with larger cloud LLM for validation and reinforcement through comprehensive analysis. Only updated 0.012% of model parameters.

Result: Achieved 98% accuracy in binary classification and 96% in multiclass classification on selected VeReMi dataset attacks, outperforming LLAMA2-7B and RoBERTa.

Conclusion: Validates the potential of LLMs in Misbehavior Detection Systems, showing significant promise for strengthening vehicular network security and ensuring road user safety.

Abstract: Malicious attacks on vehicular networks pose a serious threat to road safety as well as communication reliability. A major source of these threats stems from misbehaving vehicles within the network. To address this challenge, we propose a Large Language Model (LLM)-empowered Misbehavior Detection System (MDS) within an edge-cloud detection framework. Specifically, we fine-tune Mistral-7B, a compact and high-performing LLM, to detect misbehavior based on Basic Safety Messages (BSM) sequences as the edge component for real-time detection, while a larger LLM deployed in the cloud validates and reinforces the edge model’s detection through a more comprehensive analysis. By updating only 0.012% of the model parameters, our model, which we named MistralBSM, achieves 98% accuracy in binary classification and 96% in multiclass classification on a selected set of attacks from VeReMi dataset, outperforming LLAMA2-7B and RoBERTa. Our results validate the potential of LLMs in MDS, showing a significant promise in strengthening vehicular network security to better ensure the safety of road users.

[801] Machine Unlearning Doesn’t Do What You Think: Lessons for Generative AI Policy and Research

A. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Kevin Klyman, Matthew Jagielski, Katja Filippova, Ken Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Eleni Triantafillou, Peter Kairouz, Nicole Elyse Mitchell, Niloofar Mireshghallah, Abigail Z. Jacobs, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Ilia Shumailov, Andreas Terzis, Solon Barocas, Jennifer Wortman Vaughan, Danah Boyd, Yejin Choi, Sanmi Koyejo, Fernando Delgado, Percy Liang, Daniel E. Ho, Pamela Samuelson, Miles Brundage, David Bau, Seth Neel, Hanna Wallach, Amy B. Cyphert, Mark A. Lemley, Nicolas Papernot, Katherine Lee

Main category: cs.LG

TL;DR: Machine unlearning is proposed as a solution to remove problematic content from AI models, but faces technical and substantive challenges that limit its effectiveness as a general-purpose solution.

Details

Motivation: To address legal and moral concerns in AI models including privacy, copyright, and safety issues by removing specific problematic content or suppressing targeted information from model outputs.

Method: Provides a framework for analyzing the challenges of machine unlearning, identifying mismatches between unlearning goals and feasible implementations.

Result: Identifies several mismatches between the intended goals of unlearning and what can be practically achieved, explaining limitations of unlearning approaches.

Conclusion: Machine unlearning is not a general-purpose solution for controlling generative-AI model behavior and has significant limitations in achieving broader positive impact.

Abstract: “Machine unlearning” is a popular proposed solution for mitigating the existence of content in an AI model that is problematic for legal or moral reasons, including privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of specific information from a generative-AI model’s parameters, e.g., a particular individual’s personal data or the inclusion of copyrighted content in the model’s training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual’s data or reflect the concept of “Spiderman.” Both of these goals–the targeted removal of information from a model and the targeted suppression of information from a model’s outputs–present various technical and substantive challenges. We provide a framework for ML researchers and policymakers to think rigorously about these challenges, identifying several mismatches between the goals of unlearning and feasible implementations. These mismatches explain why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact.

[802] FairAIED: Navigating Fairness, Bias, and Ethics in Educational AI Applications

Zhipeng Yin, Sribala Vidyadhari Chinta, Zichong Wang, Matthew Gonzalez, Wenbin Zhang

Main category: cs.LG

TL;DR: This survey provides a comprehensive systematic review of algorithmic fairness in educational AI, bridging the gap between technical fairness research and educational applications through a harmonized education-centered framework.

Details

Motivation: AI systems in education can inadvertently encode and amplify biases from educational data, leading to unfair outcomes. Existing research is fragmented with differing assumptions and methodologies, and current surveys either focus on algorithmic fairness without educational context or emphasize educational methods while overlooking fairness.

Method: The survey conducts a comprehensive systematic review integrating multiple dimensions including bias sources, fairness definitions, mitigation strategies, evaluation resources, and ethical considerations into a harmonized education-centered framework. It also examines practical challenges like censored learning outcomes and fairness-utility trade-offs.

Result: The review establishes a comprehensive foundation for advancing fairness, accountability, and inclusivity in AI education by providing a unified framework that bridges technical fairness research with educational applications.

Conclusion: The survey outlines an emerging pathway toward fair AI-driven education by situating technologies and practical insights within broader educational and ethical contexts, enhancing the applicability of fairness frameworks to real-world educational AI systems.

Abstract: The integration of AI in education holds immense potential for personalizing learning experiences and transforming instructional practices. However, AI systems can inadvertently encode and amplify biases present in educational data, leading to unfair or discriminatory outcomes. As researchers have sought to understand and mitigate these biases, a growing body of work has emerged examining fairness in educational AI. These studies, though expanding rapidly, remain fragmented due to differing assumptions, methodologies, and application contexts. Moreover, existing surveys either focus on algorithmic fairness without an educational setting or emphasize educational methods while overlooking fairness. To this end, this survey provides a comprehensive systematic review of algorithmic fairness within educational AI, explicitly bridging the gap between technical fairness research and educational applications. We integrate multiple dimensions, including bias sources, fairness definitions, mitigation strategies, evaluation resources, and ethical considerations, into a harmonized, education-centered framework. In addition, we explicitly examine practical challenges such as censored or partially observed learning outcomes and the persistent difficulty in quantifying and managing the trade-off between fairness and predictive utility, enhancing the applicability of fairness frameworks to real-world educational AI systems. Finally, we outline an emerging pathway toward fair AI-driven education and by situating these technologies and practical insights within broader educational and ethical contexts, this review establishes a comprehensive foundation for advancing fairness, accountability, and inclusivity in the field of AI education.

[803] Low-Rank Adaptation for Foundation Models: A Comprehensive Review

Menglin Yang, Jialin Chen, Jinkai Tao, Yifei Zhang, Jiahong Liu, Jiasheng Zhang, Qiyao Ma, Harshit Verma, Regina Zhang, Min Zhou, Irwin King, Rex Ying

Main category: cs.LG

TL;DR: This survey provides the first comprehensive review of Low-Rank Adaptation (LoRA) techniques for general foundation models, covering recent developments, applications across domains, and future research directions.

Details

Motivation: Foundation models with billions/trillions of parameters face significant challenges in adapting to specific downstream tasks, requiring parameter-efficient fine-tuning methods.

Method: Low-Rank Adaptation (LoRA) offers a parameter-efficient mechanism to fine-tune foundation models with minimal computational overhead by using low-rank matrices.

Result: The survey comprehensively reviews LoRA techniques beyond language models to general foundation models, including recent techniques, emerging frontiers, and applications across multiple domains.

Conclusion: LoRA serves as a valuable resource for efficient foundation model adaptation, though challenges remain in theoretical understanding, scalability, and robustness that require future research.

Abstract: The rapid advancement of foundation modelslarge-scale neural networks trained on diverse, extensive datasetshas revolutionized artificial intelligence, enabling unprecedented advancements across domains such as natural language processing, computer vision, and scientific discovery. However, the substantial parameter count of these models, often reaching billions or trillions, poses significant challenges in adapting them to specific downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a highly promising approach for mitigating these challenges, offering a parameter-efficient mechanism to fine-tune foundation models with minimal computational overhead. This survey provides the first comprehensive review of LoRA techniques beyond large Language Models to general foundation models, including recent techniques foundations, emerging frontiers and applications of low-rank adaptation across multiple domains. Finally, this survey discusses key challenges and future research directions in theoretical understanding, scalability, and robustness. This survey serves as a valuable resource for researchers and practitioners working with efficient foundation model adaptation.

[804] Lyapunov Neural ODE State-Feedback Control Policies

Joshua Hang Sai Ip, Georgios Makrygiorgos, Ali Mesbah

Main category: cs.LG

TL;DR: L-NODEC is a neural ODE approach that uses Lyapunov loss to learn stable control policies for constrained nonlinear systems, guaranteeing exponential stability and adversarial robustness.

Details

Motivation: To bridge the gap between neural ODE approaches for continuous-time optimal control and stability guarantees, enabling safe and robust control of constrained nonlinear systems.

Method: Uses a novel Lyapunov loss formulation incorporating exponentially-stabilizing control Lyapunov functions to learn state-feedback neural control policies via neural ODEs.

Result: L-NODEC effectively stabilizes controlled systems around target states despite perturbations, reduces inference time to reach targets, and demonstrates performance in problems including plasma medicine dose delivery.

Conclusion: The proposed Lyapunov-NODE control approach successfully solves continuous-time optimal control problems with stability guarantees and adversarial robustness.

Abstract: Deep neural networks are increasingly used as an effective parameterization of control policies in various learning-based control paradigms. For continuous-time optimal control problems (OCPs), which are central to many decision-making tasks, control policy learning can be cast as a neural ordinary differential equation (NODE) problem wherein state and control constraints are naturally accommodated. This paper presents a NODE approach to solving continuous-time OCPs for the case of stabilizing a known constrained nonlinear system around a target state. The approach, termed Lyapunov-NODE control (L-NODEC), uses a novel Lyapunov loss formulation that incorporates an exponentially-stabilizing control Lyapunov function to learn a state-feedback neural control policy, bridging the gap of solving continuous-time OCPs via NODEs with stability guarantees. The proposed Lyapunov loss allows L-NODEC to guarantee exponential stability of the controlled system, as well as its adversarial robustness to perturbations to the initial state. The performance of L-NODEC is illustrated in two problems, including a dose delivery problem in plasma medicine. In both cases, L-NODEC effectively stabilizes the controlled system around the target state despite perturbations to the initial state and reduces the inference time necessary to reach the target.

[805] Neural Entropy

Akhil Premkumar

Main category: cs.LG

TL;DR: The paper introduces neural entropy to measure information stored in diffusion models, showing they efficiently compress structured data.

Details

Motivation: To explore the connection between deep learning and information theory through diffusion models, quantifying how they store information erased during diffusion.

Method: Introduces neural entropy as a measure related to total entropy produced by diffusion, analyzing how diffusion models convert noise back to structured data by storing erased information in neural networks.

Result: Measurements on simple image diffusion models show they are extremely efficient at compressing large ensembles of structured data.

Conclusion: Diffusion models serve as efficient information compressors through the neural entropy framework, bridging deep learning and information theory.

Abstract: We explore the connection between deep learning and information theory through the paradigm of diffusion models. A diffusion model converts noise into structured data by reinstating, imperfectly, information that is erased when data was diffused to noise. This information is stored in a neural network during training. We quantify this information by introducing a measure called neural entropy, which is related to the total entropy produced by diffusion. Neural entropy is a function of not just the data distribution, but also the diffusive process itself. Measurements of neural entropy on a few simple image diffusion models reveal that they are extremely efficient at compressing large ensembles of structured data.

[806] Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds

Fan Wang, Pengtao Shao, Yiming Zhang, Bo Yu, Shaoshan Liu, Ning Ding, Yang Cao, Yu Kang, Haifeng Wang

Main category: cs.LG

TL;DR: Proposes AnyMDP, a procedurally generated tabular MDP framework for scalable in-context reinforcement learning, enabling large-scale meta-training and investigation of data distribution effects on ICRL performance.

Details

Motivation: Addresses the lack of scalable task collections for scaling up In-Context Reinforcement Learning (ICRL) by creating a framework that can generate high-quality tasks at large scale with minimal structural biases.

Method: Develops AnyMDP through carefully designed randomization process for procedural task generation, introduces decoupled policy distillation, and incorporates prior information in the ICRL framework for efficient meta-training at scale.

Result: Demonstrates that with large-scale AnyMDP tasks, the model can generalize to unseen tasks through versatile in-context learning paradigms, and enables empirical investigation of data distribution effects on ICRL performance.

Conclusion: ICRL generalization comes at the cost of increased task diversity and longer adaptation periods, highlighting the need for diverse task design and prioritizing asymptotic performance over few-shot adaptation for scaling robust ICRL capabilities.

Abstract: In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce decoupled policy distillation and induce prior information in the ICRL framework. Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set through versatile in-context learning paradigms. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.

[807] From Epilepsy Seizures Classification to Detection: A Deep Learning-based Approach for Raw EEG Signals

Davy Darankoum, Manon Villalba, Clelia Allioux, Baptiste Caraballo, Carine Dumont, Eloise Gronlier, Corinne Roucard, Yann Roche, Chloe Habermacher, Sergei Grudinin, Julien Volle

Main category: cs.LG

TL;DR: A deep learning pipeline for epileptic seizure detection from raw EEG signals, combining CNN and Transformer architecture with novel preprocessing and postprocessing methods, achieving cross-species generalization with 93% F1-score.

Details

Motivation: One-third of mesial temporal lobe epilepsy patients are drug-resistant, requiring new treatments. Accurate seizure detection in EEG signals is crucial for evaluating anti-seizure medication efficacy.

Method: Pipeline includes: novel preprocessing segmenting continuous raw EEG without prior seizure/non-seizure distinction; postprocessing to reassemble segments and identify seizure start/end; CNN-Transformer architecture; data splitting strategy to prevent leakage.

Result: Demonstrated fundamental differences between seizure classification vs detection tasks. Achieved cross-species generalization - model trained on animal EEGs achieved 93% F1-score on human Bonn dataset.

Conclusion: The proposed deep learning pipeline effectively detects seizures from raw EEG signals and shows strong generalization capabilities across species, providing a valuable tool for ASM development and epilepsy treatment evaluation.

Abstract: Epilepsy represents the most prevalent neurological disease in the world. One-third of people suffering from mesial temporal lobe epilepsy (MTLE) exhibit drug resistance, urging the need to develop new treatments. A key part in anti-seizure medication (ASM) development is the capability of detecting and quantifying epileptic seizures occurring in electroencephalogram (EEG) signals, which is crucial for treatment efficacy evaluation. In this study, we introduced a seizure detection pipeline based on deep learning models applied to raw EEG signals. This pipeline integrates: a new pre-processing technique which segments continuous raw EEG signals without prior distinction between seizure and seizure-free activities; a post-processing algorithm developed to reassemble EEG segments and allow the identification of seizures start/end; and finally, a new evaluation procedure based on a strict seizure events comparison between predicted and real labels. Models training have been performed using a data splitting strategy which addresses the potential for data leakage. We demonstrated the fundamental differences between a seizure classification and a seizure detection task and showed the differences in performance between the two tasks. Finally, we demonstrated the generalization capabilities across species of our best architecture, combining a Convolutional Neural Network and a Transformer encoder. The model was trained on animal EEGs and tested on human EEGs with a F1-score of 93% on a balanced Bonn dataset.

[808] Mechanism Learning: reverse causal inference in the presence of multiple unknown confounding through causally weighted Gaussian mixture models

Jianqiao Mao, Max A. Little

Main category: cs.LG

TL;DR: Mechanism learning uses causally weighted Gaussian Mixture Models to deconfound observational data, enabling ML models to learn causal relationships rather than spurious associations, even with unmeasured confounding.

Details

Motivation: Traditional ML models learn associational relationships that can be spurious and non-causal, which is problematic in high-stakes applications where causal understanding is crucial.

Method: Proposes causally weighted Gaussian Mixture Models (CW-GMMs) that deconfound observational data by leveraging mechanism variables that mediate between causes and effects but are independent of confounding variables.

Result: The method successfully discovers reliable, unbiased causal predictors across synthetic, semi-synthetic and real-world datasets, outperforming classical supervised learning which remains heavily biased by spurious associations.

Conclusion: Mechanism learning provides a widely applicable approach for enabling ML models to learn causal relationships from observational data, addressing the fundamental limitation of associational learning in high-stakes applications.

Abstract: A major limitation of machine learning (ML) prediction models is that they recover associational, rather than causal, predictive relationships between variables. In high-stakes automation applications of ML this is problematic, as the model often learns spurious, non-causal associations. This paper proposes mechanism learning, a simple method which uses causally weighted Gaussian Mixture Models (CW-GMMs) to deconfound observational data such that any appropriate ML model is forced to learn predictive relationships between effects and their causes (reverse causal inference), despite the potential presence of multiple unknown and unmeasured confounding. Effect variables can be very high-dimensional, and the predictive relationship nonlinear, as is common in ML applications. This novel method is widely applicable, the only requirement is the existence of a set of mechanism variables mediating the cause (prediction target) and effect (feature data), which is independent of the (unmeasured) confounding variables. We test our method on fully synthetic, semi-synthetic and real-world datasets, demonstrating that it can discover reliable, unbiased, causal ML predictors where by contrast, the same ML predictor trained naively using classical supervised learning on the original observational data, is heavily biased by spurious associations. We provide code to implement the results in the paper, online.

[809] Exploring Kolmogorov-Arnold Networks for Interpretable Time Series Classification

Irina Barašin, Blaž Bertalanič, Mihael Mohorčič, Carolina Fortuna

Main category: cs.LG

TL;DR: This paper explores Kolmogorov-Arnold Networks (KANs) for time series classification using 117 UCR datasets, showing Efficient KAN outperforms MLPs in performance and training time, achieves competitive accuracy with state-of-the-art models while being more interpretable.

Details

Motivation: Deep neural models show promising performance for time series classification but lack theoretical understanding and interpretability. KANs have been proposed as more interpretable alternatives, but their application to time series classification remains limited.

Method: Comprehensive exploration of KAN architecture for time series classification using 117 UCR datasets, investigating transferability of regression architectures, optimal hyperparameter configurations, complexity trade-offs, and interpretability evaluation.

Result: Efficient KAN outperforms MLPs in performance and training times, shows greater stability than original KAN, achieves competitive accuracy with state-of-the-art models (HIVE-COTE2, InceptionTime) with smaller architectures and faster training, and demonstrates interpretability through SHAP analysis.

Conclusion: KANs provide a favorable balance of performance and transparency for time series classification, offering competitive accuracy while maintaining interpretability and efficiency advantages over traditional deep learning approaches.

Abstract: Time series classification is a relevant step supporting decision-making processes in various domains, and deep neural models have shown promising performance in this respect. Despite significant advancements in deep learning, the theoretical understanding of how and why complex architectures function remains limited, prompting the need for more interpretable models. Recently, the Kolmogorov-Arnold Networks (KANs) have been proposed as a more interpretable alternative to deep learning. While KAN-related research is significantly rising, to date, the study of KAN architectures for time series classification has been limited. In this paper, we aim to conduct a comprehensive and robust exploration of the KAN architecture for time series classification utilising 117 datasets from UCR benchmark archive, from multiple different domains. More specifically, we investigate a) the transferability of reference architectures designed for regression to classification tasks, b) identifying the hyperparameter and implementation configurations for an architecture that best generalizes across 117 datasets, c) the associated complexity trade-offs and d) evaluate KANs interpretability. Our results demonstrate that (1) the Efficient KAN outperforms MLPs in both performance and training times, showcasing its suitability for classification tasks. (2) Efficient KAN exhibits greater stability than the original KAN across grid sizes, depths, and layer configurations, especially when lower learning rates are employed. (3) KAN achieves competitive accuracy compared to state-of-the-art models such as HIVE-COTE2 and InceptionTime, while maintaining smaller architectures and faster training times, highlighting its favorable balance of performance and transparency. (4) The interpretability of the KAN model, as confirmed by SHAP analysis, reinforces its capacity for transparent decision-making.

[810] Deep Modularity Networks with Diversity-Preserving Regularization

Yasmin Salehi, Dennis Giannacopoulos

Main category: cs.LG

TL;DR: DMoN-DPR enhances graph clustering by adding three diversity-preserving regularization terms to address DMoN’s limitations in feature-space separation, assignment dispersion, and confidence control.

Details

Motivation: Graph clustering often lacks feature-space diversity. DMoN provides structural separation but lacks explicit mechanisms for feature-space separation, assignment dispersion, and assignment-confidence control.

Method: Proposes DMoN-DPR with three novel regularization terms: distance-based for inter-cluster separation, variance-based for per-cluster assignment dispersion, and assignment-entropy penalty with small positive weight for confident assignments.

Result: Significantly enhances label-based clustering metrics on feature-rich benchmark datasets (paired two-tailed t-test, p≤0.05).

Conclusion: Incorporating diversity-preserving regularizations effectively creates more meaningful and interpretable clusters in graph representation learning.

Abstract: Graph clustering plays a crucial role in graph representation learning but often faces challenges in achieving feature-space diversity. While Deep Modularity Networks (DMoN) leverage modularity maximization and collapse regularization to ensure structural separation, they lack explicit mechanisms for feature-space separation, assignment dispersion, and assignment-confidence control. We address this limitation by proposing Deep Modularity Networks with Diversity-Preserving Regularization (DMoN-DPR), which introduces three novel regularization terms: distance-based for inter-cluster separation, variance-based for per-cluster assignment dispersion, and an assignment-entropy penalty with a small positive weight, encouraging more confident assignments gradually. Our method significantly enhances label-based clustering metrics on feature-rich benchmark datasets (paired two-tailed t-test, $p\leq0.05$), demonstrating the effectiveness of incorporating diversity-preserving regularizations in creating meaningful and interpretable clusters.

[811] Co-MTP: A Cooperative Trajectory Prediction Framework with Multi-Temporal Fusion for Autonomous Driving

Xinyu Zhang, Zewei Zhou, Zhaoyi Wang, Yangjie Ji, Yanjun Huang, Hong Chen

Main category: cs.LG

TL;DR: Co-MTP is a cooperative trajectory prediction framework that uses V2X technology to fuse temporal information from multiple vehicles for better autonomous driving prediction and planning.

Details

Motivation: Current V2X research focuses on single-frame cooperative perception, but temporal cues between frames for prediction and planning tasks remain underexplored.

Method: Uses V2X system with heterogeneous graph transformers to capture interactions in both history and future domains. History domain complements incomplete trajectories, while future domain incorporates planning actions and other vehicles’ intentions.

Result: Achieves state-of-the-art performance on V2X-Seq dataset, with both history and future fusion significantly benefiting prediction accuracy.

Conclusion: Co-MTP demonstrates that leveraging V2X for multi-temporal fusion in both history and future domains greatly improves trajectory prediction for autonomous driving.

Abstract: Vehicle-to-everything technologies (V2X) have become an ideal paradigm to extend the perception range and see through the occlusion. Exiting efforts focus on single-frame cooperative perception, however, how to capture the temporal cue between frames with V2X to facilitate the prediction task even the planning task is still underexplored. In this paper, we introduce the Co-MTP, a general cooperative trajectory prediction framework with multi-temporal fusion for autonomous driving, which leverages the V2X system to fully capture the interaction among agents in both history and future domains to benefit the planning. In the history domain, V2X can complement the incomplete history trajectory in single-vehicle perception, and we design a heterogeneous graph transformer to learn the fusion of the history feature from multiple agents and capture the history interaction. Moreover, the goal of prediction is to support future planning. Thus, in the future domain, V2X can provide the prediction results of surrounding objects, and we further extend the graph transformer to capture the future interaction among the ego planning and the other vehicles’ intentions and obtain the final future scenario state under a certain planning action. We evaluate the Co-MTP framework on the real-world dataset V2X-Seq, and the results show that Co-MTP achieves state-of-the-art performance and that both history and future fusion can greatly benefit prediction.

[812] Implicit Bias in Matrix Factorization and its Explicit Realization in a New Architecture

Yikun Hou, Suvrit Sra, Alp Yurtsever

Main category: cs.LG

TL;DR: Gradient descent for matrix factorization shows implicit bias toward low-rank solutions, even with unbounded iterates. A new factorization model with constrained U,V and diagonal D achieves truly low-rank solutions. Extended to neural networks, it produces competitive performance with lightweight representations.

Details

Motivation: Existing theories assume bounded iterates, but empirical bias persists even with unbounded sequences. Factors develop low-rank structure while magnitudes increase, aligning with certain directions. Need to capture this behavior in a stable way.

Method: Introduce new factorization model: X ≈ UDV⊤ where U and V are constrained within norm balls, and D is a diagonal factor allowing full search space coverage. Extend to neural networks with constrained layers and diagonal components.

Result: Model consistently exhibits strong implicit bias, yielding truly low-rank solutions (not just approximate). Extended neural network model achieves competitive performance on regression and classification tasks while producing lightweight, low-rank representations.

Conclusion: The constrained factorization approach effectively captures the implicit bias of gradient descent toward low-rank solutions, providing a stable framework for obtaining truly low-rank representations in both matrix factorization and neural networks.

Abstract: Gradient descent for matrix factorization exhibits an implicit bias toward approximately low-rank solutions. While existing theories often assume the boundedness of iterates, empirically the bias persists even with unbounded sequences. This reflects a dynamic where factors develop low-rank structure while their magnitudes increase, tending to align with certain directions. To capture this behavior in a stable way, we introduce a new factorization model: $X\approx UDV^\top$, where $U$ and $V$ are constrained within norm balls, while $D$ is a diagonal factor allowing the model to span the entire search space. Experiments show that this model consistently exhibits a strong implicit bias, yielding truly (rather than approximately) low-rank solutions. Extending the idea to neural networks, we introduce a new model featuring constrained layers and diagonal components that achieves competitive performance on various regression and classification tasks while producing lightweight, low-rank representations.

[813] Understanding Endogenous Data Drift in Adaptive Models with Recourse-Seeking Users

Bo-Yi Liu, Zhi-Xuan Liu, Kuan Lun Chen, Shih-Yu Tsai, Jie Gao, Hao-Tsung Yang

Main category: cs.LG

TL;DR: This paper examines how user recourse behavior creates feedback loops in ML systems, leading to increasingly higher decision standards and higher recourse costs over time.

Details

Motivation: Real-world ML systems face distribution shifts when users adapt their features to meet model criteria, creating feedback loops between user behavior and model updates that have received limited attention.

Method: Developed a general framework to model user strategic behaviors under resource constraints and competitive dynamics, with theoretical and empirical analysis of logistic and MLP models.

Result: User recourse behavior pushes models toward higher decision standards, increasing recourse costs and making recourse actions less reliable. Proposed Fair-top-k and Dynamic Continual Learning methods to mitigate these issues.

Conclusion: Algorithmic decision-making can unintentionally reinforce higher standards and create endogenous barriers to entry, connecting to economic theories about market dynamics.

Abstract: Deep learning models are widely used in decision-making and recommendation systems, where they typically rely on the assumption of a static data distribution between training and deployment. However, real-world deployment environments often violate this assumption. Users who receive negative outcomes may adapt their features to meet model criteria, i.e., recourse action. These adaptive behaviors create shifts in the data distribution and when models are retrained on this shifted data, a feedback loop emerges: user behavior influences the model, and the updated model in turn reshapes future user behavior. Despite its importance, this bidirectional interaction between users and models has received limited attention. In this work, we develop a general framework to model user strategic behaviors and their interactions with decision-making systems under resource constraints and competitive dynamics. Both the theoretical and empirical analyses show that user recourse behavior tends to push logistic and MLP models toward increasingly higher decision standards, resulting in higher recourse costs and less reliable recourse actions over time. To mitigate these challenges, we propose two methods–Fair-top-k and Dynamic Continual Learning (DCL)–which significantly reduce recourse cost and improve model robustness. Our findings draw connections to economic theories, highlighting how algorithmic decision-making can unintentionally reinforce a higher standard and generate endogenous barriers to entry.

[814] E2Former: An Efficient and Equivariant Transformer with Linear-Scaling Tensor Products

Yunyang Li, Lin Huang, Zhihao Ding, Chu Wang, Xinran Wei, Han Yang, Zun Wang, Chang Liu, Yu Shi, Peiran Jin, Tao Qin, Mark Gerstein, Jia Zhang

Main category: cs.LG

TL;DR: E2Former introduces an equivariant transformer using Wigner 6j convolution to reduce computational complexity from O(|E|) to O(|V|) while maintaining expressive power and rotational equivariance, achieving 7x-30x speedup over conventional SO(3) convolutions.

Details

Motivation: EGNNs face computational challenges due to expensive edge feature construction via spherical tensor products, making them impractical for large-scale systems.

Method: E2Former architecture incorporating Wigner 6j convolution that shifts computational burden from edges to nodes, reducing complexity while preserving equivariance.

Result: Achieves 7x-30x speedup compared to conventional SO(3) convolutions and mitigates computational challenges without compromising geometric information capture.

Conclusion: E2Former represents a promising direction for scalable and efficient molecular modeling by addressing computational bottlenecks in equivariant graph neural networks.

Abstract: Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them impractical for large-scale systems. To address this limitation, we introduce E2Former, an equivariant and efficient transformer architecture that incorporates the Wigner $6j$ convolution (Wigner $6j$ Conv). By shifting the computational burden from edges to nodes, the Wigner $6j$ Conv reduces the complexity from $O(|\mathcal{E}|)$ to $ O(| \mathcal{V}|)$ while preserving both the model’s expressive power and rotational equivariance. We show that this approach achieves a 7x-30x speedup compared to conventional $\mathrm{SO}(3)$ convolutions. Furthermore, our empirical results demonstrate that the derived E2Former mitigates the computational challenges of existing approaches without compromising the ability to capture detailed geometric information. This development could suggest a promising direction for scalable and efficient molecular modeling.

[815] DMol: A Highly Efficient and Chemical Motif-Preserving Molecule Generation Platform

Peizhi Niu, Yu-Hsiang Wang, Vishal Rana, Chetan Rupakheti, Abhishek Pandey, Olgica Milenkovic

Main category: cs.LG

TL;DR: DMol is a new graph diffusion model for small molecule generation that outperforms DiGress in validity by 1.5% while reducing diffusion steps 10x and runtime by half. Compressed DMol further improves validity by 2% and novelty.

Details

Motivation: To develop a more efficient and valid molecule generation model that reduces computational complexity while maintaining or improving performance over state-of-the-art methods like DiGress.

Method: Uses graph diffusion with modified objective function and graph noise scheduling that changes subsets of nodes at each step. Can be combined with junction-tree-like representations by compressing frequent carbon rings into supernodes.

Result: DMol achieves 1.5% higher validity than DiGress across datasets, reduces diffusion steps by 10x, and cuts runtime in half. Compressed DMol adds 2% more validity improvement and increases novelty.

Conclusion: DMol provides significant efficiency and validity improvements for molecule generation through optimized diffusion scheduling and objective functions, with compressed DMol offering additional benefits through ring compression.

Abstract: We introduce a new graph diffusion model for small molecule generation, DMol, which outperforms the state-of-the-art DiGress model in terms of validity by roughly 1.5% across all benchmarking datasets while reducing the number of diffusion steps by at least 10-fold, and the running time to roughly one half. The performance improvements are a result of a careful change in the objective function and a graph noise scheduling approach which, at each diffusion step, allows one to only change a subset of nodes of varying size in the molecule graph. Another relevant property of the method is that it can be easily combined with junction-tree-like graph representations that arise by compressing a collection of relevant ring structures into supernodes. Unlike classical junction-tree techniques that involve VAEs and require complicated reconstruction steps, compressed DMol directly performs graph diffusion on a graph that compresses only a carefully selected set of frequent carbon rings into supernodes, which results in straightforward sample generation. This compressed DMol method offers additional validity improvements over generic DMol of roughly 2%, increases the novelty of the method, and further improves the running time due to reductions in the graph size.

[816] Can Classic GNNs Be Strong Baselines for Graph-level Tasks? Simple Architectures Meet Excellence

Yuankai Luo, Lei Shi, Xiao-Ming Wu

Main category: cs.LG

TL;DR: Enhanced GNNs (GNN+) with six techniques match or surpass Graph Transformers in graph-level tasks, challenging the superiority of complex GT architectures.

Details

Motivation: To challenge the prevailing belief that Graph Transformers outperform GNNs in graph-level tasks by exploring the untapped potential of enhanced GNN architectures.

Method: Developed GNN+ framework integrating six techniques (edge features, normalization, dropout, residuals, feed-forward networks, positional encoding) and evaluated three classic GNNs (GCN, GIN, GatedGCN) across 14 graph-level datasets.

Result: Enhanced GNNs consistently matched or surpassed GTs, achieving top-three rankings on all datasets and first place in eight, while running several times faster than GTs.

Conclusion: Simple GNN architectures with proper enhancements can achieve superior graph-level performance compared to complex Graph Transformers, challenging current assumptions about GT superiority.

Abstract: Message-passing Graph Neural Networks (GNNs) are often criticized for their limited expressiveness, issues like over-smoothing and over-squashing, and challenges in capturing long-range dependencies. Conversely, Graph Transformers (GTs) are regarded as superior due to their employment of global attention mechanisms, which potentially mitigate these challenges. Literature frequently suggests that GTs outperform GNNs in graph-level tasks, especially for graph classification and regression on small molecular graphs. In this study, we explore the untapped potential of GNNs through an enhanced framework, GNN+, which integrates six widely used techniques: edge feature integration, normalization, dropout, residual connections, feed-forward networks, and positional encoding, to effectively tackle graph-level tasks. We conduct a systematic re-evaluation of three classic GNNs (GCN, GIN, and GatedGCN) enhanced by the GNN+ framework across 14 well-known graph-level datasets. Our results reveal that, contrary to prevailing beliefs, these classic GNNs consistently match or surpass the performance of GTs, securing top-three rankings across all datasets and achieving first place in eight. Furthermore, they demonstrate greater efficiency, running several times faster than GTs on many datasets. This highlights the potential of simple GNN architectures, challenging the notion that complex mechanisms in GTs are essential for superior graph-level performance. Our source code is available at https://github.com/LUOyk1999/GNNPlus.

[817] Trustworthy AI Must Account for Interactions

Jesse C. Cresswell

Main category: cs.LG

TL;DR: The paper argues that current Trustworthy AI research focusing on individual aspects like fairness, privacy, robustness, explainability, and uncertainty quantification is insufficient due to negative trade-offs between these aspects, and calls for a holistic approach that considers all aspects simultaneously.

Details

Motivation: Current Trustworthy AI research often improves individual aspects in isolation, but these efforts create unintended negative trade-offs between different trustworthy aspects (e.g., privacy measures can harm fairness).

Method: The authors review notable approaches to five key trustworthy AI aspects (fairness, privacy, robustness, explainability, uncertainty quantification) and systematically analyze pairwise negative interactions between them.

Result: The analysis reveals that enhancing one trustworthy aspect often negatively impacts others, such as differential privacy amplifying biases and undermining fairness.

Conclusion: Research on Trustworthy AI must adopt a holistic view that accounts for interactions between all relevant aspects simultaneously, rather than improving individual aspects in isolation.

Abstract: Trustworthy AI encompasses many aspirational aspects for aligning AI systems with human values, including fairness, privacy, robustness, explainability, and uncertainty quantification. Ultimately the goal of Trustworthy AI research is to achieve all aspects simultaneously. However, efforts to enhance one aspect often introduce unintended trade-offs that negatively impact others. In this position paper, we review notable approaches to these five aspects and systematically consider every pair, detailing the negative interactions that can arise. For example, applying differential privacy to model training can amplify biases, undermining fairness. Drawing on these findings, we take the position that current research practices of improving one or two aspects in isolation are insufficient. Instead, research on Trustworthy AI must account for interactions between aspects and adopt a holistic view across all relevant axes at once. To illustrate our perspective, we provide guidance on how practitioners can work towards integrated trust, examples of how interactions affect the financial industry, and alternative views.

[818] Efficient Neural SDE Training using Wiener-Space Cubature

Luke Snow, Vikram Krishnamurthy

Main category: cs.LG

TL;DR: A novel training technique for neural SDEs that uses deterministic cubature instead of Monte-Carlo simulation, achieving better convergence rates and computational efficiency.

Details

Motivation: Existing neural SDE training methods rely on Monte-Carlo simulation which has slow convergence rates (O(1/√n)) and requires Brownian motion simulation.

Method: Extends Wiener space cubature theory to approximate expected objective functional by weighted sum of deterministic ODE solutions, bypassing Monte-Carlo simulation.

Result: Achieves O(1/n) convergence rate vs O(1/√n) for Monte-Carlo, enables parallel ODE solvers, reduces path evaluations, and eliminates Brownian motion simulation.

Conclusion: The deterministic cubature approach provides more computationally efficient training for neural SDEs with improved convergence rates and reduced complexity.

Abstract: A neural stochastic differential equation (SDE) is an SDE with drift and diffusion terms parametrized by neural networks. The training procedure for neural SDEs consists of optimizing the SDE vector field (neural network) parameters to minimize the expected value of an objective functional on infinite-dimensional path-space. Existing training techniques focus on methods to efficiently compute path-wise gradients of the objective functional with respect to these parameters, then pair this with Monte-Carlo simulation to estimate the gradient expectation. In this work we introduce a novel training technique which bypasses and improves upon this Monte-Carlo simulation; we extend results in the theory of Wiener space cubature to approximate the expected objective functional value by a weighted sum of functional evaluations of deterministic ODE solutions. Our main mathematical contribution enabling this approximation is an extension of cubature bounds to the setting of Lipschitz-nonlinear functionals acting on path-space. Our resulting constructive algorithm allows for more computationally efficient training along several lines. First, it circumvents Brownian motion simulation and enables the use of efficient parallel ODE solvers, thus decreasing the complexity of path-functional evaluation. Furthermore, and more surprisingly, we show that the number of paths required to achieve a given (expected loss functional oracle value) approximation can be reduced in this deterministic cubature regime. Specifically, we show that under reasonable regularity assumptions we can observe a O(1/n) convergence rate, where n is the number of path evaluations; in contrast with the standard O(1/sqrt(n)) rate of naive Monte-Carlo or the O(log(n)^d /n) rate of quasi-Monte-Carlo.

[819] Electrical Load Forecasting over Multihop Smart Metering Networks with Federated Learning

Ratun Rahman, Pablo Moriano, Samee U. Khan, Dinh C. Nguyen

Main category: cs.LG

TL;DR: A personalized federated learning method using meta-learning to address data heterogeneity in smart meter load forecasting while optimizing latency through resource allocation.

Details

Motivation: Traditional machine learning methods for electric load forecasting require data sharing which raises privacy concerns, and current federated learning approaches struggle with imbalanced data distribution across heterogeneous smart meters.

Method: Developed a meta-learning-based personalized federated learning strategy to handle data heterogeneity, combined with latency optimization through optimal resource allocation at smart meters.

Result: Extensive simulations on real-world datasets show superior performance in load forecasting accuracy and reduced operational latency costs compared to existing approaches.

Conclusion: The proposed personalized federated learning method effectively addresses data heterogeneity and privacy concerns in smart grid load forecasting while minimizing latency through optimized resource allocation.

Abstract: Electric load forecasting is essential for power management and stability in smart grids. This is mainly achieved via advanced metering infrastructure, where smart meters (SMs) record household energy data. Traditional machine learning (ML) methods are often employed for load forecasting, but require data sharing, which raises data privacy concerns. Federated learning (FL) can address this issue by running distributed ML models at local SMs without data exchange. However, current FL-based approaches struggle to achieve efficient load forecasting due to imbalanced data distribution across heterogeneous SMs. This paper presents a novel personalized federated learning (PFL) method for high-quality load forecasting in metering networks. A meta-learning-based strategy is developed to address data heterogeneity at local SMs in the collaborative training of local load forecasting models. Moreover, to minimize the load forecasting delays in our PFL model, we study a new latency optimization problem based on optimal resource allocation at SMs. A theoretical convergence analysis is also conducted to provide insights into FL design for federated load forecasting. Extensive simulations from real-world datasets show that our method outperforms existing approaches regarding better load forecasting and reduced operational latency costs.

[820] Split Gibbs Discrete Diffusion Posterior Sampling

Wenda Chu, Zihui Wu, Yifan Chen, Yang Song, Yisong Yue

Main category: cs.LG

TL;DR: SGDD is a plug-and-play discrete diffusion posterior sampling algorithm using split Gibbs sampling for reward-guided generation and inverse problems in discrete-state spaces.

Details

Motivation: Posterior sampling methods for discrete diffusion models remain challenging compared to continuous diffusion models, creating a need for effective discrete-state sampling algorithms.

Method: Split Gibbs sampling-based discrete diffusion posterior sampling algorithm (SGDD) that enables principled plug-and-play posterior sampling for discrete data.

Result: Achieves state-of-the-art performance on DNA sequence design, discrete image inverse problems, and music infilling, with over 30% improvement over existing baselines.

Conclusion: SGDD provides an effective solution for posterior sampling in discrete-state spaces with proven convergence and superior performance across multiple benchmarks.

Abstract: We study the problem of posterior sampling in discrete-state spaces using discrete diffusion models. While posterior sampling methods for continuous diffusion models have achieved remarkable progress, analogous methods for discrete diffusion models remain challenging. In this work, we introduce a principled plug-and-play discrete diffusion posterior sampling algorithm based on split Gibbs sampling, which we call SGDD. Our algorithm enables reward-guided generation and solving inverse problems in discrete-state spaces. We demonstrate the convergence of SGDD to the target posterior distribution and verify this through controlled experiments on synthetic benchmarks. Our method enjoys state-of-the-art posterior sampling performance on a range of benchmarks for discrete data, including DNA sequence design, discrete image inverse problems, and music infilling, achieving more than 30% improved performance compared to existing baselines. Our code is available at https://github.com/chuwd19/Split-Gibbs-Discrete-Diffusion-Posterior-Sampling.

[821] Multivariate Gaussian Topic Modelling: A novel approach to discover topics with greater semantic coherence

Satyajeet Sahoo, Jhareswar Maiti

Main category: cs.LG

TL;DR: Proposes a novel Multivariate Gaussian Topic Model (MGTM) that represents topics as Multivariate Gaussian Distributions and documents as Gaussian Mixture Models, achieving higher topic coherence and interpretability compared to benchmark models like LDA and LSA.

Details

Motivation: Traditional topic models like LDA and LSA suffer from difficulty in topic interpretability and reduced performance with shorter texts, despite being useful for identifying latent topics from large document corpora with minimal supervision.

Method: Uses Multivariate Gaussian Distributions to represent topics and Gaussian Mixture Models for documents. Applies EM algorithm on document corpus to identify constituent Multivariate Gaussian distributions corresponding to latent topics and their parameters. Topic keywords are identified from distribution parameters for topic annotations.

Result: Achieved highest mean topic coherence (0.7) and median topic coherence (0.76) compared to 4 benchmark models when tested on 20 newsgroups dataset, demonstrating high effectiveness in identifying interpretable, semantically coherent topics.

Conclusion: The proposed MGTM approach effectively captures semantic themes with high interpretability, outperforming traditional topic models in terms of topic coherence and interpretability.

Abstract: An important aspect of text mining involves information retrieval in form of discovery of semantic themes (topics) from documents using topic modelling. While generative topic models like Latent Dirichlet Allocation (LDA) or Latent Semantic Analysis (LSA) elegantly model topics as probability distributions and are useful in identifying latent topics from large document corpora with minimal supervision, they suffer from difficulty in topic interpretability and reduced performance in shorter texts. Here we propose a novel Multivariate Gaussian Topic Model (MGTM). In this approach topics are presented as Multivariate Gaussian Distributions and documents as Gaussian Mixture Models. Applying EM algorithm on a document corpus, the various constituent Multivariate Gaussian distributions corresponding to the latent topics and their respective parameters are identified. Analysis of the parameters of each distribution helps identify the respective topic keywords, and from these key-words topic annotations are carried out. This approach is applied on 20 newsgroups dataset to demonstrate the interpretability benefits vis-a-vis 4 other benchmark models. The effectiveness of this model in capturing the semantic theme of the topics with high interpretability is examined by calculating the topic coherence and comparing the coherence values with benchmark models. This model achieves a highest mean topic coherence (0.7) and median topic coherence (0.76) vis-a-vis the benchmark models, demonstrating high effectiveness in identifying interpretable, semantically coherent topics.

[822] Evaluating Simplification Algorithms for Interpretability of Time Series Classification

Brigt Håvardstun, Felix Marti-Perez, Cèsar Ferri, Jan Arne Telle

Main category: cs.LG

TL;DR: This paper introduces metrics to evaluate simplified time series for interpretability of time series classifiers, focusing on complexity and loyalty of simplifications, and validates them through experiments and human evaluation.

Details

Motivation: Time series data are not intuitively understandable to humans like text and images, so simplifications are needed to aid interpretability of time series classifiers.

Method: Developed metrics for complexity (number of segments) and loyalty (maintaining original classification) of simplifications. Evaluated four simplification algorithms across multiple TSC algorithms and datasets with varying characteristics, followed by human-grounded evaluation with forward simulation.

Result: Simplifications that select subsets of original data points typically have high Shapley value, aiding interpretability. The metrics were confirmed to have practical utility through human evaluation.

Conclusion: Provides a framework for deciding whether various simplifications are likely to aid interpretability for a given time series classifier.

Abstract: In this work, we introduce metrics to evaluate the use of simplified time series in the context of interpretability of a TSC – a Time Series Classifier. Such simplifications are important because time series data, in contrast to text and image data, are not intuitively under- standable to humans. These metrics are related to the complexity of the simplifications – how many segments they contain – and to their loyalty – how likely they are to maintain the classification of the original time series. We focus on simplifications that select a subset of the original data points, and show that these typically have high Shapley value, thereby aiding interpretability. We employ these metrics to experimentally evaluate four distinct simplification algorithms, across several TSC algorithms and across datasets of varying characteristics, from seasonal or stationary to short or long. We subsequently perform a human-grounded evaluation with forward simulation, that confirms also the practical utility of the introduced metrics to evaluate the use of simplifications in the context of interpretability of TSC. Our findings are summarized in a framework for deciding, for a given TSC, if the various simplifications are likely to aid in its interpretability.

[823] PolyG: Adaptive Graph Traversal for Diverse GraphRAG Questions

Renjie Liu, Haitian Jiang, Xiao Yan, Bo Tang, Jinyang Li

Main category: cs.LG

TL;DR: The paper proposes PolyG, an adaptive GraphRAG approach that improves answer quality and efficiency by using a question taxonomy to dynamically generate appropriate graph queries for different question patterns.

Details

Motivation: Current GraphRAG methods are evaluated on biased benchmarks and use fixed retrieval strategies that don't handle diverse real-world questions effectively, leading to poor performance in both quality and efficiency.

Method: Proposes a four-class question taxonomy and uses it to create PolyBench benchmark. PolyG decomposes questions according to the taxonomy and dynamically prompts LLMs to generate appropriate graph database queries for each sub-question.

Result: PolyG achieves higher win rate in generation quality compared to SOTA methods, with lower response latency and token cost.

Conclusion: The proposed adaptive approach using question taxonomy and dynamic query generation significantly improves GraphRAG performance across diverse question patterns.

Abstract: GraphRAG enhances large language models (LLMs) to generate quality answers for user questions by retrieving related facts from external knowledge graphs. However, current GraphRAG methods are primarily evaluated on and overly tailored for knowledge graph question answering (KGQA) benchmarks, which are biased towards a few specific question patterns and do not reflect the diversity of real-world questions. To better evaluate GraphRAG methods, we propose a complete four-class taxonomy to categorize the basic patterns of knowledge graph questions and use it to create PolyBench, a new GraphRAG benchmark encompassing a comprehensive set of graph questions. With the new benchmark, we find that existing GraphRAG methods fall short in effectiveness (i.e., quality of the generated answers) and/or efficiency (i.e., response time or token usage) because they adopt either a fixed graph traversal strategy or free-form exploration by LLMs for fact retrieval. However, different question patterns require distinct graph traversal strategies and context formation. To facilitate better retrieval, we propose PolyG, an adaptive GraphRAG approach by decomposing and categorizing the questions according to our proposed question taxonomy. Built on top of a unified interface and execution engine, PolyG dynamically prompts an LLM to generate a graph database query to retrieve the context for each decomposed basic question. Compared with SOTA GraphRAG methods, PolyG achieves a higher win rate in generation quality and has a low response latency and token cost. Our code and benchmark are open-source at https://github.com/Liu-rj/PolyG.

[824] Learning Repetition-Invariant Representations for Polymer Informatics

Yihan Zhu, Gang Liu, Eric Inae, Tengfei Luo, Meng Jiang

Main category: cs.LG

TL;DR: GRIN is a novel graph neural network method that learns polymer representations invariant to the number of repeating units, addressing limitations of existing methods that only model single units.

Details

Motivation: Existing graph neural networks fail to produce consistent representations for polymers with varying numbers of repeating units, limiting their effectiveness for true polymer structures.

Method: GRIN integrates graph-based maximum spanning tree alignment with repeat-unit augmentation to ensure structural consistency, with theoretical guarantees showing three repeating units are minimal for optimal invariant representation learning.

Result: GRIN outperforms state-of-the-art baselines on both homopolymer and copolymer benchmarks, learning stable, repetition-invariant representations.

Conclusion: The method successfully generalizes to polymer chains of unseen sizes, providing consistent vector representations for polymers regardless of their repeating unit count.

Abstract: Polymers are large macromolecules composed of repeating structural units known as monomers and are widely applied in fields such as energy storage, construction, medicine, and aerospace. However, existing graph neural network methods, though effective for small molecules, only model the single unit of polymers and fail to produce consistent vector representations for the true polymer structure with varying numbers of units. To address this challenge, we introduce Graph Repetition Invariance (GRIN), a novel method to learn polymer representations that are invariant to the number of repeating units in their graph representations. GRIN integrates a graph-based maximum spanning tree alignment with repeat-unit augmentation to ensure structural consistency. We provide theoretical guarantees for repetition-invariance from both model and data perspectives, demonstrating that three repeating units are the minimal augmentation required for optimal invariant representation learning. GRIN outperforms state-of-the-art baselines on both homopolymer and copolymer benchmarks, learning stable, repetition-invariant representations that generalize effectively to polymer chains of unseen sizes.

[825] Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models

Wei Chen, Xin Yan, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Long Chen

Main category: cs.LG

TL;DR: DCD is a novel framework that decouples positive and negative sample learning in MLLMs to mitigate hallucinations while preserving general reasoning capabilities, avoiding both likelihood displacement from training-based methods and poor pattern capture from training-free methods.

Details

Motivation: MLLMs suffer from hallucination issues where outputs misalign with visual/factual evidence. Existing solutions have trade-offs: training-based methods like DPO risk sacrificing general reasoning due to likelihood displacement, while training-free methods use handcrafted perturbations that poorly capture authentic hallucination patterns.

Method: DCD decouples learning of positive and negative samples in preference datasets, training separate positive and negative image projections within MLLM. The negative projection implicitly models real hallucination patterns, enabling vision-aware negative images in contrastive decoding inference.

Result: Extensive experiments show DCD matches DPO’s hallucination suppression while preserving general capabilities, and outperforms handcrafted contrastive decoding methods across hallucination benchmarks and general reasoning tasks.

Conclusion: DCD provides robust hallucination mitigation by avoiding pairwise optimization (preventing likelihood displacement) and eliminating handcrafted degradation, achieving effective hallucination suppression without sacrificing general reasoning performance.

Abstract: Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious hallucination issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e.g., add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize robust hallucination mitigation (i.e., maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, i.e., it matches DPO’s hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.

[826] Multi-head Temporal Latent Attention

Keqi Deng, Philip C. Woodland

Main category: cs.LG

TL;DR: MTLA reduces KV cache size in Transformer self-attention by compressing along temporal dimension, achieving 5.3x speedup and 8.3x memory reduction while maintaining performance.

Details

Motivation: Transformer self-attention's KV cache grows linearly with sequence length, becoming a bottleneck for inference efficiency. Existing latent attention methods compress KV cache but still face memory issues.

Method: MTLA uses hyper-network to dynamically merge temporally adjacent KV cache vectors and employs stride-aware causal mask for training-inference consistency.

Result: Achieves 5.3x speedup and 8.3x GPU memory reduction on English-German speech translation while maintaining translation quality. Works well across speech translation, recognition, understanding and text summarization.

Conclusion: MTLA effectively reduces KV cache memory footprint while maintaining competitive performance, making it suitable for efficient Transformer inference across various tasks.

Abstract: While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance compared to standard Multi-Head Attention (MHA), while greatly improving inference speed and GPU memory usage. For example, on a English-German speech translation task, MTLA achieves a 5.3x speedup and a reduction in GPU memory usage by a factor of 8.3 compared to MHA, while maintaining translation quality.

[827] Reinforcement Learning from Human Feedback

Nathan Lambert

Main category: cs.LG

TL;DR: This book provides a gentle introduction to RLHF methods, covering its origins, core optimization stages from instruction tuning to alignment algorithms, and advanced research topics.

Details

Motivation: To give people with quantitative background an accessible introduction to RLHF, which has become important for deploying modern ML systems.

Method: Starts with RLHF origins and definitions, then details optimization stages including instruction tuning, reward model training, and alignment algorithms like rejection sampling and reinforcement learning.

Result: A comprehensive educational resource that covers the complete RLHF pipeline from fundamentals to advanced research questions.

Conclusion: The book concludes with understudied research areas in synthetic data and evaluation, plus open questions for the RLHF field.

Abstract: Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF – both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics – understudied research questions in synthetic data and evaluation – and open questions for the field.

[828] Follow the Energy, Find the Path: Riemannian Metrics from Energy-Based Models

Louis Béthune, David Vigouroux, Yilun Du, Rufin VanRullen, Thomas Serre, Victor Boutin

Main category: cs.LG

TL;DR: The paper proposes deriving Riemannian metrics directly from pretrained Energy-Based Models (EBMs) to compute geodesics that follow the intrinsic geometry of data manifolds in high-dimensional spaces.

Details

Motivation: Estimating Riemannian metrics for curved manifolds in high-dimensional spaces is challenging, and existing methods struggle to capture the intrinsic geometry of data manifolds effectively.

Method: Introduces two novel Riemannian metrics derived from EBMs, which define spatially varying distances to compute geodesics that remain close to the data manifold with lower curvature distortion.

Result: EBM-derived metrics consistently outperform established baselines, producing geodesics with better alignment to ground-truth trajectories, especially in high-dimensional settings across synthetic, character image, and natural image datasets.

Conclusion: This is the first work to derive Riemannian metrics from EBMs, enabling data-aware geodesics and scalable geometry-driven learning for generative modeling and simulation.

Abstract: What is the shortest path between two data points lying in a high-dimensional space? While the answer is trivial in Euclidean geometry, it becomes significantly more complex when the data lies on a curved manifold – requiring a Riemannian metric to describe the space’s local curvature. Estimating such a metric, however, remains a major challenge in high dimensions. In this work, we propose a method for deriving Riemannian metrics directly from pretrained Energy-Based Models (EBMs) – a class of generative models that assign low energy to high-density regions. These metrics define spatially varying distances, enabling the computation of geodesics – shortest paths that follow the data manifold’s intrinsic geometry. We introduce two novel metrics derived from EBMs and show that they produce geodesics that remain closer to the data manifold and exhibit lower curvature distortion, as measured by alignment with ground-truth trajectories. We evaluate our approach on increasingly complex datasets: synthetic datasets with known data density, rotated character images with interpretable geometry, and high-resolution natural images embedded in a pretrained VAE latent space. Our results show that EBM-derived metrics consistently outperform established baselines, especially in high-dimensional settings. Our work is the first to derive Riemannian metrics from EBMs, enabling data-aware geodesics and unlocking scalable, geometry-driven learning for generative modeling and simulation.

[829] A Basic Evaluation of Neural Networks Trained with the Error Diffusion Learning Algorithm

Kazuhisa Fujita

Main category: cs.LG

TL;DR: EDLA is a biologically inspired alternative to backpropagation that uses global error diffusion across paired positive/negative sublayers, achieving competitive performance on parity check, regression, and image classification tasks.

Details

Motivation: To provide a biologically plausible alternative to conventional backpropagation that eliminates layer-wise error backpropagation and has been underrecognized due to language barriers.

Method: Uses single global error signal that diffuses across networks with paired positive/negative sublayers, systematically varying neuron count, network depth, and learning rates across benchmark tasks.

Result: EDLA achieves consistently high accuracy across multiple benchmarks, with performance competitive with backpropagation especially in shallow architectures. Learning rate, neuron count, and network depth significantly affect efficiency and convergence.

Conclusion: EDLA is an effective biologically inspired learning algorithm that increases accessibility to alternative training methodologies and performs competitively with backpropagation.

Abstract: This paper presents a comprehensive formulation of Kaneko’s Error Diffusion Learning Algorithm (EDLA) and evaluates its effectiveness across parity check, regression, and image classification tasks. EDLA is a biologically inspired learning algorithm that provides an alternative to conventional backpropagation for training artificial neural networks. EDLA employs a single global error signal that diffuses across networks composed of paired positive and negative sublayers, eliminating traditional layer-wise error backpropagation. This study evaluates EDLA’s effectiveness using benchmark tasks, such as parity check, regression, and image classification, by systematically varying the neuron count, network depth, and learning rates to assess its performance comprehensively. The experimental results demonstrate that EDLA achieves consistently high accuracy across multiple benchmarks, highlighting its effectiveness as a learning algorithm for neural networks. The choice of learning rate, neuron count, and network depth significantly influences EDLA’s efficiency and convergence speed. Analysis of internal network representations reveals meaningful feature extraction capabilities, and the network’s overall performance is found to be competitive with networks trained via conventional backpropagation, especially in shallow architectures. This study introduces EDLA, a biologically plausible alternative to traditional backpropagation previously underrecognized due to language barriers. By reformulating EDLA, systematically evaluating its performance, and presenting empirical evidence of its effectiveness, this study increases the visibility and accessibility of EDLA and contributes to biologically inspired training methodologies.

[830] Hyper-Transforming Latent Diffusion Models

Ignacio Peis, Batuhan Koyuncu, Isabel Valera, Jes Frellsen

Main category: cs.LG

TL;DR: A novel generative framework for functions using Implicit Neural Representations (INRs) and Transformer-based hypernetworks in latent variable models, replacing MLP-based approaches for better scalability and efficiency.

Details

Motivation: To overcome scalability limitations of MLP-based hypernetworks in prior approaches and create a more flexible framework for learning structured function representations.

Method: Integrates INRs with Transformer-based hypernetworks in latent variable models, extends latent diffusion models by replacing standard decoders with Transformer hypernetworks, and enables training from scratch or via hyper-transforming (fine-tuning only decoder while freezing pre-trained latent space).

Result: Demonstrated improved scalability, expressiveness, and generalization across multiple modalities compared to existing INR-based generative models.

Conclusion: Establishes a unified and flexible framework for learning structured function representations that enables efficient adaptation of existing generative models to INR-based representations without full retraining.

Abstract: We introduce a novel generative framework for functions by integrating Implicit Neural Representations (INRs) and Transformer-based hypernetworks into latent variable models. Unlike prior approaches that rely on MLP-based hypernetworks with scalability limitations, our method employs a Transformer-based decoder to generate INR parameters from latent variables, addressing both representation capacity and computational efficiency. Our framework extends latent diffusion models (LDMs) to INR generation by replacing standard decoders with a Transformer-based hypernetwork, which can be trained either from scratch or via hyper-transforming: a strategy that fine-tunes only the decoder while freezing the pre-trained latent space. This enables efficient adaptation of existing generative models to INR-based representations without requiring full retraining. We validate our approach across multiple modalities, demonstrating improved scalability, expressiveness, and generalization over existing INR-based generative models. Our findings establish a unified and flexible framework for learning structured function representations.

[831] Stochastic Subspace Descent Accelerated via Bi-fidelity Line Search

Nuojin Cheng, Alireza Doostan, Stephen Becker

Main category: cs.LG

TL;DR: BF-SSD is a novel zeroth-order optimization algorithm that uses bi-fidelity surrogate models combining low-fidelity and high-fidelity function evaluations to reduce computational costs while maintaining optimization performance.

Details

Motivation: To address the computational burden of expensive function evaluations in zeroth-order optimization methods, especially when gradients are inaccessible.

Method: Uses a bi-fidelity framework to construct surrogate models from low-fidelity and high-fidelity evaluations, enabling efficient backtracking line search with theoretical convergence guarantees.

Result: BF-SSD achieves superior optimization performance with significantly fewer high-fidelity evaluations across synthetic benchmarks, kernel ridge regression, adversarial attacks, and language model fine-tuning.

Conclusion: Integrating bi-fidelity strategies in zeroth-order optimization provides a computationally efficient approach for large-scale, high-dimensional problems in real-world applications.

Abstract: Efficient optimization remains a fundamental challenge across numerous scientific and engineering domains, especially when objective function and gradient evaluations are computationally expensive. While zeroth-order optimization methods offer effective approaches when gradients are inaccessible, their practical performance can be limited by the high cost associated with function queries. This work introduces the bi-fidelity stochastic subspace descent (BF-SSD) algorithm, a novel zeroth-order optimization method designed to reduce this computational burden. BF-SSD leverages a bi-fidelity framework, constructing a surrogate model from a combination of computationally inexpensive low-fidelity (LF) and accurate high-fidelity (HF) function evaluations. This surrogate model facilitates an efficient backtracking line search for step size selection, for which we provide theoretical convergence guarantees under standard assumptions. We perform a comprehensive empirical evaluation of BF-SSD across four distinct problems: a synthetic optimization benchmark, dual-form kernel ridge regression, black-box adversarial attacks on machine learning models, and transformer-based black-box language model fine-tuning. Numerical results demonstrate that BF-SSD consistently achieves superior optimization performance while requiring significantly fewer HF function evaluations compared to relevant baseline methods. This study highlights the efficacy of integrating bi-fidelity strategies within zeroth-order optimization, positioning BF-SSD as a promising and computationally efficient approach for tackling large-scale, high-dimensional problems encountered in various real-world applications.

[832] PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models

Xiaoyan Hu, Lauren Pick, Ho-fung Leung, Farzan Farnia

Main category: cs.LG

TL;DR: PromptWise is an online learning framework that selects generative AI models for prompts based on both performance and cost, using a cost-aware bandit approach to minimize service costs while maintaining quality.

Details

Motivation: Existing model-selection methods focus only on performance and ignore cost differences between models, which is important given the wide range of available generative AI models with varying service costs.

Method: Uses a cost-aware bandit structure that estimates prompt-model compatibility and allows sequential model assignments per prompt to find the least expensive model that delivers satisfactory outputs.

Result: Numerical experiments on code generation and translation tasks show PromptWise achieves comparable performance to baseline methods while substantially reducing costs.

Conclusion: PromptWise effectively balances performance and cost in model selection, demonstrating significant cost savings while maintaining output quality across various generative AI tasks.

Abstract: The rapid advancement of generative AI has provided users with a wide range of well-trained models to address diverse prompts. When selecting a model for a given prompt, users should weigh not only its performance but also its service cost. However, existing model-selection methods typically emphasize performance while overlooking cost differences. In this paper, we introduce PromptWise, an online learning framework that assigns prompts to generative models in a cost-aware manner. PromptWise estimates prompt-model compatibility to select the least expensive model expected to deliver satisfactory outputs. Unlike standard contextual bandits that make a one-shot decision per prompt, PromptWise employs a cost-aware bandit structure that allows sequential model assignments per prompt to reduce total service cost. Through numerical experiments on tasks such as code generation and translation, we demonstrate that PromptWise can achieve performance comparable to baseline selection methods while incurring substantially lower costs. The code is available at: github.com/yannxiaoyanhu/PromptWise.

[833] Chronic Diseases Prediction using Machine Learning and Deep Learning Methods

Houda Belhad, Asmae Bourbia, Salma Boughanja

Main category: cs.LG

TL;DR: Machine learning and deep learning models were applied to predict chronic diseases and thyroid disorders, with ensemble methods like Random Forest and Gradient Boosted Trees showing superior performance.

Details

Motivation: Chronic diseases are leading causes of premature mortality worldwide, and traditional diagnostic methods often fail due to the complex nature of these conditions, highlighting the need for improved early detection and intervention methods.

Method: Used various ML/DL models (Logistic Regression, Random Forest, Gradient Boosted Trees, Neural Networks, Decision Trees, Naive Bayes) with comprehensive data pre-processing including handling missing values, categorical encoding, and feature aggregation, followed by model training and evaluation using performance metrics.

Result: Ensemble methods (Random Forest and Gradient Boosted Trees) consistently outperformed other models, and Neural Networks showed superior performance in capturing complex data patterns.

Conclusion: ML and DL have significant potential to revolutionize chronic disease prediction for early diagnosis and personalized treatment, though challenges remain in data quality, model interpretability, and computational techniques in healthcare.

Abstract: Chronic diseases, such as cardiovascular disease, diabetes, chronic kidney disease, and thyroid disorders, are the leading causes of premature mortality worldwide. Early detection and intervention are crucial for improving patient outcomes, yet traditional diagnostic methods often fail due to the complex nature of these conditions. This study explores the application of machine learning (ML) and deep learning (DL) techniques to predict chronic disease and thyroid disorders. We used a variety of models, including Logistic Regression (LR), Random Forest (RF), Gradient Boosted Trees (GBT), Neural Networks (NN), Decision Trees (DT) and Native Bayes (NB), to analyze and predict disease outcomes. Our methodology involved comprehensive data pre-processing, including handling missing values, categorical encoding, and feature aggregation, followed by model training and evaluation. Performance metrics such ad precision, recall, accuracy, F1-score, and Area Under the Curve (AUC) were used to assess the effectiveness of each model. The results demonstrated that ensemble methods like Random Forest and Gradient Boosted Trees consistently outperformed. Neutral Networks also showed superior performance, particularly in capturing complex data patterns. The findings highlight the potential of ML and DL in revolutionizing chronic disease prediction, enabling early diagnosis and personalized treatment strategies. However, challenges such as data quality, model interpretability, and the need for advanced computational techniques in healthcare to improve patient outcomes and reduce the burden of chronic diseases. This study was conducted as part of Big Data class project under the supervision of our professors Mr. Abderrahmane EZ-ZAHOUT and Mr. Abdessamad ESSAIDI.

[834] A probabilistic view on Riemannian machine learning models for SPD matrices

Thibault de Surrel, Florian Yger, Fabien Lotte, Sylvain Chevallier

Main category: cs.LG

TL;DR: This paper unifies machine learning tools on SPD matrices under a probabilistic framework using Gaussian distributions on the Riemannian manifold.

Details

Motivation: To provide a unified probabilistic framework for various machine learning tools operating on the Riemannian manifold of Symmetric Positive Definite matrices.

Method: Develops several Gaussian distributions defined on the SPD manifold and reinterprets popular classifiers as Bayes Classifiers using these distributions.

Result: Shows that Gaussian distributions are pervasive in SPD matrix tools, enabling extension of other ML methods to this manifold.

Conclusion: The probabilistic framework allows unification of existing tools and facilitates extension of other machine learning methods to the SPD manifold.

Abstract: The goal of this paper is to show how different machine learning tools on the Riemannian manifold $\mathcal{P}_d$ of Symmetric Positive Definite (SPD) matrices can be united under a probabilistic framework. For this, we will need several Gaussian distributions defined on $\mathcal{P}_d$. We will show how popular classifiers on $\mathcal{P}_d$ can be reinterpreted as Bayes Classifiers using these Gaussian distributions. These distributions will also be used for outlier detection and dimension reduction. By showing that those distributions are pervasive in the tools used on $\mathcal{P}_d$, we allow for other machine learning tools to be extended to $\mathcal{P}_d$.

[835] A Tale of Two Symmetries: Exploring the Loss Landscape of Equivariant Models

YuQing Xie, Tess Smidt

Main category: cs.LG

TL;DR: Equivariant networks can face optimization challenges due to hidden parameter symmetries that create problematic loss landscapes. Relaxing equivariance constraints can help escape these minima by allowing different group representations.

Details

Motivation: To understand why equivariant networks are harder to optimize than standard networks, and whether the optimization difficulties stem from fundamental obstacles or just require different hyperparameter tuning.

Method: Theoretical analysis of loss landscape geometry for networks built using permutation representations, viewing them as subsets of unconstrained MLPs. Empirical demonstration of constraint relaxation effects.

Result: Parameter symmetries in unconstrained models can prevent learning of global minima in equivariant subspaces. Relaxing constraints can solve this issue, with weights converging to different group representations in hidden layers.

Conclusion: Hidden parameter symmetries broken by constraint enforcement create problematic loss landscapes. Constraint relaxation can help escape these minima, suggesting the need to rethink fixed group representations in hidden layers.

Abstract: Equivariant neural networks have proven to be effective for tasks with known underlying symmetries. However, optimizing equivariant networks can be tricky and best training practices are less established than for standard networks. In particular, recent works have found small training benefits from relaxing equivariance constraints. This raises the question: do equivariance constraints introduce fundamental obstacles to optimization? Or do they simply require different hyperparameter tuning? In this work, we investigate this question through a theoretical analysis of the loss landscape geometry. We focus on networks built using permutation representations, which we can view as a subset of unconstrained MLPs. Importantly, we show that the parameter symmetries of the unconstrained model has nontrivial effects on the loss landscape of the equivariant subspace and under certain conditions can provably prevent learning of the global minima. Further, we empirically demonstrate in such cases, relaxing to an unconstrained MLP can sometimes solve the issue. Interestingly, the weights eventually found via relaxation corresponds to a different choice of group representation in the hidden layer. From this, we draw 3 key takeaways. (1) By viewing the unconstrained version of an architecture, we can uncover hidden parameter symmetries which were broken by choice of constraint enforcement (2) Hidden symmetries give important insights on loss landscapes and can induce critical points and even minima (3) Hidden symmetry induced minima can sometimes be escaped by constraint relaxation and we observe the network jumps to a different choice of constraint enforcement. Effective equivariance relaxation may require rethinking the fixed choice of group representation in the hidden layers.

[836] Dequantified Diffusion-Schr{ö}dinger Bridge for Density Ratio Estimation

Wei Chen, Shigui Li, Jiacheng Li, Junmei Yang, John Paisley, Delu Zeng

Main category: cs.LG

TL;DR: D³RE is a unified framework for robust, stable, and efficient density ratio estimation that addresses density-chasm and support-chasm problems using diffusion bridges and Gaussian dequantization.

Details

Motivation: Existing density ratio estimation methods fail under significantly different distributions or inadequately overlapping supports, and yield divergent time scores near boundaries, leading to instability.

Method: Proposes dequantified diffusion bridge interpolant (DDBI) for support coverage expansion and time score stabilization, and dequantified Schrödinger bridge interpolant (DSBI) that incorporates optimal transport to solve the Schrödinger bridge problem.

Result: The method offers uniform approximation and bounded time scores theoretically, and outperforms baselines empirically in mutual information and density estimation tasks.

Conclusion: D³RE provides a robust, stable, and efficient solution for density ratio estimation that addresses fundamental limitations of existing approaches.

Abstract: Density ratio estimation is fundamental to tasks involving $f$-divergences, yet existing methods often fail under significantly different distributions or inadequately overlapping supports – the density-chasm and the support-chasm problems. Additionally, prior approaches yield divergent time scores near boundaries, leading to instability. We design $\textbf{D}^3\textbf{RE}$, a unified framework for \textbf{robust}, \textbf{stable} and \textbf{efficient} density ratio estimation. We propose the dequantified diffusion bridge interpolant (DDBI), which expands support coverage and stabilizes time scores via diffusion bridges and Gaussian dequantization. Building on DDBI, the proposed dequantified Schr{"o}dinger bridge interpolant (DSBI) incorporates optimal transport to solve the Schr{"o}dinger bridge problem, enhancing accuracy and efficiency. Our method offers uniform approximation and bounded time scores in theory, and outperforms baselines empirically in mutual information and density estimation tasks.

[837] LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments

Jin Huang, Yuchao Jin, Le An, Josh Park

Main category: cs.LG

TL;DR: An efficient Vision-Language Model pipeline optimized for embedded devices that achieves 2.5-3.2x latency reduction through patch selection, token selection, and speculative decoding.

Details

Motivation: To enable real-time VLM deployment on resource-constrained embedded devices used in robotics and autonomous driving by reducing computational overhead.

Method: Jointly uses patch selection to filter irrelevant camera views, token selection to reduce LLM input sequence length, and speculative decoding to accelerate token generation.

Result: Achieves 2.5x end-to-end latency reduction on NVIDIA DRIVE Thor platform without accuracy loss, increasing to 3.2x with FP8 post-training quantization.

Conclusion: The pipeline is a viable solution for real-time VLM deployment in resource-constrained environments.

Abstract: This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves $2.5\times$ end-to-end latency reduction without compromising task accuracy. The speed-up further increases to $3.2\times$ when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.

[838] An Effective Flow-based Method for Positive-Unlabeled Learning: 2-HNC

Dorit Hochbaum, Torpong Nitayanont

Main category: cs.LG

TL;DR: 2-HNC is a two-stage network flow-based method for positive-unlabeled learning that uses Hochbaum’s Normalized Cut to rank unlabeled samples by negative likelihood and selects optimal partitions based on estimated positive class proportion.

Details

Motivation: Address the challenge of binary classification where only positive instances are labeled in training data, with the rest unlabeled (PU learning scenario).

Method: Two-stage approach: 1) Uses Hochbaum’s Normalized Cut to generate nested partitions and rank unlabeled samples by negative likelihood without assuming negative labels; 2) Augments positive set with likely-negative samples and recomputes classification, selecting partition with positive proportion closest to given prior estimate.

Result: Extensive experiments on synthetic and real datasets show 2-HNC achieves strong performance and often outperforms existing state-of-the-art algorithms.

Conclusion: The proposed 2-HNC method effectively addresses PU learning by leveraging network flow techniques and pairwise similarities, demonstrating superior performance compared to current approaches.

Abstract: In many scenarios of binary classification, only positive instances are provided in the training data, leaving the rest of the data unlabeled. This setup, known as positive-unlabeled (PU) learning, is addressed here with a network flow-based method which utilizes pairwise similarities between samples. The method we propose here, 2-HNC, leverages Hochbaum’s Normalized Cut (HNC) and the set of solutions it provides by solving a parametric minimum cut problem. The set of solutions, that are nested partitions of the samples into two sets, correspond to varying tradeoff values between the two goals: high intra-similarity inside the sets and low inter-similarity between the two sets. This nested sequence is utilized here to deliver a ranking of unlabeled samples by their likelihood of being negative. Building on this insight, our method, 2-HNC, proceeds in two stages. The first stage generates this ranking without assuming any negative labels, using a problem formulation that is constrained only on positive labeled samples. The second stage augments the positive set with likely-negative samples and recomputes the classification. The final label prediction selects among all generated partitions in both stages, the one that delivers a positive class proportion, closest to a prior estimate of this quantity, which is assumed to be given. Extensive experiments across synthetic and real datasets show that 2-HNC yields strong performance and often surpasses existing state-of-the-art algorithms.

[839] ConTextTab: A Semantics-Aware Tabular In-Context Learner

Marco Spinaci, Marek Polewczyk, Maximilian Schambach, Sam Thelin

Main category: cs.LG

TL;DR: ConTextTab combines semantic understanding with table-native ICL framework to address limitations of existing tabular ICL methods, achieving competitive SOTA performance across benchmarks.

Details

Motivation: Current table-native ICL models lack semantic understanding due to synthetic data training, while LLM-based tabular ICL models have limited context capacity. The goal is to combine the strengths of both approaches.

Method: Integrates semantic understanding into table-native ICL framework using specialized embeddings for different data modalities and training on large-scale real-world tabular data.

Result: Competitive with SOTA across broad benchmarks and sets new standard on semantically rich CARTE benchmark.

Conclusion: ConTextTab successfully bridges the gap between efficient table-native architectures and rich semantic understanding, providing an effective solution for tabular ICL.

Abstract: Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/sap-rpt-1-oss.

[840] Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning

Zijun Chen, Shengbo Wang, Nian Si

Main category: cs.LG

TL;DR: The paper proposes two distributionally robust average-reward reinforcement learning algorithms with near-optimal sample complexity for applications requiring stable long-term performance like robotics and healthcare.

Details

Motivation: Address the need for stable long-term performance in critical applications such as robotics, operations research, and healthcare through distributionally robust average-reward reinforcement learning.

Method: Two algorithms: 1) Reduction to DR discounted MDP, 2) Anchored DR Average-Reward MDP that introduces an anchoring state to stabilize controlled transition kernels within uncertainty sets. Both assume nominal MDP is uniformly ergodic.

Result: Both algorithms achieve sample complexity of Õ(|S||A|t_mix²ε⁻²) for estimating optimal policy and robust average reward under KL and f_k-divergence uncertainty sets, with small uncertainty radius. This is the first finite-sample convergence guarantee for DR average-reward RL.

Conclusion: The proposed algorithms provide the first finite-sample convergence guarantees for distributionally robust average-reward reinforcement learning with near-optimal sample complexity, validated through numerical experiments.

Abstract: Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.

[841] Is Grokking a Computational Glass Relaxation?

Xiaotian Zhang, Yue Shang, Entao Yang, Ge Zhang

Main category: cs.LG

TL;DR: Grokking is interpreted as computational glass relaxation, where neural networks transition from memorization to generalization through a slow relaxation process without entropy barriers, challenging phase transition theories.

Details

Motivation: To understand the underlying mechanisms of neural network generalizability by studying the grokking phenomenon, where networks suddenly generalize long after perfect training performance.

Method: Framing neural networks as physical systems with parameters as degrees of freedom and train loss as energy, sampling Boltzmann entropy landscapes, and developing a Wang-landau molecular dynamics inspired optimizer (WanD).

Result: No entropy barrier found in memorization-to-generalization transition, challenging first-order phase transition theories; high-entropy advantage identified; WanD optimizer eliminates grokking and finds high-norm generalizing solutions.

Conclusion: Grokking is not a first-order phase transition with entropy barriers, but a far-from-equilibrium relaxation process; new optimizer design principles can eliminate grokking and challenge existing theories about weight norm evolution.

Abstract: Understanding neural network’s (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs’ generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs’ Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking’s far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.

[842] Flat Channels to Infinity in Neural Loss Landscapes

Flavio Martinelli, Alexander Van Meegen, Berfin Şimşek, Wulfram Gerstner, Johanni Brea

Main category: cs.LG

TL;DR: The paper identifies and characterizes ‘channels to infinity’ in neural network loss landscapes where parameters diverge to infinity while forming gated linear units, which appear as flat local minima but are actually special structures that gradient-based optimizers frequently reach.

Details

Motivation: To understand special structures in neural network loss landscapes where parameters diverge to infinity while maintaining functional equivalence, and to explain why gradient-based optimizers frequently converge to these quasi-flat regions that appear as local minima.

Method: The authors analyze gradient dynamics and geometry of loss landscapes, identifying channels where output weights diverge to ±infinity while input weights become equal, forming gated linear units. They study this phenomenon across diverse regression settings using gradient flow, SGD, and ADAM optimizers.

Result: Gradient-based optimizers reach these ‘channels to infinity’ with high probability, where neurons implement gated linear units through parameter divergence. These channels are asymptotically parallel to symmetry-induced critical points and appear as flat local minima without careful inspection.

Conclusion: The characterization provides a comprehensive understanding of quasi-flat regions in neural network loss landscapes, revealing that the emergence of gated linear units at the end of these channels highlights surprising computational capabilities of fully connected layers.

Abstract: The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_i\sigma(\mathbf{w_i} \cdot \mathbf{x}) + a_j\sigma(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow \sigma(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) \sigma’(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.

[843] Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian

Main category: cs.LG

TL;DR: Continuous CoTs enable transformers to solve directed graph reachability in diameter steps by encoding multiple search frontiers as superposition states, outperforming discrete CoTs that require sequential search.

Details

Motivation: To understand why continuous CoTs outperform discrete CoTs in reasoning tasks like directed graph reachability, and provide theoretical justification for this performance gap.

Method: Theoretical analysis showing that a two-layer transformer with continuous CoTs can solve directed graph reachability in D steps (graph diameter), where continuous thought vectors encode multiple search frontiers as superposition states enabling parallel BFS.

Result: Continuous CoTs solve directed graph reachability in D steps, while discrete CoTs require O(n²) steps. Experiments confirm the theoretical construction aligns with empirical training dynamics, showing superposition states emerge naturally without explicit supervision.

Conclusion: Continuous CoTs’ ability to encode multiple search frontiers as superposition states enables efficient parallel search, explaining their superiority over discrete CoTs in graph reasoning tasks.

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate ``thinking tokens’’ before answering the questions. While existing theoretical works demonstrate that CoTs with discrete tokens boost the capability of LLMs, recent work on continuous CoTs lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks such as directed graph reachability, a fundamental graph reasoning problem that includes many practical domain applications as special cases. In this paper, we prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem, where $D$ is the diameter of the graph, while the best known result of constant-depth transformers with discrete CoTs requires $O(n^2)$ decoding steps where $n$ is the number of vertices ($D<n$). In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously (i.e., parallel breadth-first search (BFS)), while discrete CoTs must choose a single path sampled from the superposition state, which leads to sequential search that requires many more steps and may be trapped into local solutions. We also performed extensive experiments to verify that our theoretical construction aligns well with the empirical solution obtained via training dynamics. Notably, encoding of multiple search frontiers as a superposition state automatically emerges in training continuous CoTs, without explicit supervision to guide the model to explore multiple paths simultaneously.

[844] Over-squashing in Spatiotemporal Graph Neural Networks

Ivan Marisca, Jacob Bamberger, Cesare Alippi, Michael M. Bronstein

Main category: cs.LG

TL;DR: This paper formalizes spatiotemporal over-squashing in STGNNs, showing it differs from static GNNs and that convolutional STGNNs favor information from temporally distant points over nearby ones.

Details

Motivation: While over-squashing is well-studied in static GNNs, it remains unexplored in Spatiotemporal GNNs (STGNNs) where temporal dimensions amplify the information propagation challenge.

Method: Theoretical analysis of spatiotemporal over-squashing, examining different processing paradigms (time-and-space vs time-then-space) and validating findings on synthetic and real-world datasets.

Result: Convolutional STGNNs counterintuitively favor information from temporally distant points rather than close ones, and both processing paradigms are equally affected by spatiotemporal over-squashing.

Conclusion: The work provides theoretical justification for computationally efficient implementations and principled guidance for more effective STGNN designs through deeper insights into their operational dynamics.

Abstract: Graph Neural Networks (GNNs) have achieved remarkable success across various domains. However, recent theoretical advances have identified fundamental limitations in their information propagation capabilities, such as over-squashing, where distant nodes fail to effectively exchange information. While extensively studied in static contexts, this issue remains unexplored in Spatiotemporal GNNs (STGNNs), which process sequences associated with graph nodes. Nonetheless, the temporal dimension amplifies this challenge by increasing the information that must be propagated. In this work, we formalize the spatiotemporal over-squashing problem and demonstrate its distinct characteristics compared to the static case. Our analysis reveals that, counterintuitively, convolutional STGNNs favor information propagation from points temporally distant rather than close in time. Moreover, we prove that architectures that follow either time-and-space or time-then-space processing paradigms are equally affected by this phenomenon, providing theoretical justification for computationally efficient implementations. We validate our findings on synthetic and real-world datasets, providing deeper insights into their operational dynamics and principled guidance for more effective designs.

[845] Differentiable Generalized Sliced Wasserstein Plans

Laetitia Chapel, Romain Tavenard, Samuel Vaiter

Main category: cs.LG

TL;DR: The paper proposes a differentiable approximation scheme for min-SWGG, reformulating it as a bilevel optimization problem to efficiently find optimal slices in high dimensions and extending it to manifolds.

Details

Motivation: To overcome limitations of slicing methods in Optimal Transport, particularly the exponential growth of required slices with dimension and constraint to linear projections.

Method: Reformulate min-SWGG as a bilevel optimization problem with differentiable approximation for optimal slice identification, and extend it to handle data on manifolds.

Result: The approach enables efficient computation of transport plans in high-dimensional settings and on manifolds, demonstrating practical value in gradient flows and image generation.

Conclusion: The proposed differentiable approximation scheme significantly improves the scalability and applicability of sliced Optimal Transport methods.

Abstract: Optimal Transport (OT) has attracted significant interest in the machine learning community, not only for its ability to define meaningful distances between probability distributions – such as the Wasserstein distance – but also for its formulation of OT plans. Its computational complexity remains a bottleneck, though, and slicing techniques have been developed to scale OT to large datasets. Recently, a novel slicing scheme, dubbed min-SWGG, lifts a single one-dimensional plan back to the original multidimensional space, finally selecting the slice that yields the lowest Wasserstein distance as an approximation of the full OT plan. Despite its computational and theoretical advantages, min-SWGG inherits typical limitations of slicing methods: (i) the number of required slices grows exponentially with the data dimension, and (ii) it is constrained to linear projections. Here, we reformulate min-SWGG as a bilevel optimization problem and propose a differentiable approximation scheme to efficiently identify the optimal slice, even in high-dimensional settings. We furthermore define its generalized extension for accommodating to data living on manifolds. Finally, we demonstrate the practical value of our approach in various applications, including gradient flows on manifolds and high-dimensional spaces, as well as a novel sliced OT-based conditional flow matching for image generation – where fast computation of transport plans is essential.

[846] TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, Frank Hutter

Main category: cs.LG

TL;DR: TabArena is the first continuously maintained living benchmark for tabular data that addresses limitations of static benchmarks by providing regularly updated datasets, models, and leaderboards.

Details

Motivation: Current tabular benchmarks are static and don't adapt to discovered flaws, model updates, or new model releases, creating a need for a continuously maintained benchmarking system.

Method: Manual curation of representative datasets and well-implemented models, large-scale benchmarking study to initialize public leaderboard, and establishment of maintenance protocols with experienced maintainers.

Result: Gradient-boosted trees remain strong contenders, deep learning methods catch up with larger time budgets and ensembling, foundation models excel on smaller datasets, and cross-model ensembles advance state-of-the-art (though some deep models are overrepresented due to validation overfitting).

Conclusion: TabArena provides a living benchmark with public leaderboard, reproducible code, and maintenance protocols to enable continuous evaluation and improvement of tabular machine learning models.

Abstract: With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

[847] Diversity-Aware Policy Optimization for Large Language Model Reasoning

Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, Kay Chen Tan

Main category: cs.LG

TL;DR: This paper investigates the impact of diversity in RL-based training for LLM reasoning, proposing a novel diversity-aware policy optimization method that improves mathematical reasoning performance by 3.5% across benchmarks.

Details

Motivation: Despite the importance of diversity in reinforcement learning, its influence on LLM reasoning remains largely underexplored, creating a research gap that needs to be addressed.

Method: Proposes a diversity-aware policy optimization method that designs a token-level diversity metric, reformulates it into a practical objective, and selectively applies it to positive samples within the R1-zero training framework.

Result: The method achieves a 3.5% average improvement across four mathematical reasoning benchmarks while generating more diverse and robust solutions. Strong positive correlation observed between solution diversity and reasoning potential in high-performing models.

Conclusion: Explicitly promoting diversity during RL training significantly enhances LLM reasoning capabilities, demonstrating the importance of diversity-aware approaches in improving model performance and robustness.

Abstract: The reasoning capabilities of large language models (LLMs) have advanced rapidly, particularly following the release of DeepSeek R1, which has inspired a surge of research into data quality and reinforcement learning (RL) algorithms. Despite the pivotal role diversity plays in RL, its influence on LLM reasoning remains largely underexplored. To bridge this gap, this work presents a systematic investigation into the impact of diversity in RL-based training for LLM reasoning, and proposes a novel diversity-aware policy optimization method. Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and Potential at k (a novel metric quantifying an LLM’s reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Specifically, we design a token-level diversity and reformulate it into a practical objective, then we selectively apply it to positive samples. Integrated into the R1-zero training framework, our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions.

[848] TiRex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context Learning

Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, Sepp Hochreiter

Main category: cs.LG

TL;DR: TiRex introduces an enhanced LSTM-based model (xLSTM) for zero-shot time series forecasting, combining state-tracking capabilities with in-context learning to outperform transformer-based approaches.

Details

Motivation: Existing zero-shot forecasting methods rely on transformers, which underperform in time series compared to recurrent models like LSTMs. However, LSTMs lack strong in-context learning abilities, creating a gap that TiRex aims to bridge.

Method: TiRex leverages xLSTM (enhanced LSTM with competitive in-context learning) and proposes CPM, a training-time masking strategy to enhance state-tracking capabilities for long-horizon forecasting.

Result: TiRex achieves state-of-the-art performance on HuggingFace benchmarks GiftEval and Chronos-ZS, outperforming larger models like TabPFN-TS, Chronos Bolt, TimesFM, and Moirai across both short- and long-term forecasts.

Conclusion: TiRex successfully combines the state-tracking advantages of LSTMs with in-context learning capabilities, establishing a new paradigm for zero-shot time series forecasting that outperforms transformer-based approaches.

Abstract: In-context learning, the ability of large language models to perform tasks using only examples provided in the prompt, has recently been adapted for time series forecasting. This paradigm enables zero-shot prediction, where past values serve as context for forecasting future values, making powerful forecasting tools accessible to non-experts and increasing the performance when training data are scarce. Most existing zero-shot forecasting approaches rely on transformer architectures, which, despite their success in language, often fall short of expectations in time series forecasting, where recurrent models like LSTMs frequently have the edge. Conversely, while LSTMs are well-suited for time series modeling due to their state-tracking capabilities, they lack strong in-context learning abilities. We introduce TiRex that closes this gap by leveraging xLSTM, an enhanced LSTM with competitive in-context learning skills. Unlike transformers, state-space models, or parallelizable RNNs such as RWKV, TiRex retains state-tracking, a critical property for long-horizon forecasting. To further facilitate its state-tracking ability, we propose a training-time masking strategy called CPM. TiRex sets a new state of the art in zero-shot time series forecasting on the HuggingFace benchmarks GiftEval and Chronos-ZS, outperforming significantly larger models including TabPFN-TS (Prior Labs), Chronos Bolt (Amazon), TimesFM (Google), and Moirai (Salesforce) across both short- and long-term forecasts.

[849] Tight analyses of first-order methods with error feedback

Daniel Berg Thomsen, Adrien Taylor, Aymeric Dieuleveut

Main category: cs.LG

TL;DR: This paper provides tight convergence analysis of error feedback methods EF and EF²¹ for compressed communication in distributed learning, finding optimal Lyapunov functions and establishing matching lower bounds.

Details

Motivation: Communication compression is essential for reducing bottlenecks in distributed learning, but it degrades convergence. Error feedback schemes like EF and EF²¹ were introduced to mitigate this degradation, but their theoretical guarantees needed tighter analysis.

Method: The authors use a principled approach to find the optimal Lyapunov functions for EF and EF²¹ methods, enabling sharp convergence rate guarantees. The analysis is conducted in a simplified single-agent setting for clean theoretical insights.

Result: The paper establishes tight convergence bounds for both EF and EF²¹ methods with matching lower bounds, providing the best possible convergence rates for each method. This enables rigorous comparison between EF, EF²¹, and compressed gradient descent.

Conclusion: The tight analysis reveals the fundamental performance limits of error feedback methods for compressed communication, providing sharp theoretical guarantees and enabling fair comparison between different compression schemes in distributed learning.

Abstract: Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes – most notably $\mathrm{EF}$ and $\mathrm{EF}^{21}$ – were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method – with matching lower bounds. This principled approach yields sharp performance guarantees and enables a rigorous, apples-to-apples comparison between $\mathrm{EF}$, $\mathrm{EF}^{21}$, and compressed gradient descent. Our analysis is carried out in the simplified single-agent setting, which allows for clean theoretical insights and fair comparison of the underlying mechanisms.

[850] Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

Minhak Song, Beomhan Baek, Kwangjun Ahn, Chulhee Yun

Main category: cs.LG

TL;DR: Schedule-Free (SF) method is a scalable alternative to conventional pretraining strategies that avoids decay phases and memory overhead while implicitly performing weight averaging.

Details

Motivation: Conventional pretraining strategies with fixed compute budgets are inadequate for large-scale training, and existing alternatives like WSD and weight averaging have limitations such as explicit decay phases or additional memory costs.

Method: Revisits the Schedule-Free (SF) method, analyzes its dynamics theoretically and empirically, and proposes a refined variant that improves robustness to momentum and performance under large batch sizes.

Result: SF-AdamW effectively navigates the loss landscape without decay phases or auxiliary averaging, and the refined SF variant addresses key limitations of the original method.

Conclusion: SF is established as a practical, scalable, and theoretically grounded approach for language model training.

Abstract: As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the “river” structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.

[851] Information-Theoretic Framework for Understanding Modern Machine-Learning

Meir Feder, Ruediger Urbanke, Yaniv Fogel

Main category: cs.LG

TL;DR: The paper introduces an information-theoretic framework for learning as universal prediction under log loss, defining architecture-based model complexity through the probability mass of models near the data-generating process, related to spectral properties of the Hessian/Fisher Information Matrix.

Details

Motivation: To understand why certain machine learning architectures (like deep neural networks and transformers) succeed, and to provide a unified framework that explains learning phenomena across different settings and regimes.

Method: Developed an information-theoretic framework using regret bounds, defining model complexity via the volume of models near the data-generating process, with tractable approximations through spectral analysis of Hessian/Fisher Information Matrix.

Result: The framework explains that successful architectures have broad complexity ranges, sheds light on inductive biases, SGD effectiveness, flat minima phenomena, and unifies various learning settings (online/batch, supervised/generative, realizable/agnostic).

Conclusion: The framework provides insights into why modern architectures work well (due to their layered structure creating broad complexity ranges) and opens possibilities for designing alternative architectures with potentially superior performance.

Abstract: We introduce an information-theoretic framework that views learning as universal prediction under log loss, characterized through regret bounds. Central to the framework is an effective notion of architecture-based model complexity, defined by the probability mass or volume of models in the vicinity of the data-generating process, or its projection on the model class. This volume is related to spectral properties of the expected Hessian or the Fisher Information Matrix, leading to tractable approximations. We argue that successful architectures possess a broad complexity range, enabling learning in highly over-parameterized model classes. The framework sheds light on the role of inductive biases, the effectiveness of stochastic gradient descent, and phenomena such as flat minima. It unifies online, batch, supervised, and generative settings, and applies across the stochastic-realizable and agnostic regimes. Moreover, it provides insights into the success of modern machine-learning architectures, such as deep neural networks and transformers, suggesting that their broad complexity range naturally arises from their layered structure. These insights open the door to the design of alternative architectures with potentially comparable or even superior performance.

[852] A Self-Evolving AI Agent System for Climate Science

Zijie Guo, Jiong Wang, Fenghua Ling, Wangxu Wei, Xiaoyu Yue, Zhe Jiang, Wanghan Xu, Jing-Jia Luo, Lijing Cheng, Yoo-Geun Ham, Fengfei Song, Pierre Gentine, Toshio Yamagata, Ben Fei, Wenlong Zhang, Xinyu Gu, Chao Li, Yaqiang Wang, Tao Chen, Wanli Ouyang, Bowen Zhou, Lei Bai

Main category: cs.LG

TL;DR: EarthLink is the first self-evolving AI agent system that automates Earth science research workflows through natural language interaction, integrating planning, code execution, data analysis, and physical reasoning to overcome data fragmentation challenges.

Details

Motivation: The accelerating volume and fragmentation of multi-sphere Earth science data has surpassed human analytical capacity, creating a major bottleneck for discovery, especially in climate science.

Method: EarthLink automates the entire research workflow through natural language interaction, integrating planning, code execution, data analysis, and physical reasoning into a unified process.

Result: EarthLink achieves proficiency comparable to a junior researcher in expert evaluations on core climate tasks, and autonomously discovered precursors of the Atlantic Niño by developing research strategies, identifying predictability sources, and verifying hypotheses.

Conclusion: EarthLink enables a new human-AI research paradigm where scientists focus on value judgments while AI handles complex data analysis and knowledge integration, accelerating discovery in Earth sciences.

Abstract: Scientific progress in Earth science depends on integrating data across the planet’s interconnected spheres. However, the accelerating volume and fragmentation of multi-sphere knowledge and data have surpassed human analytical capacity. This creates a major bottleneck for discovery, especially in climate science. To address this challenge, we introduce EarthLink, the first self-evolving AI agent system designed as an interactive “copilot” for Earth scientists. Through natural language interaction, EarthLink automates the entire research workflow by integrating planning, code execution, data analysis, and physical reasoning into a unified process that directly addresses this limitation. Beyond efficiency, it exhibits human-like cross-disciplinary analytical ability and achieves proficiency comparable to a junior researcher in expert evaluations on core large-scale climate tasks, including model-observation comparison and climate change understanding. When tasked with an open scientific problem, specifically the discovery of precursors of the Atlantic Ni~no, EarthLink autonomously developed a research strategy, identified sources of predictability, verified its hypotheses with available data, and proposed a physically consistent mechanism. These emerging capabilities enable a new human-AI research paradigm. Scientists can focus on value and result judgments, while AI systems handle complex data analysis and knowledge integration. This accelerates the pace and breadth of discovery in Earth sciences. The system is accessible at our website https://earthlink.intern-ai.org.cn.

[853] Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation

François Rozet, Ruben Ohana, Michael McCabe, Gilles Louppe, François Lanusse, Shirley Ho

Main category: cs.LG

TL;DR: Latent-space diffusion models can effectively emulate dynamical systems with high compression rates (up to 1000x) while maintaining accuracy, outperforming non-generative methods and providing greater prediction diversity.

Details

Motivation: To address the computational cost of diffusion models for physics emulation by adapting the latent-space generation approach used in image/video generation to dynamical systems.

Method: Using latent-space diffusion models with autoencoders for compression, investigating various compression rates and practical design choices including architectures and optimizers.

Result: Latent-space emulation maintains accuracy even with 1000x compression, diffusion-based emulators are more accurate than non-generative methods, and they provide greater prediction diversity to compensate for uncertainty.

Conclusion: Latent-space diffusion models are effective for dynamical system emulation, offering computational efficiency through compression while maintaining accuracy and handling uncertainty through diverse predictions.

Abstract: The steep computational cost of diffusion models at inference hinders their use as fast physics emulators. In the context of image and video generation, this computational drawback has been addressed by generating in the latent space of an autoencoder instead of the pixel space. In this work, we investigate whether a similar strategy can be effectively applied to the emulation of dynamical systems and at what cost. We find that the accuracy of latent-space emulation is surprisingly robust to a wide range of compression rates (up to 1000x). We also show that diffusion-based emulators are consistently more accurate than non-generative counterparts and compensate for uncertainty in their predictions with greater diversity. Finally, we cover practical design choices, spanning from architectures to optimizers, that we found critical to train latent-space emulators.

[854] Graph Neural Networks for Electricity Load Forecasting

Eloi Campagne, Yvenn Amara-Ouali, Yannig Goude, Itai Zehavi, Argyris Kalogeratos

Main category: cs.LG

TL;DR: Graph Neural Networks with attention mechanisms and ensemble aggregation improve electricity load forecasting accuracy and interpretability compared to traditional methods.

Details

Motivation: Electricity demand forecasting is challenging due to decentralized energy systems and renewable sources integration, requiring models that can capture spatial dependencies and complex non-stationarities.

Method: Comprehensive framework integrating graph-based forecasting with attention mechanisms and ensemble aggregation strategies, evaluating multiple GNN architectures (Graph Convolutional Networks, GraphSAGE, APPNP, Graph Attention Networks) on synthetic, regional (France), and fine-grained (UK) datasets.

Result: Graph-aware models consistently outperform conventional baselines (Feed Forward Neural Networks, TiREX), attention layers provide insights into spatial interactions driven by meteorological/seasonal dynamics, and ensemble aggregation (especially bottom-up expert combination) improves robustness under heterogeneous data conditions.

Conclusion: Study highlights complementarity between structural modeling, interpretability, and robustness, discussing trade-offs between accuracy, model complexity, and transparency in graph-based electricity load forecasting.

Abstract: Forecasting electricity demand is increasingly challenging as energy systems become more decentralized and intertwined with renewable sources. Graph Neural Networks (GNNs) have recently emerged as a powerful paradigm to model spatial dependencies in load data while accommodating complex non-stationarities. This paper introduces a comprehensive framework that integrates graph-based forecasting with attention mechanisms and ensemble aggregation strategies to enhance both predictive accuracy and interpretability. Several GNN architectures – including Graph Convolutional Networks, GraphSAGE, APPNP, and Graph Attention Networks – are systematically evaluated on synthetic, regional (France), and fine-grained (UK) datasets. Empirical results demonstrate that graph-aware models consistently outperform conventional baselines such as Feed Forward Neural Networks and foundation models like TiREX. Furthermore, attention layers provide valuable insights into evolving spatial interactions driven by meteorological and seasonal dynamics. Ensemble aggregation, particularly through bottom-up expert combination, further improves robustness under heterogeneous data conditions. Overall, the study highlights the complementarity between structural modeling, interpretability, and robustness, and discusses the trade-offs between accuracy, model complexity, and transparency in graph-based electricity load forecasting.

[855] CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning

Ningyuan Huang, Richard Stiskalek, Jun-Young Lee, Adrian E. Bayer, Charles C. Margossian, Christian Kragh Jespersen, Lucia A. Perez, Lawrence K. Saul, Francisco Villaescusa-Navarro

Main category: cs.LG

TL;DR: CosmoBench is a large cosmological simulation dataset containing 34K point clouds and 25K directed trees for tasks like predicting cosmological parameters, halo velocities, and reconstructing merger trees. Baseline methods show simple invariant feature models can outperform deep learning approaches, highlighting potential for combining ML and cosmology.

Details

Motivation: To extract insights from cosmological simulation data and bridge the gap between cosmology and geometric deep learning by providing a comprehensive benchmark dataset.

Method: Curated dataset from state-of-the-art cosmological simulations (41M+ core-hours, 2PB data) containing point clouds and directed trees. Evaluated multiple approaches including cosmological modeling methods and machine learning (linear models, graph neural networks).

Result: Simple linear models with invariant features sometimes outperform more complex deep learning architectures with many parameters and longer training times. Dataset enables multiple tasks: cosmological parameter prediction, velocity prediction, and merger tree reconstruction.

Conclusion: CosmoBench establishes a foundation for combining machine learning and cosmology, showing tremendous potential for improvement. It sets the stage for bridging cosmology and geometric deep learning at scale, inviting community engagement.

Abstract: Cosmological simulations provide a wealth of data in the form of point clouds and directed trees. A crucial goal is to extract insights from this data that shed light on the nature and composition of the Universe. In this paper we introduce CosmoBench, a benchmark dataset curated from state-of-the-art cosmological simulations whose runs required more than 41 million core-hours and generated over two petabytes of data. CosmoBench is the largest dataset of its kind: it contains 34 thousand point clouds from simulations of dark matter halos and galaxies at three different length scales, as well as 25 thousand directed trees that record the formation history of halos on two different time scales. The data in CosmoBench can be used for multiple tasks – to predict cosmological parameters from point clouds and merger trees, to predict the velocities of individual halos and galaxies from their collective positions, and to reconstruct merger trees on finer time scales from those on coarser time scales. We provide several baselines on these tasks, some based on established approaches from cosmological modeling and others rooted in machine learning. For the latter, we study different approaches – from simple linear models that are minimally constrained by symmetries to much larger and more computationally-demanding models in deep learning, such as graph neural networks. We find that least-squares fits with a handful of invariant features sometimes outperform deep architectures with many more parameters and far longer training time. Still there remains tremendous potential to improve these baselines by combining machine learning and cosmology to fully exploit the data. CosmoBench sets the stage for bridging cosmology and geometric deep learning at scale. We invite the community to push the frontier of scientific discovery by engaging with this dataset, available at https://cosmobench.streamlit.app

[856] A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design

Claudiu Leoveanu-Condrei

Main category: cs.LG

TL;DR: A contract layer for LLMs that applies Design by Contract principles to provide semantic and type guarantees through probabilistic remediation.

Details

Motivation: LLMs produce fluent outputs but lack verifiable guarantees, requiring a systematic approach to ensure semantic and type compliance.

Method: Adapts Design by Contract and type-theoretic principles to create a contract layer that mediates LLM calls, stipulating semantic/type requirements with probabilistic remediation.

Result: The layer enables probabilistic contract satisfaction and semantic validation through programmer-specified conditions on well-typed data structures.

Conclusion: Any two agents satisfying the same contracts are functionally equivalent with respect to those contracts, establishing a foundation for verifiable LLM behavior.

Abstract: Generative models, particularly Large Language Models (LLMs), produce fluent outputs yet lack verifiable guarantees. We adapt Design by Contract (DbC) and type-theoretic principles to introduce a contract layer that mediates every LLM call. Contracts stipulate semantic and type requirements on inputs and outputs, coupled with probabilistic remediation to steer generation toward compliance. The layer exposes the dual view of LLMs as semantic parsers and probabilistic black-box components. Contract satisfaction is probabilistic and semantic validation is operationally defined through programmer-specified conditions on well-typed data structures. More broadly, this work postulates that any two agents satisfying the same contracts are \emph{functionally equivalent} with respect to those contracts.

[857] Distributionally Robust Optimization with Adversarial Data Contamination

Shuyao Li, Ilias Diakonikolas, Jelena Diakonikolas

Main category: cs.LG

TL;DR: This paper introduces a robust optimization method that handles both data contamination (outliers) and distributional uncertainty in Distributionally Robust Optimization (DRO) for generalized linear models.

Details

Motivation: Standard DRO can be compromised by outliers in training data, creating a need for methods that simultaneously address both distributional uncertainty and data contamination.

Method: A novel modeling framework that integrates robustness against data contamination with distributional robustness, using an efficient algorithm inspired by robust statistics to optimize Wasserstein-1 DRO objectives for generalized linear models with convex Lipschitz loss functions.

Result: The method achieves an estimation error of O(√ε) for the true DRO objective value using only contaminated data under bounded covariance assumption, where ε is the fraction of adversarially corrupted data.

Conclusion: This work provides the first rigorous guarantees with efficient computation for learning under the dual challenges of data contamination and distributional shifts.

Abstract: Distributionally Robust Optimization (DRO) provides a framework for decision-making under distributional uncertainty, yet its effectiveness can be compromised by outliers in the training data. This paper introduces a principled approach to simultaneously address both challenges. We focus on optimizing Wasserstein-1 DRO objectives for generalized linear models with convex Lipschitz loss functions, where an $\epsilon$-fraction of the training data is adversarially corrupted. Our primary contribution lies in a novel modeling framework that integrates robustness against training data contamination with robustness against distributional shifts, alongside an efficient algorithm inspired by robust statistics to solve the resulting optimization problem. We prove that our method achieves an estimation error of $O(\sqrt{\epsilon})$ for the true DRO objective value using only the contaminated data under the bounded covariance assumption. This work establishes the first rigorous guarantees, supported by efficient computation, for learning under the dual challenges of data contamination and distributional shifts.

[858] Scientific Machine Learning with Kolmogorov-Arnold Networks

Salah A. Faroughi, Farinaz Mostajeran, Amin Hamed Mashhadzadeh, Shirko Faroughi

Main category: cs.LG

TL;DR: KANs are replacing MLPs in scientific machine learning due to better interpretability, flexibility, and ability to capture complex nonlinear interactions and high-frequency features.

Details

Motivation: The limitations of MLPs including poor interpretability, fixed activation functions, and difficulty capturing localized/high-frequency features drive the shift to KANs.

Method: Review categorizes KAN-based models across three perspectives: data-driven learning, physics-informed modeling, and deep-operator learning, examining architectural design, training strategies, and application efficacy.

Result: KANs show consistent improvements over MLPs in accuracy, convergence, and spectral representation, with better ability to capture complex dynamics and learn more effectively.

Conclusion: KANs offer significant advantages but face challenges in computational efficiency, theoretical guarantees, hyperparameter tuning, and algorithm complexity. Future research should focus on improving robustness, scalability, and physical consistency.

Abstract: The field of scientific machine learning, which originally utilized multilayer perceptrons (MLPs), is increasingly adopting Kolmogorov-Arnold Networks (KANs) for data encoding. This shift is driven by the limitations of MLPs, including poor interpretability, fixed activation functions, and difficulty capturing localized or high-frequency features. KANs address these issues with enhanced interpretability and flexibility, enabling more efficient modeling of complex nonlinear interactions and effectively overcoming the constraints associated with conventional MLP architectures. This review categorizes recent progress in KAN-based models across three distinct perspectives: (i) data-driven learning, (ii) physics-informed modeling, and (iii) deep-operator learning. Each perspective is examined through the lens of architectural design, training strategies, application efficacy, and comparative evaluation against MLP-based counterparts. By benchmarking KANs against MLPs, we highlight consistent improvements in accuracy, convergence, and spectral representation, clarifying KANs’ advantages in capturing complex dynamics while learning more effectively. In addition to reviewing recent literature, this work also presents several comparative evaluations that clarify central characteristics of KAN modeling and hint at their potential implications for real-world applications. Finally, this review identifies critical challenges and open research questions in KAN development, particularly regarding computational efficiency, theoretical guarantees, hyperparameter tuning, and algorithm complexity. We also outline future research directions aimed at improving the robustness, scalability, and physical consistency of KAN-based frameworks.

[859] Tricks and Plug-ins for Gradient Boosting with Transformers

Biyi Fang, Truong Vo, Jean Utke, Diego Klabjan

Main category: cs.LG

TL;DR: BoostTransformer integrates boosting principles with transformers using subgrid token selection and importance-weighted sampling to improve efficiency and performance.

Details

Motivation: Transformer architectures require heavy computational resources and complex hyperparameter tuning, creating a need for more efficient alternatives.

Method: Augments transformers with boosting principles through subgrid token selection, importance-weighted sampling, and a least square boosting objective integrated into the transformer pipeline.

Result: Demonstrates faster convergence and higher accuracy across multiple fine-grained text classification benchmarks compared to standard transformers.

Conclusion: BoostTransformer surpasses standard transformers while minimizing architectural search overhead, offering an efficient alternative to traditional transformer models.

Abstract: Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.

[860] Evaluating Federated Learning for At-Risk Student Prediction: A Comparative Analysis of Model Complexity and Data Balancing

Rodrigo Tertulino, Ricardo Almeida

Main category: cs.LG

TL;DR: Proposes a Federated Learning framework for early identification of at-risk students in distance education while preserving data privacy, achieving 85% ROC AUC.

Details

Motivation: Address persistent high dropout rates in distance education while maintaining student data privacy and sovereignty.

Method: Uses Federated Learning with OULAD dataset, simulating privacy-centric scenarios with early academic performance and digital engagement patterns. Compares Logistic Regression vs Deep Neural Network and examines local data balancing.

Result: Federated model achieves strong predictive power with ROC AUC approximately 85%, demonstrating FL as a practical solution for early-warning systems.

Conclusion: Federated Learning is a practical and scalable solution for early-warning systems that inherently respects student data sovereignty while maintaining strong predictive performance.

Abstract: This study proposes and validates a Federated Learning (FL) framework to proactively identify at-risk students while preserving data privacy. Persistently high dropout rates in distance education remain a pressing institutional challenge. Using the large-scale OULAD dataset, we simulate a privacy-centric scenario where models are trained on early academic performance and digital engagement patterns. Our work investigates the practical trade-offs between model complexity (Logistic Regression vs. a Deep Neural Network) and the impact of local data balancing. The resulting federated model achieves strong predictive power (ROC AUC approximately 85%), demonstrating that FL is a practical and scalable solution for early-warning systems that inherently respects student data sovereignty.

[861] One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra

Neng Kai Nigel Neo, Lim Jing, Ngoui Yong Zhau Preston, Koh Xue Ting Serene, Bingquan Shen

Main category: cs.LG

TL;DR: A two-stage pipeline for molecular generation from mass spectra using MIST encoder and MolForge decoder with fingerprint probability thresholding achieves 10x improvement over previous methods.

Details

Motivation: To improve de novo molecular generation from mass spectra by enhancing the two-stage pipeline approach with better training data and substructure-focused probability thresholding.

Method: Use MIST as encoder to convert mass spectra to molecular fingerprints, then MolForge as decoder to generate molecular structures, with additional training data and fingerprint bit probability thresholding.

Result: Achieved top-1 31% and top-10 40% correct molecular structure generation from mass spectra in MassSpecGym, representing a tenfold improvement over previous state-of-the-art methods.

Conclusion: This approach establishes a strong baseline for future research in de novo molecule elucidation from mass spectra.

Abstract: A common approach to the de novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt MIST (Goldman et. al., 2023) as the encoder and MolForge (Ucak et. al., 2023) as the decoder, leveraging additional training data to enhance performance. We also threshold the probabilities of each fingerprint bit to focus on the presence of substructures. This results in a tenfold improvement over previous state-of-the-art methods, generating top-1 31% / top-10 40% of molecular structures correctly from mass spectra in MassSpecGym (Bushuiev et. al., 2024). We position this as a strong baseline for future research in de novo molecule elucidation from mass spectra.

[862] FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He

Main category: cs.LG

TL;DR: FlexQ is a post-training INT6 quantization framework that combines algorithmic innovations with system-level optimizations to achieve efficient LLM inference while maintaining accuracy.

Details

Motivation: LLMs have high memory and computational costs that limit practical deployment. While INT4/INT8 quantization reduces costs, they degrade accuracy or lack optimal efficiency. INT6 offers better trade-offs but lacks native GPU hardware support.

Method: Uses uniform 6-bit weight quantization across all layers with adaptive 8-bit activations in sensitive layers identified through layer-wise sensitivity analysis. Develops specialized GPU kernel supporting W6A6 and W6A8 representations via Binary Tensor Core equivalents.

Result: Maintains near-FP16 accuracy with perplexity increases ≤0.1 on WikiText2. Achieves 1.39× speedup over ABQ-LLM on linear layers, 1.33× end-to-end inference acceleration and 1.21× memory savings over SmoothQuant.

Conclusion: FlexQ successfully bridges the gap between INT6 algorithmic benefits and hardware limitations, providing an effective solution for efficient LLM deployment with minimal accuracy loss.

Abstract: Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA family models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.1 on WikiText2. The proposed kernel achieves an average 1.39$\times$ speedup over ABQ-LLM on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33$\times$ inference acceleration and 1.21$\times$ memory savings over SmoothQuant. Code is released at https://github.com/FlyFoxPlayer/FlexQ.

[863] RL Fine-Tuning Heals OOD Forgetting in SFT

Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, Mohammad Hamdaqa

Main category: cs.LG

TL;DR: The paper challenges the oversimplified view that ‘SFT memorizes, RL generalizes’ and reveals that RL actually restores OOD reasoning ability lost during SFT, with this recovery depending on proper SFT training duration and being driven by rotation of singular vectors rather than singular value changes.

Details

Motivation: To understand the evolution and mechanisms behind the synergy of SFT and RL in two-stage fine-tuning of LLMs, as the common claim about their roles is oversimplified and the underlying processes are not well understood.

Method: Used SVD analysis on parameter matrices, manually edited them, and observed impacts on model performance to uncover mechanisms behind OOD forgetting during SFT and restoration during RL.

Result: Found that OOD performance peaks early in SFT then declines (forgetting), RL restores lost OOD ability rather than generating new capability, recovery has boundaries based on SFT duration, and the key mechanism is rotation of singular vectors rather than changes in singular values.

Conclusion: The study re-identifies SFT and RL roles in two-stage fine-tuning and discovers rotation of singular vectors as the key mechanism driving OOD forgetting and restoration, challenging conventional wisdom about how these stages work.

Abstract: The two-stage fine-tuning paradigm of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has empirically shown better reasoning performance than one-stage SFT for the post-training of Large Language Models (LLMs). However, the evolution and mechanism behind the synergy of SFT and RL are still under-explored and inconclusive. In our study, we find the well-known claim “SFT memorizes, RL generalizes” is over-simplified, and discover that: (1) OOD performance peaks at the early stage of SFT and then declines (OOD forgetting), the best SFT checkpoint cannot be captured by training/test loss; (2) the subsequent RL stage does not generate fundamentally better OOD capability, instead it plays an \textbf{OOD restoration} role, recovering the lost reasoning ability during SFT; (3) The recovery ability has boundaries, \ie{} \textbf{if SFT trains for too short or too long, RL cannot recover the lost OOD ability;} (4) To uncover the underlying mechanisms behind the forgetting and restoration process, we employ SVD analysis on parameter matrices, manually edit them, and observe their impacts on model performance. Unlike the common belief that the shift of model capacity mainly results from the changes of singular values, we find that they are actually quite stable throughout fine-tuning. Instead, the OOD behavior strongly correlates with the \textbf{rotation of singular vectors}. Our findings re-identify the roles of SFT and RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism. %reversing the rotations induced by SFT, which shows recovery from forgetting, whereas imposing the SFT parameter directions onto a RL-tuned model results in performance degradation. Code is available at https://github.com/xiaodanguoguo/RL_Heals_SFT

[864] Cold-Start Active Preference Learning in Socio-Economic Domains

Mojtaba Fayaz-Bakhsh, Danial Ataee, MohammadAmin Fazli

Main category: cs.LG

TL;DR: This paper addresses the cold-start problem in active preference learning by proposing a PCA-based self-supervised approach to generate initial pseudo-labels, followed by active learning refinement.

Details

Motivation: Active preference learning suffers from performance decline when no initial labeled data is available (cold-start problem), which remains unexplored compared to other domains like vision and text.

Method: Uses Principal Component Analysis (PCA) for self-supervised initialization to generate pseudo-labels from data’s intrinsic structure, then refines through active learning with simulated noisy oracle queries.

Result: Experiments on socio-economic datasets (financial credibility, career success, socio-economic status) show PCA-driven approach consistently outperforms standard active learning strategies without prior information.

Conclusion: The method provides a computationally efficient and straightforward solution that effectively addresses the cold-start problem in active preference learning.

Abstract: Active preference learning offers an efficient approach to modeling preferences, but it is hindered by the cold-start problem, which leads to a marked decline in performance when no initial labeled data are available. While cold-start solutions have been proposed for domains such as vision and text, the cold-start problem in active preference learning remains largely unexplored, underscoring the need for practical, effective methods. Drawing inspiration from established practices in social and economic research, the proposed method initiates learning with a self-supervised phase that employs Principal Component Analysis (PCA) to generate initial pseudo-labels. This process produces a \say{warmed-up} model based solely on the data’s intrinsic structure, without requiring expert input. The model is then refined through an active learning loop that strategically queries a simulated noisy oracle for labels. Experiments conducted on various socio-economic datasets, including those related to financial credibility, career success rate, and socio-economic status, consistently show that the PCA-driven approach outperforms standard active learning strategies that start without prior information. This work thus provides a computationally efficient and straightforward solution that effectively addresses the cold-start problem.

[865] AI-Driven Detection and Analysis of Handwriting on Seized Ivory: A Tool to Uncover Criminal Networks in the Illicit Wildlife Trade

Will Fein, Ryan J. Horwitz, John E. Brown III, Amit Misra, Felipe Oviedo, Kevin White, Juan M. Lavista Ferres, Samuel K. Wasser

Main category: cs.LG

TL;DR: AI-driven analysis of handwritten markings on seized elephant tusks reveals forensic connections between ivory shipments, providing a scalable, low-cost method to disrupt wildlife trafficking networks.

Details

Motivation: The transnational ivory trade drives elephant decline, but existing forensic methods like DNA analysis are expensive and sometimes impossible. Handwritten markings on tusks are easy to photograph but rarely analyzed, presenting an untapped forensic opportunity.

Method: Developed an AI pipeline that uses object detection models to extract over 17,000 individual markings from 6,085 photographs of seized tusks (2014-2019), then labels and describes them using state-of-the-art AI tools to identify recurring signature markings.

Result: Identified 184 recurring signature markings connecting tusks, with 20 markings appearing in multiple seizures, establishing forensic links between shipments through traffickers involved in both operations.

Conclusion: AI-driven handwriting analysis complements existing investigative techniques, fills gaps where other data is unavailable, and demonstrates transformative potential for wildlife forensics and disrupting organized wildlife crime.

Abstract: The transnational ivory trade continues to drive the decline of elephant populations across Africa, and trafficking networks remain difficult to disrupt. Tusks seized by law enforcement officials carry forensic information on the traffickers responsible for their export, including DNA evidence and handwritten markings made by traffickers. For 20 years, analyses of tusk DNA have identified where elephants were poached and established connections among shipments of ivory. While the links established using genetic evidence are extremely conclusive, genetic data is expensive and sometimes impossible to obtain. But though handwritten markings are easy to photograph, they are rarely documented or analyzed. Here, we present an AI-driven pipeline for extracting and analyzing handwritten markings on seized elephant tusks, offering a novel, scalable, and low-cost source of forensic evidence. Having collected 6,085 photographs from eight large seizures of ivory over a 6-year period (2014-2019), we used an object detection model to extract over 17,000 individual markings, which were then labeled and described using state-of-the-art AI tools. We identified 184 recurring “signature markings” that connect the tusks on which they appear. 20 signature markings were observed in multiple seizures, establishing forensic links between these seizures through traffickers involved in both shipments. This work complements other investigative techniques by filling in gaps where other data sources are unavailable. The study demonstrates the transformative potential of AI in wildlife forensics and highlights practical steps for integrating handwriting analysis into efforts to disrupt organized wildlife crime.

[866] A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications

Zhenyu Tao, Wei Xu, Xiaohu You

Main category: cs.LG

TL;DR: The paper introduces a generalized bisimulation metric (GBSM) for pairs of MDPs with rigorous mathematical properties, enabling tighter theoretical bounds for policy transfer, state aggregation, and sampling-based estimation compared to standard BSM.

Details

Motivation: While bisimulation metric (BSM) is effective for single MDP analysis, its application to multiple-MDP scenarios like policy transfer remains challenging due to lack of rigorous mathematical analysis for inter-MDP settings.

Method: Formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, proving three fundamental properties: symmetry, inter-MDP triangle inequality, and distance bound on identical state spaces.

Result: GBSM provides explicit bounds for policy transfer, state aggregation, and sampling-based estimation that are strictly tighter than standard BSM bounds, with closed-form sample complexity for estimation improving upon existing asymptotic results.

Conclusion: GBSM enables rigorous theoretical analysis in multi-MDP scenarios with validated effectiveness through numerical results, advancing the application of bisimulation metrics beyond single MDP settings.

Abstract: The bisimulation metric (BSM) is a powerful tool for computing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to multiple-MDP scenarios, such as policy transfer, remains challenging. Prior work has attempted to generalize BSM to pairs of MDPs, but a lack of rigorous analysis of its mathematical properties has limited further theoretical progress. In this work, we formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, which is rigorously proven with the three fundamental properties: GBSM symmetry, inter-MDP triangle inequality, and the distance bound on identical state spaces. Leveraging these properties, we theoretically analyse policy transfer, state aggregation, and sampling-based estimation in MDPs, obtaining explicit bounds that are strictly tighter than those derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.

[867] Let’s Grow an Unbiased Community: Guiding the Fairness of Graphs via New Links

Jiahua Lu, Huaxiao Liu, Shuotong Bai, Junjie Xu, Renqiang Luo, Enyan Dai

Main category: cs.LG

TL;DR: FairGuide is a framework that enhances fairness in graph neural networks by introducing new links to guide biased graph structures toward unbiased ones, using differentiable community detection and meta-gradients.

Details

Motivation: Graph neural networks face fairness challenges due to biases in graph structures, and existing biased structures need guidance toward unbiased ones through new links to foster fair communities.

Method: Introduces differentiable community detection as a pseudo downstream task and uses meta-gradients from fairness-guidance objective to identify new links that enhance structural fairness.

Result: Extensive experiments show effectiveness and generalizability across various graph-based fairness tasks.

Conclusion: FairGuide successfully enhances structural fairness and promotes fairness generalization in downstream applications through strategic link addition.

Abstract: Graph Neural Networks (GNNs) have achieved remarkable success across diverse applications. However, due to the biases in the graph structures, graph neural networks face significant challenges in fairness. Although the original user graph structure is generally biased, it is promising to guide these existing structures toward unbiased ones by introducing new links. The fairness guidance via new links could foster unbiased communities, thereby enhancing fairness in downstream applications. To address this issue, we propose a novel framework named FairGuide. Specifically, to ensure fairness in downstream tasks trained on fairness-guided graphs, we introduce a differentiable community detection task as a pseudo downstream task. Our theoretical analysis further demonstrates that optimizing fairness within this pseudo task effectively enhances structural fairness, promoting fairness generalization across diverse downstream applications. Moreover, FairGuide employs an effective strategy which leverages meta-gradients derived from the fairness-guidance objective to identify new links that significantly enhance structural fairness. Extensive experimental results demonstrate the effectiveness and generalizability of our proposed method across a variety of graph-based fairness tasks.

[868] Disentangled Lottery Tickets: Identifying and Assembling Core and Specialist Subnetworks

Sadman Mohammad Nasif, Md Abrar Jahin, M. F. Mridha

Main category: cs.LG

TL;DR: The paper proposes the Disentangled Lottery Ticket (DiLT) Hypothesis, which identifies a universal “core” subnetwork and specialized “specialist” subnetworks in neural networks, showing that non-consensus weights are functionally important and enable modular assembly.

Details

Motivation: To challenge the assumption in COLT that non-overlapping weights are unimportant, and to demonstrate that these weights capture specialized, task-specific features that play critical functional roles.

Method: Developed a framework using Gromov-Wasserstein (GW) distance to quantify functional similarity between layer representations and spectral clustering to identify modular structures, applied to ResNet and Vision Transformer architectures on ImageNet and fine-grained datasets.

Result: The “core” ticket provides superior transfer learning performance, “specialist” tickets retain domain-specific features enabling modular assembly, and the full re-assembled “union” ticket outperforms COLT.

Conclusion: This work reframes pruning as a process for discovering modular, disentangled subnetworks rather than merely compressing models, highlighting the functional importance of non-consensus weights.

Abstract: The Lottery Ticket Hypothesis (LTH) suggests that within large neural networks, there exist sparse, trainable “winning tickets” capable of matching the performance of the full model, but identifying them through Iterative Magnitude Pruning (IMP) is computationally expensive. Recent work introduced COLT, an accelerator that discovers a “consensus” subnetwork by intersecting masks from models trained on disjoint data partitions; however, this approach discards all non-overlapping weights, assuming they are unimportant. This paper challenges that assumption and proposes the Disentangled Lottery Ticket (DiLT) Hypothesis, which posits that the intersection mask represents a universal, task-agnostic “core” subnetwork, while the non-overlapping difference masks capture specialized, task-specific “specialist” subnetworks. A framework is developed to identify and analyze these components using the Gromov-Wasserstein (GW) distance to quantify functional similarity between layer representations and reveal modular structures through spectral clustering. Experiments on ImageNet and fine-grained datasets such as Stanford Cars, using ResNet and Vision Transformer architectures, show that the “core” ticket provides superior transfer learning performance, the “specialist” tickets retain domain-specific features enabling modular assembly, and the full re-assembled “union” ticket outperforms COLT - demonstrating that non-consensus weights play a critical functional role. This work reframes pruning as a process for discovering modular, disentangled subnetworks rather than merely compressing models.

[869] PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning

Xin Yu, Cong Xie, Ziyu Zhao, Tiantian Fan, Lingzhou Xue, Zhi Zhang

Main category: cs.LG

TL;DR: PrunedLoRA is a framework that uses structured pruning to create more expressive low-rank adapters from over-parameterized spaces, outperforming standard LoRA and its variants across multiple tasks.

Details

Motivation: Standard LoRA's representational capacity often lags behind full fine-tuning, and there's a need to obtain more expressive low-rank adapters from over-parameterized spaces.

Method: PrunedLoRA dynamically prunes less important components during fine-tuning using gradient-based structured pruning, preventing their reactivation and enabling flexible rank allocation. It minimizes pruning error for overall loss with fine-grained pruning and recovery updates.

Result: Empirically outperforms LoRA and its variants across mathematical reasoning, code generation, and natural language understanding tasks, and shows advantages over existing structured pruning methods across diverse sparsity levels.

Conclusion: PrunedLoRA provides a more robust and effective approach for parameter-efficient fine-tuning through dynamic structured pruning, with theoretical guarantees on pruning robustness.

Abstract: Low-rank adaptation (LoRA) has become a widely used paradigm for parameter-efficient fine-tuning of large language models, yet its representational capacity often lags behind full fine-tuning. Within the context of LoRA, a key open question is how to obtain expressive low-rank adapters from over-parameterized spaces. We propose \textit{PrunedLoRA}, a new framework that leverages structured pruning to obtain highly representative low-rank adapters from an over-parameterized initialization. Unlike prior approaches that impose a fixed low-rank budget, PrunedLoRA dynamically prunes less important components during fine-tuning and prevents their reactivation, enabling flexible and adaptive rank allocation. For structured pruning, by minimizing the pruning error for overall loss, we provide fine-grained pruning and recovery updates in a gradient-based pruning strategy with grounded interpretation. We provide the first theoretical analysis of the robustness of structured pruning and provably show that under the impact of weight perturbation, gradient-based pruning is more robust than activation-based pruning with respect to overall loss. Empirically, PrunedLoRA consistently outperforms LoRA and its variants across supervised fine-tuning tasks in mathematical reasoning, code generation, and natural language understanding, and it also demonstrates advantages over existing structured pruning methods across diverse sparsity levels.

[870] Reconciling Communication Compression and Byzantine-Robustness in Distributed Learning

Diksha Gupta, Antonio Honsell, Chuan Xu, Nirupam Gupta, Giovanni Neglia

Main category: cs.LG

TL;DR: RoSDHB is a new distributed learning algorithm that combines Polyak momentum with coordinated compression to address both Byzantine faults and high communication costs simultaneously, outperforming the state-of-the-art Byz-DASHA-PAGE method.

Details

Motivation: Distributed learning faces challenges from Byzantine faults and high communication costs, but prior work has shown that naively combining compression with Byzantine-robust aggregation weakens resilience. The interplay between these two challenges has received limited attention.

Method: RoSDHB integrates classical Polyak momentum with a coordinated compression strategy, providing a more efficient alternative to momentum-based variance reduction schemes used in existing methods.

Result: Theoretically, RoSDHB matches Byz-DASHA-PAGE’s convergence guarantees under standard gradient dissimilarity model while using milder assumptions and requiring less memory and communication per client. Empirically, it demonstrates stronger robustness with substantial communication savings.

Conclusion: RoSDHB effectively addresses both Byzantine robustness and communication efficiency in distributed learning, offering improved performance over existing state-of-the-art methods with better resource efficiency.

Abstract: Distributed learning enables scalable model training over decentralized data, but remains hindered by Byzantine faults and high communication costs. While both challenges have been studied extensively in isolation, their interplay has received limited attention. Prior work has shown that naively combining communication compression with Byzantine-robust aggregation can severely weaken resilience to faulty nodes. The current state-of-the-art, Byz-DASHA-PAGE, leverages a momentum-based variance reduction scheme to counteract the negative effect of compression noise on Byzantine robustness. In this work, we introduce RoSDHB, a new algorithm that integrates classical Polyak momentum with a coordinated compression strategy. Theoretically, RoSDHB matches the convergence guarantees of Byz-DASHA-PAGE under the standard $(G,B)$-gradient dissimilarity model, while relying on milder assumptions and requiring less memory and communication per client. Empirically, RoSDHB demonstrates stronger robustness while achieving substantial communication savings compared to Byz-DASHA-PAGE.

[871] Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models

Shutong Wu, Jiawei Zhang

Main category: cs.LG

TL;DR: FreeDave is a novel fast decoding algorithm for Diffusion Large Language Models that enables lossless parallel decoding without model modifications, achieving up to 3.78× inference speedup without performance degradation.

Details

Motivation: DLLMs show advantages in context understanding but suffer from slow inference due to requiring many decoding steps. Existing parallel decoding algorithms cause performance degradation.

Method: Proposes FreeDave algorithm with parallel-decoded candidate generation and verification, using minimal model forward calls to reproduce static decoding results.

Result: Achieves up to 3.78× inference throughput improvement on math reasoning and code generation benchmarks without performance loss.

Conclusion: FreeDave enables efficient parallel decoding for DLLMs while maintaining generation quality, overcoming the speed-quality trade-off in existing methods.

Abstract: Diffusion Large Language Models (DLLMs) have emerged as a new paradigm of language modeling beyond autoregressive next-token prediction. Thanks to their bidirectional attention mechanism, DLLMs are more capable of capturing the connection of context, and thus show unique advantages in challenges like the famous “reversal curse” or learning under data-constrained scenarios. In addition, taking advantage of their inherent modeling foundations, DLLMs have the great potential of efficient inference with parallel decoding algorithms, which enable multi-token prediction per step. However, the high generation quality often requires the number of decoding steps equal to the sequence length, which performs a one-token-per-step decoding, and existing parallel decoding algorithms, which yield suboptimal decoding paths, bring inference speedup at the cost of non-negligible performance degradation. To overcome this challenge, we introduce Free Draft-and-Verification (FreeDave), a novel fast decoding algorithm tailored for DLLMs that achieves lossless parallel decoding without any model modification or extra modules. Specifically, we propose an algorithm of parallel-decoded candidate generation and verification, which is theoretically guaranteed to use the fewest model forward calls to reproduce the same sequence generated by static decoding when enough computation and memory budget is provided. By extensive evaluations on math reasoning and code generation benchmarks across different DLLMs, FreeDave is proven to boost the inference throughput up to $3.78\times$ without performance degradation.

[872] Khiops: An End-to-End, Frugal AutoML and XAI Machine Learning Solution for Large, Multi-Table Databases

Marc Boullé, Nicolas Voisine, Bruno Guerraz, Carine Hue, Felipe Olmos, Vladimir Popescu, Stéphane Gouache, Stéphane Bouget, Alexis Bondu, Luc Aurelien Gauthier, Yassine Nair Benrekia, Fabrice Clérot, Vincent Lemaire

Main category: cs.LG

TL;DR: Khiops is an open-source ML tool for large multi-table databases using Bayesian methods for classification, regression, variable selection, and co-clustering with efficient handling of massive datasets.

Details

Motivation: To provide an efficient machine learning solution for mining large multi-table databases with millions of records and thousands of variables, addressing the need for scalable Bayesian approaches in database analytics.

Method: Uses a naive Bayesian classifier with variable selection and weight learning, employs discretization for numerical data and value clustering for categorical data, and automatically constructs aggregates for multi-table propositionalization.

Result: Successfully handles databases with millions of individuals, tens of thousands of variables, and hundreds of millions of records in secondary tables, with academic validation through 20+ publications.

Conclusion: Khiops provides a robust, scalable Bayesian framework for large-scale database mining with comprehensive variable importance measures and multi-table support, available as both Python library and user interface.

Abstract: Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.

[873] NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables

Yilin Li, Guozhu Meng, Mingyang Sun, Yanzhong Wang, Kun Sun, Hailong Chang, Yuekang Li

Main category: cs.LG

TL;DR: NeuroDeX is a novel decompiler for DNN executables that uses LLMs and dynamic analysis to handle compilation optimizations and quantized models, achieving high accuracy in model recovery.

Details

Motivation: On-device deep learning models face reverse engineering threats, and existing decompilers struggle with compilation optimizations and quantized compiled models.

Method: Leverages LLMs’ semantic understanding capabilities combined with dynamic analysis for operator type recognition, attribute recovery, and model reconstruction.

Result: Successfully decompiles 96 DNN executables across 12 models, achieving nearly identical recovery for non-quantized models and 72% top-1 accuracy for quantized executables.

Conclusion: NeuroDeX provides a more comprehensive and effective solution for DNN executable decompilation compared to previous approaches.

Abstract: On-device deep learning models have extensive real world demands. Deep learning compilers efficiently compile models into executables for deployment on edge devices, but these executables may face the threat of reverse engineering. Previous studies have attempted to decompile DNN executables, but they face challenges in handling compilation optimizations and analyzing quantized compiled models. In this paper, we present NeuroDeX to unlock diverse support in decompiling DNN executables. NeuroDeX leverages the semantic understanding capabilities of LLMs along with dynamic analysis to accurately and efficiently perform operator type recognition, operator attribute recovery and model reconstruction. NeuroDeX can recover DNN executables into high-level models towards compilation optimizations, different architectures and quantized compiled models. We conduct experiments on 96 DNN executables across 12 common DNN models. Extensive experimental results demonstrate that NeuroDeX can decompile non-quantized executables into nearly identical high-level models. NeuroDeX can recover functionally similar high-level models for quantized executables, achieving an average top-1 accuracy of 72%. NeuroDeX offers a more comprehensive and effective solution compared to previous DNN executables decompilers.

[874] What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably)

Zixuan Gong, Jiaye Teng, Yong Liu

Main category: cs.LG

TL;DR: Looped transformers (Looped-Attn) outperform standard transformers on complex reasoning tasks due to their loss landscape geometry, which favors V-shaped valleys enabling better convergence and complex pattern learning. A new training framework SHIFT is proposed to accelerate Looped-Attn training.

Details

Motivation: The theoretical basis for why looped transformers outperform standard transformers on complex reasoning tasks remains underexplored, despite empirical evidence of their superior performance.

Method: Analyze loss landscape geometry using the River-Valley model, distinguishing U-shaped (flat) and V-shaped (steep) valleys. Propose SHIFT, a staged hierarchical framework for progressive training of Looped-Attn.

Result: Looped-Attn induces a landscape-level inductive bias towards River-V-Valley, guaranteeing better loss convergence through valley hopping and encouraging learning of complex patterns compared to Single-Attn’s River-U-Valley.

Conclusion: The recursive architecture of looped transformers creates a beneficial loss landscape geometry that explains their performance advantage, and the SHIFT framework enables accelerated training while maintaining comparable performance.

Abstract: While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.

[875] Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu

Main category: cs.LG

TL;DR: Establishes a Functional Scaling Law (FSL) that captures full loss trajectories under arbitrary learning rate schedules, using an intrinsic-time viewpoint to explain training dynamics beyond final-step loss.

Details

Motivation: Existing scaling laws focus only on final-step loss, leaving gaps in understanding how learning rate schedules shape entire loss dynamics during training.

Method: Analyzes SGD on power-law kernel regression model using intrinsic-time viewpoint, derives FSL that captures loss trajectories under arbitrary learning rate schedules through convolutional functional.

Result: Derived explicit scaling relations for constant, exponential decay, and warmup-stable-decay schedules; explains empirical phenomena like higher-capacity model efficiency and WSD superiority; validated on LLMs from 0.1B to 1B parameters.

Conclusion: FSL provides a practical surrogate model for fitting and predicting loss trajectories in large-scale pre-training, offering insights into how learning rate schedules fundamentally shape training dynamics.

Abstract: Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule’s influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs – constant, exponential decay, and warmup-stable-decay (WSD) – and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.

[876] Continual Learning with Query-Only Attention

Gautham Bekal, Ashish Pujari, Scott David Kelly

Main category: cs.LG

TL;DR: Query-only attention mechanism for continual learning that discards keys and values while preserving transformer inductive bias, mitigating catastrophic forgetting and loss of plasticity.

Details

Motivation: Continual learning faces challenges from distributional shift across tasks, leading to catastrophic forgetting and loss of plasticity in neural networks.

Method: Proposed query-only attention mechanism that simplifies transformers by removing keys and values, while maintaining core inductive bias. Analyzed through Hessian spectrum analysis to study curvature rank.

Result: Query-only attention significantly outperforms baselines like selective re-initialization in mitigating both loss of plasticity and catastrophic forgetting in continual learning scenarios.

Conclusion: Full attention may not be essential for meta-learning benefits in continual learning; query-based models help preserve plasticity through maintained curvature rank across tasks.

Abstract: Continual learning involves learning from a stream of data without repetition of data points, a scenario that is inherently complex due to distributional shift across tasks. We propose a query-only attention mechanism that discards keys and values, yet preserves the core inductive bias of transformer architectures. In continual learning scenarios, this simplified mechanism significantly mitigates both loss of plasticity and catastrophic forgetting, outperforming baselines such as selective re-initialization. We establish a conceptual link between query-only attention, full transformer attention, and model agnostic meta-learning, framing them as instances of meta-learning. We further provide intuition for why query-based models and attention networks help preserve plasticity in continual settings. Finally, through preliminary Hessian spectrum analysis, we observe that models maintaining higher curvature rank across tasks tend to retain plasticity. Our findings suggest that full attention may not be essential for capturing the benefits of meta-learning in continual learning.

[877] Riemannian Consistency Model

Chaoran Cheng, Yusong Wang, Yuxin Chen, Xiangxin Zhou, Nanning Zheng, Ge Liu

Main category: cs.LG

TL;DR: Riemannian Consistency Model (RCM) enables few-step generation on Riemannian manifolds using covariant derivatives and exponential maps, with theoretical equivalence between distillation and training variants.

Details

Motivation: Extend consistency models from Euclidean domains to Riemannian manifolds to handle curved geometry while enabling few-step generation.

Method: Use covariant derivative and exponential-map parameterization to derive closed-form training objectives for both discrete- and continuous-time RCM, with simplified training that avoids complex differential calculations.

Result: Superior generative quality demonstrated on various non-Euclidean manifolds including flat-tori, spheres, and SO(3) rotation group.

Conclusion: RCM successfully enables few-step consistency modeling on Riemannian manifolds while respecting intrinsic geometric constraints, with theoretical foundation and practical effectiveness.

Abstract: Consistency models are a class of generative models that enable few-step generation for diffusion and flow matching models. While consistency models have achieved promising results on Euclidean domains like images, their applications to Riemannian manifolds remain challenging due to the curved geometry. In this work, we propose the Riemannian Consistency Model (RCM), which, for the first time, enables few-step consistency modeling while respecting the intrinsic manifold constraint imposed by the Riemannian geometry. Leveraging the covariant derivative and exponential-map-based parameterization, we derive the closed-form solutions for both discrete- and continuous-time training objectives for RCM. We then demonstrate theoretical equivalence between the two variants of RCM: Riemannian consistency distillation (RCD) that relies on a teacher model to approximate the marginal vector field, and Riemannian consistency training (RCT) that utilizes the conditional vector field for training. We further propose a simplified training objective that eliminates the need for the complicated differential calculation. Finally, we provide a unique kinematics perspective for interpreting the RCM objective, offering new theoretical angles. Through extensive experiments, we manifest the superior generative quality of RCM in few-step generation on various non-Euclidean manifolds, including flat-tori, spheres, and the 3D rotation group SO(3).

[878] Automotive Crash Dynamics Modeling Accelerated with Machine Learning

Mohammad Amin Nabian, Sudeep Chavare, Deepak Akhare, Rishikesh Ranade, Ram Cherukuri, Srinivas Tadepalli

Main category: cs.LG

TL;DR: Machine learning surrogate models for crashworthiness assessment using NVIDIA PhysicsNeMo framework, achieving orders-of-magnitude faster predictions than traditional FE simulations.

Details

Motivation: Traditional finite element simulations for crashworthiness assessment are computationally expensive and time-consuming, creating need for faster alternatives.

Method: Compared MeshGraphNet and Transolver neural network architectures with three transient dynamics modeling strategies: time-conditional, standard autoregressive, and stability-enhanced autoregressive with rollout-based training on 150 LS-DYNA FE simulations of Body-in-White crash scenarios.

Result: Models captured overall deformation trends with reasonable fidelity, achieving orders-of-magnitude computational cost reduction compared to full FE simulations, though not matching full FE accuracy.

Conclusion: Demonstrated feasibility of applying machine learning to structural crash dynamics, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.

Abstract: Crashworthiness assessment is a critical aspect of automotive design, traditionally relying on high-fidelity finite element (FE) simulations that are computationally expensive and time-consuming. This work presents an exploratory comparative study on developing machine learning-based surrogate models for efficient prediction of structural deformation in crash scenarios using the NVIDIA PhysicsNeMo framework. Given the limited prior work applying machine learning to structural crash dynamics, the primary contribution lies in demonstrating the feasibility and engineering utility of the various modeling approaches explored in this work. We investigate two state-of-the-art neural network architectures for modeling crash dynamics: MeshGraphNet, and Transolver. Additionally, we examine three strategies for modeling transient dynamics: time-conditional, the standard Autoregressive approach, and a stability-enhanced Autoregressive scheme incorporating rollout-based training. The models are evaluated on a comprehensive Body-in-White (BIW) crash dataset comprising 150 detailed FE simulations using LS-DYNA. The dataset represents a structurally rich vehicle assembly with over 200 components, including 38 key components featuring variable thickness distributions to capture realistic manufacturing variability. Each model utilizes the undeformed mesh geometry and component characteristics as inputs to predict the spatiotemporal evolution of the deformed mesh during the crash sequence. Evaluation results show that the models capture the overall deformation trends with reasonable fidelity, demonstrating the feasibility of applying machine learning to structural crash dynamics. Although not yet matching full FE accuracy, the models achieve orders-of-magnitude reductions in computational cost, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.

[879] PO-CKAN:Physics Informed Deep Operator Kolmogorov Arnold Networks with Chunk Rational Structure

Junyi Wu, Guang Lin

Main category: cs.LG

TL;DR: PO-CKAN is a physics-informed deep operator framework using Chunkwise Rational KANs for learning solution operators of PDEs, achieving improved accuracy over existing methods.

Details

Motivation: To develop an efficient framework for learning physically consistent solution operators of parametric time-dependent PDEs with varying inputs, overcoming limitations of traditional methods.

Method: Combines DeepONet architecture with Chunkwise Rational KAN sub-networks and integrates PINN principles to enforce physical consistency through PDE residual loss.

Result: On Burgers’ equation with ν=0.01, reduces mean relative L² error by ~48% compared to PI-DeepONet, and achieves competitive accuracy on Eikonal and diffusion-reaction benchmarks.

Conclusion: PO-CKAN provides an effective framework for accurate and physically consistent operator learning of parametric PDEs with improved performance over existing approaches.

Abstract: We propose PO-CKAN, a physics-informed deep operator framework based on Chunkwise Rational Kolmogorov–Arnold Networks (KANs), for approximating the solution operators of partial differential equations. This framework leverages a Deep Operator Network (DeepONet) architecture that incorporates Chunkwise Rational Kolmogorov-Arnold Network (CKAN) sub-networks for enhanced function approximation. The principles of Physics-Informed Neural Networks (PINNs) are integrated into the operator learning framework to enforce physical consistency. This design enables the efficient learning of physically consistent spatio-temporal solution operators and allows for rapid prediction for parametric time-dependent PDEs with varying inputs (e.g., parameters, initial/boundary conditions) after training. Validated on challenging benchmark problems, PO-CKAN demonstrates accurate operator learning with results closely matching high-fidelity solutions. PO-CKAN adopts a DeepONet-style branch–trunk architecture with its sub-networks instantiated as rational KAN modules, and enforces physical consistency via a PDE residual (PINN-style) loss. On Burgers’ equation with $\nu=0.01$, PO-CKAN reduces the mean relative $L^2$ error by approximately 48% compared to PI-DeepONet, and achieves competitive accuracy on the Eikonal and diffusion–reaction benchmarks.

[880] Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness

Longwei Wang, Ifrat Ikhtear Uddin, KC Santosh, Chaowei Zhang, Xiao Qin, Yang Zhou

Main category: cs.LG

TL;DR: This paper proposes using group-equivariant convolutions (rotation- and scale-equivariant layers) in CNNs to improve adversarial robustness without adversarial training, achieving better resilience to attacks while maintaining clean-data accuracy.

Details

Motivation: Adversarial training is computationally expensive and can reduce clean-data accuracy. The authors seek an architectural approach that embeds symmetry priors to naturally improve robustness against adversarial attacks.

Method: Two symmetry-aware architectures: parallel design (independent processing of standard and equivariant features before fusion) and cascaded design (sequential application of equivariant operations). Uses group-equivariant convolutions to encode rotation and scale symmetries.

Result: Models consistently improve adversarial robustness and generalization across CIFAR-10, CIFAR-100, and CIFAR-10C under FGSM and PGD attacks, without adversarial training. Theoretically shows reduced hypothesis space complexity and tighter certified robustness bounds.

Conclusion: Symmetry-enforcing architectures offer efficient and principled alternatives to data augmentation-based defenses, providing inherent adversarial robustness through architectural design rather than training procedures.

Abstract: Adversarial examples reveal critical vulnerabilities in deep neural networks by exploiting their sensitivity to imperceptible input perturbations. While adversarial training remains the predominant defense strategy, it often incurs significant computational cost and may compromise clean-data accuracy. In this work, we investigate an architectural approach to adversarial robustness by embedding group-equivariant convolutions-specifically, rotation- and scale-equivariant layers-into standard convolutional neural networks (CNNs). These layers encode symmetry priors that align model behavior with structured transformations in the input space, promoting smoother decision boundaries and greater resilience to adversarial attacks. We propose and evaluate two symmetry-aware architectures: a parallel design that processes standard and equivariant features independently before fusion, and a cascaded design that applies equivariant operations sequentially. Theoretically, we demonstrate that such models reduce hypothesis space complexity, regularize gradients, and yield tighter certified robustness bounds under the CLEVER (Cross Lipschitz Extreme Value for nEtwork Robustness) framework. Empirically, our models consistently improve adversarial robustness and generalization across CIFAR-10, CIFAR-100, and CIFAR-10C under both FGSM and PGD attacks, without requiring adversarial training. These findings underscore the potential of symmetry-enforcing architectures as efficient and principled alternatives to data augmentation-based defenses.

[881] Multi-Agent Regime-Conditioned Diffusion (MARCD) for CVaR-Constrained Portfolio Decisions

Ali Atiah Alzahrani

Main category: cs.LG

TL;DR: MARCD is a generative-to-decision framework that combines regime-conditioned scenarios with CVaR optimization to improve portfolio performance under regime shifts, achieving 34% reduction in maximum drawdown compared to benchmarks.

Details

Motivation: To improve portfolio decisions under regime shifts by combining generative scenarios with convex CVaR allocation, addressing the need for better risk management during market crises.

Method: Four-stage framework: (i) Gaussian HMM for latent regime inference, (ii) diffusion generator for regime-conditioned scenarios, (iii) signal extraction via blended moments, (iv) governed CVaR epigraph quadratic program with constraints.

Result: On liquid multi-asset ETFs (2005-2025), MARCD showed stronger scenario calibration and materially smaller drawdowns: MaxDD 9.3% vs 14.1% for BL (34% reduction) during 2020-2025 out-of-sample period.

Conclusion: The framework provides an auditable pipeline with explicit constraints, demonstrating the value of decision-aware generative modeling in finance for improved risk management and portfolio performance.

Abstract: We examine whether regime-conditioned generative scenarios combined with a convex CVaR allocator improve portfolio decisions under regime shifts. We present MARCD, a generative-to-decision framework with: (i) a Gaussian HMM to infer latent regimes; (ii) a diffusion generator that produces regime-conditioned scenarios; (iii) signal extraction via blended, shrunk moments; and (iv) a governed CVaR epigraph quadratic program. Contributions: Within the Scenario stage we introduce a tail-weighted diffusion objective that up-weights low-quantile outcomes relevant for drawdowns and a regime-expert (MoE) denoiser whose gate increases with crisis posteriors; both are evaluated end-to-end through the allocator. Under strict walk-forward on liquid multi-asset ETFs (2005-2025), MARCD exhibits stronger scenario calibration and materially smaller drawdowns: MaxDD 9.3% versus 14.1% for BL (a 34% reduction) over 2020-2025 out-of-sample. The framework provides an auditable pipeline with explicit budget, box, and turnover constraints, demonstrating the value of decision-aware generative modeling in finance.

[882] ADPO: Anchored Direct Preference Optimization

Wang Zixian

Main category: cs.LG

TL;DR: ADPO is a generalized preference optimization framework that uses soft, listwise supervision with anchor-based policy updates, providing robustness to noise and distribution shift through implicit trust region regularization.

Details

Motivation: Standard DPO is brittle to annotator noise and distribution shift due to its reliance on hard pairwise preferences. There's a need for a more robust framework that can handle soft supervision and provide geometric stability.

Method: ADPO learns from soft, listwise supervision by anchoring policy updates to a reference model. The anchoring mechanism imposes an implicit trust region on policy updates enforced by the softmax Fisher information metric, with both fixed and dynamic anchor strategies.

Result: Dynamic anchors outperform fixed anchors in noisy online exploration (5-11% improvement), while fixed anchors are dramatically more effective for offline distillation (achieving 387% of teacher performance on HalfCheetah-v5 and reducing KL divergence by up to 5000x). Larger models amplify ADPO’s benefits.

Conclusion: ADPO provides a robust, unified framework for preference learning with clear guidance for anchor strategy selection based on task requirements, acting as an effective trust-region regularizer that scales well with model size.

Abstract: Direct Preference Optimization (DPO) has become a standard for aligning models with human feedback, yet its reliance on hard, pairwise preferences makes it brittle to annotator noise and distribution shift. We propose Anchored Direct Preference Optimization (ADPO), a generalized framework that learns from soft, listwise supervision by anchoring policy updates to a reference model. Our key theoretical contribution is to show that this anchoring mechanism imposes an implicit trust region on the policy update, enforced by the softmax Fisher information metric. This provides a robust geometric interpretation for both fixed and dynamic anchor strategies. Our central empirical finding is a task-dependent tradeoff between anchor update strategies. Through controlled experiments across twelve scenarios and two MuJoCo environments, we demonstrate that (1) for online exploration in noisy environments, a dynamic anchor that tracks the learning policy is superior, improving performance by 5 to 11 percent over a fixed anchor; and (2) for offline distillation, a fixed anchor pointing to the teacher policy is dramatically more effective, achieving returns of 206.7 on HalfCheetah-v5 (387 percent of teacher) and 65.4 on Hopper-v5 (61 percent of teacher), while reducing KL divergence to the teacher by up to 5000 times compared with standard knowledge distillation. These findings offer clear, practical guidance for selecting anchor strategies and establish ADPO as a robust, unified framework for preference learning. Larger models further amplify ADPO’s benefits (0.718 vs. 0.416 at hidden dimension 256), suggesting that anchoring acts as an effective trust-region regularizer. We release code and configurations to facilitate reproducibility.

[883] Computational Budget Should Be Considered in Data Selection

Weilin Wan, Weizhong Zhang, Cheng Jin

Main category: cs.LG

TL;DR: CADS introduces compute budget-aware data selection through bilevel optimization, addressing key challenges with Hessian-free gradient estimation and efficient inner-loop optimization.

Details

Motivation: Existing data selection methods ignore compute budget constraints, but no algorithm consistently outperforms others across varying budgets, making budget integral to effective data selection strategies.

Method: Proposes Computational budget-Aware Data Selection (CADS) as bilevel optimization: inner loop trains model within budget constraints on selected data, outer loop optimizes data selection. Uses probabilistic reparameterization with Hessian-free policy gradient estimator and transforms inner optimization into penalty term.

Result: Achieves performance gains up to 14.42% over baselines in vision and language benchmarks.

Conclusion: Compute budget must be integral to data selection strategies, and CADS effectively addresses this through novel bilevel optimization framework with efficient gradient estimation.

Abstract: Data selection improves computational efficiency by choosing informative subsets of training samples. However, existing methods ignore the compute budget, treating data selection and importance evaluation independently of compute budget constraints. Yet empirical studies show no algorithm can consistently outperform others (or even random selection) across varying budgets. We therefore argue that compute budget must be integral to data-selection strategies, since different budgets impose distinct requirements on data quantity, quality, and distribution for effective training. To this end, we propose a novel Computational budget-Aware Data Selection (CADS) method and naturally formulate it into a bilevel optimization framework, where the inner loop trains the model within the constraints of the computational budget on some selected subset of training data, while the outer loop optimizes data selection based on model evaluation. Our technical contributions lie in addressing two main challenges in solving this bilevel optimization problem: the expensive Hessian matrix estimation for outer-loop gradients and the computational burden of achieving inner-loop optimality during iterations. To solve the first issue, we propose a probabilistic reparameterization strategy and compute the gradient using a Hessian-free policy gradient estimator. To address the second challenge, we transform the inner optimization problem into a penalty term in the outer objective, further discovering that we only need to estimate the minimum of a one-dimensional loss to calculate the gradient, significantly improving efficiency. Extensive experiments show that our method achieves performance gains of up to 14.42% over baselines in vision and language benchmarks.

Jiacheng Liu, Xinyu Wang, Yuqi Lin, Zhikai Wang, Peiru Wang, Peiliang Cai, Qinming Zhou, Zhengan Yan, Zexuan Yan, Zhengyi Shi, Chang Zou, Yue Ma, Linfeng Zhang

Main category: cs.LG

TL;DR: Diffusion Caching is a training-free, architecture-agnostic method that reduces computational overhead in diffusion models by reusing intrinsic computational redundancies through feature-level cross-step reuse and inter-layer scheduling.

Details

Motivation: Diffusion models suffer from prohibitive computational overhead and generation latency due to multi-step iterations and complex backbone networks, creating bottlenecks for real-time applications. Existing acceleration techniques face limitations in applicability, training costs, or quality degradation.

Method: Identifies and reuses intrinsic computational redundancies in the diffusion process through feature-level cross-step reuse and inter-layer scheduling without modifying model parameters. The approach evolves from static reuse to dynamic prediction.

Result: Provides a systematic review and unified framework for Diffusion Caching classification and analysis. Shows evolution from static to dynamic caching approaches that enhance flexibility across diverse tasks and enable integration with other acceleration techniques.

Conclusion: Diffusion Caching paradigm will become a key enabler for real-time and efficient generative AI, paving the way for unified efficient inference frameworks for future multimodal and interactive applications, injecting new vitality into Efficient Generative Intelligence.

Abstract: Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.

[885] Provable Generalization Bounds for Deep Neural Networks with Momentum-Adaptive Gradient Dropout

Adeel Safder

Main category: cs.LG

TL;DR: MAGDrop is a novel regularization method that dynamically adjusts dropout rates based on gradients and momentum to prevent overfitting in deep neural networks, with theoretical PAC-Bayes bounds and strong empirical results on MNIST and CIFAR-10.

Details

Motivation: Deep neural networks often suffer from overfitting due to their high capacity, requiring effective regularization methods to improve generalization performance.

Method: Momentum-Adaptive Gradient Dropout (MAGDrop) - dynamically adjusts dropout rates on activations based on current gradients and accumulated momentum to enhance stability in non-convex optimization.

Result: Achieved competitive performance: MNIST (99.52%) and CIFAR-10 (92.03%) with generalization gaps of 0.48% and 6.52% respectively. Theoretical PAC-Bayes bounds were 29.2% tighter than standard approaches.

Conclusion: MAGDrop bridges theoretical insights and practical advancements, providing a robust framework for enhancing DNN generalization suitable for high-stakes applications.

Abstract: Deep neural networks (DNNs) achieve remarkable performance but often suffer from overfitting due to their high capacity. We introduce Momentum-Adaptive Gradient Dropout (MAGDrop), a novel regularization method that dynamically adjusts dropout rates on activations based on current gradients and accumulated momentum, enhancing stability in non-convex optimization landscapes. To theoretically justify MAGDrop’s effectiveness, we derive a non-asymptotic, computable PAC-Bayes generalization bound that accounts for its adaptive nature, achieving up to 29.2% tighter bounds compared to standard approaches by leveraging momentum-driven perturbation control. Empirically, the activation-based MAGDrop achieves competitive performance on MNIST (99.52%) and CIFAR-10 (92.03%), with generalization gaps of 0.48% and 6.52%, respectively. We provide fully reproducible code and numerical computation of our bounds to validate our theoretical claims. Our work bridges theoretical insights and practical advancements, offering a robust framework for enhancing DNN generalization, making it suitable for high-stakes applications.

[886] Knowledge-guided Continual Learning for Behavioral Analytics Systems

Yasas Senarath, Hemant Purohit

Main category: cs.LG

TL;DR: Proposes an augmentation-based approach using external knowledge to enhance replay-based continual learning for deviant behavior classification, addressing data drift and catastrophic forgetting in online platforms.

Details

Motivation: User behavior on online platforms evolves over time, causing data drift that degrades model performance. Fine-tuning with new data leads to catastrophic forgetting, while replay-based approaches are limited by fixed buffer sizes.

Method: Novel augmentation-based approach that incorporates external knowledge into replay-based continual learning framework, using data augmentation to overcome buffer size limitations.

Result: Evaluation with three deviant behavior classification datasets shows that augmentation helps outperform baseline replay-based approaches in continual learning.

Conclusion: External knowledge augmentation effectively enhances replay-based continual learning, improving performance on evolving online behavior classification tasks while mitigating forgetting.

Abstract: User behavior on online platforms is evolving, reflecting real-world changes in how people post, whether it’s helpful messages or hate speech. Models that learn to capture this content can experience a decrease in performance over time due to data drift, which can lead to ineffective behavioral analytics systems. However, fine-tuning such a model over time with new data can be detrimental due to catastrophic forgetting. Replay-based approaches in continual learning offer a simple yet efficient method to update such models, minimizing forgetting by maintaining a buffer of important training instances from past learned tasks. However, the main limitation of this approach is the fixed size of the buffer. External knowledge bases can be utilized to overcome this limitation through data augmentation. We propose a novel augmentation-based approach to incorporate external knowledge in the replay-based continual learning framework. We evaluate several strategies with three datasets from prior studies related to deviant behavior classification to assess the integration of external knowledge in continual learning and demonstrate that augmentation helps outperform baseline replay-based approaches.

[887] Amortized Active Generation of Pareto Sets

Daniel M. Steinberg, Asiri Wijesinghe, Rafael Oliveira, Piotr Koniusz, Cheng Soon Ong, Edwin V. Bonilla

Main category: cs.LG

TL;DR: A-GPS is a framework for online discrete black-box multi-objective optimization that learns a generative model of Pareto sets, supports user preference conditioning, and achieves high-quality approximations without explicit hypervolume computation.

Details

Motivation: To address the need for efficient multi-objective optimization that can incorporate user preferences and avoid computationally expensive hypervolume calculations.

Method: Uses a generative model conditioned on non-dominance relations predicted by a class probability estimator, incorporates preference direction vectors for user-specified trade-offs, and updates the model iteratively using Pareto membership and preference alignment.

Result: Achieves strong sample efficiency on synthetic benchmarks and protein design tasks, effectively captures user preferences, and produces high-quality Pareto set approximations.

Conclusion: A-GPS provides a simple yet powerful approach for multi-objective optimization that flexibly incorporates user preferences while avoiding explicit hypervolume computation.

Abstract: We introduce active generation of Pareto sets (A-GPS), a new framework for online discrete black-box multi-objective optimization (MOO). A-GPS learns a generative model of the Pareto set that supports a-posteriori conditioning on user preferences. The method employs a class probability estimator (CPE) to predict non-dominance relations and to condition the generative model toward high-performing regions of the search space. We also show that this non-dominance CPE implicitly estimates the probability of hypervolume improvement (PHVI). To incorporate subjective trade-offs, A-GPS introduces preference direction vectors that encode user-specified preferences in objective space. At each iteration, the model is updated using both Pareto membership and alignment with these preference directions, producing an amortized generative model capable of sampling across the Pareto front without retraining. The result is a simple yet powerful approach that achieves high-quality Pareto set approximations, avoids explicit hypervolume computation, and flexibly captures user preferences. Empirical results on synthetic benchmarks and protein design tasks demonstrate strong sample efficiency and effective preference incorporation.

[888] ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, Luca Zappella

Main category: cs.LG

TL;DR: ParaRNN enables parallel training of nonlinear RNNs by solving recurrence relationships as a single system of equations using Newton’s iterations and parallel reductions, achieving up to 665x speedup and scaling to 7B parameters.

Details

Motivation: Traditional RNNs are limited by sequential computation that prevents parallelization, while existing parallel architectures like Transformers and SSMs have constraints - SSMs are limited by linearity and cannot model complex nonlinear dependencies.

Method: Frame nonlinear recurrence relationships as a single system of equations and solve them in parallel using Newton’s iterations combined with custom parallel reductions.

Result: Achieved 665x speedup over sequential training, successfully trained 7B parameter LSTM and GRU models that achieve perplexity comparable to similarly-sized Transformers and Mamba2 architectures.

Conclusion: ParaRNN breaks the sequence-parallelization barrier for nonlinear RNNs, enabling scalable training of complex nonlinear models while maintaining competitive performance with state-of-the-art architectures.

Abstract: Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton’s iterations combined with custom parallel reductions. Our implementation achieves speedups of up to 665x over naive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.

[889] Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers

Marko Karbevski, Antonij Mijoski

Main category: cs.LG

TL;DR: The paper proves that Query weights in attention mechanisms are redundant, reducing parameters by over 8% while maintaining comparable performance.

Details

Motivation: To investigate whether the Query, Key, Value weight triplet in attention mechanisms can be reduced to improve parameter efficiency in LLMs.

Method: Theoretical analysis under simplifying assumptions, validated on full-complexity GPT-3 small architectures with layer normalization, skip connections, and weight decay trained from scratch.

Result: The reduced model without Query weights achieves comparable validation loss to standard baselines while reducing non-embedding/lm-head parameters by over 8%.

Conclusion: Query weights are redundant in attention mechanisms, motivating further investigation of this redundancy at larger scales.

Abstract: The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.

[890] Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa, Jiachen T. Wang, Peng Gao, Charith Peris, Yao Ma, Rahul Gupta, Ming Jin, Prateek Mittal, Ruoxi Jia

Main category: cs.LG

TL;DR: This paper proposes a new paradigm for defending against jailbreak attacks on LLMs by introducing the Adversarial Déjà Vu hypothesis and Adversarial Skill Compositional Training (ASCoT), which trains models on diverse compositions of adversarial skill primitives rather than isolated attacks.

Details

Motivation: Current adversarial training methods often fail against novel jailbreak attacks due to optimization challenges and difficulties in defining realistic threat models. There's a critical need for more effective defenses against evolving jailbreak techniques.

Method: The authors conducted large-scale analysis of 32 attack papers, extracting and compressing adversarial skills into sparse dictionaries of primitives. They then developed ASCoT, which trains models on diverse compositions of these skill primitives rather than individual attack instances.

Result: ASCoT substantially improves robustness to unseen attacks, including multi-turn jailbreaks, while maintaining low over-refusal rates. The approach demonstrates that expanding adversarial skill coverage is more important than just increasing data scale.

Conclusion: The Adversarial Déjà Vu hypothesis is valid - novel jailbreaks are largely recombinations of existing adversarial skills. Training on skill compositions rather than isolated attacks provides better generalization against unseen jailbreaks.

Abstract: Large language models remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Defending against novel jailbreaks represents a critical challenge in AI safety. Adversarial training – designed to make models robust against worst-case perturbations – has been the dominant paradigm for adversarial robustness. However, due to optimization challenges and difficulties in defining realistic threat models, adversarial training methods often fail on newly developed jailbreaks in practice. This paper proposes a new paradigm for improving robustness against unseen jailbreaks, centered on the Adversarial D'ej`a Vu hypothesis: novel jailbreaks are not fundamentally new, but largely recombinations of adversarial skills from previous attacks. We study this hypothesis through a large-scale analysis of 32 attack papers published over two years. Using an automated pipeline, we extract and compress adversarial skills into a sparse dictionary of primitives, with LLMs generating human-readable descriptions. Our analysis reveals that unseen attacks can be effectively explained as sparse compositions of earlier skills, with explanatory power increasing monotonically as skill coverage grows. Guided by this insight, we introduce Adversarial Skill Compositional Training (ASCoT), which trains on diverse compositions of skill primitives rather than isolated attack instances. ASCoT substantially improves robustness to unseen attacks, including multi-turn jailbreaks, while maintaining low over-refusal rates. We also demonstrate that expanding adversarial skill coverage, not just data scale, is key to defending against novel attacks. \textcolor{red}{\textbf{Warning: This paper contains content that may be harmful or offensive in nature.

[891] Contextual Tokenization for Graph Inverted Indices

Pritish Chakraborty, Indradyumna Roy, Soumen Chakrabarti, Abir De

Main category: cs.LG

TL;DR: CORGII is a graph indexing framework that converts dense graph representations into sparse binary codes for efficient subgraph retrieval using inverted indices, with trainable token impact weights and multi-probing capabilities.

Details

Motivation: Existing graph retrieval methods require exhaustive scoring of corpus graphs, which limits efficiency when searching for subgraph isomorphisms in large graph datasets.

Method: Uses contextual dense graph representations, differentiable discretization to create sparse binary codes over learned vocabulary, and integrates trainable token impact weights with inverted indices.

Result: CORGII achieves better accuracy-efficiency trade-offs compared to baselines, enabling efficient subgraph retrieval without exhaustive scoring.

Conclusion: CORGII is the first indexer using discrete tokens from dense graph representations with inverted lists, providing superior performance for subgraph isomorphism retrieval tasks.

Abstract: Retrieving graphs from a large corpus, that contain a subgraph isomorphic to a given query graph, is a core operation in many real-world applications. While recent multi-vector graph representations and scores based on set alignment and containment can provide accurate subgraph isomorphism tests, their use in retrieval remains limited by their need to score corpus graphs exhaustively. We introduce CORGII (Contextual Representation of Graphs for Inverted Indexing), a graph indexing framework in which, starting with a contextual dense graph representation, a differentiable discretization module computes sparse binary codes over a learned latent vocabulary. This text document-like representation allows us to leverage classic, highly optimized inverted indices, while supporting soft (vector) set containment scores. Pushing this paradigm further, we replace the classical, fixed impact weight of a `token’ on a graph (such as TFIDF or BM25) with a data-driven, trainable impact weight. Finally, we explore token expansion to support multi-probing the index for smoother accuracy-efficiency tradeoffs. To our knowledge, CORGII is the first indexer of dense graph representations using discrete tokens mapping to efficient inverted lists. Extensive experiments show that CORGII provides better trade-offs between accuracy and efficiency, compared to several baselines.

[892] Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

Beomhan Baek, Minhak Song, Chulhee Yun

Main category: cs.LG

TL;DR: The paper analyzes the implicit bias of incremental Adam (one sample per step) versus full-batch Adam, showing they can converge to different max-margin classifiers depending on dataset structure and batching scheme.

Details

Motivation: Adam is widely used but its theoretical understanding is limited, especially regarding how its implicit bias differs between full-batch and incremental regimes.

Method: Analyzed incremental Adam for logistic regression on separable data, constructed structured datasets to demonstrate different convergence behaviors, developed proxy algorithm to capture limiting behavior, and compared with Signum optimizer.

Result: Incremental Adam can converge to ℓ₂-max-margin classifier on structured datasets, unlike full-batch Adam’s ℓ∞-max-margin bias. Signum converges to ℓ∞-max-margin classifier regardless of batch size.

Conclusion: Adam’s implicit bias depends on both batching scheme and dataset structure, while Signum’s bias remains invariant to batch size.

Abstract: Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and we show that its bias can deviate from the full-batch behavior. To illustrate this, we construct a class of structured datasets where incremental Adam provably converges to the $\ell_2$-max-margin classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch Adam. For general datasets, we develop a proxy algorithm that captures the limiting behavior of incremental Adam as $\beta_2 \to 1$ and we characterize its convergence direction via a data-dependent dual fixed-point formulation. Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges to the $\ell_\infty$-max-margin classifier for any batch size by taking $\beta$ close enough to 1. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.

[893] Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations

Daniel Sin, Milad Toutounchian

Main category: cs.LG

TL;DR: A method called OPBA for generating counterfactual explanations in high-dimensional spaces using boundary approximation and binary search, achieving 5-50% distance reduction compared to current methods.

Details

Motivation: To develop an effective model-agnostic method for generating realistic counterfactual explanations that can handle real-world constraints on immutable and categorical features.

Method: Four-step approach: fit dataset to model, find decision boundary, determine constraints, compute closest feasible counterfactual point. Uses discretized boundary approximation with binary search to find optimal boundary points.

Result: Outperforms current methods with 5-50% reduction in L2 distance across four datasets. Handles constraints on immutable features (age, gender, sex, height). Runtime significantly faster than grid-based approaches.

Conclusion: OPBA provides a simple and effective model-agnostic method for computing nearest feasible counterfactual explanations with realistic constraints.

Abstract: In our article, we describe a method for generating counterfactual explanations in high-dimensional spaces using four steps that involve fitting our dataset to a model, finding the decision boundary, determining constraints on the problem, and computing the closest point (counterfactual explanation) from that boundary. We propose a discretized approach where we find many discrete points on the boundary and then identify the closest feasible counterfactual explanation. This method, which we later call $\textit{Optimal Point for Boundary Approximation}$ (OPBA), applies binary search to find decision boundary points and then searches for the closest boundary point. Across four datasets of varying dimensionality, we show that our method can outperform current methods for counterfactual generation with reductions in distance between $5%$ to $50%$ in terms of the $L_2$ norm. Our method can also handle real-world constraints by restricting changes to immutable and categorical features, such as age, gender, sex, height, and other related characteristics such as the case for a health-based dataset. In terms of runtime, the OPBA algorithm generates decision boundary points on multiple orders of magnitude in the same given time when we compare to a grid-based approach. In general, our method provides a simple and effective model-agnostic method that can compute nearest feasible (i.e. realistic with constraints) counterfactual explanations. All of our results and code are available at: https://github.com/dsin85691/OPBA_For_Counterfactuals

Akira Tamamori

Main category: cs.LG

TL;DR: Two-Stage LKPLO is a multi-stage outlier detection framework that combines kernel PCA for non-linear data linearization and local clustering for multi-modal distributions, achieving state-of-the-art performance on challenging datasets.

Details

Motivation: To overcome limitations of conventional projection-based outlier detection methods that rely on fixed statistical metrics and assume single data structures, which fail on complex real-world datasets.

Method: A two-stage framework with: (1) generalized loss-based outlyingness measure (PLO) with adaptive loss functions, (2) global kernel PCA stage to linearize non-linear structures, and (3) local clustering stage for multi-modal distributions.

Result: Achieves state-of-the-art performance in 5-fold cross-validation on 10 benchmark datasets, significantly outperforming baselines on challenging structures like multi-cluster data (Optdigits) and high-dimensional data (Arrhythmia).

Conclusion: The synergistic combination of kernelization and localization stages is essential for superior performance, providing a powerful tool for complex outlier detection problems and highlighting the importance of hybrid multi-stage architectures.

Abstract: This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection framework that overcomes the coexisting limitations of conventional projection-based methods: their reliance on a fixed statistical metric and their assumption of a single data structure. Our framework uniquely synthesizes three key concepts: (1) a generalized loss-based outlyingness measure (PLO) that replaces the fixed metric with flexible, adaptive loss functions like our proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear data structures; and (3) a subsequent local clustering stage to handle multi-modal distributions. Comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, with automated hyperparameter optimization, demonstrate that Two-Stage LKPLO achieves state-of-the-art performance. It significantly outperforms strong baselines on datasets with challenging structures where existing methods fail, most notably on multi-cluster data (Optdigits) and complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study empirically confirms that the synergistic combination of both the kernelization and localization stages is indispensable for its superior performance. This work contributes a powerful new tool for a significant class of outlier detection problems and underscores the importance of hybrid, multi-stage architectures.

[895] Physics-Informed Extreme Learning Machine (PIELM): Opportunities and Challenges

He Yang, Fei Ren, Francesco Calabro, Hai-Sui Yu, Xiaohui Chen, Pei-Zhi Zhuang

Main category: cs.LG

TL;DR: This paper provides a perspective review on Physics-Informed Extreme Learning Machine (PIELM), highlighting its advantages over other physics-informed machine learning methods and discussing current challenges and future opportunities.

Details

Motivation: The motivation is to share perspectives and experiences on PIELM since no comprehensive summary or review is currently available, despite its promising computational efficiency and accuracy compared to other PIML paradigms.

Method: The paper presents a review and perspective analysis of PIELM approaches, examining how they solve various types of differential equations with challenging characteristics like sharp gradients, nonlinearities, high-frequency behavior, and multiphysics coupling.

Result: The review identifies that many efforts have been made to address complex differential equations using PIELM, showing encouraging successes in handling various mathematical challenges.

Conclusion: Despite existing successes, many pressing challenges remain, providing opportunities to develop more robust, interpretable, and generalizable PIELM frameworks for scientific and engineering applications.

Abstract: We are delighted to see the recent development of physics-informed extreme learning machine (PIELM) for its higher computational efficiency and accuracy compared to other physics-informed machine learning (PIML) paradigms. Since a comprehensive summary or review of PIELM is currently unavailable, we would like to take this opportunity to share our perspectives and experiences on this promising research direction. We can see that many efforts have been made to solve ordinary/partial differential equations (ODEs/PDEs) characterized by sharp gradients, nonlinearities, high-frequency behavior, hard constraints, uncertainty, multiphysics coupling, and interpretability. Despite these encouraging successes, many pressing challenges remain to be tackled, which also provides opportunities to develop more robust, interpretable, and generalizable PIELM frameworks for scientific and engineering applications.

[896] Topic Analysis with Side Information: A Neural-Augmented LDA Approach

Biyi Fang, Truong Vo, Kripa Rajshekhar, Diego Klabjan

Main category: cs.LG

TL;DR: nnLDA is a neural-augmented topic model that integrates auxiliary information through a neural prior mechanism, outperforming traditional models like LDA in topic coherence and downstream tasks.

Details

Motivation: Traditional topic models like LDA struggle to incorporate side information (metadata, user attributes, document labels), limiting their expressiveness, personalization, and interpretability.

Method: Proposes nnLDA with neural prior mechanism where topic proportion priors are generated by neural networks conditioned on auxiliary features, using stochastic variational EM for joint optimization.

Result: Outperforms LDA and Dirichlet-Multinomial Regression across multiple benchmarks in topic coherence, perplexity, and downstream classification.

Conclusion: Combining neural representation learning with probabilistic topic modeling is beneficial when side information is available.

Abstract: Traditional topic models such as Latent Dirichlet Allocation (LDA) have been widely used to uncover latent structures in text corpora, but they often struggle to integrate auxiliary information such as metadata, user attributes, or document labels. These limitations restrict their expressiveness, personalization, and interpretability. To address this, we propose nnLDA, a neural-augmented probabilistic topic model that dynamically incorporates side information through a neural prior mechanism. nnLDA models each document as a mixture of latent topics, where the prior over topic proportions is generated by a neural network conditioned on auxiliary features. This design allows the model to capture complex nonlinear interactions between side information and topic distributions that static Dirichlet priors cannot represent. We develop a stochastic variational Expectation-Maximization algorithm to jointly optimize the neural and probabilistic components. Across multiple benchmark datasets, nnLDA consistently outperforms LDA and Dirichlet-Multinomial Regression in topic coherence, perplexity, and downstream classification. These results highlight the benefits of combining neural representation learning with probabilistic topic modeling in settings where side information is available.

cs.MA

[897] On the Fundamental Limitations of Decentralized Learnable Reward Shaping in Cooperative Multi-Agent Reinforcement Learning

Aditya Akella

Main category: cs.MA

TL;DR: Decentralized learnable reward shaping (DMARL-RSA) performs poorly compared to centralized methods in cooperative multi-agent tasks, achieving only -24.20 average reward vs 1.92 for MAPPO, showing fundamental limitations of decentralized coordination.

Details

Motivation: To investigate whether decentralized learnable reward shaping can be effective in cooperative multi-agent settings, as previous work has shown promise in single-agent scenarios.

Method: Proposed DMARL-RSA, a fully decentralized system where each agent learns individual reward shaping, evaluated on cooperative navigation tasks in simple_spread_v3 environment.

Result: DMARL-RSA achieved -24.20 average reward, significantly worse than centralized MAPPO (1.92) and similar to simple independent learning (IPPO: -23.19). Decentralized methods had higher landmark coverage but worse overall performance.

Conclusion: Advanced reward shaping cannot overcome fundamental decentralized coordination limitations. Three critical barriers identified: non-stationarity, exponential credit assignment complexity, and misalignment between individual and global objectives. Centralized coordination is necessary for effective multi-agent cooperation.

Abstract: Recent advances in learnable reward shaping have shown promise in single-agent reinforcement learning by automatically discovering effective feedback signals. However, the effectiveness of decentralized learnable reward shaping in cooperative multi-agent settings remains poorly understood. We propose DMARL-RSA, a fully decentralized system where each agent learns individual reward shaping, and evaluate it on cooperative navigation tasks in the simple_spread_v3 environment. Despite sophisticated reward learning, DMARL-RSA achieves only -24.20 +/- 0.09 average reward, compared to MAPPO with centralized training at 1.92 +/- 0.87–a 26.12-point gap. DMARL-RSA performs similarly to simple independent learning (IPPO: -23.19 +/- 0.96), indicating that advanced reward shaping cannot overcome fundamental decentralized coordination limitations. Interestingly, decentralized methods achieve higher landmark coverage (0.888 +/- 0.029 for DMARL-RSA, 0.960 +/- 0.045 for IPPO out of 3 total) but worse overall performance than centralized MAPPO (0.273 +/- 0.008 landmark coverage)–revealing a coordination paradox between local optimization and global performance. Analysis identifies three critical barriers: (1) non-stationarity from concurrent policy updates, (2) exponential credit assignment complexity, and (3) misalignment between individual reward optimization and global objectives. These results establish empirical limits for decentralized reward learning and underscore the necessity of centralized coordination for effective multi-agent cooperation.

[898] Urban-MAS: Human-Centered Urban Prediction with LLM-Based Multi-Agent System

Shangyu Lou

Main category: cs.MA

TL;DR: Urban-MAS is an LLM-based multi-agent system framework that improves human-centered urban prediction through three specialized agent types working together under zero-shot settings.

Details

Motivation: LLMs can integrate multimodal urban data but often underperform on domain-specific urban tasks, requiring a more robust framework for human-centered urban prediction.

Method: Three agent types: Predictive Factor Guidance Agents prioritize key factors; Reliable UrbanInfo Extraction Agents validate and re-extract information; Multi-UrbanInfo Inference Agents integrate multi-source data for prediction.

Result: Experiments on running-amount prediction and urban perception across Tokyo, Milan, and Seattle show Urban-MAS substantially reduces errors compared to single-LLM baselines.

Conclusion: Urban-MAS is a scalable paradigm for human-centered urban AI prediction, with Predictive Factor Guidance Agents being most critical for performance enhancement.

Abstract: Urban Artificial Intelligence (Urban AI) has advanced human-centered urban tasks such as perception prediction and human dynamics. Large Language Models (LLMs) can integrate multimodal inputs to address heterogeneous data in complex urban systems but often underperform on domain-specific tasks. Urban-MAS, an LLM-based Multi-Agent System (MAS) framework, is introduced for human- centered urban prediction under zero-shot settings. It includes three agent types: Predictive Factor Guidance Agents, which prioritize key predictive factors to guide knowledge extraction and enhance the effectiveness of compressed urban knowledge in LLMs; Reliable UrbanInfo Extraction Agents, which improve robustness by com- paring multiple outputs, validating consistency, and re-extracting when conflicts occur; and Multi-UrbanInfo Inference Agents, which integrate extracted multi-source information across dimensions for prediction. Experiments on running-amount prediction and ur- ban perception across Tokyo, Milan, and Seattle demonstrate that Urban-MAS substantially reduces errors compared to single-LLM baselines. Ablation studies indicate that Predictive Factor Guidance Agents are most critical for enhancing predictive performance, po- sitioning Urban-MAS as a scalable paradigm for human-centered urban AI prediction. Code is available on the project website:https://github.com/THETUREHOOHA/UrbanMAS

[899] Sherlock: Reliable and Efficient Agentic Workflow Execution

Yeonju Ro, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Ricardo Bianchini, Aditya Akella, Zhangyang Wang, Mattan Erez, Esha Choukse

Main category: cs.MA

TL;DR: Sherlock is a system that uses counterfactual analysis to identify error-prone nodes in LLM workflows and selectively applies cost-optimal verifiers only where needed, with speculative execution to reduce latency.

Details

Motivation: Agentic workflows using LLMs are error-prone, and verifying every step introduces significant latency and cost overheads. There's a need to identify which nodes deserve verification, select appropriate verifiers, and minimize latency impact.

Method: Sherlock uses counterfactual analysis on agentic workflows to identify error-prone nodes and selectively attaches cost-optimal verifiers only where necessary. It speculatively executes downstream tasks to reduce latency while verification runs in background.

Result: Sherlock delivers 18.3% accuracy gain on average, reduces workflow execution time by up to 48.7% over non-speculative execution, and lowers verification cost by 26.0% compared to Monte Carlo search-based method.

Conclusion: Principled, fault-aware verification effectively balances efficiency and reliability in agentic workflows, demonstrating that selective verification with speculative execution can significantly improve performance while maintaining accuracy.

Abstract: With the increasing adoption of large language models (LLM), agentic workflows, which compose multiple LLM calls with tools, retrieval, and reasoning steps, are increasingly replacing traditional applications. However, such workflows are inherently error-prone: incorrect or partially correct output at one step can propagate or even amplify through subsequent stages, compounding the impact on the final output. Recent work proposes integrating verifiers that validate LLM output or actions, such as self-reflection, debate, or LLM-as-a-judge mechanisms. Yet, verifying every step introduces significant latency and cost overheads. In this work, we seek to answer three key questions: which nodes in a workflow are most error-prone and thus deserve costly verification, how to select the most appropriate verifier for each node, and how to use verification with minimal impact to latency? Our solution, Sherlock, addresses these using counterfactual analysis on agentic workflows to identify error-prone nodes and selectively attaching cost-optimal verifiers only where necessary. At runtime, Sherlock speculatively executes downstream tasks to reduce latency overhead, while verification runs in the background. If verification fails, execution is rolled back to the last verified output. Compared to the non-verifying baseline, Sherlock delivers an 18.3% accuracy gain on average across benchmarks. Sherlock reduces workflow execution time by up to 48.7% over non-speculative execution and lowers verification cost by 26.0% compared to the Monte Carlo search-based method, demonstrating that principled, fault-aware verification effectively balances efficiency and reliability in agentic workflows.

[900] Spatial Crowdsourcing-based Task Allocation for UAV-assisted Maritime Data Collection

Xiaoling Han, Bin Lin, Zhenyu Na, Bowen Li, Chaoyue Zhang, Ran Zhang

Main category: cs.MA

TL;DR: This paper proposes a spatial crowdsourcing-based task allocation algorithm (SC-MDC-TA) for UAV-assisted maritime data collection that optimizes task allocation based on spatial-temporal requirements and reduces task completion time and energy consumption.

Details

Motivation: The increasing diversity and complexity of maritime data collection tasks require effective task allocation methods for UAV-assisted operations in variable maritime service scenarios.

Method: Developed an SC-based MDC network model and designed SC-MDC-TA algorithm using quality estimation (SINR and energy consumption) and reverse auction to minimize task waiting time while ensuring timely completion.

Result: Simulation results show the algorithm effectively allocates tasks across various MDC scenarios and reduces task completion time and UAV energy consumption compared to benchmarks.

Conclusion: The proposed SC-MDC-TA algorithm provides an effective solution for task allocation in UAV-assisted maritime data collection networks, improving efficiency and reducing resource consumption.

Abstract: Driven by the unceasing development of maritime services, tasks of unmanned aerial vehicle (UAV)-assisted maritime data collection (MDC) are becoming increasingly diverse, complex and personalized. As a result, effective task allocation for MDC is becoming increasingly critical. In this work, integrating the concept of spatial crowdsourcing (SC), we develop an SC-based MDC network model and investigate the task allocation problem for UAV-assisted MDC. In variable maritime service scenarios, tasks are allocated to UAVs based on the spatial and temporal requirements of the tasks, as well as the mobility of the UAVs. To address this problem, we design an SC-based task allocation algorithm for the MDC (SC-MDC-TA). The quality estimation is utilized to assess and regulate task execution quality by evaluating signal to interference plus noise ratio and the UAV energy consumption. The reverse auction is employed to potentially reduce the task waiting time as much as possible while ensuring timely completion. Additionally, we establish typical task allocation scenarios based on maritime service requirements indicated by electronic navigational charts. Simulation results demonstrate that the proposed SC-MDC-TA algorithm effectively allocates tasks for various MDC scenarios. Furthermore, compared to the benchmark, the SC-MDC-TA algorithm can also reduce the task completion time and lower the UAV energy consumption.

[901] AgentGit: A Version Control Framework for Reliable and Scalable LLM-Powered Multi-Agent Systems

Yang Li, Siqi Ping, Xiyu Chen, Xiaojian Qi, Zigan Wang, Ye Luo, Xiaowei Zhang

Main category: cs.MA

TL;DR: AgentGit is a framework that brings Git-like rollback and branching capabilities to multi-agent systems, reducing redundant computation and improving reliability and scalability.

Details

Motivation: Current multi-agent systems struggle with reliability and scalability on complex tasks, needing better error recovery and exploration capabilities.

Method: Built as an infrastructure layer on LangGraph, AgentGit supports state commit, revert, and branching operations, allowing agents to efficiently traverse and compare multiple trajectories.

Result: AgentGit significantly reduces redundant computation, lowers runtime and token usage, and supports parallel exploration across branches, outperforming LangGraph, AutoGen, and Agno baselines.

Conclusion: AgentGit provides a practical path to more robust multi-agent system design, enabling error recovery, safe exploration, iterative debugging, and A/B testing in collaborative AI systems.

Abstract: With the rapid progress of large language models (LLMs), LLM-powered multi-agent systems (MAS) are drawing increasing interest across academia and industry. However, many current MAS frameworks struggle with reliability and scalability, especially on complex tasks. We present AgentGit, a framework that brings Git-like rollback and branching to MAS workflows. Built as an infrastructure layer on top of LangGraph, AgentGit supports state commit, revert, and branching, allowing agents to traverse, compare, and explore multiple trajectories efficiently. To evaluate AgentGit, we designed an experiment that optimizes target agents by selecting better prompts. We ran a multi-step A/B test against three baselines – LangGraph, AutoGen, and Agno – on a real-world task: retrieving and analyzing paper abstracts. Results show that AgentGit significantly reduces redundant computation, lowers runtime and token usage, and supports parallel exploration across multiple branches, enhancing both reliability and scalability in MAS development. This work offers a practical path to more robust MAS design and enables error recovery, safe exploration, iterative debugging, and A/B testing in collaborative AI systems.

[902] Predictive Auxiliary Learning for Belief-based Multi-Agent Systems

Qinwei Huang, Stefan Wang, Simon Khan, Garrett Katz, Qinru Qiu

Main category: cs.MA

TL;DR: BEPAL is a multi-agent reinforcement learning framework that uses auxiliary predictive tasks to improve learning efficiency and stability in partially observable environments.

Details

Motivation: Most multi-agent systems rely only on rewards for policy training, which may not effectively aggregate information from observations, communications, and reward signals in partially observable environments.

Method: BEPAL follows centralized training with decentralized execution. Each agent learns a belief model that predicts unobservable state information (other agents’ rewards or motion directions) alongside its policy model through auxiliary training objectives.

Result: BEPAL achieves an average 16% performance improvement in predator-prey environment and Google Research Football, with more stable convergence compared to baseline methods.

Conclusion: Auxiliary predictive learning stabilizes MARL training and improves overall performance by enriching hidden state representations with information beyond immediate reward maximization.

Abstract: The performance of multi-agent reinforcement learning (MARL) in partially observable environments depends on effectively aggregating information from observations, communications, and reward signals. While most existing multi-agent systems primarily rely on rewards as the only feedback for policy training, our research shows that introducing auxiliary predictive tasks can significantly enhance learning efficiency and stability. We propose Belief-based Predictive Auxiliary Learning (BEPAL), a framework that incorporates auxiliary training objectives to support policy optimization. BEPAL follows the centralized training with decentralized execution paradigm. Each agent learns a belief model that predicts unobservable state information, such as other agents’ rewards or motion directions, alongside its policy model. By enriching hidden state representations with information that does not directly contribute to immediate reward maximization, this auxiliary learning process stabilizes MARL training and improves overall performance. We evaluate BEPAL in the predator-prey environment and Google Research Football, where it achieves an average improvement of about 16 percent in performance metrics and demonstrates more stable convergence compared to baseline methods.

[903] Credit Network Modeling and Analysis via Large Language Models

Enbo Sun, Yongzhao Wang, Hao Zhou

Main category: cs.MA

TL;DR: LLMs are used to construct credit networks from financial statements and analyze them for optimal financial operations like portfolio compression and debt removal.

Details

Motivation: To leverage LLMs for translating financial statements into credit networks and analyzing network structures to improve financial system performance.

Method: Use LLMs to translate individual firm financial statements into credit networks, aggregate them, detect inconsistencies with human intervention, and apply financial operations to synthetic and real-world datasets.

Result: LLMs effectively translate financial statements into credit networks with diverse topologies and generate coherent reasoning for optimal execution of financial operations to enhance network performance.

Conclusion: LLMs demonstrate strong capabilities in constructing and analyzing credit networks, providing effective strategies for financial operations to maximize total assets in the network.

Abstract: We investigate the application of large language models (LLMs) to construct credit networks from firms’ textual financial statements and to analyze the resulting network structures. We start with using LLMs to translate each firm’s financial statement into a credit network that pertains solely to that firm. These networks are then aggregated to form a comprehensive credit network representing the whole financial system. During this process, the inconsistencies in financial statements are automatically detected and human intervention is involved. We demonstrate that this translation process is effective across financial statements corresponding to credit networks with diverse topological structures. We further investigate the reasoning capabilities of LLMs in analyzing credit networks and determining optimal strategies for executing financial operations to maximize network performance measured by the total assets of firms, which is an inherently combinatorial optimization challenge. To demonstrate this capability, we focus on two financial operations: portfolio compression and debt removal, applying them to both synthetic and real-world datasets. Our findings show that LLMs can generate coherent reasoning and recommend effective executions of these operations to enhance overall network performance.

[904] From Pixels to Cooperation Multi Agent Reinforcement Learning based on Multimodal World Models

Sureyya Akin, Kavita Srivastava, Prateek B. Kapoor, Pradeep G. Sethi, Sunita Q. Patel, Rahu Srivastava

Main category: cs.MA

TL;DR: Proposes a Multimodal World Model (MWM) framework for sample-efficient multi-agent reinforcement learning from high-dimensional sensory inputs like pixels and audio, achieving orders-of-magnitude better sample efficiency than model-free MARL baselines.

Details

Motivation: Model-free MARL algorithms struggle with sample inefficiency when learning from high-dimensional multimodal sensory inputs due to challenges in representation learning, partial observability, and credit assignment.

Method: Uses a shared generative Multimodal World Model trained to learn compressed latent representations by fusing distributed multimodal observations via scalable attention. Then trains MARL policies entirely within this latent space, decoupling representation from policy learning.

Result: Achieves orders-of-magnitude greater sample efficiency than state-of-the-art model-free MARL baselines, shows multimodal fusion is essential for sensory asymmetry tasks, and provides superior robustness to sensor dropout.

Conclusion: The MWM-MARL framework effectively addresses sample inefficiency in multimodal MARL by decoupling representation and policy learning through a shared world model, enabling practical real-world deployment with sensor robustness.

Abstract: Learning cooperative multi-agent policies directly from high-dimensional, multimodal sensory inputs like pixels and audio (from pixels) is notoriously sample-inefficient. Model-free Multi-Agent Reinforcement Learning (MARL) algorithms struggle with the joint challenge of representation learning, partial observability, and credit assignment. To address this, we propose a novel framework based on a shared, generative Multimodal World Model (MWM). Our MWM is trained to learn a compressed latent representation of the environment’s dynamics by fusing distributed, multimodal observations from all agents using a scalable attention-based mechanism. Subsequently, we leverage this learned MWM as a fast, “imagined” simulator to train cooperative MARL policies (e.g., MAPPO) entirely within its latent space, decoupling representation learning from policy learning. We introduce a new set of challenging multimodal, multi-agent benchmarks built on a 3D physics simulator. Our experiments demonstrate that our MWM-MARL framework achieves orders-of-magnitude greater sample efficiency compared to state-of-the-art model-free MARL baselines. We further show that our proposed multimodal fusion is essential for task success in environments with sensory asymmetry and that our architecture provides superior robustness to sensor-dropout, a critical feature for real-world deployment.

[905] An Explanation-oriented Inquiry Dialogue Game for Expert Collaborative Recommendations

Qurat-ul-ain Shaheen, Katarzyna Budzynska, Carles Sierra

Main category: cs.MA

TL;DR: A requirement analysis for collaborative medical dialogues and an inquiry dialogue game that enables medical experts to collaboratively make recommendations while generating explainable reasoning traces.

Details

Motivation: To incorporate explainability into multiagent system design for medical collaboration, allowing experts with different knowledge bases to work together effectively.

Method: Developed an inquiry dialogue game with explanation-based illocutionary forces, implemented as a prototype web-application, and evaluated through a formative user study.

Result: The user study confirmed that the dialogue game meets medical experts’ collaboration needs and provides insights on the value of dialogue-based communication tools in medicine.

Conclusion: The inquiry dialogue game successfully enables collaborative medical decision-making while maintaining explainability through rich reasoning traces.

Abstract: This work presents a requirement analysis for collaborative dialogues among medical experts and an inquiry dialogue game based on this analysis for incorporating explainability into multiagent system design. The game allows experts with different knowledge bases to collaboratively make recommendations while generating rich traces of the reasoning process through combining explanation-based illocutionary forces in an inquiry dialogue. The dialogue game was implemented as a prototype web-application and evaluated against the specification through a formative user study. The user study confirms that the dialogue game meets the needs for collaboration among medical experts. It also provides insights on the real-life value of dialogue-based communication tools for the medical community.

[906] Learning what to say and how precisely: Efficient Communication via Differentiable Discrete Communication Learning

Aditya Kapoor, Yash Bhisikar, Benjamin Freed, Jan Peters, Mingfei Sun

Main category: cs.MA

TL;DR: The paper presents a generalized Differentiable Discrete Communication Learning (DDCL) framework that enables MARL agents to learn bit-level message precision optimization, achieving significant bandwidth reduction while maintaining or improving task performance.

Details

Motivation: Current MARL communication approaches are limited to binary gating (whether to communicate) and cannot optimize message precision at bit-level due to gradient flow issues from discretization.

Method: Extends DDCL to support unbounded signals, creating a universal plug-and-play layer for MARL architectures that enables end-to-end optimization of discrete messages with dynamic precision modulation.

Result: Achieves over 10x bandwidth reduction while matching or exceeding task performance across four state-of-the-art MARL algorithms, and shows that simple Transformer-based policies with DDCL can match complex specialized architectures.

Conclusion: Demonstrates the ‘Bitter Lesson’ in MARL communication - simple architectures with learned communication protocols can outperform complex bespoke designs, questioning the need for specialized communication mechanisms.

Abstract: Effective communication in multi-agent reinforcement learning (MARL) is critical for success but constrained by bandwidth, yet past approaches have been limited to complex gating mechanisms that only decide \textit{whether} to communicate, not \textit{how precisely}. Learning to optimize message precision at the bit-level is fundamentally harder, as the required discretization step breaks gradient flow. We address this by generalizing Differentiable Discrete Communication Learning (DDCL), a framework for end-to-end optimization of discrete messages. Our primary contribution is an extension of DDCL to support unbounded signals, transforming it into a universal, plug-and-play layer for any MARL architecture. We verify our approach with three key results. First, through a qualitative analysis in a controlled environment, we demonstrate \textit{how} agents learn to dynamically modulate message precision according to the informational needs of the task. Second, we integrate our variant of DDCL into four state-of-the-art MARL algorithms, showing it reduces bandwidth by over an order of magnitude while matching or exceeding task performance. Finally, we provide direct evidence for the \enquote{Bitter Lesson} in MARL communication: a simple Transformer-based policy leveraging DDCL matches the performance of complex, specialized architectures, questioning the necessity of bespoke communication designs.

[907] MARFT: Multi-Agent Reinforcement Fine-Tuning

Junwei Liao, Muning Wen, Jun Wang, Weinan Zhang

Main category: cs.MA

TL;DR: This paper proposes MARFT (Multi-Agent Reinforcement Fine-Tuning), a novel paradigm for fine-tuning LLM-based Multi-Agent Systems using reinforcement learning techniques, addressing the limitations of traditional MARL approaches.

Details

Motivation: LLM-based Multi-Agent Systems show impressive capabilities but lack effective fine-tuning methods using foundational RL techniques. Traditional MARL methods face challenges when directly applied to LaMAS due to their unique characteristics.

Method: The authors introduce MARFT framework with Flex-MG game formulation that aligns with LaMAS optimization, providing conceptual foundations, key distinctions from MARL, and practical implementation strategies including open-source code.

Result: A robust and scalable MARFT framework is developed with complete open-source implementation available, bridging theoretical foundations with practical methodologies for LaMAS optimization.

Conclusion: This work serves as a roadmap for advancing MARFT toward resilient and adaptive solutions in agentic systems, addressing real-world application challenges and providing foundational framework for future research.

Abstract: LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high-quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a brand-new MG called Flex-MG, which aligns with the LaMAS optimization in real-world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS-oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao-ai/MARFT.

[908] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, Yiran Chen

Main category: cs.MA

TL;DR: KVCOMM is a training-free framework that enables efficient KV-cache reuse in multi-agent LLM systems by aligning cache offsets of overlapping contexts, achieving up to 7.8x speedup without quality degradation.

Details

Motivation: Multi-agent LLM systems suffer from substantial overhead due to repeated reprocessing of overlapping contexts across agents, as standard KV caching cannot be directly reused due to diverging prefixes from agent-specific context extensions.

Method: KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples (anchors) that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online for dynamic adaptation.

Result: KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads and up to 7.8x speedup in a five-agent setting, reducing TTFT from ~430 ms to ~55 ms without quality degradation.

Conclusion: KVCOMM effectively addresses the offset variance challenge in multi-agent KV-cache reuse, enabling significant performance improvements in multi-agent LLM systems while maintaining output quality.

Abstract: Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.

cs.MM

[909] LongCat-Flash-Omni Technical Report

Meituan LongCat Team, Bairui Wang, Bayan, Bin Xiao, Bo Zhang, Bolin Rong, Borun Chen, Chang Wan, Chao Zhang, Chen Huang, Chen Chen, Chen Chen, Chengxu Yang, Chengzuo Yang, Cong Han, Dandan Peng, Delian Ruan, Detai Xin, Disong Wang, Dongchao Yang, Fanfan Liu, Fengjiao Chen, Fengyu Yang, Gan Dong, Gang Huang, Gang Xu, Guanglu Wan, Guoqiang Tan, Guoqiao Yu, Haibo Qiu, Hao Lu, Hongbo Liu, Hongyu Xiang, Jiaheng Wu, Jian Yang, Jiaxing Liu, Jing Huang, Jingang Wang, Jinrui Ding, Juchao Jiang, Jun Kuang, Jun Wang, Junhui Mei, Ke Ding, Kefeng Zhang, Lei Chen, Liang Shi, Limeng Qiao, Liming Zheng, Lin Ma, Liuyang Guo, Liya Ma, Luying Sun, Man Gao, Mengshen Zhu, Miao Cao, Minliang Lin, Nuo Xu, Peng Shi, Qi Zhang, Qian Fang, Qian Wang, Qian Yang, Quanxiu Wang, Rongxiang Weng, Rongxin Guo, Ruoxuan Liang, Senbin Yang, Shanbo Xu, Shanglin Lei, Shengze Ye, Shimin Chen, Shuaiqi Chen, Shujie Hu, Shuo Li, Siqi Yang, Siyu Xu, Siyu Ren, Song Li, Songxiang Liu, Tianhao Bai, Tianye Dai, Wei Hong, Wei Wang, Weixiao Zhao, Wengang Cao, Wenlong Zhu, Wenlong He, Xi Su, Xi Nan, Xiaohan Zhao, Xiaohao Wang, Xiaoyu Zhao, Xiaoyu Wang, Xiaoyu Li, Xin Pan, Xin Chen, Xiusong Sun, Xu Xiang, Xudong Xing, Xuezhi Cao, Xunliang Cai, Yang Yang, Yanli Tan, Yao Yao, Yerui Sun, Yi Chen, Yifan Lu, Yin Gong, Yining Zhang, Yitian Chen, Yiyang Gan, Yuchen Tang, Yuchen Xie, Yueqian Wang, Yuewen Zheng, Yufei Zhang, Yufeng Zhong, Yulei Qian, Yuqi Peng, Yuwei Jiang, Zeyang Hu, Zheng Zhang, Zhengkun Tian, Zhiqing Hong, Zhixiong Zeng, Zhuqi Mi, Ziran Li, Ziwen Wang, Ziyi Zhao, Ziyuan Zhuang, Zizhe Zhao

Main category: cs.MM

TL;DR: LongCat-Flash-Omni is a 560B parameter open-source omni-modal model that achieves state-of-the-art real-time audio-visual interaction through progressive training and efficient MoE architecture.

Details

Motivation: To develop a comprehensive multimodal model that can handle real-time audio-visual interactions while maintaining strong unimodal capabilities, addressing the challenges of large-scale multimodal training.

Method: Uses curriculum-inspired progressive training, Shortcut-connected Mixture-of-Experts architecture with zero-computation experts, modality-decoupled parallelism for training infrastructure, and integrates multimodal perception and speech reconstruction modules.

Result: Achieves state-of-the-art performance on omni-modal benchmarks among open-source models, delivers competitive results across text, image, video, and audio tasks, and maintains low-latency real-time interaction despite 560B parameters (27B activated).

Conclusion: The model demonstrates exceptional efficiency in multimodal training while achieving comprehensive capabilities across multiple modalities, providing a foundation for future research in large-scale multimodal AI systems.

Abstract: We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability. Building upon LongCat-Flash, which adopts a high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its immense size of 560B parameters (with 27B activated), LongCat-Flash-Omni achieves low-latency real-time audio-visual interaction. For training infrastructure, we developed a modality-decoupled parallelism scheme specifically designed to manage the data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by sustaining over 90% of the throughput achieved by text-only training. Extensive evaluations show that LongCat-Flash-Omni achieves state-of-the-art performance on omni-modal benchmarks among open-source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image, and video understanding, as well as audio understanding and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open-source the model to foster future research and development in the community.

[910] Predicting Encoding Energy from Low-Pass Anchors for Green Video Streaming

Zoha Azimi, Reza Farahani, Vignesh V Menon, Christian Timmerer

Main category: cs.MM

TL;DR: Lightweight energy prediction method for video encoding that uses lower-resolution anchor encodings to estimate energy consumption, achieving over 50% energy savings with minimal quality degradation.

Details

Motivation: Video streaming dominates Internet traffic but raises energy efficiency and carbon emission concerns, requiring trade-offs between energy consumption and Quality of Experience (QoE).

Method: Uses reference encodings at lower resolutions (anchors) to predict energy consumption of high-resolution videos, eliminating exhaustive per-segment measurements. Automatically selects encoding parameters like resolution and QP to maintain perceptual quality within VMAF limits.

Result: For only 1.68 average VMAF score reduction (below Just Noticeable Difference threshold), achieved 51.22% encoding energy savings and 53.54% decoding energy savings compared to no quality degradation scenario.

Conclusion: The proposed method provides an effective trade-off between energy efficiency and video quality, enabling substantial energy savings with imperceptible quality degradation in video streaming.

Abstract: Video streaming now represents the dominant share of Internet traffic, as ever-higher-resolution content is distributed across a growing range of heterogeneous devices to sustain user Quality of Experience (QoE). However, this trend raises significant concerns about energy efficiency and carbon emissions, requiring methods to provide a trade-off between energy and QoE. This paper proposes a lightweight energy prediction method that estimates the energy consumption of high-resolution video encodings using reference encodings generated at lower resolutions (so-called anchors), eliminating the need for exhaustive per-segment energy measurements, a process that is infeasible at scale. We automatically select encoding parameters, such as resolution and quantization parameter (QP), to achieve substantial energy savings while maintaining perceptual quality, as measured by the Video Multimethod Fusion Assessment (VMAF), within acceptable limits. We implement and evaluate our approach with the open-source VVenC encoder on 100 video sequences from the Inter4K dataset across multiple encoding settings. Results show that, for an average VMAF score reduction of only 1.68, which stays below the Just Noticeable Difference (JND) threshold, our method achieves 51.22% encoding energy savings and 53.54% decoding energy savings compared to a scenario with no quality degradation.

[911] Rhythm in the Air: Vision-based Real-Time Music Generation through Gestures

Barathi Subramanian, Rathinaraja Jeyaraj, Anand Paul, Kapilya Gangadharan

Main category: cs.MM

TL;DR: This paper presents a vision-based dynamic gesture recognition system for real-time music composition using a custom dataset and a multi-layer attention-based GRU model that achieves 96.83% accuracy.

Details

Motivation: To enable seamless human-computer interaction for music creation through gestures without physical touch, advancing HCI experiences and providing new ways for people to interact with music.

Method: Created a custom gesture dataset with 15,000+ samples across 21 classes (7 musical notes at 3 pitch levels) and developed a multi-layer attention-based GRU (MLA-GRU) model that uses GRU for temporal pattern learning and attention layers to focus on musically relevant gesture segments.

Result: MLA-GRU significantly outperformed classical GRU, achieving 96.83% accuracy compared to baseline’s 86.7%, with superior efficiency and processing speed suitable for interactive applications.

Conclusion: The proposed system enables innovative gesture-based music interaction, advances HCI experiences, and demonstrates MLA-GRU’s effectiveness for swift and precise gesture recognition in real-time applications.

Abstract: Gesture recognition is an essential component of human-computer interaction (HCI), facilitating seamless interconnectivity between users and computer systems without physical touch. This paper introduces an innovative application of vision-based dynamic gesture recognition (VDGR) for real-time music composition through gestures. To implement this application, we generate a custom gesture dataset that encompasses over 15000 samples across 21 classes, incorporating 7 musical notes each manifesting at three distinct pitch levels. To effectively deal with the modest volume of training data and to accurately discern and prioritize complex gesture sequences for music creation, we develop a multi-layer attention-based gated recurrent unit (MLA-GRU) model, in which gated recurrent unit (GRU) is used to learn temporal patterns from the observed sequence and an attention layer is employed to focus on musically pertinent gesture segments. Our empirical studies demonstrate that MLA-GRU significantly surpasses the classical GRU model, achieving a remarkable accuracy of 96.83% compared to the baseline’s 86.7%. Moreover, our approach exhibits superior efficiency and processing speed, which are crucial for interactive applications. Using our proposed system, we believe that people will interact with music in a new and exciting way. It not only advances HCI experiences but also highlights MLA-GRU’s effectiveness in scenarios demanding swift and precise gesture recognition.

[912] EV-NVC: Efficient Variable bitrate Neural Video Compression

Yongcun Hu, Yingzhen Zhai, Jixiang Luo, Wenrui Dai, Dell Zhang, Hongkai Xiong, Xuelong Li

Main category: cs.MM

TL;DR: The paper proposes an efficient variable bitrate neural video codec (EV-NVC) with piecewise linear sampler and long-short-term feature fusion module to improve rate-distortion performance and context modeling.

Details

Motivation: Training neural video codecs with variable rate is challenging due to complex training strategies and model structure.

Method: Uses piecewise linear sampler (PLS) for better rate-distortion in high bitrate range, long-short-term feature fusion module (LSTFFM) for enhanced context modeling, and mixed-precision training with detailed stage-specific strategies.

Result: Reduces BD-rate by 30.56% compared to HM-16.25 within low-delay mode.

Conclusion: The proposed EV-NVC with PLS and LSTFFM effectively improves neural video coding performance through optimized training strategies and architectural enhancements.

Abstract: Training neural video codec (NVC) with variable rate is a highly challenging task due to its complex training strategies and model structure. In this paper, we train an efficient variable bitrate neural video codec (EV-NVC) with the piecewise linear sampler (PLS) to improve the rate-distortion performance in high bitrate range, and the long-short-term feature fusion module (LSTFFM) to enhance the context modeling. Besides, we introduce mixed-precision training and discuss the different training strategies for each stage in detail to fully evaluate its effectiveness. Experimental results show that our approach reduces the BD-rate by 30.56% compared to HM-16.25 within low-delay mode.

eess.AS

[913] NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion

Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen, Aurosweta Mahapatra, Ali N. Salman, Carlos Busso, Berrak Sisman

Main category: eess.AS

TL;DR: NaturalVoices is a large-scale spontaneous podcast dataset for emotion-aware voice conversion, containing 5,049 hours of real-life speech with comprehensive annotations.

Details

Motivation: Existing speech datasets are acted, limited in scale, and fail to capture expressive richness of real-life communication, creating a gap for voice conversion research.

Method: Created a large-scale spontaneous podcast dataset with automatic annotations for emotion, speech quality, transcripts, speaker identity, and sound events, plus an open-source annotation pipeline.

Result: The dataset supports development of robust VC models that produce natural, expressive speech, while revealing limitations of current architectures on spontaneous data.

Conclusion: NaturalVoices is both a valuable resource and challenging benchmark for advancing voice conversion field, enabling emotion-aware VC research.

Abstract: Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the rise of large neural networks, several large-scale speech corpora have emerged and been widely adopted across various speech processing tasks. However, the field of voice conversion (VC) still lacks large-scale, expressive, and real-life speech resources suitable for modeling natural prosody and emotion. To fill this gap, we release NaturalVoices (NV), the first large-scale spontaneous podcast dataset specifically designed for emotion-aware voice conversion. It comprises 5,049 hours of spontaneous podcast recordings with automatic annotations for emotion (categorical and attribute-based), speech quality, transcripts, speaker identity, and sound events. The dataset captures expressive emotional variation across thousands of speakers, diverse topics, and natural speaking styles. We also provide an open-source pipeline with modular annotation tools and flexible filtering, enabling researchers to construct customized subsets for a wide range of VC tasks. Experiments demonstrate that NaturalVoices supports the development of robust and generalizable VC models capable of producing natural, expressive speech, while revealing limitations of current architectures when applied to large-scale spontaneous data. These results suggest that NaturalVoices is both a valuable resource and a challenging benchmark for advancing the field of voice conversion. Dataset is available at: https://huggingface.co/JHU-SmileLab

[914] MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models

Yayue Deng, Guoqiang Hu, Haiyang Sun, Xiangyu Zhang, Haoyang Zhang, Fei Tian, Xuerui Yang, Gang Yu, Eng Siong Chng

Main category: eess.AS

TL;DR: Multi-Bench is the first benchmark for evaluating Spoken Dialogue Models in multi-turn interactive dialogue with emotional intelligence, featuring hierarchical tasks and showing current models perform well on basic understanding but need improvement in advanced emotional reasoning.

Details

Motivation: Current SDM benchmarks focus mainly on single-turn exchanges, leaving multi-turn interactive conversations with emotional intelligence underexplored.

Method: Multi-Bench uses a hierarchical structure with basic track (emotion understanding/reasoning) and advanced track (emotion support/application), comprising five tasks and 3.2K samples with reproducible evaluation framework.

Result: Evaluation of six SDMs shows they achieve good performance on basic understanding tasks but have room for improvement in advanced multi-turn interactive dialogue and reasoning, particularly in emotion awareness and application.

Conclusion: Current SDMs need further development for advanced emotional intelligence in multi-turn interactive dialogues, with Multi-Bench providing a comprehensive evaluation framework for this purpose.

Abstract: Spoken Dialogue Models (SDMs) have advanced rapidly, yet their ability to sustain genuinely interactive multi-turn conversations remains underexplored, as most benchmarks focus on single-turn exchanges. We introduce Multi-Bench, the first benchmark explicitly designed to evaluate SDMs in multi-turn interactive dialogue with an emphasis on emotional intelligence. Multi-Bench employs a hierarchical structure with a basic track for emotion understanding and reasoning and an advanced track for emotion support and application. It comprises five carefully designed tasks and about 3.2K samples, ranging from emotion recognition to complex reasoning and interactive dialogue, supported by a reproducible evaluation framework. We evaluate six representative SDMs on eight subsets of Multi-Bench. Results show that while current SDMs achieve good performance on basic understanding tasks, they still have room for improvement in advanced multi-turn interactive dialogue and reasoning-related tasks, particularly in emotion awareness and application.

[915] WhisperVC: Target Speaker-Controllable Mandarin Whisper-to-Speech Conversion

Dong Liu, Ming Li

Main category: eess.AS

TL;DR: WhisperVC is a three-stage framework for Mandarin whisper-to-speech conversion that achieves near ground-truth quality while maintaining speaker similarity.

Details

Motivation: Whispered speech lacks vocal-fold excitation and exhibits reduced energy and shifted formant frequencies, making natural and intelligible voice reconstruction highly challenging.

Method: Three-stage framework: 1) Fine-tuned Content Encoder with Whisper-large V3 and Conformer-based variational autoencoder with soft-DTW alignment; 2) Deterministic Length-Channel Aligner and duration-free FastSpeech 2 model; 3) Fine-tuned HiFi-GAN vocoder on predicted mel-spectrograms.

Result: Achieves near ground-truth quality (DNSMOS 3.11, UTMOS 2.52, CER 18.67%) while maintaining speaker similarity (cosine 0.76) and robust performance under whisper-only inference on AISHELL6-Whisper corpus.

Conclusion: WhisperVC effectively addresses the challenges of whisper-to-speech conversion and demonstrates high-quality voice reconstruction capabilities.

Abstract: Whispered speech lacks vocal-fold excitation and exhibits reduced energy and shifted formant frequencies, making natural and intelligible voice reconstruction highly challenging. To address this issue, we propose \emph{WhisperVC}, a three-stage framework for Mandarin whisper-to-speech (W2S) conversion. Stage~~1 employs a fine-tuned Content Encoder based on the OpenAI Whisper-large~~V3 model and a Conformer-based variational autoencoder with soft-DTW alignment to learn domain-invariant and temporally consistent representations. Stage~~2 introduces a deterministic Length–Channel Aligner and a duration-free FastSpeech~~2 model conditioned on speaker embeddings for controllable timbre and stable prosody. Stage3 fine-tunes a HiFi-GAN vocoder on predicted mel-spectrograms to synthesize high-fidelity waveforms. Experiments on the AISHELL6-Whisper corpus demonstrate that WhisperVC achieves near ground-truth quality (\textbf{DNSMOS3.11}, \textbf{UTMOS~~2.52}, \textbf{CER~~18.67%}), while maintaining speaker similarity (\textbf{cosine~0.76}) and robust performance under whisper-only inference.

[916] Towards General Auditory Intelligence: Large Multimodal Models for Machine Listening and Speaking

Siyin Wang, Zengrui Jin, Changli Tang, Qiujia Li, Bo Li, Chen Chen, Yuchen Hu, Wenyi Yu, Yixuan Li, Jimin Zhuang, Yudong Yang, Mingqiu Wang, Michael Han, Yifan Ding, Junwen Bai, Tom Ouyang, Shuo-yiin Chang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Guangzhi Sun, Zhehuai Chen, Ji Wu, Bowen Zhou, Yuxuan Wang, Tara Sainath, Yonghui Wu, Chao Zhang

Main category: eess.AS

TL;DR: Survey on integrating audio into LLMs for enhanced comprehension, generation, interaction, and multimodal understanding towards audio-native AGI systems.

Details

Motivation: Computer audition needs to evolve beyond traditional paradigms to leverage foundation models for comprehensive understanding, natural generation, and human-like interaction in the era of LLMs and AGI.

Method: Comprehensive review and analysis of recent progress in four key areas: audio comprehension, audio generation, speech-based interaction, and audio-visual understanding.

Result: LLMs are reshaping audio perception and reasoning, enabling deeper semantic understanding of sound, expressive audio generation, human-like spoken interaction, and enhanced multimodal intelligence through audio-visual fusion.

Conclusion: Identifies critical challenges and future directions for building audio-native AGI systems capable of perceiving, understanding, and interacting through sound as naturally as humans.

Abstract: In the era of large language models (LLMs) and artificial general intelligence (AGI), computer audition must evolve beyond traditional paradigms to fully leverage the capabilities of foundation models, towards more comprehensive understanding, more natural generation and more human-like interaction. Audio, as a modality rich in semantic, emotional, and contextual cues, plays a vital role in achieving naturalistic and embodied machine intelligence. This survey provides a comprehensive review of recent progress in integrating audio into LLMs, with a focus on four key areas: audio comprehension, audio generation, speech-based interaction, and audio-visual understanding. We analyze how LLMs are reshaping audio perception and reasoning, enabling systems to understand sound at a deeper semantic level, generate expressive audio outputs, and engage in human-like spoken interaction. Furthermore, we explore how the fusion of audio and visual modalities enhances situational awareness and cross-modal reasoning, pushing the boundaries of multimodal intelligence. This survey not only synthesizes existing research but also identifies critical challenges and future directions for building audio-native AGI systems capable of perceiving, understanding, and interacting through sound as naturally as humans do.

[917] AudioNet: Supervised Deep Hashing for Retrieval of Similar Audio Events

Sagar Dutta, Vipul Arora

Main category: eess.AS

TL;DR: AudioNet is a supervised deep hashing method that generates binary hash codes for efficient retrieval of similar audio events using audio queries, achieving state-of-the-art performance on multiple datasets.

Details

Motivation: To develop an efficient method for retrieving similar audio events using deep learning-based hashing, addressing the need for effective audio event retrieval systems.

Method: Uses a deep learning system with discrete gradient propagation to optimize binary hash codes, incorporating a novel loss function with weighted contrastive and pairwise loss components plus hashcode balancing.

Result: Achieves high retrieval performance on multiple standard datasets, setting new benchmarks and showing effectiveness even with imbalanced datasets.

Conclusion: AudioNet establishes a strong baseline for future studies on efficient audio event retrieval using deep audio embeddings, demonstrating promising performance and systematic benefits.

Abstract: This work presents a supervised deep hashing method for retrieving similar audio events. The proposed method, named AudioNet, is a deep-learning-based system for efficient hashing and retrieval of similar audio events using an audio example as a query. AudioNet achieves high retrieval performance on multiple standard datasets by generating binary hash codes for similar audio events, setting new benchmarks in the field, and highlighting its efficacy and effectiveness compare to other hashing methods. Through comprehensive experiments on standard datasets, our research represents a pioneering effort in evaluating the retrieval performance of similar audio events. A novel loss function is proposed which incorporates weighted contrastive and weighted pairwise loss along with hashcode balancing to improve the efficiency of audio event retrieval. The method adopts discrete gradient propagation, which allows gradients to be propagated through discrete variables during backpropagation. This enables the network to optimize the discrete hash codes using standard gradient-based optimization algorithms, which are typically used for continuous variables. The proposed method showcases promising retrieval performance, as evidenced by the experimental results, even when dealing with imbalanced datasets. The systematic analysis conducted in this study further supports the significant benefits of the proposed method in retrieval performance across multiple datasets. The findings presented in this work establish a baseline for future studies on the efficient retrieval of similar audio events using deep audio embeddings.

[918] Leveraging Language Information for Target Language Extraction

Mehmet Sinan Yıldırım, Ruijie Tao, Wupeng Wang, Junyi Ao, Haizhou Li

Main category: eess.AS

TL;DR: A novel end-to-end framework that leverages speech pre-trained models to extract target language speech from multilingual mixtures, achieving significant improvements in extraction quality.

Details

Motivation: Conventional extraction systems lack prior language knowledge, while human auditory systems excel at this task. Speech pre-trained models can provide the missing language knowledge to improve extraction performance.

Method: Proposed an end-to-end framework that uses language knowledge from speech pre-trained models to guide the extraction model in capturing target language characteristics. Created the first publicly available multilingual dataset for Target Language Extraction.

Result: Achieved improvements of 1.22 dB in SI-SNR for English extraction and 1.12 dB for German extraction from mixtures containing both languages.

Conclusion: Leveraging language knowledge from speech pre-trained models effectively improves target language extraction quality from multilingual speech mixtures.

Abstract: Target Language Extraction aims to extract speech in a specific language from a mixture waveform that contains multiple speakers speaking different languages. The human auditory system is adept at performing this task with the knowledge of the particular language. However, the performance of the conventional extraction systems is limited by the lack of this prior knowledge. Speech pre-trained models, which capture rich linguistic and phonetic representations from large-scale in-the-wild corpora, can provide this missing language knowledge to these systems. In this work, we propose a novel end-to-end framework to leverage language knowledge from speech pre-trained models. This knowledge is used to guide the extraction model to better capture the target language characteristics, thereby improving extraction quality. To demonstrate the effectiveness of our proposed approach, we construct the first publicly available multilingual dataset for Target Language Extraction. Experimental results show that our method achieves improvements of 1.22 dB and 1.12 dB in SI-SNR for English and German extraction, respectively, from mixtures containing both languages.

[919] Aligning Speech to Languages to Enhance Code-switching Speech Recognition

Hexin Liu, Xiangyu Zhang, Haoyang Zhang, Leibny Paola Garcia, Andy W. H. Khong, Eng Siong Chng, Shinji Watanabe

Main category: eess.AS

TL;DR: This paper proposes a language alignment loss (LAL) for code-switching speech recognition that aligns acoustic features to pseudo-language labels learned during ASR training, enabling frame-level language identification without annotations. It also introduces LLM-based generative error correction guided by linguistic hints from LAL outputs.

Details

Motivation: Code-switching in speech causes language confusion for automatic speech recognition systems, requiring methods to handle mixed-language scenarios without explicit language annotations.

Method: Proposes language alignment loss (LAL) that aligns acoustic features to pseudo-language labels learned from ASR decoder during training. Also employs large language models via generative error correction with linguistic hints derived from LAL outputs and decoded hypotheses.

Result: LAL improves CS-ASR performance for both hybrid CTC/attention and Whisper models on SEAME and ASRU 2019 datasets with negligible parameter increase. Achieves 8.6% relative improvement on ASRU dataset and 14.1%/5.5% improvements with LLM-based error correction on ASRU/SEAME test sets respectively.

Conclusion: Language alignment loss effectively addresses code-switching ASR challenges by enabling frame-level language identification without annotations and balancing bilingual data during training, while LLM-based error correction with linguistic hints further enhances performance.

Abstract: Code-switching (CS) refers to the switching of languages within a speech signal and results in language confusion for automatic speech recognition (ASR). To address language confusion, we propose a language alignment loss (LAL) that aligns acoustic features to pseudo-language labels learned from the ASR decoder during ASR training. This approach enables frame-level language identification without the need for frame-level language annotations. To further tackle the complex token alternatives for language modeling in bilingual scenarios, we propose to employ large language models via a generative error correction method. A linguistic hint, derived from LAL outputs and decoded hypotheses, is introduced to guide the prompting and enhance the LLM-based generative error correction for CS-ASR. The proposed methods are evaluated on the SEAME dataset and data from the ASRU 2019 Mandarin-English code-switching speech recognition challenge. The incorporation of the proposed language alignment loss improves CS-ASR performance for both hybrid CTC/attention and Whisper models on both datasets, with only a negligible increase in the number of parameters. This work also highlights the efficacy of language alignment loss in balancing primary-language-dominant bilingual data during training, with an 8.6% relative improvement on the ASRU dataset compared to the baseline model. Performance evaluation using large language models reveals the advantage of the linguistic hint by achieving 14.1% and 5.5% relative improvement on test sets of the ASRU and SEAME datasets, respectively.

[920] Instance-Specific Test-Time Training for Speech Editing in the Wild

Taewoo Kim, Uijong Lee, Hayoung Park, Choongsang Cho, Nam In Park, Young Han Lee

Main category: eess.AS

TL;DR: A test-time training method for speech editing that uses instance-specific adaptation to handle diverse acoustic conditions, employing direct supervision from ground-truth features and indirect supervision via auxiliary losses for smooth acoustic transitions.

Details

Motivation: Previous speech editing systems struggle with unseen and diverse acoustic conditions in real-world scenarios, leading to degraded performance.

Method: Instance-specific test-time training with direct supervision from ground-truth acoustic features in unedited regions and indirect supervision in edited regions using auxiliary losses based on duration constraints and phoneme prediction.

Result: Outperforms existing speech editing systems in both objective and subjective evaluations on in-the-wild benchmark datasets.

Conclusion: The proposed method effectively handles diverse acoustic conditions, mitigates bandwidth discontinuity, enables precise speech rate control, and improves editing performance in real-world scenarios.

Abstract: Speech editing systems aim to naturally modify speech content while preserving acoustic consistency and speaker identity. However, previous studies often struggle to adapt to unseen and diverse acoustic conditions, resulting in degraded editing performance in real-world scenarios. To address this, we propose an instance-specific test-time training method for speech editing in the wild. Our approach employs direct supervision from ground-truth acoustic features in unedited regions and indirect supervision in edited regions via auxiliary losses based on duration constraints and phoneme prediction. This strategy mitigates the bandwidth discontinuity problem in speech editing, ensuring smooth acoustic transitions between unedited and edited regions. Additionally, it enables precise control over speech rate by adapting the model to target durations via mask length adjustment during test-time training. Experiments on in-the-wild benchmark datasets demonstrate that our method outperforms existing speech editing systems in both objective and subjective evaluations.

[921] Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges

Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe

Main category: eess.AS

TL;DR: The CHiME-7/8 challenges focused on multi-channel speech recognition and diarization, revealing key trends: shift to end-to-end ASR systems, continued reliance on guided source separation, importance of accurate diarization refinement, weak correlation between transcription quality and downstream tasks, and persistent challenges in transcribing spontaneous speech.

Details

Motivation: To advance state-of-the-art in distant speech recognition by addressing multi-channel, generalizable joint ASR and diarization of conversational speech through community challenges.

Method: Organized CHiME-7/8 challenges with 9 teams submitting 32 systems, analyzed participant submissions, evaluated using specific metrics and datasets with baseline systems.

Result: Key findings include: transition to end-to-end ASR systems, continued use of guided source separation over neural SSE, importance of diarization refinement, weak correlation between transcription quality and downstream tasks, and persistent challenges in spontaneous speech transcription.

Conclusion: Despite progress, accurately transcribing spontaneous speech in challenging environments remains difficult, and current neural speech separation techniques still struggle with complex scenarios, while downstream evaluation metrics may not fully capture transcription quality.

Abstract: The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges’ design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.

[922] DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models

Kevin Wilkinghoff, Zheng-Hua Tan

Main category: eess.AS

TL;DR: DSpAST is a novel audio encoder that learns disentangled representations of spatial audio with only 0.2% additional parameters, significantly outperforming SpatialAST in spatial audio reasoning tasks.

Details

Motivation: Current spatial audio encoders struggle to capture all required information (sound event types, direction, distance) in a single encoder, leading to worse performance compared to task-specific encoders.

Method: Developed DSpAST based on SpatialAST architecture, focusing on learning disentangled representations of spatial audio with minimal parameter increase.

Result: Experiments on SpatialSoundQA with BAT system show DSpAST significantly outperforms SpatialAST in spatial audio reasoning tasks.

Conclusion: DSpAST successfully addresses the limitations of single audio encoders by learning disentangled representations while maintaining parameter efficiency.

Abstract: Reasoning about spatial audio with large language models requires a spatial audio encoder as an acoustic front-end to obtain audio embeddings for further processing. Such an encoder needs to capture all information required to detect the type of sound events, as well as the direction and distance of their corresponding sources. Accomplishing this with a single audio encoder is demanding as the information required for each of these tasks is mostly independent of each other. As a result, the performance obtained with a single encoder is often worse than when using task-specific audio encoders. In this work, we present DSpAST, a novel audio encoder based on SpatialAST that learns disentangled representations of spatial audio while having only 0.2% additional parameters. Experiments on SpatialSoundQA with the spatial audio reasoning system BAT demonstrate that DSpAST significantly outperforms SpatialAST.

[923] Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs

Anand, Umberto Cappellazzo, Stavros Petridis, Maja Pantic

Main category: eess.AS

TL;DR: This paper studies attention sinks and massive activations in multimodal speech recognition LLMs, identifies their patterns across ASR, VSR, and AVSR tasks, and proposes a decorrelation loss to mitigate these phenomena while improving performance under high downsampling.

Details

Motivation: To understand the internal dynamics of fine-tuned LLMs in multimodal speech recognition and address the limitations in current understanding of attention sinks and massive activations in this domain.

Method: Conducted detailed analysis of audio-visual LLMs to identify attention sinks and massive activations, then introduced a simple decorrelation loss that reduces cosine similarity between BOS and other tokens to mitigate these phenomena.

Result: Identified attention sinks and massive activations at BOS and intermediate low-semantic tokens across ASR, VSR, and AVSR. Showed that massive activations originate in MLP layers and correspond to fixed feature indices. The decorrelation loss effectively mitigated intermediate sinks and massive activations while improving WER under high audio-visual feature downsampling.

Conclusion: The study provides insights into multimodal LLM dynamics and demonstrates that decorrelation loss can effectively address attention sinks and massive activations while maintaining or improving performance in speech recognition tasks.

Abstract: Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.

eess.IV

[924] Towards Reliable Pediatric Brain Tumor Segmentation: Task-Specific nnU-Net Enhancements

Xiaolong Li, Zhi-Qin John Xu, Yan Ren, Tianming Qiu, Xiaowen Wang

Main category: eess.IV

TL;DR: An advanced nnU-Net framework with widened residual encoder, SE attention, 3D depthwise separable convolutions, specificity-driven regularization, and Gaussian weight initialization achieved top performance in pediatric brain tumor segmentation on BraTS 2025 Task-6.

Details

Motivation: Pediatric brain tumor segmentation in mpMRI faces challenges due to limited data, high anatomical variability, and heterogeneous imaging across institutions, requiring specialized solutions.

Method: Enhanced nnU-Net with widened residual encoder with squeeze-and-excitation attention, 3D depthwise separable convolutions, specificity-driven regularization, Gaussian weight initialization, and postprocessing steps.

Result: Achieved first place on BraTS 2025 Task-6 validation leaderboard with lesion-wise Dice scores: 0.759 (CC), 0.967 (ED), 0.826 (ET), 0.910 (NET), 0.928 (TC), and 0.928 (WT).

Conclusion: The proposed advanced nnU-Net framework effectively addresses pediatric brain tumor segmentation challenges and demonstrates state-of-the-art performance on the largest public pediatric high-grade glioma dataset.

Abstract: Accurate segmentation of pediatric brain tumors in multi-parametric magnetic resonance imaging (mpMRI) is critical for diagnosis, treatment planning, and monitoring, yet faces unique challenges due to limited data, high anatomical variability, and heterogeneous imaging across institutions. In this work, we present an advanced nnU-Net framework tailored for BraTS 2025 Task-6 (PED), the largest public dataset of pre-treatment pediatric high-grade gliomas. Our contributions include: (1) a widened residual encoder with squeeze-and-excitation (SE) attention; (2) 3D depthwise separable convolutions; (3) a specificity-driven regularization term; and (4) small-scale Gaussian weight initialization. We further refine predictions with two postprocessing steps. Our models achieved first place on the Task-6 validation leaderboard, attaining lesion-wise Dice scores of 0.759 (CC), 0.967 (ED), 0.826 (ET), 0.910 (NET), 0.928 (TC) and 0.928 (WT).

Aditya Parikh, Sneha Das, Aasa Feragen

Main category: eess.IV

TL;DR: Algorithmic bias in medical imaging segmentation disproportionately affects younger patients in breast cancer detection, primarily due to intrinsic learning challenges rather than label quality issues.

Details

Motivation: To understand the causes of algorithmic bias in medical segmentation tasks, particularly in breast cancer imaging where younger patients experience significant performance disparities, and to investigate whether bias stems from label quality or inherent image characteristics.

Method: Audited the MAMA-MIA dataset to establish baseline age-related bias, conducted controlled experiments to test hypotheses about bias origins, and systematically refuted explanations related to label quality sensitivity and case difficulty imbalance.

Result: Revealed a ‘Biased Ruler effect’ where flawed validation labels misrepresent actual model bias, demonstrated that younger patient cases are intrinsically harder to learn, and showed that balancing training data by difficulty fails to mitigate disparities.

Conclusion: Systemic bias is learned and amplified from machine-generated labels, and achieving fairness requires addressing qualitative distributional differences rather than merely balancing case counts.

Abstract: Algorithmic bias in medical imaging can perpetuate health disparities, yet its causes remain poorly understood in segmentation tasks. While fairness has been extensively studied in classification, segmentation remains underexplored despite its clinical importance. In breast cancer segmentation, models exhibit significant performance disparities against younger patients, commonly attributed to physiological differences in breast density. We audit the MAMA-MIA dataset, establishing a quantitative baseline of age-related bias in its automated labels, and reveal a critical Biased Ruler effect where systematically flawed labels for validation misrepresent a model’s actual bias. However, whether this bias originates from lower-quality annotations (label bias) or from fundamentally more challenging image characteristics remains unclear. Through controlled experiments, we systematically refute hypotheses that the bias stems from label quality sensitivity or quantitative case difficulty imbalance. Balancing training data by difficulty fails to mitigate the disparity, revealing that younger patient cases are intrinsically harder to learn. We provide direct evidence that systemic bias is learned and amplified when training on biased, machine-generated labels, a critical finding for automated annotation pipelines. This work introduces a systematic framework for diagnosing algorithmic bias in medical segmentation and demonstrates that achieving fairness requires addressing qualitative distributional differences rather than merely balancing case counts.

[926] Image-based ground distance detection for crop-residue-covered soil

Baochao Wang, Xingyu Zhang, Qingtao Zong, Alim Pulatov, Shuqi Shang, Dongwei Wang

Main category: eess.IV

TL;DR: An image-based method using 3D and RGB cameras to accurately measure ground distance through crop residue by distinguishing soil from residue areas, achieving ±3mm precision for seeding depth control in conservation agriculture.

Details

Motivation: Current distance sensors (laser, ultrasonic, mechanical) cannot differentiate between crop residues and soil, making precise seeding depth control impossible in conservation agriculture where soil is covered with crop residues.

Method: Uses 3D camera for depth image and RGB camera for color image simultaneously. Color image distinguishes residue and soil areas to generate a mask, which is applied to depth image to exclude residue areas and calculate ground distance from soil only.

Result: Method is feasible for real-time implementation with measurement error within ±3mm, enabling precise ground distance detection through crop residue coverage.

Conclusion: This approach can be applied in conservation agriculture machinery for precision depth seeding and other depth-control applications like transplanting or tillage.

Abstract: Conservation agriculture features a soil surface covered with crop residues, which brings benefits of improving soil health and saving water. However, one significant challenge in conservation agriculture lies in precisely controlling the seeding depth on the soil covered with crop residues. This is constrained by the lack of ground distance information, since current distance measurement techniques, like laser, ultrasonic, or mechanical displacement sensors, are incapable of differentiating whether the distance information comes from the residue or the soil. This paper presents an image-based method to get the ground distance information for the crop-residues-covered soil. This method is performed with 3D camera and RGB camera, obtaining depth image and color image at the same time. The color image is used to distinguish the different areas of residues and soil and finally generates a mask image. The mask image is applied to the depth image so that only the soil area depth information can be used to calculate the ground distance, and residue areas can be recognized and excluded from ground distance detection. Experimentation shows that this distance measurement method is feasible for real-time implementation, and the measurement error is within plus or minus 3mm. It can be applied in conservation agriculture machinery for precision depth seeding, as well as other depth-control-demanding applications like transplant or tillage.

[927] GDROS: A Geometry-Guided Dense Registration Framework for Optical-SAR Images under Large Geometric Transformations

Zixuan Sun, Shuaifeng Zhi, Ruize Li, Jingyuan Xia, Yongxiang Liu, Weidong Jiang

Main category: eess.IV

TL;DR: GDROS is a geometry-guided dense registration framework for optical-SAR image pairs that addresses modal discrepancy challenges through cross-modal feature extraction and geometric constraints.

Details

Motivation: Registration of optical and SAR images is challenging due to severe nonlinear radiometric differences, geometric distortions, and noise variations. Existing methods struggle with reliable registration under large geometric transformations.

Method: Extract cross-modal deep features using CNN-Transformer hybrid module, build multi-scale 4D correlation volume for pixel-wise dense correspondences, and apply least squares regression to geometrically constrain the optical flow field with affine transformation.

Result: Extensive experiments on WHU-Opt-SAR, OS, and UBCv2 datasets show GDROS significantly outperforms state-of-the-art methods across all metrics and different spatial resolutions.

Conclusion: GDROS provides robust performance for optical-SAR image registration by leveraging global cross-modal interactions and geometric guidance, effectively handling modal discrepancy challenges.

Abstract: Registration of optical and synthetic aperture radar (SAR) remote sensing images serves as a critical foundation for image fusion and visual navigation tasks. This task is particularly challenging because of their modal discrepancy, primarily manifested as severe nonlinear radiometric differences (NRD), geometric distortions, and noise variations. Under large geometric transformations, existing classical template-based and sparse keypoint-based strategies struggle to achieve reliable registration results for optical-SAR image pairs. To address these limitations, we propose GDROS, a geometry-guided dense registration framework leveraging global cross-modal image interactions. First, we extract cross-modal deep features from optical and SAR images through a CNN-Transformer hybrid feature extraction module, upon which a multi-scale 4D correlation volume is constructed and iteratively refined to establish pixel-wise dense correspondences. Subsequently, we implement a least squares regression (LSR) module to geometrically constrain the predicted dense optical flow field. Such geometry guidance mitigates prediction divergence by directly imposing an estimated affine transformation on the final flow predictions. Extensive experiments have been conducted on three representative datasets WHU-Opt-SAR dataset, OS dataset, and UBCv2 dataset with different spatial resolutions, demonstrating robust performance of our proposed method across different imaging resolutions. Qualitative and quantitative results show that GDROS significantly outperforms current state-of-the-art methods in all metrics. Our source code will be released at: https://github.com/Zi-Xuan-Sun/GDROS.

[928] Been There, Scanned That: Nostalgia-Driven LiDAR Compression for Self-Driving Cars

Ali Khalid, Jaiaid Mobin, Sumanth Rao Appala, Avinash Maurya, Stephany Berrio Perez, M. Mustafa Rafique, Fawad Ahmad

Main category: eess.IV

TL;DR: DejaView is a compression system for autonomous vehicle LiDAR data that exploits long-term temporal redundancies over days/months, achieving 210x compression with 15cm error.

Details

Motivation: Autonomous vehicles generate terabytes of LiDAR data daily, creating high network and storage costs for cloud transfer and analysis.

Method: Uses diff operations to represent point clouds as deltas relative to past 3D data, leveraging that vehicles repeatedly traverse the same routes.

Result: Achieves 210x compression ratio with only 15cm reconstruction error using two months of real LiDAR data.

Conclusion: Long-term temporal redundancies in autonomous vehicle routes enable highly effective compression, significantly reducing data transfer and storage costs.

Abstract: An autonomous vehicle can generate several terabytes of sensor data per day. A significant portion of this data consists of 3D point clouds produced by depth sensors such as LiDARs. This data must be transferred to cloud storage, where it is utilized for training machine learning models or conducting analyses, such as forensic investigations in the event of an accident. To reduce network and storage costs, this paper introduces DejaView. Although prior work uses interframe redundancies to compress data, DejaView searches for and uses redundancies on larger temporal scales (days and months) for more effective compression. We designed DejaView with the insight that the operating area of autonomous vehicles is limited and that vehicles mostly traverse the same routes daily. Consequently, the 3D data they collect daily is likely similar to the data they have captured in the past. To capture this, the core of DejaView is a diff operation that compactly represents point clouds as delta w.r.t. 3D data from the past. Using two months of LiDAR data, an end-to-end implementation of DejaView can compress point clouds by a factor of 210 at a reconstruction error of only 15 cm.

[929] Deep Generative Models for Enhanced Vitreous OCT Imaging

Simone Sarrocco, Philippe C. Cattin, Peter M. Maloca, Paul Friedrich, Philippe Valmaggia

Main category: eess.IV

TL;DR: Deep learning models were evaluated for enhancing vitreous OCT image quality and reducing acquisition time. cDDPM showed the best clinical performance despite U-Net having superior quantitative metrics, demonstrating potential for fourfold acquisition time reduction.

Details

Motivation: To improve vitreous optical coherence tomography (OCT) image quality while significantly reducing acquisition time, which is important for clinical efficiency and patient comfort.

Method: Used multiple DL models including cDDPMs, BBDMs, U-Net, Pix2Pix, and VQ-GAN to generate high-quality SD vitreous OCT images from lower-quality inputs. Evaluated using image quality metrics (PSNR, SSIM, LPIPS) and Visual Turing Tests by ophthalmologists.

Result: U-Net achieved best quantitative metrics (PSNR: 30.230, SSIM: 0.820), but cDDPM performed best in clinical evaluation with highest Visual Turing Test ranking (3.07), 32.9% fool rate, and 85.7% anatomical preservation. cDDPM generated vitreous regions more similar to reference than true ART1 or ART10 B-scans.

Conclusion: Discrepancies exist between quantitative metrics and clinical evaluation, requiring combined assessment. cDDPM shows strong potential for generating clinically meaningful vitreous OCT images while reducing acquisition time fourfold, with promise for clinical integration.

Abstract: Purpose: To evaluate deep learning (DL) models for enhancing vitreous optical coherence tomography (OCT) image quality and reducing acquisition time. Methods: Conditional Denoising Diffusion Probabilistic Models (cDDPMs), Brownian Bridge Diffusion Models (BBDMs), U-Net, Pix2Pix, and Vector-Quantised Generative Adversarial Network (VQ-GAN) were used to generate high-quality spectral-domain (SD) vitreous OCT images. Inputs were SD ART10 images, and outputs were compared to pseudoART100 images obtained by averaging ten ART10 images per eye location. Model performance was assessed using image quality metrics and Visual Turing Tests, where ophthalmologists ranked generated images and evaluated anatomical fidelity. The best model’s performance was further tested within the manually segmented vitreous on newly acquired data. Results: U-Net achieved the highest Peak Signal-to-Noise Ratio (PSNR: 30.230) and Structural Similarity Index Measure (SSIM: 0.820), followed by cDDPM. For Learned Perceptual Image Patch Similarity (LPIPS), Pix2Pix (0.697) and cDDPM (0.753) performed best. In the first Visual Turing Test, cDDPM ranked highest (3.07); in the second (best model only), cDDPM achieved a 32.9% fool rate and 85.7% anatomical preservation. On newly acquired data, cDDPM generated vitreous regions more similar in PSNR to the ART100 reference than true ART1 or ART10 B-scans and achieved higher PSNR on whole images when conditioned on ART1 than ART10. Conclusions: Results reveal discrepancies between quantitative metrics and clinical evaluation, highlighting the need for combined assessment. cDDPM showed strong potential for generating clinically meaningful vitreous OCT images while reducing acquisition time fourfold. Translational Relevance: cDDPMs show promise for clinical integration, supporting faster, higher-quality vitreous imaging. Dataset and code will be made publicly available.

[930] Evaluating Video Quality Metrics for Neural and Traditional Codecs using 4K/UHD-1 Videos

Benjamin Herb, Rakesh Rao Ramachandra Rao, Steve Göring, Alexander Raake

Main category: eess.IV

TL;DR: This paper presents a subjective quality assessment study comparing traditional (AV1, VVC) and neural video codecs (DCVC-FM, DCVC-RT) to evaluate the validity of existing quality metrics for neural video compression.

Details

Motivation: With neural video codecs emerging as alternatives to traditional methods, it's important to determine whether existing quality metrics remain valid for evaluating their performance, as few studies have systematically investigated this using well-designed subjective tests.

Method: Conducted subjective quality assessment using 6 source videos encoded at 4 resolutions with 9 QP values, resulting in 216 sequences rated by 30 participants. Evaluated full-reference, hybrid, and no-reference quality metrics on the subjective data.

Result: VMAF and AVQBits|H0|f showed strong Pearson correlation, FasterVQA performed best among no-reference metrics, and PSNR had highest Spearman correlation for within-sequence comparisons. No significant performance differences in metric reliability were observed between traditional and neural codecs.

Conclusion: Existing quality metrics remain applicable to neural video codecs, with no significant reliability differences compared to traditional codecs. The dataset will be publicly available to support further research.

Abstract: With neural video codecs (NVCs) emerging as promising alternatives for traditional compression methods, it is increasingly important to determine whether existing quality metrics remain valid for evaluating their performance. However, few studies have systematically investigated this using well-designed subjective tests. To address this gap, this paper presents a subjective quality assessment study using two traditional (AV1 and VVC) and two variants of a neural video codec (DCVC-FM and DCVC-RT). Six source videos (8-10 seconds each, 4K/UHD-1, 60 fps) were encoded at four resolutions (360p to 2160p) using nine different QP values, resulting in 216 sequences that were rated in a controlled environment by 30 participants. These results were used to evaluate a range of full-reference, hybrid, and no-reference quality metrics to assess their applicability to the induced quality degradations. The objective quality assessment results show that VMAF and AVQBits|H0|f demonstrate strong Pearson correlation, while FasterVQA performed best among the tested no-reference metrics. Furthermore, PSNR shows the highest Spearman rank order correlation for within-sequence comparisons across the different codecs. Importantly, no significant performance differences in metric reliability are observed between traditional and neural video codecs across the tested metrics. The dataset, consisting of source videos, encoded videos, and both subjective and quality metric scores will be made publicly available following an open-science approach (https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-NVC).

[931] Learned Adaptive Kernels for High-Fidelity Image Downscaling

Piyush Narhari Pise, Sanjay Ghosh

Main category: eess.IV

TL;DR: ADK-Net is a deep CNN framework for supervised image downscaling that learns adaptive, spatially-varying resampling kernels independently for each RGB color channel, achieving state-of-the-art performance.

Details

Motivation: Classic image downscaling methods often cause blurring or aliasing, and existing learning-based approaches don't fully address channel-specific characteristics needed for maximal fidelity against ground-truth low-resolution images.

Method: Uses a hierarchical ResNet-based architecture with parallel channel-specific kernel generators to predict spatially-varying adaptive resampling kernels for each pixel and RGB channel, trained end-to-end with L1 reconstruction loss.

Result: Establishes new state-of-the-art on standard benchmarks including RealSR dataset, with significant improvements in PSNR and SSIM metrics compared to existing methods.

Conclusion: ADK-Net effectively addresses channel interdependencies in image downscaling and demonstrates superior performance through its adaptive kernel prediction approach.

Abstract: Image downscaling is a fundamental operation in image processing, crucial for adapting high-resolution content to various display and storage constraints. While classic methods often introduce blurring or aliasing, recent learning-based approaches offer improved adaptivity. However, achieving maximal fidelity against ground-truth low-resolution (LR) images, particularly by accounting for channel-specific characteristics, remains an open challenge. This paper introduces ADK-Net (Adaptive Downscaling Kernel Network), a novel deep convolutional neural network framework for high-fidelity supervised image downscaling. ADK-Net explicitly addresses channel interdependencies by learning to predict spatially-varying, adaptive resampling kernels independently for each pixel and uniquely for each color channel (RGB). The architecture employs a hierarchical design featuring a ResNet-based feature extractor and parallel channel-specific kernel generators, themselves composed of ResNet-based trunk and branch sub-modules, enabling fine-grained kernel prediction. Trained end-to-end using an L1 reconstruction loss against ground-truth LR data, ADK-Net effectively learns the target downscaling transformation. Extensive quantitative and qualitative experiments on standard benchmarks, including the RealSR dataset, demonstrate that ADK-Net establishes a new state-of-the-art in supervised image downscaling, yielding significant improvements in PSNR and SSIM metrics compared to existing learning-based and traditional methods.

[932] Symmetric Entropy-Constrained Video Coding for Machines

Yuxiao Sun, Meiqin Liu, Chao Yao, Qi Tang, Jian Jin, Weisi Lin, Frederic Dufaux, Yao Zhao

Main category: eess.IV

TL;DR: SEC-VCM is a symmetric entropy-constrained video coding framework for machines that aligns video codec with visual backbones to preserve semantics and discard MVS-irrelevant information, achieving SOTA rate-task performance with significant bitrate savings.

Details

Motivation: Existing VCM methods bind codecs to specific downstream models, requiring retraining and limiting generalization in multi-task scenarios. Unified VCM frameworks using VB/VFM mainly maintain semantic consistency but don't directly link video coding with understanding under VB/VFM guidance.

Method: Proposes SEC-VCM with bi-directional entropy-constraint (BiEC) mechanism ensuring symmetry between video decoding and VB encoding by suppressing conditional entropy, and semantic-pixel dual-path fusion (SPDF) module injecting pixel-level priors into reconstruction.

Result: Achieves SOTA rate-task performance with significant bitrate savings: 37.4% on video instance segmentation, 29.8% on video object segmentation, 46.2% on object detection, and 44.9% on multiple object tracking compared to VTM.

Conclusion: The framework successfully establishes symmetric alignment between video codec and VB, enabling explicit handling of semantic information beneficial to MVS while squeezing useless information, demonstrating superior performance across multiple video understanding tasks.

Abstract: As video transmission increasingly serves machine vision systems (MVS) instead of human vision systems (HVS), video coding for machines (VCM) has become a critical research topic. Existing VCM methods often bind codecs to specific downstream models, requiring retraining or supervised data, thus limiting generalization in multi-task scenarios. Recently, unified VCM frameworks have employed visual backbones (VB) and visual foundation models (VFM) to support multiple video understanding tasks with a single codec. They mainly utilize VB/VFM to maintain semantic consistency or suppress non-semantic information, but seldom explore how to directly link video coding with understanding under VB/VFM guidance. Hence, we propose a Symmetric Entropy-Constrained Video Coding framework for Machines (SEC-VCM). It establishes a symmetric alignment between the video codec and VB, allowing the codec to leverage VB’s representation capabilities to preserve semantics and discard MVS-irrelevant information. Specifically, a bi-directional entropy-constraint (BiEC) mechanism ensures symmetry between the process of video decoding and VB encoding by suppressing conditional entropy. This helps the codec to explicitly handle semantic information beneficial to MVS while squeezing useless information. Furthermore, a semantic-pixel dual-path fusion (SPDF) module injects pixel-level priors into the final reconstruction. Through semantic-pixel fusion, it suppresses artifacts harmful to MVS and improves machine-oriented reconstruction quality. Experimental results show our framework achieves state-of-the-art~(SOTA) in rate-task performance, with significant bitrate savings over VTM on video instance segmentation (37.4%), video object segmentation (29.8%), object detection (46.2%), and multiple object tracking (44.9%). We will release our code soon.

[933] DeepHQ: Learned Hierarchical Quantizer for Progressive Deep Image Coding

Jooyoung Lee, Se Yoon Jeong, Munchurl Kim

Main category: eess.IV

TL;DR: Proposes a neural network-based progressive image coding method with learned quantization step sizes and selective compression for improved efficiency.

Details

Motivation: Existing progressive image coding methods use handcrafted quantization hierarchies, leading to sub-optimal compression efficiency.

Method: Uses learned quantization step sizes for each quantization layer and incorporates selective compression of essential representation components per layer.

Result: Achieves significantly higher coding efficiency than existing approaches with decreased decoding time and reduced model size.

Conclusion: The proposed method with learned quantization and selective compression outperforms traditional progressive coding approaches.

Abstract: Unlike fixed- or variable-rate image coding, progressive image coding (PIC) aims to compress various qualities of images into a single bitstream, increasing the versatility of bitstream utilization and providing high compression efficiency compared to simulcast compression. Research on neural network (NN)-based PIC is in its early stages, mainly focusing on applying varying quantization step sizes to the transformed latent representations in a hierarchical manner. These approaches are designed to compress only the progressively added information as the quality improves, considering that a wider quantization interval for lower-quality compression includes multiple narrower sub-intervals for higher-quality compression. However, the existing methods are based on handcrafted quantization hierarchies, resulting in sub-optimal compression efficiency. In this paper, we propose an NN-based progressive coding method that firstly utilizes learned quantization step sizes via learning for each quantization layer. We also incorporate selective compression with which only the essential representation components are compressed for each quantization layer. We demonstrate that our method achieves significantly higher coding efficiency than the existing approaches with decreased decoding time and reduced model size. The source code is publicly available at https://github.com/JooyoungLeeETRI/DeepHQ

[934] FIPER: Factorized Features for Robust Image Super-Resolution and Compression

Yang-Che Sun, Cheng Yu Yeo, Ernie Chu, Jun-Cheng Chen, Yu-Lun Liu

Main category: eess.IV

TL;DR: A unified Factorized Features representation for low-level vision tasks like Super-Resolution and Image Compression, using basis-coefficient decomposition and explicit frequency formulation to capture structural components and multi-scale features.

Details

Motivation: Shared principles between SISR and Image Compression tasks - both require recovering and preserving fine image details, whether by enhancing resolution or reconstructing compressed data.

Method: Uses basis-coefficient decomposition and explicit frequency formulation to capture structural components and multi-scale visual features. Replaces simple feature maps with Factorized Features and leverages mergeable-basis property for multi-frame compression optimization.

Result: State-of-the-art performance with 204.4% average relative PSNR improvement over baseline in Super-Resolution and 9.35% BD-rate reduction in Image Compression compared to previous SOTA.

Conclusion: Factorized Features provide a unified representation that effectively addresses core challenges of both low-level vision tasks, demonstrating broad generalizability and superior performance.

Abstract: In this work, we propose using a unified representation, termed Factorized Features, for low-level vision tasks, where we test on Single Image Super-Resolution (SISR) and \textbf{Image Compression}. Motivated by the shared principles between these tasks, they require recovering and preserving fine image details, whether by enhancing resolution for SISR or reconstructing compressed data for Image Compression. Unlike previous methods that mainly focus on network architecture, our proposed approach utilizes a basis-coefficient decomposition as well as an explicit formulation of frequencies to capture structural components and multi-scale visual features in images, which addresses the core challenges of both tasks. We replace the representation of prior models from simple feature maps with Factorized Features to validate the potential for broad generalizability. In addition, we further optimize the compression pipeline by leveraging the mergeable-basis property of our Factorized Features, which consolidates shared structures on multi-frame compression. Extensive experiments show that our unified representation delivers state-of-the-art performance, achieving an average relative improvement of 204.4% in PSNR over the baseline in Super-Resolution (SR) and 9.35% BD-rate reduction in Image Compression compared to the previous SOTA. Project page: https://jayisaking.github.io/FIPER/

[935] LanPaint: Training-Free Diffusion Inpainting with Asymptotically Exact and Fast Conditional Sampling

Candi Zheng, Yuan Lan, Yang Wang

Main category: eess.IV

TL;DR: LanPaint is a training-free method for partial conditional sampling in diffusion models that uses Langevin dynamics to enable fast, accurate inpainting without backpropagation.

Details

Motivation: Existing methods for partial conditional sampling in diffusion models suffer from intractable inverse problems, require expensive backpropagation, or are incompatible with fast ODE-based samplers, limiting their practical use.

Method: Leverages carefully designed Langevin dynamics to enable training-free, asymptotically exact partial conditional sampling for ODE-based and rectified flow diffusion models, using fast Monte Carlo sampling without backpropagation.

Result: Achieves superior performance with precise partial conditioning and visually coherent inpainting across diverse tasks, demonstrating accurate distributional matching.

Conclusion: LanPaint provides an efficient, training-free solution for partial conditional sampling that overcomes limitations of prior approaches and works with fast ODE-based samplers.

Abstract: Diffusion models excel at joint pixel sampling for image generation but lack efficient training-free methods for partial conditional sampling (e.g., inpainting with known pixels). Prior work typically formulates this as an intractable inverse problem, relying on coarse variational approximations, heuristic losses requiring expensive backpropagation, or slow stochastic sampling. These limitations preclude: (1) accurate distributional matching in inpainting results, (2) efficient inference modes without gradient, (3) compatibility with fast ODE-based samplers. To address these limitations, we propose LanPaint: a training-free, asymptotically exact partial conditional sampling methods for ODE-based and rectified flow diffusion models. By leveraging carefully designed Langevin dynamics, LanPaint enables fast, backpropagation-free Monte Carlo sampling. Experiments demonstrate that our approach achieves superior performance with precise partial conditioning and visually coherent inpainting across diverse tasks.

[936] Anatomically Guided Motion Correction for Placental IVIM Parameter Estimation with Accelerated Sampling Method

Mbaimou Auxence Ngremmadji, Freddy Odille, Charline Bertholdt, Marine Beaumont, Olivier Morel, Bailiang Chen

Main category: eess.IV

TL;DR: A novel motion correction framework for placental IVIM MRI using super-resolution anatomical data and accelerated Bayesian fitting with pCN sampling, improving accuracy and reducing scan time.

Details

Motivation: IVIM MRI for placental assessment requires prolonged scans and is sensitive to maternal/fetal motion, which affects parameter estimation accuracy.

Method: Two-step motion correction using SRR anatomical reference data and accelerated Bayesian fitting with preconditioned Crank-Nicholson sampling strategy.

Result: Motion correction reduced mean absolute fitting error from 4.14 to 3.02, and pCN sampling accelerated parameter estimation by 39% while maintaining accuracy.

Conclusion: The proposed method enables fast and reliable IVIM parameter estimation in challenging prenatal MRI scenarios.

Abstract: Intravoxel incoherent motion (IVIM) is a diffusion-weighted magnetic resonance imaging (MRI) method that may be applied to the placenta to help diagnose abnormal pregnancies. IVIM requires prolonged scan times, followed by a model-based estimation procedure. Maternal or fetal motion during the scan affects the accuracy of this estimation. In this work, we proposed to address this challenging motion correction and data fitting problem by using additional anatomical information that is routinely collected at the beginning of the examination. Super-resolution reconstruction (SRR) was applied to these anatomical data, to provide a patient-specific, 3D isotropic, anatomic reference. Our first contribution is a novel framework with a two-step motion correction that uses both IVIM and the SRR anatomic data, accounting for both intra- and inter-scan, non-rigid motion. Our second contribution is an automation and acceleration of the IVIM data fitting, using a state-of-the-art Bayesian-type algorithm, modified with a preconditioned Crank-Nicholson (pCN) sampling strategy. The accuracy of the IVIM parameter fitting was improved by the proposed motion correction strategy, as assessed by the mean absolute fitting error in the region of interest, which was 4.14 before and 3.02 after correction (arbitrary units of signal intensity). The novel sampling strategy accelerated parameter estimation by 39% in average, with the same accuracy as that of the conventional Bayesian approach. In conclusion, the proposed method may be applied to obtain fast and reliable IVIM parameter estimates in challenging scenarios such as prenatal MRI.

[937] Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration

Taejin Jeong, Joohyeok Kim, Jaehoon Joo, Seong Jae Hwang

Main category: eess.IV

TL;DR: V-ViT is a voting-based Vision Transformer framework that improves calibration in glaucoma diagnosis by integrating binocular information and metadata, addressing overconfidence issues through an iterative dropout-based voting system.

Details

Motivation: Glaucoma diagnosis suffers from significant subjectivity and model overconfidence, which can lead to fatal issues like overdiagnosis or missed critical diseases. Existing calibration methods overlook glaucoma's systemic associations and diagnostic subjectivity.

Method: Proposed V-ViT framework integrates patient’s binocular information and metadata, and uses an iterative dropout-based Voting System to maximize calibration performance and mitigate diagnostic subjectivity.

Result: Achieved state-of-the-art performance across all metrics, including primary calibration metrics. Effectively resolves overconfidence issues in glaucoma diagnosis predictions.

Conclusion: V-ViT provides highly reliable predictions for clinical use in glaucoma diagnosis by addressing calibration and overconfidence problems through its voting-based approach and integration of multiple data sources.

Abstract: Glaucoma is a major cause of irreversible blindness, with significant diagnostic subjectivity. This inherent uncertainty, combined with the overconfidence of models optimized solely for accuracy can lead to fatal issues such as overdiagnosis or missing critical diseases. To ensure clinical trust, model calibration is essential for reliable predictions, yet study in this field remains limited. Existing calibration study have overlooked glaucoma’s systemic associations and high diagnostic subjectivity. To overcome these limitations, we propose V-ViT (Voting-based ViT), a framework that enhances calibration by integrating a patient’s binocular information and metadata. Furthermore, to mitigate diagnostic subjectivity, V-ViT utilizes an iterative dropout-based Voting System to maximize calibration performance. The proposed framework achieved state-of-the-art performance across all metrics, including the primary calibration metrics. Our results demonstrate that V-ViT effectively resolves the issue of overconfidence in predictions in glaucoma diagnosis, providing highly reliable predictions for clinical use. Our source code is available at https://github.com/starforTJ/V-ViT.

[938] Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2

Yuwen Chen, Zafer Yildiz, Qihang Li, Yaqian Chen, Haoyu Dong, Hanxue Gu, Nicholas Konz, Maciej A. Mazurowski

Main category: eess.IV

TL;DR: SLM-SAM 2 improves medical image annotation by using separate short-term and long-term memory banks to reduce error propagation in volumetric segmentation, outperforming SAM 2 with significant Dice score improvements and 60.575% faster correction time.

Details

Motivation: Manual annotation of volumetric medical images (MRI, CT) is labor-intensive. While SAM 2 offers potential for speeding up annotation through mask propagation, it suffers from error propagation issues, especially at boundary regions.

Method: Proposed Short-Long Memory SAM 2 (SLM-SAM 2) with distinct short-term and long-term memory banks and separate attention modules to improve segmentation accuracy and reduce error propagation.

Result: SLM-SAM 2 outperforms SAM 2 on four public datasets, achieving average Dice improvements of 0.14 (5 volumes) and 0.10 (1 volume), with 60.575% reduction in mask correction time per volume.

Conclusion: SLM-SAM 2 represents a significant step toward more accurate automated annotation of medical images for segmentation model development, with stronger resistance to over-propagation.

Abstract: Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos. We show that the proposed method markedly outperforms the default SAM 2, achieving an average Dice Similarity Coefficient improvement of 0.14 and 0.10 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, reducing the time required to correct propagated masks by 60.575% per volume compared to SAM 2, making a notable step toward more accurate automated annotation of medical images for segmentation model development.

[939] Autoadaptive Medical Segment Anything Model

Tyler Ward, Meredith K. Owen, O’Kira Coleman, Brian Noehren, Abdullah-Al-Zubaer Imran

Main category: eess.IV

TL;DR: ADA-SAM is a multitask learning framework for medical image segmentation that uses class activation maps from an auxiliary classifier to guide SAM-based semi-supervised segmentation, with a novel gradient feedback mechanism connecting segmentation and classification branches.

Details

Motivation: Manual annotation for medical image segmentation is expensive, time-consuming, and error-prone, creating a need for accurate, automatic, and annotation-efficient training methods.

Method: Proposes ADA-SAM framework with: 1) Class activation maps from auxiliary classifier guiding SAM-based semi-supervised segmentation, 2) Novel gradient feedback mechanism using segmentation gradients to improve classification predictions.

Result: Outperforms both fully-supervised and semi-supervised baselines by double digits in limited label settings on real-world clinical rehabilitation data.

Conclusion: ADA-SAM provides an effective solution for annotation-efficient medical image segmentation through its multitask learning approach and gradient feedback mechanism.

Abstract: Medical image segmentation is a key task in the imaging workflow, influencing many image-based decisions. Traditional, fully-supervised segmentation models rely on large amounts of labeled training data, typically obtained through manual annotation, which can be an expensive, time-consuming, and error-prone process. This signals a need for accurate, automatic, and annotation-efficient methods of training these models. We propose ADA-SAM (automated, domain-specific, and adaptive segment anything model), a novel multitask learning framework for medical image segmentation that leverages class activation maps from an auxiliary classifier to guide the predictions of the semi-supervised segmentation branch, which is based on the Segment Anything (SAM) framework. Additionally, our ADA-SAM model employs a novel gradient feedback mechanism to create a learnable connection between the segmentation and classification branches by using the segmentation gradients to guide and improve the classification predictions. We validate ADA-SAM on real-world clinical data collected during rehabilitation trials, and demonstrate that our proposed method outperforms both fully-supervised and semi-supervised baselines by double digits in limited label settings. Our code is available at: https://github.com/tbwa233/ADA-SAM.

[940] Joint Lossless Compression and Steganography for Medical Images via Large Language Models

Pengcheng Zheng, Xiaorong Pu, Kecheng Chen, Jiaxin Huang, Meng Yang, Bai Feng, Yazhou Ren, Jianan Jiang

Main category: eess.IV

TL;DR: A novel joint lossless compression and steganography framework for medical images that securely embeds privacy messages while maintaining high compression performance and efficiency.

Details

Motivation: Existing LLM-based compressors for medical images have unsatisfactory trade-offs between compression performance and efficiency, and overlook security aspects critical in medical scenarios.

Method: Uses adaptive modalities decomposition to partition images into global/local segments, implements dual-path lossless compression with segmented message steganography in local path, and employs anatomical priors-based low-rank adaptation (A-LoRA) fine-tuning.

Result: Extensive experiments demonstrate superiority in compression ratios, efficiency, and security compared to existing methods.

Conclusion: The proposed framework effectively addresses the compression-security trade-off in medical imaging and will be made publicly available.

Abstract: Recently, large language models (LLMs) have driven promising progress in lossless image compression. However, directly adopting existing paradigms for medical images suffers from an unsatisfactory trade-off between compression performance and efficiency. Moreover, existing LLM-based compressors often overlook the security of the compression process, which is critical in modern medical scenarios. To this end, we propose a novel joint lossless compression and steganography framework. Inspired by bit plane slicing (BPS), we find it feasible to securely embed privacy messages into medical images in an invisible manner. Based on this insight, an adaptive modalities decomposition strategy is first devised to partition the entire image into two segments, providing global and local modalities for subsequent dual-path lossless compression. During this dual-path stage, we innovatively propose a segmented message steganography algorithm within the local modality path to ensure the security of the compression process. Coupled with the proposed anatomical priors-based low-rank adaptation (A-LoRA) fine-tuning strategy, extensive experimental results demonstrate the superiority of our proposed method in terms of compression ratios, efficiency, and security. The source code will be made publicly available.

[941] DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI

Bocheng Guo, Jin Wang, Yijie Li, Junyi Wang, Mingyu Gao, Puming Feng, Yuqian Chen, Jarrett Rushmore, Nikos Makris, Yogesh Rathi, Lauren J O’Donnell, Fan Zhang

Main category: eess.IV

TL;DR: Deep Multi-view Fiber Clustering (DMVFC) is a novel deep learning framework that integrates geometric, microstructural, and functional information from dMRI and fMRI for more meaningful white matter fiber clustering.

Details

Motivation: Current fiber clustering methods only use geometric characteristics and neglect functional and microstructural information, which limits their ability to create functionally coherent white matter parcellations.

Method: DMVFC has two main components: (1) multi-view pretraining to compute embedding features from fiber geometry, microstructure measures, and functional signals separately, and (2) collaborative fine-tuning to simultaneously refine embedding differences.

Result: DMVFC demonstrated superior performance compared to two state-of-the-art fiber clustering methods in achieving functionally meaningful and consistent white matter parcellation results.

Conclusion: The integration of multimodal information (geometric, microstructural, and functional) through deep learning enables more functionally coherent white matter parcellation for structural connectivity analysis.

Abstract: Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.

Today’s Research Highlights

Table of Contents

cs.CL

[1] PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization

[2] Cognitive Alignment in Personality Reasoning: Leveraging Prototype Theory for MBTI Inference

[3] ParaScopes: What do Language Models Activations Encode About Future Text?

[4] Training LLMs Beyond Next Token Prediction - Filling the Mutual Information Gap

[5] Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

[6] AgentBnB: A Browser-Based Cybersecurity Tabletop Exercise with Large Language Model Support and Retrieval-Aligned Scaffolding

[7] IL-PCSR: Legal Corpus for Prior Case and Statute Retrieval

[8] POSESTITCH-SLT: Linguistically Inspired Pose-Stitching for End-to-End Sign Language Translation

[9] Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

[10] Language Modeling With Factorization Memory

[11] Spatial Knowledge Graph-Guided Multimodal Synthesis

[12] Reversal Invariance in Autoregressive Language Models

[13] LingGym: How Far Are LLMs from Thinking Like Field Linguists?

[14] MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

[15] Reasoning Trajectories for Socratic Debugging of Student Code: From Misconceptions to Contradictions and Updated Beliefs

[16] PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks

[17] MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts

[18] G2: Guided Generation for Enhanced Output Diversity in LLMs

[19] Remembering Unequally: Global and Disciplinary Bias in LLM-Generated Co-Authorship Networks

[20] Leveraging the Cross-Domain & Cross-Linguistic Corpus for Low Resource NMT: A Case Study On Bhili-Hindi-English Parallel Corpus

[21] With Privacy, Size Matters: On the Importance of Dataset Size in Differentially Private Text Rewriting

[22] ToM: Leveraging Tree-oriented MapReduce for Long-Context Reasoning in Large Language Models

[23] Zero-RAG: Towards Retrieval-Augmented Generation with Zero Redundant Knowledge

[24] Fine-Tuning DialoGPT on Common Diseases in Rural Nepal for Medical Conversations

[25] Exploring and Mitigating Gender Bias in Encoder-Based Transformer Models

[26] Word Salad Chopper: Reasoning Models Waste A Ton Of Decoding Budget On Useless Repetitions, Self-Knowingly

[27] Multi-refined Feature Enhanced Sentiment Analysis Using Contextual Instruction

[28] Friend or Foe: How LLMs’ Safety Mind Gets Fooled by Intent Shift Attack

[29] FlashEVA: Accelerating LLM inference via Efficient Attention

[30] OpenSIR: Open-Ended Self-Improving Reasoner

[31] SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

[32] Certain but not Probable? Differentiating Certainty from Probability in LLM Token Outputs for Probabilistic Scenarios

[33] Modeling the Construction of a Literary Archetype: The Case of the Detective Figure in French Literature

[34] Do You Know About My Nation? Investigating Multilingual Language Models’ Cultural Literacy Through Factual Knowledge

[35] Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?

[36] Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

[37] TriCon-Fair: Triplet Contrastive Learning for Mitigating Social Bias in Pre-trained Language Models

[38] Assessing LLM Reasoning Steps via Principal Knowledge Grounding

[39] ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

[40] The Biased Oracle: Assessing LLMs’ Understandability and Empathy in Medical Diagnoses

[41] The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles

[42] Advancing Machine-Generated Text Detection from an Easy to Hard Supervision Perspective

[43] MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

[44] IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation

[45] Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

[46] OceanAI: A Conversational Platform for Accurate, Transparent, Near-Real-Time Oceanographic Insights

[47] VayuChat: An LLM-Powered Conversational Interface for Air Quality Data Analytics

[48] Building a Silver-Standard Dataset from NICE Guidelines for Clinical LLMs

[49] HPLT~3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

[50] Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering

[51] TSVer: A Benchmark for Fact Verification Against Time-Series Evidence

[52] MicroRemed: Benchmarking LLMs in Microservices Remediation

[53] Learning When to Quit in Sales Conversations

[54] Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

[55] ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction

[56] Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning

[57] DEER: Disentangled Mixture of Experts with Instance-Adaptive Routing for Generalizable Machine-Generated Text Detection

[58] AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs

[59] When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding

[60] “Give a Positive Review Only”: An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

[61] FirstAidQA: A Synthetic Dataset for First Aid and Emergency Response in Low-Connectivity Settings

[62] DeepSpecs: Expert-Level Questions Answering in 5G

[63] DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness

[64] Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series

[65] PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise

[66] Safer in Translation? Presupposition Robustness in Indic Languages

[67] The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation

[68] Confounding Factors in Relating Model Performance to Morphology

[69] RAGSmith: A Framework for Finding the Optimal Composition of Retrieval-Augmented Generation Methods Across Datasets

[70] LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge

[71] “Don’t Teach Minerva”: Guiding LLMs Through Complex Syntax for Faithful Latin Translation with RAG

[72] BARD: budget-aware reasoning distillation

[73] Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

[74] Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

[75] BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

[76] Difficulty-Controllable Cloze Question Distractor Generation

[77] Math anxiety and associative knowledge structure are entwined in psychology students but not in Large Language Models like GPT-3.5 and GPT-4o