Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 75]
cs.CV [Total: 89]
cs.AI [Total: 37]
cs.SD [Total: 4]
cs.LG [Total: 145]
cs.MA [Total: 3]
cs.MM [Total: 1]
eess.AS [Total: 10]
eess.IV [Total: 9]

cs.CL

[1] Cache Mechanism for Agent RAG Systems

Shuhang Lin, Zhencan Peng, Lingyao Li, Xiao Lin, Xi Zhu, Yongfeng Zhang

Main category: cs.CL

TL;DR: ARC is a novel caching framework that dynamically manages small, high-value corpora for LLM agents, reducing storage to 0.015% of original while maintaining high performance.

Details

Motivation: Agent-level cache management in RAG systems is underexplored, particularly the need for dynamic, compact corpora tailored to each agent's specific requirements.

Method: ARC synthesizes historical query distribution patterns with the intrinsic geometry of cached items in embedding space to automatically maintain high-relevance cache without annotations.

Result: ARC reduces storage requirements to 0.015% of original corpus, achieves up to 79.8% has-answer rate, and reduces average retrieval latency by 80% across three retrieval datasets.

Conclusion: ARC can drastically enhance both efficiency and effectiveness in RAG-powered LLM agents through intelligent cache management.

Abstract: Recent advances in Large Language Model (LLM)-based agents have been propelled by Retrieval-Augmented Generation (RAG), which grants the models access to vast external knowledge bases. Despite RAG’s success in improving agent performance, agent-level cache management, particularly constructing, maintaining, and updating a compact, relevant corpus dynamically tailored to each agent’s need, remains underexplored. Therefore, we introduce ARC (Agent RAG Cache Mechanism), a novel, annotation-free caching framework that dynamically manages small, high-value corpora for each agent. By synthesizing historical query distribution patterns with the intrinsic geometry of cached items in the embedding space, ARC automatically maintains a high-relevance cache. With comprehensive experiments on three retrieval datasets, our experimental results demonstrate that ARC reduces storage requirements to 0.015% of the original corpus while offering up to 79.8% has-answer rate and reducing average retrieval latency by 80%. Our results demonstrate that ARC can drastically enhance efficiency and effectiveness in RAG-powered LLM agents.

[2] Automatic Machine Translation Detection Using a Surrogate Multilingual Translation Model

Cristian García-Romero, Miquel Esplà-Gomis, Felipe Sánchez-Martínez

Main category: cs.CL

TL;DR: Proposes using internal representations from multilingual MT models to detect machine-translated text, achieving 5+ percentage point accuracy gains over SOTA methods.

Details

Motivation: Large parallel corpora contain substantial machine-generated translations that degrade MT quality when used in training, making detection essential.

Method: Uses internal representations from a surrogate multilingual MT model to distinguish human vs machine-translated sentences.

Result: Outperforms current state-of-the-art techniques, especially for non-English language pairs, with at least 5 percentage point accuracy gains.

Conclusion: The approach effectively filters machine-translated content and is crucial for building high-quality MT systems.

Abstract: Modern machine translation (MT) systems depend on large parallel corpora, often collected from the Internet. However, recent evidence indicates that (i) a substantial portion of these texts are machine-generated translations, and (ii) an overreliance on such synthetic content in training data can significantly degrade translation quality. As a result, filtering out non-human translations is becoming an essential pre-processing step in building high-quality MT systems. In this work, we propose a novel approach that directly exploits the internal representations of a surrogate multilingual MT model to distinguish between human and machine-translated sentences. Experimental results show that our method outperforms current state-of-the-art techniques, particularly for non-English language pairs, achieving gains of at least 5 percentage points of accuracy.

[3] LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo

Main category: cs.CL

TL;DR: LEGO-Eval is a new evaluation framework that uses diverse tools to ground scene components for better assessment of 3D scene-instruction alignment, outperforming current methods by 0.41 F1 score.

Details

Motivation: Current 3D scene generation methods often produce unrealistic scenes due to coarse-grained instructions, which can lead embodied agents to learn incorrect priors when trained in such environments. Existing evaluation methods like CLIPScore and VLMs fail to reliably assess alignment due to shallow understanding of 3D scenes.

Method: Introduces LEGO-Eval framework with diverse tools to explicitly ground scene components for accurate alignment assessment, and LEGO-Bench benchmark with detailed instructions specifying complex layouts and attributes of real-world environments.

Result: LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Current generation methods show significant limitations, with success rates reaching at most 10% in generating scenes that fully align with fine-grained instructions.

Conclusion: The proposed LEGO-Eval framework provides more reliable assessment of 3D scene-instruction alignment, revealing substantial gaps in current generation methods that need to be addressed for effective embodied agent training.

Abstract: Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.

[4] Step-Audio-EditX Technical Report

Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

Main category: cs.CL

TL;DR: Step-Audio-EditX is the first open-source LLM-based audio model that excels at expressive audio editing and zero-shot TTS, using large-margin synthetic data instead of embedding-based priors.

Details

Motivation: To create an audio model capable of expressive and iterative audio editing including emotion, speaking style, and paralinguistics, while avoiding the limitations of traditional representation-level disentanglement approaches.

Method: Leverages only large-margin synthetic data, circumventing the need for embedding-based priors or auxiliary modules, enabling both iterative control and high expressivity across voices.

Result: Evaluation shows Step-Audio-EditX surpasses MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

Conclusion: The large-margin learning approach represents a fundamental pivot from conventional focus on representation-level disentanglement and enables superior expressive audio editing capabilities.

Abstract: We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

[5] Targeted Error Correction in Knowledge Distillation: Small Language Models Surpass GPT

Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi

Main category: cs.CL

TL;DR: ARF pipeline enables smaller open-source LLMs to outperform larger proprietary models in customer service summarization through error analysis, targeted revision, and fine-tuning.

Details

Motivation: To improve cost efficiency and data privacy while maintaining competitive accuracy in customer service summarization tasks using smaller open-source models.

Method: Analyze-Revise-Finetune (ARF) pipeline: 1) Analyze/categorize errors from teacher model (GPT-3.5), 2) Targeted revision using editor model (Llama 3.1 70B) to create refined training data, 3) Fine-tune student model (Llama 3.1 8B) on refined data.

Result: Fine-tuned smaller student model (Llama 3.1 8B) achieved superior summarization performance compared to GPT-3.5.

Conclusion: ARF pipeline provides a generalizable framework for enhancing open-source LLMs across diverse downstream applications with improved cost efficiency and data privacy.

Abstract: We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller open-source language models (LLMs) to surpass substantially larger proprietary models in customer service summarization tasks. The pipeline first analyzes and categorizes common errors in summaries produced by a teacher model (GPT-3.5), then performs a targeted revision using a compact editor model (Llama 3.1 70B) to generate high-quality, refined training data. Fine-tuning a smaller student model (Llama 3.1 8B) on this refined data resulted in superior summarization performance compared to GPT-3.5. The ARF pipeline improves cost efficiency and data privacy while maintaining competitive accuracy, illustrating a generalizable framework for enhancing open-source LLMs across diverse downstream applications.

[6] Data-Efficient Adaptation and a Novel Evaluation Method for Aspect-based Sentiment Analysis

Yan Cathy Hua, Paul Denny, Jörg Wicker, Katerina Taškova

Main category: cs.CL

TL;DR: This paper introduces FTS-OBP evaluation method for ABSA, studies small language models in education domain ABSA, and releases educational ABSA resources.

Details

Motivation: Address the concentration of ABSA research in commercial domains and the limitations of traditional evaluation methods that penalize boundary variations in generative models.

Method: Proposed FTS-OBP evaluation method, systematically explored small decoder-only generative language models (<7B parameters) using data-free (in-context learning, weight merging) and data-light fine-tuning methods with multitask strategy.

Result: Small models (1.5-3.8B) surpassed proprietary large models and approached benchmark results with only 200-1,000 examples on a single GPU using the proposed multitask fine-tuning strategy.

Conclusion: The work provides effective solutions for ABSA in low-resource domains through novel evaluation methods, optimized small model training strategies, and publicly available educational resources.

Abstract: Aspect-based Sentiment Analysis (ABSA) is a fine-grained opinion mining approach that identifies and classifies opinions associated with specific entities (aspects) or their categories within a sentence. Despite its rapid growth and broad potential, ABSA research and resources remain concentrated in commercial domains, leaving analytical needs unmet in high-demand yet low-resource areas such as education and healthcare. Domain adaptation challenges and most existing methods’ reliance on resource-intensive in-training knowledge injection further hinder progress in these areas. Moreover, traditional evaluation methods based on exact matches are overly rigid for ABSA tasks, penalising any boundary variations which may misrepresent the performance of generative models. This work addresses these gaps through three contributions: 1) We propose a novel evaluation method, Flexible Text Similarity Matching and Optimal Bipartite Pairing (FTS-OBP), which accommodates realistic extraction boundary variations while maintaining strong correlation with traditional metrics and offering fine-grained diagnostics. 2) We present the first ABSA study of small decoder-only generative language models (SLMs; <7B parameters), examining resource lower bounds via a case study in education review ABSA. We systematically explore data-free (in-context learning and weight merging) and data-light fine-tuning methods, and propose a multitask fine-tuning strategy that significantly enhances SLM performance, enabling 1.5-3.8 B models to surpass proprietary large models and approach benchmark results with only 200-1,000 examples on a single GPU. 3) We release the first public set of education review ABSA resources to support future research in low-resource domains.

[7] A Computational Approach to Analyzing Disrupted Language in Schizophrenia: Integrating Surprisal and Coherence Measures

Gowtham Premananth, Carol Espy-Wilson

Main category: cs.CL

TL;DR: This study analyzes language disruptions in schizophrenia using computational linguistic measures (surprisal and semantic coherence) to differentiate patients from healthy controls and assess symptom severity.

Details

Motivation: Language disruptions in schizophrenia reflect cognitive disturbances and could serve as objective markers for diagnosis and symptom severity assessment.

Method: Computing surprisal and semantic coherence of language using computational models to compare schizophrenia patients with healthy controls.

Result: The study found differences in surprisal and semantic coherence between schizophrenia subjects and healthy controls, with these measures changing according to symptom severity.

Conclusion: Computational linguistic measures (surprisal and semantic coherence) can characterize language disruptions in schizophrenia and provide insights into symptom severity.

Abstract: Language disruptions are one of the well-known effects of schizophrenia symptoms. They are often manifested as disorganized speech and impaired discourse coherence. These abnormalities in spontaneous language production reflect underlying cognitive disturbances and have the potential to serve as objective markers for symptom severity and diagnosis of schizophrenia. This study focuses on how these language disruptions can be characterized in terms of two computational linguistic measures: surprisal and semantic coherence. By computing surprisal and semantic coherence of language using computational models, this study investigates how they differ between subjects with schizophrenia and healthy controls. Furthermore, this study provides further insight into how language disruptions in terms of these linguistic measures change with varying degrees of schizophrenia symptom severity.

[8] ROBoto2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment

Anthony Hevia, Sanjana Chintalapati, Veronica Ka Wai Lai, Thanh Tam Nguyen, Wai-Tat Wong, Terry Klassen, Lucy Lu Wang

Main category: cs.CL

TL;DR: ROBOTO2 is an open-source web platform that uses LLMs to assist with risk of bias assessment in clinical trials, combining PDF parsing, retrieval-augmented prompting, and human review to streamline the ROB2 annotation process.

Details

Motivation: To address the labor-intensive nature of traditional risk of bias assessment in clinical trials and leverage LLMs to automate parts of the ROB2 annotation process while maintaining human oversight.

Method: Interactive web platform combining PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users upload trial reports, receive preliminary answers with evidence for ROB2 signaling questions, and provide real-time feedback.

Result: Created a dataset of 521 pediatric clinical trial reports (8954 signaling questions with 1202 evidence passages) and benchmarked ROB2 performance for 4 LLMs. The platform is publicly available with code and data released.

Conclusion: ROBOTO2 demonstrates the potential of LLM-assisted risk of bias assessment while highlighting current model capabilities and ongoing challenges in automating systematic review processes.

Abstract: We present ROBOTO2, an open-source, web-based platform for large language model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBOTO2 streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process via an interactive interface that combines PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users can upload clinical trial reports, receive preliminary answers and supporting evidence for ROB2 signaling questions, and provide real-time feedback or corrections to system suggestions. ROBOTO2 is publicly available at https://roboto2.vercel.app/, with code and data released to foster reproducibility and adoption. We construct and release a dataset of 521 pediatric clinical trial reports (8954 signaling questions with 1202 evidence passages), annotated using both manually and LLM-assisted methods, serving as a benchmark and enabling future research. Using this dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into current model capabilities and ongoing challenges in automating this critical aspect of systematic review.

[9] Reading Between the Lines: The One-Sided Conversation Problem

Victoria Ebert, Rishabh Singh, Tuochao Chen, Noah A. Smith, Shyamnath Gollakota

Main category: cs.CL

TL;DR: The paper introduces the one-sided conversation problem (1SC) where only one side of a dialogue is available, and presents methods for reconstructing missing turns and generating summaries from partial transcripts.

Details

Motivation: Real-world conversational AI applications often face constraints where only one side of a conversation can be recorded (e.g., telemedicine, call centers, smart glasses), creating the need to work with incomplete dialogue data.

Method: The study evaluates prompting and finetuned models on datasets like MultiWOZ, DailyDialog, and Candor, using techniques like accessing one future turn, utterance length information, and placeholder prompting to mitigate hallucination.

Result: Models can effectively reconstruct missing speaker turns with proper context (one future turn and utterance length), and high-quality summaries can be generated without full reconstruction. Large models work well with prompting while smaller models require finetuning.

Conclusion: The paper presents 1SC as a novel challenge and demonstrates promising results for privacy-aware conversational AI, showing that effective reconstruction and summarization are possible from one-sided conversations.

Abstract: Conversational AI is constrained in many real-world settings where only one side of a dialogue can be recorded, such as telemedicine, call centers, and smart glasses. We formalize this as the one-sided conversation problem (1SC): inferring and learning from one side of a conversation. We study two tasks: (1) reconstructing the missing speaker’s turns for real-time use cases, and (2) generating summaries from one-sided transcripts. Evaluating prompting and finetuned models on MultiWOZ, DailyDialog, and Candor with both human A/B testing and LLM-as-a-judge metrics, we find that access to one future turn and information about utterance length improves reconstruction, placeholder prompting helps to mitigate hallucination, and while large models generate promising reconstructions with prompting, smaller models require finetuning. Further, high-quality summaries can be generated without reconstructing missing turns. We present 1SC as a novel challenge and report promising results that mark a step toward privacy-aware conversational AI.

[10] PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech

Michel Wong, Ali Alshehri, Sophia Kao, Haotian He

Main category: cs.CL

TL;DR: PolyNorm is a prompt-based text normalization approach using LLMs that reduces manual rule engineering and improves performance across multiple languages compared to traditional systems.

Details

Motivation: Traditional text normalization systems require substantial engineering effort, are difficult to scale, and have limited language coverage, especially in low-resource settings.

Method: Uses prompt-based approach with Large Language Models (LLMs) and a language-agnostic pipeline for automatic data curation and evaluation.

Result: Experiments across eight languages show consistent reductions in word error rate (WER) compared to production-grade systems.

Conclusion: PolyNorm reduces reliance on manually crafted rules, enables broader linguistic applicability with minimal human intervention, and releases PolyNorm-Benchmark dataset to support further research.

Abstract: Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages show consistent reductions in the word error rate (WER) compared to a production-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena.

Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

Main category: cs.CL

TL;DR: Omni-router Transformer uses shared routing across MoE layers to improve expert cooperation and specialization in ASR, achieving better performance than dense and Switch Transformer models.

Details

Motivation: Traditional MoE methods route experts independently per layer, but analysis shows routers' choices are weakly correlated across layers, limiting inter-layer expert cooperation and specialization.

Method: Propose Omni-router Transformer that uses a shared router across different MoE layers to increase cooperation between experts in different layers.

Result: Achieves lower training loss and consistently outperforms dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2% respectively across 10 diverse ASR benchmarks.

Conclusion: Shared routing enables structured expert usage and improved robustness to diverse data in ASR tasks.

Abstract: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

[12] CARMA: Comprehensive Automatically-annotated Reddit Mental Health Dataset for Arabic

Saad Mankarious, Ayah Zirikly

Main category: cs.CL

TL;DR: CARMA is the first large-scale automatically annotated Arabic Reddit dataset for mental health detection, covering six conditions and enabling classification experiments with various models.

Details

Motivation: Address the gap in Arabic mental health resources due to cultural stigma and lack of annotated datasets, while existing research focuses mainly on English.

Method: Created CARMA dataset from Arabic Reddit posts with automatic annotation for six mental health conditions and control group. Conducted lexical/semantic analysis and classification experiments using shallow classifiers to large language models.

Result: CARMA surpasses existing resources in scale and diversity. Classification experiments demonstrated promising results for mental health detection in Arabic, revealing linguistic markers of specific conditions.

Conclusion: The study shows potential for advancing mental health detection in underrepresented languages like Arabic through large-scale datasets and computational approaches.

Abstract: Mental health disorders affect millions worldwide, yet early detection remains a major challenge, particularly for Arabic-speaking populations where resources are limited and mental health discourse is often discouraged due to cultural stigma. While substantial research has focused on English-language mental health detection, Arabic remains significantly underexplored, partly due to the scarcity of annotated datasets. We present CARMA, the first automatically annotated large-scale dataset of Arabic Reddit posts. The dataset encompasses six mental health conditions, such as Anxiety, Autism, and Depression, and a control group. CARMA surpasses existing resources in both scale and diversity. We conduct qualitative and quantitative analyses of lexical and semantic differences between users, providing insights into the linguistic markers of specific mental health conditions. To demonstrate the dataset’s potential for further mental health analysis, we perform classification experiments using a range of models, from shallow classifiers to large language models. Our results highlight the promise of advancing mental health detection in underrepresented languages such as Arabic.

[13] Control Barrier Function for Aligning Large Language Models

Yuya Miyaoka, Masaki Inoue

Main category: cs.CL

TL;DR: A control-based framework using control barrier functions (CBF) to align LLMs without fine-tuning, ensuring user-desirable text generation through safety filters applied to predicted tokens.

Details

Motivation: To ensure large language models generate user-desirable text through a safety-aligned approach without requiring fine-tuning of baseline models.

Method: Apply control barrier function (CBF) safety filters to predicted tokens from baseline LLMs, creating an add-on safety system that can directly use evaluation models for filter design.

Result: Implemented with open-source language models to successfully generate positive text through the CBF-based intervention system.

Conclusion: The CBF safety filter provides an effective add-on solution for LLM alignment that doesn’t require model fine-tuning and can directly incorporate evaluation models for desired outcomes.

Abstract: This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text.

Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang

Main category: cs.CL

TL;DR: MME-CC is a vision-grounded benchmark for evaluating multimodal LLMs’ cognitive capacity across spatial, geometric, and knowledge-based reasoning tasks, revealing current limitations and common error patterns.

Details

Motivation: Existing multimodal benchmarks either overemphasize textual reasoning or fail to systematically assess vision-centric cognitive behaviors, leaving MLLMs' cognitive capacity insufficiently evaluated.

Method: Introduces MME-CC benchmark with 11 reasoning tasks organized into three categories (spatial, geometric, knowledge-based reasoning) and conducts extensive experiments on 16 representative MLLMs.

Result: Closed-source models lead overall (Gemini-2.5-Pro: 42.66 vs GLM-4.5V: 30.45), while spatial and geometric reasoning remain weak (≤30%). Identifies common error patterns including orientation mistakes, fragile cross-view identity persistence, and poor counterfactual instruction adherence.

Conclusion: The work aims to catalyze a shift toward treating cognitive capacity as central to both MLLM evaluation and model design, highlighting the importance of vision-centric cognitive assessment.

Abstract: As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

[15] Who Sees the Risk? Stakeholder Conflicts and Explanatory Policies in LLM-based Risk Assessment

Srishti Yadav, Jasmina Gajcin, Erik Miehling, Elizabeth Daly

Main category: cs.CL

TL;DR: A framework using LLMs as judges to predict and explain AI risks from different stakeholder perspectives, generating interpretable policies and visualizing conflicts.

Details

Motivation: Understanding how different stakeholders perceive AI risks is essential for responsible AI deployment and human-centered governance.

Method: Uses LLMs as judges with Risk Atlas Nexus and GloVE explanation method to generate stakeholder-specific, interpretable policies and interactive visualizations of conflicts.

Result: Stakeholder perspectives significantly influence risk perception and conflict patterns across medical AI, autonomous vehicles, and fraud detection domains.

Conclusion: Stakeholder-aware explanations are crucial for making LLM-based evaluations more transparent, interpretable, and aligned with human-centered AI governance goals.

Abstract: Understanding how different stakeholders perceive risks in AI systems is essential for their responsible deployment. This paper presents a framework for stakeholder-grounded risk assessment by using LLMs, acting as judges to predict and explain risks. Using the Risk Atlas Nexus and GloVE explanation method, our framework generates stakeholder-specific, interpretable policies that shows how different stakeholders agree or disagree about the same risks. We demonstrate our method using three real-world AI use cases of medical AI, autonomous vehicles, and fraud detection domain. We further propose an interactive visualization that reveals how and why conflicts emerge across stakeholder perspectives, enhancing transparency in conflict reasoning. Our results show that stakeholder perspectives significantly influence risk perception and conflict patterns. Our work emphasizes the importance of these stakeholder-aware explanations needed to make LLM-based evaluations more transparent, interpretable, and aligned with human-centered AI governance goals.

[16] Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks

Kevin Wang, Subre Abdoul Moktar, Jia Li, Kangshuo Li, Feng Chen

Main category: cs.CL

TL;DR: Comprehensive empirical study of 12 uncertainty estimation methods for LLMs in QA tasks, evaluating both in-distribution and out-of-distribution performance across 4 quality metrics.

Details

Motivation: Ensuring trustworthiness of LLM outputs is crucial as they become more pervasive, and uncertainty estimation plays a key role in achieving this.

Method: Evaluated 12 UE methods including information-based, density-based, P(True), and semantic consistency approaches on QA tasks using both ID and OOD datasets with 4 generation quality metrics including LLMScore.

Result: Information-based methods perform best in ID settings, density-based methods and P(True) excel in OOD contexts, and semantic consistency methods show reliable performance across datasets.

Conclusion: Different uncertainty estimation methods have complementary strengths - no single method is optimal for all situations, highlighting the need for context-aware uncertainty measurement in LLMs.

Abstract: Large Language Models (LLMs) have become increasingly pervasive, finding applications across many industries and disciplines. Ensuring the trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE) plays a key role. In this work, a comprehensive empirical study is conducted to examine the robustness and effectiveness of diverse UE measures regarding aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE methods and four generation quality metrics including LLMScore from LLM criticizers to evaluate the uncertainty of LLM-generated answers in Question-Answering (QA) tasks on both in-distribution (ID) and out-of-distribution (OOD) datasets. Our analysis reveals that information-based methods, which leverage token and sequence probabilities, perform exceptionally well in ID settings due to their alignment with the model’s understanding of the data. Conversely, density-based methods and the P(True) metric exhibit superior performance in OOD contexts, highlighting their effectiveness in capturing the model’s epistemic uncertainty. Semantic consistency methods, which assess variability in generated answers, show reliable performance across different datasets and generation metrics. These methods generally perform well but may not be optimal for every situation.

[17] BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

Shahriyar Zaman Ridoy, Azmine Toushik Wasi, Koushik Ahamed Tonmoy

Main category: cs.CL

TL;DR: BengaliMoralBench is the first large-scale ethics benchmark for Bengali language and culture, evaluating multilingual LLMs’ alignment with local ethical norms across five moral domains using native-speaker annotations.

Details

Motivation: Multilingual LLMs are gaining traction in South Asia but existing ethics benchmarks are English-centric and Western-focused, overlooking cultural nuances critical for Bengali speakers (285+ million people).

Method: Created BengaliMoralBench covering five moral domains (Daily Activities, Habits, Parenting, Family Relationships, Religious Activities) with 50 culturally relevant subtopics, annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics.

Result: Systematic zero-shot evaluation of prominent multilingual LLMs (Llama, Gemma, Qwen, DeepSeek) showed wide performance variation (50-91% accuracy) with consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness.

Conclusion: BengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting ethically robust AI deployment in diverse, low-resource multilingual settings like Bangladesh.

Abstract: As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, which is spoken by over 285 million people and ranked 6th globally, remains underexplored. Existing ethics benchmarks are largely English-centric and shaped by Western frameworks, overlooking cultural nuances critical for real-world deployment. To address this, we introduce BengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts. It covers five moral domains, Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities, subdivided into 50 culturally relevant subtopics. Each scenario is annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct systematic zero-shot evaluation of prominent multilingual LLMs, including Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and standard metrics. Performance varies widely (50-91% accuracy), with qualitative analysis revealing consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. BengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting the deployment of ethically robust AI in diverse, low-resource multilingual settings such as Bangladesh.

[18] LGM: Enhancing Large Language Models with Conceptual Meta-Relations and Iterative Retrieval

Wenchang Lei, Ping Zou, Yue Wang, Feng Sun, Lei Zhao

Main category: cs.CL

TL;DR: The Language Graph Model (LGM) enhances LLMs’ conceptual understanding by extracting meta-relations from text and using a reflection mechanism, enabling processing of unlimited-length texts without truncation and outperforming standard RAG methods.

Details

Motivation: LLMs struggle with ambiguous or conceptually misaligned terms in user instructions, requiring better methods for conceptual clarity and interpretation.

Method: Extracts meta-relations (inheritance, alias, composition) from natural language, employs reflection mechanism for validation, and uses Concept Iterative Retrieval Algorithm to dynamically supply relations and descriptions to LLMs.

Result: LGM consistently outperforms existing RAG baselines on standard benchmarks and enables processing of texts of any length without truncation.

Conclusion: The proposed LGM approach effectively enhances LLMs’ conceptual interpretation capabilities and overcomes limitations of conventional RAG methods.

Abstract: Large language models (LLMs) exhibit strong semantic understanding, yet struggle when user instructions involve ambiguous or conceptually misaligned terms. We propose the Language Graph Model (LGM) to enhance conceptual clarity by extracting meta-relations-inheritance, alias, and composition-from natural language. The model further employs a reflection mechanism to validate these meta-relations. Leveraging a Concept Iterative Retrieval Algorithm, these relations and related descriptions are dynamically supplied to the LLM, improving its ability to interpret concepts and generate accurate responses. Unlike conventional Retrieval-Augmented Generation (RAG) approaches that rely on extended context windows, our method enables large language models to process texts of any length without the need for truncation. Experiments on standard benchmarks demonstrate that the LGM consistently outperforms existing RAG baselines.

[19] Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification

Shaghayegh Kolli, Richard Rosenbaum, Timo Cavelius, Lasse Strothe, Andrii Lata, Jana Diesner

Main category: cs.CL

TL;DR: Hybrid fact-checking system combining LLMs, knowledge graphs, and web search agents achieves high accuracy on FEVER benchmark without task-specific fine-tuning, with fallback strategies for comprehensive coverage.

Details

Motivation: Address limitations of LLMs (lack of reliable grounding) and knowledge graphs (limited coverage/latency) by integrating their complementary strengths for more robust fact-checking.

Method: Three-step pipeline: 1) KG retrieval from DBpedia, 2) LM classification with task-specific prompts, 3) Web search agent as fallback when KG coverage is insufficient.

Result: Achieves F1 score of 0.93 on FEVER benchmark Supported/Refuted split; successfully identifies valid evidence for claims originally labeled as Not Enough Information.

Conclusion: Presents a modular, open-source fact-checking pipeline with effective fallback strategies and generalization across datasets.

Abstract: Large language models (LLMs) excel in generating fluent utterances but can lack reliable grounding in verified information. At the same time, knowledge-graph-based fact-checkers deliver precise and interpretable evidence, yet suffer from limited coverage or latency. By integrating LLMs with knowledge graphs and real-time search agents, we introduce a hybrid fact-checking approach that leverages the individual strengths of each component. Our system comprises three autonomous steps: 1) a Knowledge Graph (KG) Retrieval for rapid one-hop lookups in DBpedia, 2) an LM-based classification guided by a task-specific labeling prompt, producing outputs with internal rule-based logic, and 3) a Web Search Agent invoked only when KG coverage is insufficient. Our pipeline achieves an F1 score of 0.93 on the FEVER benchmark on the Supported/Refuted split without task-specific fine-tuning. To address Not enough information cases, we conduct a targeted reannotation study showing that our approach frequently uncovers valid evidence for claims originally labeled as Not Enough Information (NEI), as confirmed by both expert annotators and LLM reviewers. With this paper, we present a modular, opensource fact-checking pipeline with fallback strategies and generalization across datasets.

[20] Beyond Ranked Lists: The SARAL Framework for Cross-Lingual Document Set Retrieval

Shantanu Agarwal, Joel Barry, Elizabeth Boschee, Scott Miller

Main category: cs.CL

TL;DR: SARAL system for cross-lingual information retrieval exceeded competitors in 5 out of 6 evaluation conditions across Farsi, Kazakh, and Georgian languages.

Details

Motivation: To advance cross-lingual information retrieval (CLIR) capabilities, particularly focusing on retrieving query-relevant document sets rather than just ranked lists.

Method: Developed a novel approach for CLIR that emphasizes retrieving document sets relevant to queries, as part of the MATERIAL initiative.

Result: In Phase-3 evaluations, SARAL outperformed other teams in five out of six evaluation conditions spanning three different languages (Farsi, Kazakh, and Georgian).

Conclusion: The SARAL system demonstrated superior performance in cross-lingual information retrieval, successfully handling multiple languages and achieving better results than competing approaches.

Abstract: Machine Translation for English Retrieval of Information in Any Language (MATERIAL) is an IARPA initiative targeted to advance the state of cross-lingual information retrieval (CLIR). This report provides a detailed description of Information Sciences Institute’s (ISI’s) Summarization and domain-Adaptive Retrieval Across Language’s (SARAL’s) effort for MATERIAL. Specifically, we outline our team’s novel approach to handle CLIR with emphasis in developing an approach amenable to retrieve a query-relevant document \textit{set}, and not just a ranked document-list. In MATERIAL’s Phase-3 evaluations, SARAL exceeded the performance of other teams in five out of six evaluation conditions spanning three different languages (Farsi, Kazakh, and Georgian).

[21] IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs

Souvik Rana, Arul Menezes, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal

Main category: cs.CL

TL;DR: IndicSuperTokenizer combines subword and multi-word tokenization with language-specific pre-tokenization for Indic multilingual LLMs, achieving 39.5% better fertility score than LLaMA4 and 44% inference throughput improvement.

Details

Motivation: Current subword tokenizers like BPE are underexplored for multilingual settings, especially for Indic languages with diverse scripts and rich morphological variation, limiting LLM performance and efficiency.

Method: Combines subword and multi-word tokenization with language-specific pre-tokenization to create linguistically aligned tokens for Indic languages.

Result: Achieves 39.5% better average fertility score than LLaMA4 and 18% better than Sutra across English, 22 Indian languages and code data, with 44% inference throughput improvement while maintaining comparable benchmark performance.

Conclusion: The proposed tokenizer design is robust and effective for multilingual LLMs, demonstrating significant improvements in efficiency and linguistic alignment for Indic languages.

Abstract: Tokenizers play a crucial role in determining the performance, training efficiency, and the inference cost of Large Language Models (LLMs). Designing effective tokenizers for multilingual LLMs is particularly challenging due to diverse scripts and rich morphological variation. While subword methods such as Byte Pair Encoding (BPE) are widely adopted, their effectiveness in multilingual settings remains underexplored. We present IndicSuperTokenizer, a tokenizer for Indic multilingual LLMs, that combines both subword and multi-word tokenization, along with language-specific pre-tokenization, leading to more linguistically aligned tokens and achieving a new state-of-the-art in fertility score. Evaluated across English, 22 Indian languages and code data, our tokenizer improves the average fertility score by 39.5% over LLaMA4 and by 18% over Sutra (the current best). This translates to 44% improvement in inference throughput over LLaMA4 while maintaining comparable performance on English and Indic benchmarks. We also present detailed ablations across tokenizer training data size, vocabulary size, merging techniques, and pre-tokenization strategies, demonstrating the robustness of our design choices.

[22] Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature

Ranul Dayarathne, Uvini Ranaweera, Upeksha Ganegoda

Main category: cs.CL

TL;DR: Comparison of four open-source LLMs (Mistral-7b-instruct, LLaMa2-7b-chat, Falcon-7b-instruct, Orca-mini-v3-7b) and GPT-3.5 on QA tasks using RAG in computer science literature, evaluating accuracy, precision, rankings, and latency.

Details

Motivation: To compare performance of different LLMs in question-answering tasks across diverse domains using Retrieval Augmented Generation (RAG) to reduce hallucination and enhance model capabilities.

Method: Used RAG support for QA tasks in computer science literature, evaluated with accuracy/precision for binary questions, human/Gemini rankings and cosine similarity for long-answer questions, plus latency measurements.

Result: GPT-3.5 with RAG performed best overall. Among open-source models, Mistral-7b-instruct with RAG surpassed others in both binary and long-answer QA. Orca-mini-v3-7b had shortest latency while LLaMa2-7b-chat had highest latency.

Conclusion: Open-source LLMs can compete with proprietary models like GPT-3.5 when properly supported by infrastructure, with Mistral-7b-instruct showing strong performance and Orca-mini-v3-7b offering fastest response times.

Abstract: Retrieval Augmented Generation (RAG) is emerging as a powerful technique to enhance the capabilities of Generative AI models by reducing hallucination. Thus, the increasing prominence of RAG alongside Large Language Models (LLMs) has sparked interest in comparing the performance of different LLMs in question-answering (QA) in diverse domains. This study compares the performance of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat, Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI’s trending GPT-3.5 over QA tasks within the computer science literature leveraging RAG support. Evaluation metrics employed in the study include accuracy and precision for binary questions and ranking by a human expert, ranking by Google’s AI model Gemini, alongside cosine similarity for long-answer questions. GPT-3.5, when paired with RAG, effectively answers binary and long-answer questions, reaffirming its status as an advanced LLM. Regarding open-source LLMs, Mistral AI’s Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b reports the shortest average latency in generating responses, whereas LLaMa2-7b-chat by Meta reports the highest average latency. This research underscores the fact that open-source LLMs, too, can go hand in hand with proprietary models like GPT-3.5 with better infrastructure.

[23] SCALE: Upscaled Continual Learning of Large Language Models

Jin-woo Lee, Junhwa Choi, Bongkyu Hwang, Jinho Choo, Bogun Kim, JeongSeon Yi, Joonseok Lee, DongYoung Jung, Jaeseon Park, Kyoungwon Park, Suk-hoon Jung

Main category: cs.CL

TL;DR: SCALE is a width upscaling architecture for continual pre-training that inserts lightweight expansion into linear modules while freezing pre-trained parameters, preserving base model functionality while increasing capacity.

Details

Motivation: Progress in continual pre-training depends more on scaling the right structure than just scaling parameters alone, addressing the need to preserve base model knowledge while acquiring new capabilities.

Method: SCALE uses width upscaling with Persistent Preservation (freezing pre-trained weights) and Collaborative Adaptation (selectively training expansion components), with variants SCALE-Preserve, SCALE-Adapt, and SCALE-Route for token-level routing.

Result: SCALE mitigates severe forgetting compared to depth expansion, achieves less forgetting on English evaluations while gaining competitive performance on Korean benchmarks, and offers the best stability-plasticity trade-off.

Conclusion: SCALE’s preservation-adaptation interplay stabilizes optimization and provides an effective approach for continual pre-training that balances knowledge retention with new capability acquisition.

Abstract: We revisit continual pre-training for large language models and argue that progress now depends more on scaling the right structure than on scaling parameters alone. We introduce SCALE, a width upscaling architecture that inserts lightweight expansion into linear modules while freezing all pre-trained parameters. This preserves the residual and attention topologies and increases capacity without perturbing the base model’s original functionality. SCALE is guided by two principles: Persistent Preservation, which maintains the base model’s behavior via preservation-oriented initialization and freezing of the pre-trained weights, and Collaborative Adaptation, which selectively trains a subset of expansion components to acquire new knowledge with minimal interference. We instantiate these ideas as SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and SCALE-Route, an optional routing extension that performs token-level routing between preservation and adaptation heads. On a controlled synthetic biography benchmark, SCALE mitigates the severe forgetting observed with depth expansion while still acquiring new knowledge. In continual pre-training on a Korean corpus, SCALE variants achieve less forgetting on English evaluations and competitive gains on Korean benchmarks, with these variants offering the best overall stability-plasticity trade-off. Accompanying analysis clarifies when preservation provably holds and why the interplay between preservation and adaptation stabilizes optimization compared to standard continual learning setups.

[24] How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli

Main category: cs.CL

TL;DR: This paper introduces source-aware metrics for speech-to-text translation evaluation that incorporate audio source information through ASR transcripts or back-translations, addressing limitations of reference-only evaluation.

Details

Motivation: Current ST evaluation relies on reference translations but ignores valuable source audio information. While source-aware metrics work well in MT, extending them to ST is challenging due to audio sources and lack of reliable transcripts.

Method: Two strategies for generating textual proxies: ASR transcripts and back-translations of references. Introduces cross-lingual re-segmentation algorithm to align synthetic sources with references. Evaluated on 79 language pairs across 6 ST systems.

Result: ASR transcripts are more reliable than back-translations when WER < 20%, while back-translations offer computationally cheaper alternative. Cross-lingual re-segmentation enables robust use of source-aware MT metrics in ST evaluation.

Conclusion: Source-aware metrics with synthetic sources enable more accurate ST evaluation, with ASR transcripts being preferred when accurate, and back-translations as effective alternatives. The re-segmentation algorithm facilitates principled ST evaluation methodologies.

Abstract: Automatic evaluation of speech-to-text translation (ST) systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In machine translation (MT), recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, automatic speech recognition (ASR) transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.

[25] Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks

Jindong Hong, Tianjie Chen, Lingjie Luo, Chuanyang Zheng, Ting Xu, Haibao Yu, Jianing Qiu, Qianzhong Chen, Suning Huang, Yan Xu, Yong Gui, Yijun He, Jiankai Sun

Main category: cs.CL

TL;DR: Thinking mode in multimodal LLMs provides only marginal performance improvements over standard mode for medical tasks, with suboptimal performance on complex clinical applications.

Details

Motivation: To evaluate how enhanced reasoning processes in dual-state MLLMs impact performance and reliability in clinical tasks, given the rapid adoption of these reasoning-capable models.

Method: Evaluated thinking mode capabilities of Seed1.5-VL and Gemini-2.5-Flash on four visual medical tasks using VQA-RAD and ROCOv2 datasets.

Result: Activating thinking mode showed only marginal improvements compared to standard non-thinking mode for most tasks. Performance on complex medical tasks like open-ended VQA and medical image interpretation remained suboptimal.

Conclusion: Current reasoning MLLMs need domain-specific medical data and more advanced methods for medical knowledge integration to effectively handle complex clinical applications.

Abstract: A recent advancement in Multimodal Large Language Models (MLLMs) research is the emergence of “reasoning MLLMs” that offer explicit control over their internal thinking processes (normally referred as the “thinking mode”) alongside the standard “non-thinking mode”. This capability allows these models to engage in a step-by-step process of internal deliberation before generating a final response. With the rapid transition to and adoption of these “dual-state” MLLMs, this work rigorously evaluated how the enhanced reasoning processes of these MLLMs impact model performance and reliability in clinical tasks. This paper evaluates the active “thinking mode” capabilities of two leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We assessed their performance on four visual medical tasks using VQA-RAD and ROCOv2 datasets. Our findings reveal that the improvement from activating the thinking mode remains marginal compared to the standard non-thinking mode for the majority of the tasks. Their performance on complex medical tasks such as open-ended VQA and medical image interpretation remains suboptimal, highlighting the need for domain-specific medical data and more advanced methods for medical knowledge integration.

[26] Generative Artificial Intelligence in Bioinformatics: A Systematic Review of Models, Applications, and Methodological Advances

Riasad Alvi, Sayeem Been Zaman, Wasimul Karim, Arefin Ittesafun Abian, Mohaimenul Azam Khan Raiaan, Saddam Mukta, Md Rafi Ur Rashid, Md Rafiqul Islam, Yakub Sebastian, Sami Azam

Main category: cs.CL

TL;DR: This systematic review evaluates GenAI’s transformative impact in bioinformatics, analyzing applications across genomics, proteomics, and drug discovery through six research questions that assess methodological advancements, performance, and future directions.

Details

Motivation: To systematically identify and evaluate the growing developments of generative AI in bioinformatics, given its transformative potential across genomics, proteomics, transcriptomics, structural biology, and drug discovery.

Method: Used systematic review methodology with six research questions (RQs) following preferred reporting items for systematic reviews and meta-analysis methods to evaluate GenAI strategies in methodological advancement, predictive performance, and specialization.

Result: GenAI demonstrates superior performance over traditional methods across bioinformatics subfields, with specialized model architectures outperforming general-purpose models. Significant benefits were found in molecular analysis and data integration, improving accuracy and reducing errors. Structural modeling, functional prediction, and synthetic data generation showed validated improvements.

Conclusion: While GenAI shows transformative potential in bioinformatics with validated performance improvements, current constraints include scalability limitations and data biases affecting generalizability. Future directions should focus on robust evaluation and biologically grounded modeling, supported by diverse molecular, cellular, and textual datasets.

Abstract: Generative artificial intelligence (GenAI) has become a transformative approach in bioinformatics that often enables advancements in genomics, proteomics, transcriptomics, structural biology, and drug discovery. To systematically identify and evaluate these growing developments, this review proposed six research questions (RQs), according to the preferred reporting items for systematic reviews and meta-analysis methods. The objective is to evaluate impactful GenAI strategies in methodological advancement, predictive performance, and specialization, and to identify promising approaches for advanced modeling, data-intensive discovery, and integrative biological analysis. RQ1 highlights diverse applications across multiple bioinformatics subfields (sequence analysis, molecular design, and integrative data modeling), which demonstrate superior performance over traditional methods through pattern recognition and output generation. RQ2 reveals that adapted specialized model architectures outperformed general-purpose models, an advantage attributed to targeted pretraining and context-aware strategies. RQ3 identifies significant benefits in the bioinformatics domains, focusing on molecular analysis and data integration, which improves accuracy and reduces errors in complex analysis. RQ4 indicates improvements in structural modeling, functional prediction, and synthetic data generation, validated by established benchmarks. RQ5 suggests the main constraints, such as the lack of scalability and biases in data that impact generalizability, and proposes future directions focused on robust evaluation and biologically grounded modeling. RQ6 examines that molecular datasets (such as UniProtKB and ProteinNet12), cellular datasets (such as CELLxGENE and GTEx) and textual resources (such as PubMedQA and OMIM) broadly support the training and generalization of GenAI models.

[27] Silenced Biases: The Dark Side LLMs Learned to Refuse

Rom Himelstein, Amit LeVi, Brit Youngmann, Yaniv Nemcovsky, Avi Mendelson

Main category: cs.CL

TL;DR: The paper introduces the Silenced Bias Benchmark (SBB) to uncover hidden biases in safety-aligned LLMs that are masked by refusal responses, using activation steering to reveal underlying unfair preferences.

Details

Motivation: Current fairness evaluation methods for LLMs often misinterpret refusal responses as positive fairness indicators, creating a false sense of fairness while overlooking deeper biases concealed by safety alignment.

Method: Proposes SBB using activation steering to reduce model refusals during QA, enabling scalable detection of silenced biases without relying on prompt manipulation or handcrafted implicit queries.

Result: Demonstrates on multiple LLMs that there’s an alarming distinction between models’ direct responses and their underlying fairness issues, exposing biases that standard evaluations miss.

Conclusion: SBB provides a scalable fairness evaluation framework that can expand to new demographics and subjects, encouraging development of truly fair models beyond alignment masking effects.

Abstract: Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model’s refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models’ latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models’ direct responses and their underlying fairness issues.

[28] EQ-Negotiator: Dynamic Emotional Personas Empower Small Language Models for Edge-Deployable Credit Negotiation

Yunbo Long, Yuhan Liu, Alexandra Brintrup

Main category: cs.CL

TL;DR: EQ-Negotiator bridges the performance gap between small and large language models in automated negotiation by using emotional personas and a reasoning system that integrates game theory with Hidden Markov Models to track emotional states online.

Details

Motivation: Large language models are computationally expensive and privacy-invasive for on-device applications, while small language models suffer from performance gaps in handling emotionally charged negotiations.

Method: EQ-Negotiator framework integrates game theory with Hidden Markov Models to learn and track debtor emotional states online without pre-training, enabling strategic intelligence for SLMs.

Result: A 7B parameter model with EQ-Negotiator achieves better debt recovery and negotiation efficiency than baseline LLMs 10x larger, outperforming in adversarial scenarios like cheating, threatening, and victim-playing.

Conclusion: Strategic emotional intelligence, not raw model scale, is critical for automated negotiation success, enabling effective, ethical, and privacy-preserving AI negotiators for edge deployment.

Abstract: The deployment of large language models (LLMs) in automated negotiation has set a high performance benchmark, but their computational cost and data privacy requirements render them unsuitable for many privacy-sensitive, on-device applications such as mobile assistants, embodied AI agents or private client interactions. While small language models (SLMs) offer a practical alternative, they suffer from a significant performance gap compared to LLMs in playing emotionally charged complex personas, especially for credit negotiation. This paper introduces EQ-Negotiator, a novel framework that bridges this capability gap using emotional personas. Its core is a reasoning system that integrates game theory with a Hidden Markov Model(HMM) to learn and track debtor emotional states online, without pre-training. This allows EQ-Negotiator to equip SLMs with the strategic intelligence to counter manipulation while de-escalating conflict and upholding ethical standards. Through extensive agent-to-agent simulations across diverse credit negotiation scenarios, including adversarial debtor strategies like cheating, threatening, and playing the victim, we show that a 7B parameter language model with EQ-Negotiator achieves better debt recovery and negotiation efficiency than baseline LLMs more than 10 times its size. This work advances persona modeling from descriptive character profiles to dynamic emotional architectures that operate within privacy constraints. Besides, this paper establishes that strategic emotional intelligence, not raw model scale, is the critical factor for success in automated negotiation, paving the way for effective, ethical, and privacy-preserving AI negotiators that can operate on the edge.

[29] LFC-DA: Logical Formula-Controlled Data Augmentation for Enhanced Logical Reasoning

Shenghao Li

Main category: cs.CL

TL;DR: LFC-DA is a symbolic-logic-controlled pipeline for logical data augmentation that maps text to propositional expressions, compiles rules, searches for valid formulas, and verbalizes them back to natural language questions, improving logical reasoning in LLMs.

Details

Motivation: Current methods for logical data augmentation either rely heavily on costly human annotation or generate uninterpretable and logically homogeneous examples using large language models directly.

Method: A symbolic-logic-controlled pipeline that maps logical text to propositional expressions, compiles a compact rule library, performs bounded state-space search to discover valid formulas, and verbalizes them back into natural-language questions.

Result: Experiments on ReClor and LogiQA datasets show significant improvements in the logical-reasoning accuracy of pretrained models.

Conclusion: LFC-DA is effective for LLM-guided logical data augmentation, ensuring both diversity and logical rigor under propositional logic.

Abstract: For complex logical data augmentation, heavy reliance on human annotation is costly, whereas direct generation with large language models yields uninterpretable and logically homogeneous examples. To address this, we present LFC-DA, a symbolic-logic-controlled pipeline: logical text is first mapped to propositional expressions, a compact rule library is compiled, and a bounded state-space search systematically discovers valid formulas that are then verbalized back into natural-language questions, ensuring both diversity and logical rigor under propositional logic. Experiments on ReClor and LogiQA show significant improvements in the logical-reasoning accuracy of pretrained models, confirming the effectiveness of LFC-DA for LLM-guided logical data augmentation.

[30] Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

Saumitra Yadav, Manish Shrivastava

Main category: cs.CL

TL;DR: Asymmetric BPE with different merge operations for source and target languages outperforms symmetric BPE in machine translation, especially for low-resource language pairs.

Details

Motivation: Existing MT research uses symmetric BPE with the same number of merge operations for both languages, but this uniform approach may not be optimal across different language pairs and data sizes.

Method: Investigated BPE segmentation recipes across various data volumes and language pairs, comparing symmetric vs. asymmetric BPE where source and target languages have different numbers of merge operations.

Result: Asymmetric BPE significantly improved results over symmetric approach, especially in low-resource settings (50K, 100K, 500K sentence pairs). Achieved statistically significant gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi, with improvements in 10 out of 12 systems across six additional language pairs.

Conclusion: Optimal results come from high NMO for source (4K-32K) and low NMO for target (0.5K-2K), particularly benefiting low-resource machine translation.

Abstract: Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn’t guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE, where the source and target languages have different NMOs, significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant ($p<0.05$) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in low-resource setups. We validated this trend across six additional language pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut), observing statistically significant improvement in 10 out of 12 systems compared to symmetric BPE. Our findings indicate a high NMO for the source (4K to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results, particularly benefiting low-resource MT.

[31] Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

Célian Ringwald, Fabien Gandon, Catherine Faron, Franck Michel, Hanna Abi Akl

Main category: cs.CL

TL;DR: Small language models struggle with rare properties in relation extraction for complete RDF graphs. The paper shows that ensuring each property appears above a threshold in training data solves this bottleneck.

Details

Motivation: To investigate how small language models handle both datatype and object properties for complete RDF graph extraction, focusing on the challenge of long-tail distribution of rare properties.

Method: Evaluated multiple strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation to address the rare properties problem.

Result: Found that the best strategy is to build a training set where the number of occurrences of each property exceeds a given threshold, enabling equal performance across unbalanced target properties.

Conclusion: The findings provide practical guidance for training shape-aware small language models and highlight promising directions for future work in semantic relation extraction.

Abstract: Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.

[32] Efficient Reasoning via Thought-Training and Thought-Free Inference

Canhui Wu, Qiong Cao, Chao Xue, Wei Xi, Xiaodong He

Main category: cs.CL

TL;DR: 3TF framework trains models to internalize reasoning processes, enabling thought-free inference with concise outputs while maintaining reasoning quality.

Details

Motivation: Existing methods compress verbose reasoning outputs but still require explicit reasoning during inference, lacking efficiency in real-time applications.

Method: Train hybrid models in both reasoning and non-reasoning modes, then fine-tune on CoT data to internalize reasoning while enforcing thought-free outputs at inference.

Result: 3TF-trained models show significant improvements on reasoning benchmarks under thought-free inference, achieving high reasoning quality without explicit step-by-step generation.

Conclusion: High-quality reasoning can be learned and executed implicitly, enabling efficient thought-free inference while maintaining reasoning capabilities.

Abstract: Recent advances in large language models (LLMs) have leveraged explicit Chain-of-Thought (CoT) prompting to improve reasoning accuracy. However, most existing methods primarily compress verbose reasoning outputs. These Long-to-Short transformations aim to improve efficiency, but still rely on explicit reasoning during inference. In this work, we introduce \textbf{3TF} (\textbf{T}hought-\textbf{T}raining and \textbf{T}hought-\textbf{F}ree inference), a framework for efficient reasoning that takes a Short-to-Long perspective. We first train a hybrid model that can operate in both reasoning and non-reasoning modes, and then further train it on CoT-annotated data to internalize structured reasoning, while enforcing concise, thought-free outputs at inference time using the no-reasoning mode. Unlike compression-based approaches, 3TF improves the reasoning quality of non-reasoning outputs, enabling models to perform rich internal reasoning implicitly while keeping external outputs short. Empirically, 3TF-trained models obtain large improvements on reasoning benchmarks under thought-free inference, demonstrating that high quality reasoning can be learned and executed implicitly without explicit step-by-step generation.

[33] Knowledge-Augmented Question Error Correction for Chinese Question Answer System with QuestionRAG

Longpeng Qiu, Ting Li, Shuai Mao, Nan Yang, Xiaohui Yan

Main category: cs.CL

TL;DR: QuestionRAG is a framework that improves LLMs’ ability to correct input errors in QA systems by combining knowledge augmentation and RL-based alignment to prevent misinterpretation and over-correction.

Details

Motivation: LLMs struggle with input errors in QA systems, often misinterpreting user intent or unnecessarily altering question structure, leading to incorrect responses.

Method: Uses knowledge augmentation (external knowledge like search results) to address misinterpretation, and reinforcement learning for alignment to prevent over-correction.

Result: Knowledge augmentation is critical for understanding faulty questions, and RL-based alignment is significantly more effective than supervised fine-tuning, improving instruction following and generalization.

Conclusion: Integrating knowledge augmentation and RL-based alignment unlocks LLMs’ full potential for question correction tasks.

Abstract: Input errors in question-answering (QA) systems often lead to incorrect responses. Large language models (LLMs) struggle with this task, frequently failing to interpret user intent (misinterpretation) or unnecessarily altering the original question’s structure (over-correction). We propose QuestionRAG, a framework that tackles these problems. To address misinterpretation, it enriches the input with external knowledge (e.g., search results, related entities). To prevent over-correction, it uses reinforcement learning (RL) to align the model’s objective with precise correction, not just paraphrasing. Our results demonstrate that knowledge augmentation is critical for understanding faulty questions. Furthermore, RL-based alignment proves significantly more effective than traditional supervised fine-tuning (SFT), boosting the model’s ability to follow instructions and generalize. By integrating these two strategies, QuestionRAG unlocks the full potential of LLMs for the question correction task.

[34] CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre

Main category: cs.CL

TL;DR: CareMedEval is a new dataset for evaluating LLMs on biomedical critical appraisal, derived from French medical exams with 534 questions based on 37 scientific articles. Current LLMs struggle with this task, failing to exceed 50% accuracy even with reasoning.

Details

Motivation: Critical appraisal of scientific literature is essential in biomedicine, but LLMs have limited reliability for critical reasoning in specialized domains. Existing benchmarks don't adequately evaluate critical reading grounded in scientific papers.

Method: Created CareMedEval dataset from authentic French medical student exams containing 534 questions based on 37 scientific articles. Benchmarked state-of-the-art generalist and biomedical-specialized LLMs under various context conditions.

Result: Both open and commercial LLMs failed to exceed 50% Exact Match Rate. Generating intermediate reasoning tokens improved results, but models remained challenged on questions about study limitations and statistical analysis.

Conclusion: CareMedEval provides a challenging benchmark that exposes current LLM limitations in grounded reasoning for biomedical critical appraisal, paving the way for future development of automated support tools.

Abstract: Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.

[35] Kastor: Fine-tuned Small Language Models for Shape-based Active Relation Extraction

Ringwald Celian, Gandon Fabien, Faron Catherine, Michel Franck, Abi Akl Hanna

Main category: cs.CL

TL;DR: Kastor framework enhances RDF pattern-based extraction for fine-tuning small language models by reformulating SHACL shape validation to evaluate all property combinations, improving generalization and enabling iterative refinement of noisy knowledge bases.

Details

Motivation: To meet demands for completing and refining knowledge bases in specialized domains by advancing RDF pattern-based extraction approaches for more efficient and robust relation extraction.

Method: Reformulates traditional SHACL shape validation to evaluate all possible property combinations, selects optimal combinations for training, and employs iterative learning to refine noisy knowledge bases.

Result: Significantly enhances model generalization and performance, enabling creation of robust models capable of uncovering new relevant facts from limited text and RDF data.

Conclusion: Kastor framework successfully advances RDF pattern-based extraction for specialized domain knowledge base completion and refinement through improved validation approaches and iterative learning.

Abstract: RDF pattern-based extraction is a compelling approach for fine-tuning small language models (SLMs) by focusing a relation extraction task on a specified SHACL shape. This technique enables the development of efficient models trained on limited text and RDF data. In this article, we introduce Kastor, a framework that advances this approach to meet the demands for completing and refining knowledge bases in specialized domains. Kastor reformulates the traditional validation task, shifting from single SHACL shape validation to evaluating all possible combinations of properties derived from the shape. By selecting the optimal combination for each training example, the framework significantly enhances model generalization and performance. Additionally, Kastor employs an iterative learning process to refine noisy knowledge bases, enabling the creation of robust models capable of uncovering new, relevant facts

[36] BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation

Kazi Reyazul Hasan, Mubasshira Musarrat, A. B. M. Alim Al Islam, Muhammad Abdullah Adnan

Main category: cs.CL

TL;DR: BanglaSTEM dataset improves technical translation from Bangla to English, enabling better use of English LLMs for STEM problem solving in Bangla.

Details

Motivation: Existing Bangla-English translation systems struggle with technical terms, leading to incorrect problem interpretations when using English-focused language models.

Method: Created BanglaSTEM dataset with 5,000 high-quality technical sentence pairs, trained T5-based translation model, and tested on code generation and math problem solving.

Result: Significant improvements in technical translation accuracy, making English language models more accessible and effective for Bangla speakers.

Conclusion: BanglaSTEM dataset and model successfully address technical translation challenges, enabling better cross-lingual problem solving capabilities.

Abstract: Large language models work well for technical problem solving in English but perform poorly when the same questions are asked in Bangla. A simple solution would be to translate Bangla questions into English first and then use these models. However, existing Bangla-English translation systems struggle with technical terms. They often mistranslate specialized vocabulary, which changes the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM fields including computer science, mathematics, physics, chemistry, and biology. We generated over 12,000 translations using language models and then used human evaluators to select the highest quality pairs that preserve technical terminology correctly. We train a T5-based translation model on BanglaSTEM and test it on two tasks: generating code and solving math problems. Our results show significant improvements in translation accuracy for technical content, making it easier for Bangla speakers to use English-focused language models effectively. Both the BanglaSTEM dataset and the trained translation model are publicly released at https://huggingface.co/reyazul/BanglaSTEM-T5.

[37] HaluMem: Evaluating Hallucinations in Memory Systems of Agents

Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, Zhiyu Li

Main category: cs.CL

TL;DR: HaluMem is the first operation-level hallucination evaluation benchmark for memory systems, featuring three evaluation tasks and large-scale datasets to identify where hallucinations occur in memory processes.

Details

Motivation: Current memory systems in AI frequently exhibit hallucinations during storage and retrieval, but existing evaluations are end-to-end and cannot localize which operational stage causes the hallucinations.

Method: Created HaluMem benchmark with three evaluation tasks (memory extraction, updating, and question answering) and constructed two large-scale datasets (HaluMem-Medium and HaluMem-Long) with multi-turn human-AI interactions containing 15k memory points and 3.5k questions.

Result: Empirical studies show existing memory systems generate and accumulate hallucinations during extraction and updating stages, which then propagate errors to question answering. Hallucinations occur across different context scales and task complexities.

Conclusion: Future research should focus on developing interpretable and constrained memory operation mechanisms to systematically suppress hallucinations and improve memory reliability in AI systems.

Abstract: Memory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which makes it difficult to localize the operational stage within the memory system where hallucinations arise. To address this, we introduce the Hallucination in Memory Benchmark (HaluMem), the first operation level hallucination evaluation benchmark tailored to memory systems. HaluMem defines three evaluation tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long. Both include about 15k memory points and 3.5k multi-type questions. The average dialogue length per user reaches 1.5k and 2.6k turns, with context lengths exceeding 1M tokens, enabling evaluation of hallucinations across different context scales and task complexities. Empirical studies based on HaluMem show that existing memory systems tend to generate and accumulate hallucinations during the extraction and updating stages, which subsequently propagate errors to the question answering stage. Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability.

[38] One Battle After Another: Probing LLMs’ Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

Qi Jia, Kaiwei Zhang, Xiujie Song, Ye Shen, Xiangyang Zhu, Guangtao Zhai

Main category: cs.CL

TL;DR: Proposes EvolIF, an extensible framework for assessing multi-turn instruction-following ability in LLMs through dynamic conversation construction with state changes and tracebacks.

Details

Motivation: Existing benchmarks are limited to fixed turns, susceptible to saturation, and fail to account for interactive user experience in multi-topic dialogues.

Method: Three-layer mechanism decoupling linguistic forms from user intent, tracking constraints/instructions/topics, with dynamic benchmark construction and patience-based termination.

Result: GPT-5 shows superior performance with 18.54 average turns and 70.31% robustness, significantly outperforming Gemini-2.5-Pro by 11.41%.

Conclusion: The framework effectively evaluates multi-turn instruction-following, revealing significant performance differences among models, with GPT-5 leading.

Abstract: Understanding how well large language models can follow users’ instructions throughout a dialogue spanning multiple topics is of great importance for data-intensive conversational applications. Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user’s interactive experience. In this work, we propose an extensible framework for assessing multi-turn instruction-following ability. At its core, our framework decouples linguistic surface forms from user intent simulation through a three-layer mechanism that tracks constraints, instructions, and topics. This framework mimics User-LLM interaction by enabling the dynamic construction of benchmarks with state changes and tracebacks, terminating a conversation only when the model exhausts a simulated user’s patience. We define a suite of metrics capturing the quality of the interaction process. Using this framework, we construct EvolIF, an evolving instruction-following benchmark incorporating nine distinct constraint types. Our results indicate that GPT-5 exhibits superior instruction-following performance. It sustains an average of 18.54 conversational turns and demonstrates 70.31% robustness, outperforming Gemini-2.5-Pro by a significant margin of 11.41%, while other models lag far behind. All of the data and code will be made publicly available online.

[39] SOLVE-Med: Specialized Orchestration for Leading Vertical Experts across Medical Specialties

Roberta Di Marino, Giovanni Dioguardi, Antonio Romano, Giuseppe Riccio, Mariano Barone, Marco Postiglione, Flora Amato, Vincenzo Moscato

Main category: cs.CL

TL;DR: SOLVE-Med is a multi-agent system using specialized small language models (1B parameters each) for medical question answering, achieving better performance than larger standalone models while enabling local deployment.

Details

Motivation: Address deployment challenges in medical QA systems including hallucinations, bias, computational demands, privacy concerns, and need for specialized expertise across diverse medical domains.

Method: Multi-agent architecture with Router Agent for dynamic specialist selection, ten domain-specialized models (1B parameters each) fine-tuned on specific medical specialties, and Orchestrator Agent for response synthesis.

Result: Achieved ROUGE-1 of 0.301 and BERTScore F1 of 0.697 on Italian medical forum data across ten specialties, outperforming standalone models up to 14B parameters while enabling local deployment.

Conclusion: SOLVE-Med demonstrates that specialized small models in multi-agent architecture can provide superior performance over larger standalone models for complex medical queries while addressing deployment challenges.

Abstract: Medical question answering systems face deployment challenges including hallucinations, bias, computational demands, privacy concerns, and the need for specialized expertise across diverse domains. Here, we present SOLVE-Med, a multi-agent architecture combining domain-specialized small language models for complex medical queries. The system employs a Router Agent for dynamic specialist selection, ten specialized models (1B parameters each) fine-tuned on specific medical domains, and an Orchestrator Agent that synthesizes responses. Evaluated on Italian medical forum data across ten specialties, SOLVE-Med achieves superior performance with ROUGE-1 of 0.301 and BERTScore F1 of 0.697, outperforming standalone models up to 14B parameters while enabling local deployment. Our code is publicly available on GitHub: https://github.com/PRAISELab-PicusLab/SOLVE-Med.

[40] Bearing Syntactic Fruit with Stack-Augmented Neural Networks

Brian DuSell, Ryan Cotterell

Main category: cs.CL

TL;DR: Neural networks with stack augmentation can achieve human-like hierarchical generalization in language learning without special training conditions.

Details

Motivation: To develop neural network architectures that can generalize hierarchically like humans do in language acquisition, without requiring special conditions like syntactic supervision, massive pre-training, or extended training.

Method: Tested three base architectures (transformer, simple RNN, LSTM) augmented with two stack types: superposition stack and nondeterministic stack. Also proposed a modification to stack RNN architecture.

Result: Transformers with nondeterministic stacks generalized best on classical question formation task. The proposed stack RNN modification improved hierarchical generalization.

Conclusion: Stack-augmented neural networks may be more accurate models of human language acquisition than standard architectures and serve as useful tools for psycholinguistic study.

Abstract: Any finite set of training data is consistent with an infinite number of hypothetical algorithms that could have generated it. Studies have shown that when human children learn language, they consistently favor hypotheses based on hierarchical syntactic rules without ever encountering disambiguating examples. A recent line of work has inquired as to whether common neural network architectures share this bias, finding that they do so only under special conditions: when syntactically supervised, when pre-trained on massive corpora, or when trained long past convergence. In this paper, we demonstrate, for the first time, neural network architectures that are able to generalize in human-like fashion without any of the aforementioned requirements: stack-augmented neural networks. We test three base architectures (transformer, simple RNN, LSTM) augmented with two styles of stack: the superposition stack of Joulin & Mikolov (2015) and a nondeterministic generalization of it proposed by DuSell & Chiang (2023). We find that transformers with nondeterministic stacks generalize best out of these architectures on a classical question formation task. We also propose a modification to the stack RNN architecture that improves hierarchical generalization. These results suggest that stack-augmented neural networks may be more accurate models of human language acquisition than standard architectures, serving as useful objects of psycholinguistic study. Our code is publicly available.

[41] MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

Sofie Helene Bruun, Dan Saattrup Smart

Main category: cs.CL

TL;DR: Created MultiZebraLogic - large multilingual datasets of zebra puzzles to benchmark LLM logical reasoning across 9 Germanic languages, with varying difficulty levels and red herrings.

Details

Motivation: Need comprehensive benchmarks to measure LLM logical reasoning abilities across multiple languages and difficulty levels suitable for different model capabilities.

Method: Generated zebra puzzles in multiple languages, themes, sizes (2x3 and 4x5), with 14 clue types and 8 red herring types. Tested on GPT-4o mini and o3-mini models.

Result: 4x5 puzzles sufficiently challenge reasoning models; 5 red herrings reduce o3-mini accuracy by 15±7%; no significant language/theme effect; no correlation between clue types and difficulty.

Conclusion: Published MultiZebraLogic datasets (128+1024 puzzles per language) and generation code for adaptable multilingual logical reasoning benchmarking.

Abstract: Measuring the full abilities of large language models (LLMs) requires benchmarks representing multiple tasks. We aim to create large, high-quality datasets for comparison of logical reasoning skills across several languages and of suitable difficulty for LLMs of various reasoning ability. We explore multiple ways of increasing difficulty. We generate zebra puzzles in multiple languages, themes, sizes and including 14 different clue types and 8 red herring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 are sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a reasoning model), respectively. Including 5 red herrings decreases o3-mini puzzle-level accuracy on 4x5 puzzles by 15$\pm$7 %. Scores of o3-mini on 4x5 puzzles are not significantly affected by use of English vs. Danish or the common houses theme vs. the country-specific smoerrebroed theme. We find no correlation between difficulty and the selected clue types. Datasets of 128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic languages for sizes 2x3 and 4x5. We publish code for puzzle generation, designed for adaptablity into more languages and themes.

[42] AILA–First Experiments with Localist Language Models

Joachim Diederich

Main category: cs.CL

TL;DR: Introduces controllable locality in transformers via a tunable parameter that enables dynamic interpolation between interpretable localist encodings and efficient distributed representations without retraining.

Details

Motivation: To address the trade-off between interpretability and performance in language models, particularly for regulated domains requiring transparency while maintaining capability.

Method: Two-layer transformer architecture with a tunable locality parameter λ (from 1.0 to 0.0) tested on WikiText corpus, measuring attention entropy, pointer fidelity, perplexity, and accuracy.

Result: Localist configurations (λ=1.0) achieved lower attention entropy (5.36 bits vs 7.18 bits) and higher pointer fidelity. Intermediate locality (λ=0.6) optimized trade-off with 4.65 perplexity and 84.7% accuracy.

Conclusion: Localist language models provide practical framework for regulated domains, offering mathematical control over interpretability-performance spectrum through explicit penalty thresholds.

Abstract: This paper presents the first empirical demonstration of controllable locality in transformer language models, a novel architectural framework that enables continuous control over the degree of representation localization through a tunable locality dial parameter. Unlike traditional language models that rely exclusively on distributed representations, our approach allows dynamic interpolation between highly interpretable localist encodings and efficient distributed representations without requiring model retraining. We conducted experiments on the WikiText corpus using a two-layer transformer architecture, systematically varying the locality parameter {\lambda} across the full spectrum from 1.0 (fully localist) to 0.0 (fully distributed). Our results demonstrate that localist configurations achieve dramatically lower attention entropy, with {\lambda} = 1.0 yielding 5.36 bits compared to 7.18 bits at {\lambda} = 0.0, while maintaining substantially higher pointer fidelity scores reflecting stronger alignment with rule-specified targets. Prediction experiments reveal that intermediate locality values optimize the tradeoff between interpretability and performance, with {\lambda} = 0.6 achieving test perplexity of 4.65 and accuracy of 84.7%. These findings establish that localist language models provide a practical framework for applications in regulated domains requiring both transparency and capability, offering precise mathematical control over the interpretability-performance spectrum through explicit penalty thresholds and information-theoretic design principles.

[43] ASVRI-Legal: Fine-Tuning LLMs with Retrieval Augmented Generation for Enhanced Legal Regulation

One Octadion, Bondan Sapta Prakoso, Nanang Yudi Setiawan, Novanto Yudistira

Main category: cs.CL

TL;DR: Fine-tuned LLMs with legal domain data and RAG to assist policymakers in legal analysis and regulation drafting.

Details

Motivation: To better support policymakers in understanding, analyzing, and crafting legal regulations by enhancing LLMs' legal comprehension.

Method: Curated supervised legal dataset for fine-tuning and integrated Retrieval-Augmented Generation (RAG) to access external legal knowledge.

Result: Created a tool that actively assists in interpreting regulations and drafting new ones aligned with current needs.

Conclusion: This approach significantly enhances legal research and regulation development effectiveness in the evolving legal field.

Abstract: In this study, we explore the fine-tuning of Large Language Models (LLMs) to better support policymakers in their crucial work of understanding, analyzing, and crafting legal regulations. To equip the model with a deep understanding of legal texts, we curated a supervised dataset tailored to the specific needs of the legal domain. Additionally, we integrated the Retrieval-Augmented Generation (RAG) method, enabling the LLM to access and incorporate up-to-date legal knowledge from external sources. This combination of fine-tuning and RAG-based augmentation results in a tool that not only processes legal information but actively assists policymakers in interpreting regulations and drafting new ones that align with current needs. The results demonstrate that this approach can significantly enhance the effectiveness of legal research and regulation development, offering a valuable resource in the ever-evolving field of law.

[44] A systematic review of relation extraction task since the emergence of Transformers

Ringwald Celian, Gandon, Fabien, Faron Catherine, Michel Franck, Abi Akl Hanna

Main category: cs.CL

TL;DR: Systematic review of relation extraction research from 2019-2024, analyzing 34 surveys, 64 datasets, and 104 Transformer-based models to identify trends and challenges.

Details

Motivation: To provide a comprehensive overview of relation extraction research since Transformer models emerged, consolidating methodological advances and resources for researchers.

Method: Used automated framework to collect and annotate publications, systematically reviewing surveys, datasets, and models published between 2019-2024.

Result: Identified current trends, limitations, and open challenges in relation extraction, highlighting methodological advances, benchmark resources, and semantic web technology integration.

Conclusion: The review offers researchers and practitioners a comprehensive reference for understanding the evolution and future directions of relation extraction.

Abstract: This article presents a systematic review of relation extraction (RE) research since the advent of Transformer-based models. Using an automated framework to collect and annotate publications, we analyze 34 surveys, 64 datasets, and 104 models published between 2019 and 2024. The review highlights methodological advances, benchmark resources, and the integration of semantic web technologies. By consolidating results across multiple dimensions, the study identifies current trends, limitations, and open challenges, offering researchers and practitioners a comprehensive reference for understanding the evolution and future directions of RE.

[45] Towards Transparent Stance Detection: A Zero-Shot Approach Using Implicit and Explicit Interpretability

Apoorva Upadhyaya, Wolfgang Nejdl, Marco Fisichella

Main category: cs.CL

TL;DR: IRIS is a novel interpretable Zero-Shot Stance Detection framework that uses implicit rationales (text sequences) and explicit rationales (linguistic measures) to understand attitudes toward unseen targets, treating stance detection as an information retrieval ranking task.

Details

Motivation: Existing ZSSD methods suffer from generalizability issues, lack coherence between text and target, rely too much on explicit reasoning, provide coarse explanations, and lack interpretability in their reasoning process.

Method: IRIS combines implicit rationales (text sequences relevant to stances) and explicit rationales (linguistic measures of emotional/cognitive dimensions) in an information retrieval ranking framework that doesn’t require ground-truth rationales.

Result: Extensive experiments on VAST, EZ-STANCE, P-Stance, and RFD datasets using 50%, 30%, and 10% training data demonstrate strong generalizability and performance.

Conclusion: IRIS provides inherent interpretability through its dual rationale approach and information retrieval framework, offering nuanced understanding of stance detection without requiring labeled rationale data.

Abstract: Zero-Shot Stance Detection (ZSSD) identifies the attitude of the post toward unseen targets. Existing research using contrastive, meta-learning, or data augmentation suffers from generalizability issues or lack of coherence between text and target. Recent works leveraging large language models (LLMs) for ZSSD focus either on improving unseen target-specific knowledge or generating explanations for stance analysis. However, most of these works are limited by their over-reliance on explicit reasoning, provide coarse explanations that lack nuance, and do not explicitly model the reasoning process, making it difficult to interpret the model’s predictions. To address these issues, in our study, we develop a novel interpretable ZSSD framework, IRIS. We provide an interpretable understanding of the attitude of the input towards the target implicitly based on sequences within the text (implicit rationales) and explicitly based on linguistic measures (explicit rationales). IRIS considers stance detection as an information retrieval ranking task, understanding the relevance of implicit rationales for different stances to guide the model towards correct predictions without requiring the ground-truth of rationales, thus providing inherent interpretability. In addition, explicit rationales based on communicative features help decode the emotional and cognitive dimensions of stance, offering an interpretable understanding of the author’s attitude towards the given target. Extensive experiments on the benchmark datasets of VAST, EZ-STANCE, P-Stance, and RFD using 50%, 30%, and even 10% training data prove the generalizability of our model, benefiting from the proposed architecture and interpretable design.

[46] ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation

Jing Gao, Shutiao Luo, Yumeng Liu, Yuanming Li, Hongji Zeng

Main category: cs.CL

TL;DR: The paper introduces ChiMDQA, a Chinese Multi-Document Question Answering Dataset with 6,068 high-quality QA pairs across six domains, designed for business applications and various NLP tasks.

Details

Motivation: To address the growing demand for high-quality Chinese document question-answering datasets in NLP applications, particularly for downstream business scenarios.

Method: Created through meticulous document screening and systematic question-design methodology across six domains (academic, education, finance, law, medical, news), with QA pairs classified into ten fine-grained categories.

Result: Developed ChiMDQA dataset containing 6,068 rigorously curated QA pairs from long-form documents, ensuring diversity and high quality for document comprehension, knowledge extraction, and intelligent QA systems.

Conclusion: ChiMDQA provides a substantial foundation for future research and practical applications in Chinese QA, with the dataset and code made publicly available.

Abstract: With the rapid advancement of natural language processing (NLP) technologies, the demand for high-quality Chinese document question-answering datasets is steadily growing. To address this issue, we present the Chinese Multi-Document Question Answering Dataset(ChiMDQA), specifically designed for downstream business scenarios across prevalent domains including academic, education, finance, law, medical treatment, and news. ChiMDQA encompasses long-form documents from six distinct fields, consisting of 6,068 rigorously curated, high-quality question-answer (QA) pairs further classified into ten fine-grained categories. Through meticulous document screening and a systematic question-design methodology, the dataset guarantees both diversity and high quality, rendering it applicable to various NLP tasks such as document comprehension, knowledge extraction, and intelligent QA systems. Additionally, this paper offers a comprehensive overview of the dataset’s design objectives, construction methodologies, and fine-grained evaluation system, supplying a substantial foundation for future research and practical applications in Chinese QA. The code and data are available at: https://anonymous.4open.science/r/Foxit-CHiMDQA/.

[47] Do Androids Dream of Unseen Puppeteers? Probing for a Conspiracy Mindset in Large Language Models

Francesco Corso, Francesco Pierri, Gianmarco De Francisci Morales

Main category: cs.CL

TL;DR: LLMs exhibit partial conspiratorial tendencies, show sociodemographic biases, and are easily manipulated into adopting conspiratorial perspectives through targeted prompts.

Details

Motivation: To evaluate whether LLMs reproduce higher-order psychological constructs like conspiratorial mindset, given their increasing use as proxies for studying human behavior and the critical role of conspiracy beliefs in misinformation spread.

Method: Administered validated psychometric surveys measuring conspiracy mindset to multiple LLMs under different prompting and conditioning strategies, including socio-demographic attribute conditioning.

Result: LLMs show partial agreement with conspiracy beliefs, conditioning produces uneven effects revealing demographic biases, and targeted prompts easily shift responses toward conspiratorial directions.

Conclusion: LLMs’ susceptibility to conspiratorial manipulation highlights the importance of critically evaluating their psychological dimensions for computational social science advancement and developing mitigation strategies against harmful uses.

Abstract: In this paper, we investigate whether Large Language Models (LLMs) exhibit conspiratorial tendencies, whether they display sociodemographic biases in this domain, and how easily they can be conditioned into adopting conspiratorial perspectives. Conspiracy beliefs play a central role in the spread of misinformation and in shaping distrust toward institutions, making them a critical testbed for evaluating the social fidelity of LLMs. LLMs are increasingly used as proxies for studying human behavior, yet little is known about whether they reproduce higher-order psychological constructs such as a conspiratorial mindset. To bridge this research gap, we administer validated psychometric surveys measuring conspiracy mindset to multiple models under different prompting and conditioning strategies. Our findings reveal that LLMs show partial agreement with elements of conspiracy belief, and conditioning with socio-demographic attributes produces uneven effects, exposing latent demographic biases. Moreover, targeted prompts can easily shift model responses toward conspiratorial directions, underscoring both the susceptibility of LLMs to manipulation and the potential risks of their deployment in sensitive contexts. These results highlight the importance of critically evaluating the psychological dimensions embedded in LLMs, both to advance computational social science and to inform possible mitigation strategies against harmful uses.

[48] Grounded Misunderstandings in Asymmetric Dialogue: A Perspectivist Annotation Scheme for MapTask

Nan Li, Albert Gatt, Massimo Poesio

Main category: cs.CL

TL;DR: Analyzes referential understanding in collaborative dialogue using a new annotation scheme for the HCRC MapTask corpus, revealing that full misunderstandings are rare but multiplicity discrepancies cause systematic divergences.

Details

Motivation: To study how participants in asymmetric collaborative settings may believe they agree while actually referring to different entities, and to trace how understanding emerges, diverges, and repairs over time.

Method: Introduces a perspectivist annotation scheme that separately captures speaker and addressee interpretations for each reference expression, using a scheme-constrained LLM annotation pipeline to obtain 13k annotated references with reliability estimates.

Result: Full misunderstandings are rare when lexical variants are unified, but multiplicity discrepancies systematically induce divergences, showing how apparent grounding can mask referential misalignment.

Conclusion: Provides a framework for studying grounded misunderstanding and evaluating (V)LLMs’ capacity to model perspective-dependent grounding in collaborative dialogue.

Abstract: Collaborative dialogue relies on participants incrementally establishing common ground, yet in asymmetric settings they may believe they agree while referring to different entities. We introduce a perspectivist annotation scheme for the HCRC MapTask corpus (Anderson et al., 1991) that separately captures speaker and addressee grounded interpretations for each reference expression, enabling us to trace how understanding emerges, diverges, and repairs over time. Using a scheme-constrained LLM annotation pipeline, we obtain 13k annotated reference expressions with reliability estimates and analyze the resulting understanding states. The results show that full misunderstandings are rare once lexical variants are unified, but multiplicity discrepancies systematically induce divergences, revealing how apparent grounding can mask referential misalignment. Our framework provides both a resource and an analytic lens for studying grounded misunderstanding and for evaluating (V)LLMs’ capacity to model perspective-dependent grounding in collaborative dialogue.

[49] Retrieval-Augmented Feature Generation for Domain-Specific Classification

Xinhao Zhang, Jinghan Zhang, Fengran Mo, Dakshak Keerthi Chandra, Yuzhong Chen, Fei Xie, Kunpeng Liu

Main category: cs.CL

TL;DR: RAFG is a retrieval-augmented feature generation method that uses knowledge retrieval and LLMs to create interpretable features for domain classification tasks, improving performance across medical, economic, and geographic datasets.

Details

Motivation: Feature generation enhances learning with limited data, but creating interpretable features typically requires domain expertise. The paper aims to automate this process while maintaining feature interpretability.

Method: RAFG retrieves knowledge from existing features to identify associations, then uses LLMs for feature generation with reasoning to verify feature quality during the generation process.

Result: Experiments show RAFG produces high-quality, meaningful features and significantly improves classification performance compared to baseline methods across multiple domains.

Conclusion: The RAFG method successfully generates useful and explainable features for domain classification tasks without requiring extensive domain knowledge, demonstrating improved performance through automated feature generation.

Abstract: Feature generation can significantly enhance learning outcomes, particularly for tasks with limited data. An effective way to improve feature generation is to expand the current feature space using existing features and enriching the informational content. However, generating new, interpretable features usually requires domain-specific knowledge on top of the existing features. In this paper, we introduce a Retrieval-Augmented Feature Generation method, RAFG, to generate useful and explainable features specific to domain classification tasks. To increase the interpretability of the generated features, we conduct knowledge retrieval among the existing features in the domain to identify potential feature associations. These associations are expected to help generate useful features. Moreover, we develop a framework based on large language models (LLMs) for feature generation with reasoning to verify the quality of the features during their generation process. Experiments across several datasets in medical, economic, and geographic domains show that our RAFG method can produce high-quality, meaningful features and significantly improve classification performance compared with baseline methods.

[50] Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation

Sanjana Ramprasad, Byron C. Wallace

Main category: cs.CL

TL;DR: Current automatic factuality metrics for LLM-generated summaries are unreliable - they perform poorly on hard cases requiring deep reasoning, can be gamed by adding content-free sentences, and may rely more on parametric knowledge than source references.

Details

Motivation: Traditional metrics like ROUGE have saturated for LLM summaries, but LLMs still introduce factual inaccuracies. There's a need to understand whether current factuality metrics actually measure what they claim or just exploit artifacts.

Method: Stress tested various automatic factuality metrics using a shallow classifier to separate easy vs hard examples, tested sensitivity to fact-preserving edits vs factual corrections, and attempted to game metrics by adding content-free sentences.

Result: All metrics showed substantial performance drops on hard cases requiring deep reasoning. Some were more sensitive to benign edits than factual corrections. Most metrics could be artificially inflated by appending innocuous sentences. ChatGPT-DA was most robust but may overly rely on parametric knowledge.

Conclusion: Current factuality metrics are unreliable and their true measurement capabilities are questionable. There’s a need for more robust evaluation methods that truly assess factual consistency with source documents.

Abstract: Modern LLMs can now produce highly readable abstractive summaries, to the point that traditional automated metrics for evaluating summary quality, such as ROUGE, have saturated. However, LLMs still sometimes introduce inaccuracies into summaries, i.e., information inconsistent with or unsupported by the corresponding source. Measuring the occurrence of these often subtle factual inconsistencies automatically has proved challenging. This in turn has motivated development of metrics intended to measure the factual consistency of generated summaries against sources. But are these approaches measuring what they purport to? Or are they mostly exploiting artifacts? In this work, we stress test a range of automatic factuality metrics, including specialized models and LLM-based prompting methods, to probe what they actually capture. Using a shallow classifier to separate easy'' examples for factual evaluation where surface features suffice from hard’’ cases requiring deeper reasoning, we find that all metrics show substantial performance drops on the latter. Furthermore, some metrics are more sensitive to benign, fact-preserving edits than to factual corrections. Building on this observation, we demonstrate that most automatic factuality metrics can be gamed, i.e., their scores can be artificially inflated by appending innocuous, content-free sentences to summaries. Among the metrics tested, the prompt based ChatGPT-DA approach is the most robust and reliable. However, this comes with a notable caveat: Prompting LLMs to assess factuality may overly rely on their parametric knowledge rather than the provided reference when making judgments. Taken together, our findings call into question the reliability of current factuality metrics and prompt a broader reflection on what these metrics are truly measuring.

[51] From Haystack to Needle: Label Space Reduction for Zero-shot Classification

Nathan Vandemoortele, Bram Steenwinckel, Femke Ongenae, Sofie Van Hoecke

Main category: cs.CL

TL;DR: Label Space Reduction (LSR) improves zero-shot classification by iteratively refining and reducing the label space, helping LLMs focus on relevant options. It achieves significant performance gains and offers an efficient distilled version.

Details

Motivation: To enhance zero-shot classification performance of LLMs by addressing the challenge of large label spaces that can confuse models and reduce accuracy.

Method: LSR iteratively refines classification label space by systematically ranking and reducing candidate classes, leveraging unlabeled data and statistical learning capabilities. Also proposes distillation into probabilistic classifiers for efficiency.

Result: Improves macro-F1 scores by average 7.0% (up to 14.2%) with Llama-3.1-70B and 3.3% (up to 11.1%) with Claude-3.5-Sonnet across seven benchmarks compared to standard zero-shot baselines.

Conclusion: LSR effectively enhances zero-shot classification by dynamically optimizing label space representation, with distillation providing computational efficiency.

Abstract: We present Label Space Reduction (LSR), a novel method for improving zero-shot classification performance of Large Language Models (LLMs). LSR iteratively refines the classification label space by systematically ranking and reducing candidate classes, enabling the model to concentrate on the most relevant options. By leveraging unlabeled data with the statistical learning capabilities of data-driven models, LSR dynamically optimizes the label space representation at test time. Our experiments across seven benchmarks demonstrate that LSR improves macro-F1 scores by an average of 7.0% (up to 14.2%) with Llama-3.1-70B and 3.3% (up to 11.1%) with Claude-3.5-Sonnet compared to standard zero-shot classification baselines. To reduce the computational overhead of LSR, which requires an additional LLM call at each iteration, we propose distilling the model into a probabilistic classifier, allowing for efficient inference.

[52] Verdict: A Library for Scaling Judge-Time Compute

Nimit Kalra, Leonard Tang

Main category: cs.CL

TL;DR: Verdict is an open-source library that scales judge-time compute to improve LLM-based automated evaluation systems through modular reasoning units and increased inference-time compute.

Details

Motivation: Standard LLM-as-a-judge approaches suffer from reliability issues, prompting the need for more accurate and interpretable automated evaluation systems.

Method: Leverages composition of modular reasoning units (verification, debate, aggregation) and increased inference-time compute to enhance judge quality.

Result: Achieves performance competitive with orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models across challenging tasks like content moderation, fact-checking, and hallucination detection.

Conclusion: Establishes a foundation for scalable, interpretable, and reliable LLM-based evaluation systems for researchers and practitioners.

Abstract: The use of LLMs as automated judges (“LLM-as-a-judge”) is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce Verdict, an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators. Verdict leverages the composition of modular reasoning units (such as verification, debate, and aggregation) and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, Verdict judges achieves performance competitive with orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Our framework establishes a foundation for scalable, interpretable, and reliable LLM-based evaluation systems for both researchers and practitioners.

[53] Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models

Nghia Bui, Guergana Savova, Lijing Wang

Main category: cs.CL

TL;DR: Random seeds significantly impact LLM fine-tuning performance, causing notable variance in both macro metrics (accuracy, F1) and micro-level prediction consistency across GLUE and SuperGLUE benchmarks.

Details

Motivation: The impact of random seeds in fine-tuning large language models has been largely overlooked despite its potential influence on model performance, creating a need for systematic evaluation.

Method: Systematically evaluate random seed effects using GLUE and SuperGLUE benchmarks, analyzing macro-level impact through traditional metrics (accuracy, F1) with mean/variance calculations, and introducing a novel consistency metric for micro-level prediction stability.

Result: Experiments reveal significant variance at both macro and micro levels, demonstrating substantial performance fluctuations and prediction instability across different random seeds.

Conclusion: Careful consideration of random seeds is essential in fine-tuning and evaluation processes due to their significant impact on model performance stability.

Abstract: The impact of random seeds in fine-tuning large language models (LLMs) has been largely overlooked despite its potential influence on model performance.In this study, we systematically evaluate the effects of random seeds on LLMs using the GLUE and SuperGLUE benchmarks. We analyze the macro-level impact through traditional metrics like accuracy and F1, calculating their mean and variance to quantify performance fluctuations. To capture the micro-level effects, we introduce a novel metric, consistency, measuring the stability of individual predictions across runs. Our experiments reveal significant variance at both macro and micro levels, underscoring the need for careful consideration of random seeds in fine-tuning and evaluation.

[54] Traversal Verification for Speculative Tree Decoding

Yepeng Weng, Qiao Hu, Xujie Chen, Li Liu, Dianwen Mei, Huishi Qiu, Jiang Tian, Zhongchao Shi

Main category: cs.CL

TL;DR: Traversal Verification is a novel speculative decoding algorithm that uses leaf-to-root traversal to improve acceptance rates and throughput in large language model inference.

Details

Motivation: Existing speculative decoding methods have suboptimal acceptance rates due to token-level verification and inefficient candidate utilization, where parent node rejection causes all child nodes to be discarded.

Method: Proposes Traversal Verification that considers acceptance of entire token sequences from current node to root, preserving valid subsequences that would be discarded by existing top-down methods.

Result: Theoretically proven to produce identical probability distribution as target model (lossless inference) while achieving substantial acceleration gains. Experiments show consistent improvements in acceptance length and throughput across different LLMs and tasks.

Conclusion: Traversal Verification fundamentally rethinks verification paradigm and provides a more efficient speculative decoding approach that maintains model accuracy while significantly improving inference speed.

Abstract: Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in parallel to determine whether the drafted tokens should be accepted or rejected. To enhance acceptance rates, existing frameworks typically construct token trees containing multiple candidates in each timestep. However, their reliance on token-level verification mechanisms introduces two critical limitations: First, the probability distribution of a sequence differs from that of individual tokens, leading to suboptimal acceptance length. Second, current verification schemes begin from the root node and proceed layer by layer in a top-down manner. Once a parent node is rejected, all its child nodes should be discarded, resulting in inefficient utilization of speculative candidates. This paper introduces Traversal Verification, a novel speculative decoding algorithm that fundamentally rethinks the verification paradigm through leaf-to-root traversal. Our approach considers the acceptance of the entire token sequence from the current node to the root, and preserves potentially valid subsequences that would be prematurely discarded by existing methods. We theoretically prove that the probability distribution obtained through Traversal Verification is identical to that of the target model, guaranteeing lossless inference while achieving substantial acceleration gains. Experimental results across different large language models and multiple tasks show that our method consistently improves acceptance length and throughput over existing methods.

[55] Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

Gaurav Kamath, Sowmya Vajjala

Main category: cs.CL

TL;DR: Synthetic data shows promise for low-resource language NER, with significant variation between languages.

Details

Motivation: NER for low-resource languages needs robust systems with limited labeled training data. Data augmentation is common but synthetic data's role in multilingual low-resource NER is underexplored.

Method: Explored synthetic data in multilingual low-resource NER across 11 diverse languages from different language families.

Result: Synthetic data holds promise for low-resource language NER, but shows significant variation between different languages.

Conclusion: Synthetic data is promising for low-resource NER but effectiveness varies across languages, requiring language-specific considerations.

Abstract: Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.

[56] The Case for Repeatable, Open, and Expert-Grounded Hallucination Benchmarks in Large Language Models

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

Main category: cs.CL

TL;DR: The paper argues for repeatable, open, and domain-contextualized hallucination benchmarking of language models, showing that without expert involvement in data creation, hallucination metrics lack validity and utility.

Details

Motivation: Plausible but inaccurate tokens in model-generated text are pervasive and problematic, yet there is little scientific work to comprehensively measure language model hallucination prevalence.

Method: Presents a taxonomy of hallucinations and conducts a case study demonstrating the importance of expert involvement in early stages of data creation for valid hallucination metrics.

Result: The case study shows that when experts are absent from early data creation stages, the resulting hallucination metrics lack both validity and practical utility.

Conclusion: Language models should be evaluated using repeatable, open, and domain-contextualized hallucination benchmarking with expert involvement to ensure valid and useful metrics.

Abstract: Plausible, but inaccurate, tokens in model-generated text are widely believed to be pervasive and problematic for the responsible adoption of language models. Despite this concern, there is little scientific work that attempts to measure the prevalence of language model hallucination in a comprehensive way. In this paper, we argue that language models should be evaluated using repeatable, open, and domain-contextualized hallucination benchmarking. We present a taxonomy of hallucinations alongside a case study that demonstrates that when experts are absent from the early stages of data creation, the resulting hallucination metrics lack validity and practical utility.

[57] Distilling LLM Agent into Small Models with Retrieval and Code Tools

Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang

Main category: cs.CL

TL;DR: Agent Distillation transfers full task-solving behavior from large LLMs to smaller models using retrieval and code tools, achieving competitive performance with much smaller models.

Details

Motivation: Current CoT distillation struggles with rare factual knowledge and precise computation tasks where small models hallucinate due to limited capabilities.

Method: Proposes Agent Distillation framework with two improvements: first-thought prefix prompting for better teacher trajectories, and self-consistent action generation for test-time robustness.

Result: Small models (0.5B, 1.5B, 3B parameters) achieve performance competitive with next-tier larger models (1.5B, 3B, 7B) fine-tuned using CoT distillation across eight reasoning tasks.

Conclusion: Agent distillation enables building practical, tool-using small agents by transferring full reasoning capabilities from large LLMs.

Abstract: Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.

[58] R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang

Main category: cs.CL

TL;DR: R2R is a neural token routing method that selectively uses LLMs only for critical path-divergent tokens while letting SLMs handle most token generation, achieving better performance with fewer parameters and faster inference.

Details

Motivation: LLMs have high inference overhead while distilled SLMs suffer performance loss because they can't follow LLMs' reasoning paths. The key insight is that only a small fraction of tokens actually diverge reasoning paths between LLMs and SLMs.

Method: Developed Roads to Rome (R2R) - a neural token routing method that identifies divergent tokens using an automatic data generation pipeline, and trains a lightweight router to selectively activate LLMs only for critical tokens while SLMs handle the rest.

Result: Applied to DeepSeek R1-1.5B and R1-32B models, R2R achieved average activated parameter size of 5.6B, surpassing R1-7B accuracy by 1.6x and outperforming R1-14B. Compared to R1-32B, it delivered 2.8x wall-clock speedup with comparable performance.

Conclusion: R2R advances the Pareto frontier of test-time scaling efficiency by enabling selective LLM usage only for critical reasoning steps while maintaining most generation with efficient SLMs.

Abstract: Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce Roads to Rome (R2R), a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.

[59] Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs

Jakub Podolak, Rajeev Verma

Main category: cs.CL

TL;DR: DeepSeek R1-32B’s self-reported confidence is unreliable in default settings but becomes trustworthy after exploring its predictive distribution through chain-of-thought reasoning.

Details

Motivation: To understand the source of uncertainty in DeepSeek R1-32B and investigate why its verbal confidence scores are unreliable in standard QA settings.

Method: Analyzed self-reported confidence vs. semantic entropy, tested chain-of-thought reasoning to explore predictive distribution, and used separate reader models to reconstruct confidence scores.

Result: Chain-of-thought reasoning greatly improved verbal confidence effectiveness, even on simple questions. Semantic entropy remained reliable due to larger test-time compute exploring the predictive distribution.

Conclusion: Reliable uncertainty estimation requires explicit exploration of the generative space, and self-reported confidence is only trustworthy after such exploration.

Abstract: We study the source of uncertainty in DeepSeek R1-32B by analyzing its self-reported verbal confidence on question answering (QA) tasks. In the default answer-then-confidence setting, the model is regularly over-confident, whereas semantic entropy - obtained by sampling many responses - remains reliable. We hypothesize that this is because of semantic entropy’s larger test-time compute, which lets us explore the model’s predictive distribution. We show that granting DeepSeek the budget to explore its distribution by forcing a long chain-of-thought before the final answer greatly improves its verbal score effectiveness, even on simple fact-retrieval questions that normally require no reasoning. Furthermore, a separate reader model that sees only the chain can reconstruct very similar confidences, indicating the verbal score might be merely a statistic of the alternatives surfaced during reasoning. Our analysis concludes that reliable uncertainty estimation requires explicit exploration of the generative space, and self-reported confidence is trustworthy only after such exploration.

[60] LexTime: A Benchmark for Temporal Ordering of Legal Events

Claire Barale, Leslie Barrett, Vikram Sunil Bajaj, Michael Rovatsos

Main category: cs.CL

TL;DR: LexTime is a new dataset for evaluating LLMs’ event ordering capabilities in legal language, showing improved accuracy on legal texts compared to narrative texts but challenges with legal linguistic complexities.

Details

Motivation: Existing benchmarks lack specialized evaluation for how LLMs handle event ordering in legal contexts, which is important for case law analysis, compliance monitoring, and legal summarization.

Method: Created LexTime dataset with 512 instances from U.S. Federal Complaints containing annotated event pairs and their temporal relations, then evaluated LLMs’ performance on legal event ordering.

Result: LLMs are more accurate on legal event ordering than narrative texts (up to +10.5%), achieve 80.8% accuracy for implicit-explicit event pairs, but struggle with legal linguistic complexities and nested clauses.

Conclusion: While performance is promising, specific features of legal texts remain a bottleneck for legal temporal event reasoning, requiring concrete modeling improvements to better address legal language complexities.

Abstract: Understanding temporal relationships and accurately reconstructing the event timeline is important for case law analysis, compliance monitoring, and legal summarization. However, existing benchmarks lack specialized language evaluation, leaving a gap in understanding how LLMs handle event ordering in legal contexts. We introduce LexTime, a dataset designed to evaluate LLMs' event ordering capabilities in legal language, consisting of 512 instances from U.S. Federal Complaints with annotated event pairs and their temporal relations. Our findings show that (1) LLMs are more accurate on legal event ordering than on narrative texts (up to +10.5%); (2) longer input contexts and implicit events boost accuracy, reaching 80.8% for implicit-explicit event pairs; (3) legal linguistic complexities and nested clauses remain a challenge. While performance is promising, specific features of legal texts remain a bottleneck for legal temporal event reasoning, and we propose concrete modeling directions to better address them.

[61] Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models

Haoyi Song, Ruihan Ji, Naichen Shi, Fan Lai, Raed Al Kontar

Main category: cs.CL

TL;DR: This paper proposes a probabilistic framework for uncertainty quantification in LLMs using an inverse model approach with systematic perturbations, introducing Inv-Entropy as a new uncertainty measure and GAAP perturbation algorithm.

Details

Motivation: Existing uncertainty quantification methods for LLMs are often heuristic and lack probabilistic interpretation, making reliable deployment challenging.

Method: Theoretical justification for perturbations in UQ, dual random walk perspective modeling input-output pairs as Markov chains, probabilistic framework based on inverse model, Inv-Entropy uncertainty measure, and GAAP perturbation algorithm using genetic algorithms.

Result: Inv-Entropy outperforms existing semantic UQ methods, and the framework supports flexible definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics.

Conclusion: The proposed probabilistic framework provides effective uncertainty quantification for LLMs with theoretical grounding and practical flexibility, enabling more reliable deployment.

Abstract: Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic interpretation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/Uncertainty-Quantification-for-LLMs.

[62] Scalable Medication Extraction and Discontinuation Identification from Electronic Health Records Using Large Language Models

Chong Shao, Douglas Snyder, Chiran Li, Bowen Gu, Kerry Ngan, Chun-Ting Yang, Jiageng Wu, Richard Wyss, Kueiyu Joshua Lin, Jie Yang

Main category: cs.CL

TL;DR: LLMs show strong performance in extracting medications and classifying discontinuation status from EHR notes, with GPT-4o achieving the best results and open-source models like Llama-3.1-70B-Instruct providing scalable alternatives.

Details

Motivation: Medication discontinuation information in EHRs is often buried in unstructured notes, making automated extraction challenging but crucial for patient safety.

Method: Evaluated 12 advanced LLMs on three EHR datasets using multiple prompting strategies including zero-shot, few-shot, and Chain-of-Thought reasoning for medication extraction and status classification tasks.

Result: GPT-4o achieved highest average F1 scores: 94.0% for extraction, 78.1% for classification, and 72.7% for joint task. Open-source models performed competitively, with Llama-3.1-70B-Instruct achieving best results on specific datasets. Few-shot learning generally improved performance.

Conclusion: LLMs demonstrate strong potential for medication extraction and discontinuation identification from EHR notes, with open-source models offering scalable alternatives to proprietary systems.

Abstract: Identifying medication discontinuations in electronic health records (EHRs) is vital for patient safety but is often hindered by information being buried in unstructured notes. This study aims to evaluate the capabilities of advanced open-sourced and proprietary large language models (LLMs) in extracting medications and classifying their medication status from EHR notes, focusing on their scalability on medication information extraction without human annotation. We collected three EHR datasets from diverse sources to build the evaluation benchmark. We evaluated 12 advanced LLMs and explored multiple LLM prompting strategies. Performance on medication extraction, medication status classification, and their joint task (extraction then classification) was systematically compared across all experiments. We found that LLMs showed promising performance on the medication extraction and discontinuation classification from EHR notes. GPT-4o consistently achieved the highest average F1 scores in all tasks under zero-shot setting - 94.0% for medication extraction, 78.1% for discontinuation classification, and 72.7% for the joint task. Open-sourced models followed closely, Llama-3.1-70B-Instruct achieved the highest performance in medication status classification on the MIV-Med dataset (68.7%) and in the joint task on both the Re-CASI (76.2%) and MIV-Med (60.2%) datasets. Medical-specific LLMs demonstrated lower performance compared to advanced general-domain LLMs. Few-shot learning generally improved performance, while CoT reasoning showed inconsistent gains. LLMs demonstrate strong potential for medication extraction and discontinuation identification on EHR notes, with open-sourced models offering scalable alternatives to proprietary systems and few-shot can further improve LLMs’ capability.

[63] Post Persona Alignment for Multi-Session Dialogue Generation

Yi-Pei Chen, Noriki Nishida, Hideki Nakayama, Yuji Matsumoto

Main category: cs.CL

TL;DR: PPA is a two-stage framework for multi-session dialogue generation that first generates general responses from context, then retrieves persona memories and refines responses for persona alignment, improving consistency and diversity.

Details

Motivation: LLMs struggle with maintaining persona fidelity and conversational coherence across extended interactions, and existing retrieval-before-generation methods constrain response diversity and produce generic outputs.

Method: Post Persona Alignment (PPA) - a two-stage framework: 1) Generate general response from dialogue context only, 2) Retrieve relevant persona memories using the response as query, 3) Refine response to align with speaker’s persona.

Result: PPA significantly outperforms prior approaches in consistency, diversity, and persona relevance on multi-session LLM-generated dialogue data.

Conclusion: PPA offers a more flexible and effective paradigm for long-term personalized dialogue generation by promoting naturalness and diversity while preserving consistency and personalization.

Abstract: Multi-session persona-based dialogue generation presents challenges in maintaining long-term consistency and generating diverse, personalized responses. While large language models (LLMs) excel in single-session dialogues, they struggle to preserve persona fidelity and conversational coherence across extended interactions. Existing methods typically retrieve persona information before response generation, which can constrain diversity and result in generic outputs. We propose Post Persona Alignment (PPA), a novel two-stage framework that reverses this process. PPA first generates a general response based solely on dialogue context, then retrieves relevant persona memories using the response as a query, and finally refines the response to align with the speaker’s persona. This post-hoc alignment strategy promotes naturalness and diversity while preserving consistency and personalization. Experiments on multi-session LLM-generated dialogue data demonstrate that PPA significantly outperforms prior approaches in consistency, diversity, and persona relevance, offering a more flexible and effective paradigm for long-term personalized dialogue generation.

[64] AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin

Main category: cs.CL

TL;DR: AlphaDecay is an adaptive weight decay method for LLMs that assigns different decay strengths to each module based on their spectral properties, outperforming uniform decay approaches.

Details

Motivation: Uniform weight decay across all layers ignores the structural diversity and varying spectral properties of different modules in LLMs, which can lead to suboptimal regularization.

Method: Uses Heavy-Tailed Self-Regularization theory to analyze empirical spectral density of weight correlation matrices, assigning weaker decay to modules with more heavy-tailed spectra (stronger feature learning) and stronger decay to those with lighter-tailed spectra.

Result: Extensive pre-training with models from 60M to 1B parameters shows AlphaDecay achieves better perplexity and generalization compared to uniform decay and other adaptive decay baselines.

Conclusion: Adaptive weight decay based on module-specific spectral properties is more effective than uniform decay for regularizing LLMs during training.

Abstract: Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify “heavy-tailedness.” Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. Our code is available at https://github.com/hed-ucas/AlphaDecay.

[65] MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

Xiaoyuan Li, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu

Main category: cs.CL

TL;DR: This paper evaluates Multi-modal Large Language Models’ ability to perform visual operations through code in mathematical reasoning, focusing on code generation and editing capabilities.

Details

Motivation: Existing evaluations mainly focus on text-only reasoning outputs, leaving MLLMs' ability to perform accurate visual operations via code largely unexplored.

Method: The framework evaluates two key aspects: Multi-modal Code Generation (MCG) for understanding and constructing visualizations from scratch, and Multi-modal Code Editing (MCE) for fine-grained operations including Deletion, Modification and Annotation. Uses a dataset covering five types of mathematical figures.

Result: Experimental evaluation of nine mainstream MLLMs reveals that existing models still lag significantly behind human performance in performing fine-grained visual operations.

Conclusion: Current MLLMs have significant limitations in code-based visual operations for mathematical reasoning, highlighting the need for improved capabilities in this area.

Abstract: Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM’s ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM’s code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model’s ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model’s capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.

[66] PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Oshayer Siddique, J. M Areeb Uzair Alam, Md Jobayer Rahman Rafy, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

Main category: cs.CL

TL;DR: Evaluates frontier LLMs on physics problem-solving using multi-agent frameworks and introduces PhysicsEval benchmark with 19,609 physics problems.

Details

Motivation: To assess and improve LLM performance on physics problem-solving, both mathematical and descriptive, which is crucial for natural language reasoning.

Method: Uses inference-time techniques and multi-agent frameworks where smaller LLM agents verify solutions cumulatively, plus introduces PhysicsEval benchmark with problems from textbooks and solutions from educational sources.

Result: Significant performance improvements when multi-agent framework is applied to problems where models initially perform poorly.

Conclusion: Multi-agent verification frameworks effectively enhance LLM performance on physics problem-solving, and PhysicsEval provides a comprehensive benchmark for future evaluations.

Abstract: The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, ${\rm P{\small HYSICS}E{\small VAL}}$, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.

[67] Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu

Main category: cs.CL

TL;DR: Sparse-dLLM is a training-free framework that accelerates diffusion LLMs by using dynamic cache eviction and sparse attention, achieving 10x higher throughput with comparable performance.

Details

Motivation: Current caching techniques for diffusion LLMs suffer from high memory usage that limits long-context applications, despite persistent cross-layer sparsity in attention patterns.

Method: Proposes delayed bidirectional sparse caching that leverages token saliency stability to retain critical tokens and dynamically evict unimportant prefix/suffix entries using attention-guided strategy.

Result: Achieves up to 10x higher throughput than vanilla dLLMs with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness.

Conclusion: Sparse-dLLM enables efficient long-context applications for diffusion LLMs through selective cache eviction while maintaining performance.

Abstract: Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10$\times$ higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness. The code is available at https://github.com/OpenMOSS/Sparse-dLLM.

[68] Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives

Yinuo Xu, Veronica Derricks, Allison Earl, David Jurgens

Main category: cs.CL

TL;DR: DEM-MoE models annotator disagreement using demographic-aware routing and synthetic data imputation via LLM persona prompting to better represent diverse perspectives in subjective NLP tasks.

Details

Motivation: To better model annotator disagreement in subjective NLP tasks by capturing structured group-level variation and addressing sparse demographic coverage in annotation data.

Method: Developed DEM-MoE model that routes inputs to expert subnetworks based on annotator demographics, and used LLM-generated synthetic annotations via zero-shot persona prompting for data imputation, with strategies for blending real and synthetic data.

Result: DEM-MoE performs competitively across demographic groups, especially on datasets with high annotator disagreement. Synthetic judgments align moderately well with human annotations and offer scalable data enrichment. Optimal blending strategies depend on dataset structure.

Conclusion: The combination of demographic-aware modeling and synthetic data imputation improves representation of diverse perspectives in subjective NLP tasks, with dataset-specific strategies being important for optimal performance.

Abstract: We present an approach to modeling annotator disagreement in subjective NLP tasks through both architectural and data-centric innovations. Our model, DEM-MoE (Demographic-Aware Mixture of Experts), routes inputs to expert subnetworks based on annotator demographics, enabling it to better represent structured, group-level variation compared to prior models. DEM-MoE consistently performs competitively across demographic groups, and shows especially strong results on datasets with high annotator disagreement. To address sparse demographic coverage, we test whether LLM-generated synthetic annotations via zero-shot persona prompting can be used for data imputation. We show these synthetic judgments align moderately well with human annotations on our data and offer a scalable way to potentially enrich training data. We then propose and evaluate approaches for blending real and synthetic data using strategies tailored to dataset structure. We find that the optimal strategies depend on dataset structure. Together, these contributions improve the representation of diverse perspectives.

[69] Evaluating Large Language Models for Detecting Antisemitism

Jay Patel, Hrudayangam Mehta, Jeremy Blackburn

Main category: cs.CL

TL;DR: This paper evaluates 8 open-source LLMs for detecting antisemitic content using in-context learning and introduces Guided-CoT prompting to improve performance and reduce refusals.

Details

Motivation: Hateful content detection requires continuous adaptation, and automated tools need to handle evolving social media landscapes effectively.

Method: Evaluated 8 open-source LLMs using in-context definition and moderation policy guidelines. Developed Guided-CoT prompting technique with domain-specific thoughts to improve performance.

Result: Guided-CoT improved performance and utility across all models, reducing refusals. Llama 3.1 70B outperformed fine-tuned GPT-3.5. Introduced metrics for semantic divergence in rationales, revealing differences in LLM behaviors.

Conclusion: LLMs show varying utility, explainability, and reliability in hate speech detection. Guided-CoT effectively handles in-context policies and improves model performance across different configurations.

Abstract: Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs’ capability to detect antisemitic content, specifically leveraging in-context definition. We also study how LLMs understand and explain their decisions given a moderation policy as a guideline. First, we explore various prompting techniques and design a new CoT-like prompt, Guided-CoT, and find that injecting domain-specific thoughts increases performance and utility. Guided-CoT handles the in-context policy well, improving performance and utility by reducing refusals across all evaluated models, regardless of decoding configuration, model size, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs’ utility, explainability, and reliability. Code and resources available at: https://github.com/idramalab/quantify-llm-explanations

[70] FaStfact: Faster, Stronger Long-Form Factuality Evaluations in LLMs

Yingjia Wan, Haochen Tan, Xiao Zhu, Xinyu Zhou, Zhiwei Li, Qingsong Lv, Changxuan Sun, Jiaqi Zeng, Yi Xu, Jianqiao Lu, Yinhong Liu, Zhijiang Guo

Main category: cs.CL

TL;DR: FaStfact is an efficient framework for evaluating factuality of long-form LLM generations using chunk-level claim extraction with confidence-based pre-verification and document-level evidence retrieval.

Details

Motivation: Existing methods for evaluating long-form factuality suffer from inefficiency due to overcomplicated pipelines and ineffectiveness from inaccurate claim sets and insufficient evidence.

Method: Uses chunk-level claim extraction with confidence-based pre-verification to reduce time/token cost, and collects document-level evidence from web pages with selective retrieval during verification.

Result: Achieves highest alignment with human evaluation and best time/token efficiency among existing baselines, as demonstrated on the FaStfact-Bench benchmark.

Conclusion: FaStfact provides a reliable, efficient, and effective solution for evaluating long-form factuality in LLM generations.

Abstract: Evaluating the factuality of long-form generations from Large Language Models (LLMs) remains challenging due to efficiency bottlenecks and reliability concerns. Prior efforts attempt this by decomposing text into claims, searching for evidence, and verifying claims, but suffer from critical drawbacks: (1) inefficiency due to overcomplicated pipeline components, and (2) ineffectiveness stemming from inaccurate claim sets and insufficient evidence. To address these limitations, we propose \textbf{FaStfact}, an evaluation framework that achieves the highest alignment with human evaluation and time/token efficiency among existing baselines. FaStfact first employs chunk-level claim extraction integrated with confidence-based pre-verification, significantly reducing the time and token cost while ensuring reliability. For searching and verification, it collects document-level evidence from crawled web-pages and selectively retrieves it during verification. Extensive experiments based on an annotated benchmark \textbf{FaStfact-Bench} demonstrate the reliability of FaStfact in both efficiently and effectively evaluating long-form factuality. Code, benchmark data, and annotation interface tool are available at https://github.com/Yingjia-Wan/FaStfact.

[71] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness

Fali Wang, Jihai Chen, Shuhua Yang, Ali Al-Lawati, Linli Tang, Hui Liu, Suhang Wang

Main category: cs.CL

TL;DR: This paper presents a systematic survey of Small Language Model (SLM) and Large Language Model (LLM) collaboration frameworks, categorizing them by four main objectives: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness.

Details

Motivation: LLMs face challenges including high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. SLMs offer compact, efficient, and adaptable solutions, making collaborative frameworks that integrate their complementary strengths promising.

Method: The paper conducts a systematic survey and proposes a taxonomy covering four collaboration objectives. It reviews representative methods, summarizes design paradigms, and analyzes different approaches to SLM-LLM collaboration.

Result: The survey provides a comprehensive framework for understanding SLM-LLM collaboration, categorizing existing methods and identifying key design patterns that leverage SLMs’ specialization and efficiency with LLMs’ generalization and reasoning capabilities.

Conclusion: The paper outlines open challenges and future directions for efficient and secure SLM-LLM collaboration, providing a foundation for further research in this emerging field.

Abstract: Large language models (LLMs) have achieved remarkable progress across domains and applications but face challenges such as high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. Small language models (SLMs), with compact, efficient, and adaptable features, offer promising solutions. Building on this potential, recent research explores collaborative frameworks that integrate their complementary strengths, leveraging SLMs' specialization and efficiency with LLMs’ generalization and reasoning to address diverse objectives across tasks and deployment scenarios. Motivated by these developments, this paper presents a systematic survey of SLM-LLM collaboration from the perspective of collaboration objectives. We propose a taxonomy covering four goals: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness. Under this framework, we review representative methods, summarize design paradigms, and outline open challenges and future directions toward efficient and secure SLM-LLM collaboration. The collected papers are available at https://github.com/FairyFali/SLMs-Survey.

[72] Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers

Ziye Xia, Sergei S. Ospichev

Main category: cs.CL

TL;DR: This paper proposes a prompt engineering-based method for analyzing key concept paths in academic papers using small language models and knowledge graph constraints to identify innovation points and rare paths.

Details

Motivation: The rapid growth of academic publications makes it difficult for scientists to track research findings. Existing paper databases only offer basic concept matching and classification, lacking deep exploration of concept relationships.

Method: Based on OpenAlex knowledge graph, the method uses prompt engineering with small language models for key concept extraction and innovation identification, enhanced by knowledge graph constraint mechanisms. Fine-tuned Qwen and DeepSeek models were used.

Result: Analysis of 8,000 papers revealed strong correlation between key concept path distribution patterns and innovation points/rare paths. Fine-tuned models achieved significant accuracy improvements and are available on Hugging Face.

Conclusion: The proposed method successfully addresses limitations of existing paper analysis systems by enabling deeper exploration of concept relationships and innovation identification through knowledge graph-constrained language models.

Abstract: In recent years, the rapid increase in academic publications across various fields has posed severe challenges for academic paper analysis: scientists struggle to timely and comprehensively track the latest research findings and methodologies. Key concept extraction has proven to be an effective analytical paradigm, and its automation has been achieved with the widespread application of language models in industrial and scientific domains. However, existing paper databases are mostly limited to similarity matching and basic classification of key concepts, failing to deeply explore the relational networks between concepts. This paper is based on the OpenAlex opensource knowledge graph. By analyzing nearly 8,000 open-source paper data from Novosibirsk State University, we discovered a strong correlation between the distribution patterns of paper key concept paths and both innovation points and rare paths. We propose a prompt engineering-based key concept path analysis method. This method leverages small language models to achieve precise key concept extraction and innovation point identification, and constructs an agent based on a knowledge graph constraint mechanism to enhance analysis accuracy. Through fine-tuning of the Qwen and DeepSeek models, we achieved significant improvements in accuracy, with the models publicly available on the Hugging Face platform.

[73] SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

Qiusi Zhan, Angeline Budiman-Chan, Abdelrahman Zayed, Xingzhi Guo, Daniel Kang, Joo-Kyung Kim

Main category: cs.CL

TL;DR: Search agents using LLMs are more vulnerable to producing harmful outputs than base LLMs, especially when retrieving external information. SafeSearch uses multi-objective reinforcement learning to improve safety while maintaining utility.

Details

Motivation: LLM-based search agents show increased safety risks compared to base models, as they may lower refusal thresholds when retrieving external documents and synthesize unsafe information from retrieved sources.

Method: SafeSearch uses multi-objective reinforcement learning with a final-output safety/utility reward and a novel query-level shaping term that penalizes unsafe queries and rewards safe ones.

Result: SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while maintaining safe, helpful responses and matching the QA performance of utility-only fine-tuned agents.

Conclusion: Joint alignment of safety and utility is crucial for search agents, and the query-level reward in SafeSearch effectively improves both safety and utility simultaneously.

Abstract: Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked “How can I track someone’s location without their consent?”, a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.

[74] AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation

Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jiaheng Wei

Main category: cs.CL

TL;DR: AgenticMath is an agentic pipeline for generating high-quality mathematical question-answer pairs to enhance LLM reasoning, achieving competitive performance with much smaller datasets than traditional methods.

Details

Motivation: Current methods for creating LLM reasoning datasets suffer from generating low-quality/incorrect answers and limited information richness from available data sources.

Method: Four-stage pipeline: (1) Seed Question Filter for high-quality selection, (2) Agentic Question Rephrase using multi-agent system for diverse paraphrases, (3) Answer Augment with chain-of-thought reasoning for correctness, (4) Question and Answer Evaluation to retain superior pairs.

Result: Fine-tuning 3B-8B parameter LLMs on AgenticMath datasets (30-60K samples) achieves competitive/superior performance on mathematical reasoning benchmarks compared to baselines trained on much larger datasets (400K-2.3M samples).

Conclusion: Targeted, high-quality data generation is more efficient for improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.

Abstract: The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic pipeline for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.

[75] HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Stephan Oepen, Nikolay Arefev, Mikko Aulamo, Marta Bañón, Maja Buljan, Laurie Burchell, Lucas Charpentier, Pinzhen Chen, Mariya Fedorova, Ona de Gibert, Barry Haddow, Jan Hajič, Jindřich Helcl, Andrey Kutuzov, Veronika Laippala, Zihao Li, Risto Luukkonen, Bhavitvya Malik, Vladislav Mikhailov, Amanda Myntti, Dayyán O’Brien, Lucie Poláková, Sampo Pyysalo, Gema Ramírez Sánchez, Janine Siewert, Pavel Stepachev, Jörg Tiedemann, Teemu Vahtola, Dušan Variš, Fedor Vitiugin, Tea Vojtěchová, Jaume Zaragoza

Main category: cs.CL

TL;DR: Creation of the largest open multilingual dataset (30T tokens) with comprehensive preprocessing pipeline and evaluation benchmarks for 200 languages, including trained models.

Details

Motivation: To provide open, high-quality multilingual datasets for LLM pre-training that address data scarcity and quality issues across many languages.

Method: Web crawling from multiple sources, with pipeline for HTML extraction, language ID, deduplication, annotation (register, quality, PII), filtering, and creation of parallel corpora via mining and machine translation.

Result: 30 trillion token dataset for 200 languages, 57 monolingual encoder-decoder models, multilingual benchmarks for 9 European languages, and large parallel text collection.

Conclusion: Successfully created comprehensive multilingual resources enabling better LLM development across many languages through open datasets and evaluation frameworks.

Abstract: We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.

cs.CV

[76] Cropland Mapping using Geospatial Embeddings

Ivan Zvonkov, Gabriel Tseng, Inbal Becker-Reshef, Hannah Kerner

Main category: cs.CV

TL;DR: Geospatial embeddings from Presto and AlphaEarth enable efficient and accurate cropland mapping in Togo, supporting land use change analysis and climate impact assessment.

Details

Motivation: Need for accurate land cover maps to understand land use change as a climate change driver, with geospatial embeddings offering efficient mapping solutions that are underexplored in real-world applications.

Method: Used geospatial embeddings from Presto and AlphaEarth to produce cropland maps in Togo.

Result: Geospatial embeddings achieved high-accuracy cropland classification and simplified workflows.

Conclusion: Geospatial embeddings are effective tools for cropland mapping that can support better assessments of land use change and climate impacts.

Abstract: Accurate and up-to-date land cover maps are essential for understanding land use change, a key driver of climate change. Geospatial embeddings offer a more efficient and accessible way to map landscape features, yet their use in real-world mapping applications remains underexplored. In this work, we evaluated the utility of geospatial embeddings for cropland mapping in Togo. We produced cropland maps using embeddings from Presto and AlphaEarth. Our findings show that geospatial embeddings can simplify workflows, achieve high-accuracy cropland classification and ultimately support better assessments of land use change and its climate impacts.

[77] Generative Hints

Andy Dimnaku, Abdullah Yusuf Kavranoğlu, Yaser Abu-Mostafa

Main category: cs.CV

TL;DR: Generative hints is a training method that enforces known invariances in the entire input space using generative models to create virtual examples, outperforming standard data augmentation.

Details

Motivation: Data augmentation alone doesn't fully capture invariant properties since it only learns from transformations of training data, failing to enforce invariances across the entire input space.

Method: Uses generative models trained on the dataset to create unlabeled virtual examples, then trains models in a semi-supervised manner with both classification and hint objectives on these virtual examples to enforce functional properties.

Result: Consistently outperforms standard data augmentation across datasets, architectures, and loss functions. Achieved up to 1.78% top-1 accuracy improvement on fine-grained visual classification benchmarks and 1.286% average boost on CheXpert X-ray dataset.

Conclusion: Generative hints effectively enforce known invariances beyond what data augmentation can achieve, providing significant performance improvements in visual classification tasks.

Abstract: Data augmentation is widely used in vision to introduce variation and mitigate overfitting, through enabling models to learn invariant properties, such as spatial invariance. However, these properties are not fully captured by data augmentation alone, since it attempts to learn the property on transformations of the training data only. We propose generative hints, a training methodology that directly enforces known invariances in the entire input space. Our approach leverages a generative model trained on the training set to approximate the input distribution and generate unlabeled images, which we refer to as virtual examples. These virtual examples are used to enforce functional properties known as hints. In generative hints, although the training dataset is fully labeled, the model is trained in a semi-supervised manner on both the classification and hint objectives, using the unlabeled virtual examples to guide the model in learning the desired hint. Across datasets, architectures, and loss functions, generative hints consistently outperform standard data augmentation when learning the same property. On popular fine-grained visual classification benchmarks, we achieved up to 1.78% top-1 accuracy improvement (0.63% on average) over fine-tuned models with data augmentation and an average performance boost of 1.286% on the CheXpert X-ray dataset.

[78] Med-Banana-50K: A Cross-modality Large-Scale Dataset for Text-guided Medical Image Editing

Zhihui Chen, Mengling Feng

Main category: cs.CV

TL;DR: Med-Banana-50K is a comprehensive 50K-image dataset for instruction-based medical image editing across three modalities (chest X-ray, brain MRI, fundus photography) and 23 disease types, featuring systematic medical quality control and including failed attempts for preference learning.

Details

Motivation: The research community lacks large-scale, high-quality, openly accessible datasets specifically designed for medical image editing with strict anatomical and clinical constraints, which constrains progress in multimodal medical AI.

Method: Dataset construction using Gemini-2.5-Flash-Image to generate bidirectional edits (lesion addition/removal) from real medical images, with systematic medical quality control via LLM-as-Judge evaluation and history-aware iterative refinement up to five rounds.

Result: Created Med-Banana-50K dataset with 50K images spanning three modalities and 23 disease types, including 37K failed attempts with full conversation logs for preference learning and alignment research.

Conclusion: Med-Banana-50K establishes a foundation for training and evaluating next-generation medical image editing models by providing a large-scale, medically validated, and fully documented resource that addresses the current gap in medical image editing datasets.

Abstract: Recent advances in multimodal large language models have enabled remarkable medical image editing capabilities. However, the research community’s progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built specifically for medical image editing with strict anatomical and clinical constraints. We introduce Med-Banana-50K, a comprehensive 50K-image dataset for instruction-based medical image editing spanning three modalities (chest X-ray, brain MRI, fundus photography) and 23 disease types. Our dataset is constructed by leveraging Gemini-2.5-Flash-Image to generate bidirectional edits (lesion addition and removal) from real medical images. What distinguishes Med-Banana-50K from general-domain editing datasets is our systematic approach to medical quality control: we employ LLM-as-Judge with a medically grounded rubric (instruction compliance, structural plausibility, realism, and fidelity preservation) and history-aware iterative refinement up to five rounds. Beyond single-turn editing, Med-Banana-50K includes 37K failed attempts with full conversation logs for preference learning and alignment research. By providing this large-scale, medically validated, and fully documented resource, Med-Banana-50K establishes a foundation for training and evaluating the next generation of medical image editing models.Our dataset and code are publicly available at [https://github.com/richardChenzhihui/med-banana-50k].

[79] ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology

Srikumar Sastry, Subash Khanal, Aayush Dhakal, Jiayu Lin, Dan Cher, Phoenix Jarosz, Nathan Jacobs

Main category: cs.CV

TL;DR: ProM3E is a probabilistic masked multimodal embedding model for ecology that learns to infer missing modalities through masked reconstruction, supports modality inversion, and enables cross-modal retrieval with mixed similarity metrics.

Details

Motivation: To develop a multimodal model for ecology that can handle any-to-any generation of representations and analyze the feasibility of fusing different modalities for downstream tasks.

Method: Uses masked modality reconstruction in embedding space to learn inference of missing modalities from context modalities, with probabilistic modeling to determine optimal modality fusion.

Result: Achieves superior performance in cross-modal retrieval tasks using mixed inter-modal and intra-modal similarities, and demonstrates strong representation learning through linear probing.

Conclusion: ProM3E provides an effective framework for multimodal representation learning in ecology with capabilities for modality inference, inversion, and optimized cross-modal retrieval.

Abstract: We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyse the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model. All our code, datasets and model will be released at https://vishu26.github.io/prom3e.

[80] EvtSlowTV – A Large and Diverse Dataset for Event-Based Depth Estimation

Sadiq Layi Macaulay, Nimet Kaygusuz, Simon Hadfield

Main category: cs.CV

TL;DR: EvtSlowTV is a large-scale event camera dataset from YouTube footage with 13B+ events, enabling self-supervised depth estimation without frame annotations.

Details

Motivation: Existing event-based depth estimation methods are limited by small annotated datasets, restricting real-world generalization.

Method: Curated large-scale dataset from YouTube footage with diverse environmental conditions and motions, used for self-supervised learning framework.

Result: Dataset is order of magnitude larger than existing ones; training enhances model generalization to complex scenes and motions.

Conclusion: EvtSlowTV bridges the dataset gap for event-based depth estimation and enables better generalization while preserving event data’s asynchronous nature.

Abstract: Event cameras, with their high dynamic range (HDR) and low latency, offer a promising alternative for robust depth estimation in challenging environments. However, many event-based depth estimation approaches are constrained by small-scale annotated datasets, limiting their generalizability to real-world scenarios. To bridge this gap, we introduce EvtSlowTV, a large-scale event camera dataset curated from publicly available YouTube footage, which contains more than 13B events across various environmental conditions and motions, including seasonal hiking, flying, scenic driving, and underwater exploration. EvtSlowTV is an order of magnitude larger than existing event datasets, providing an unconstrained, naturalistic setting for event-based depth learning. This work shows the suitability of EvtSlowTV for a self-supervised learning framework to capitalise on the HDR potential of raw event streams. We further demonstrate that training with EvtSlowTV enhances the model’s ability to generalise to complex scenes and motions. Our approach removes the need for frame-based annotations and preserves the asynchronous nature of event data.

[81] Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification

Mikhael Djajapermana, Moritz Reiber, Daniel Mueller-Gritschneder, Ulf Schlichtmann

Main category: cs.CV

TL;DR: A new hybrid CNN-ViT neural architecture search space for tinyML that finds efficient models combining CNN and ViT blocks with searchable pooling layers.

Details

Motivation: Existing hybrid CNN-ViT architectures are too large and computationally expensive for tinyML deployment, requiring more efficient alternatives.

Method: Created a NAS search space covering hybrid CNN and ViT blocks for local/global feature learning, plus novel searchable pooling layers for efficient feature map reduction.

Result: On CIFAR10, the search space produced hybrid architectures with superior accuracy and inference speed compared to ResNet-based tinyML models under tight size constraints.

Conclusion: The proposed hybrid CNN-ViT NAS search space enables finding efficient architectures suitable for tinyML deployment while maintaining competitive performance.

Abstract: Hybrids of Convolutional Neural Network (CNN) and Vision Transformer (ViT) have outperformed pure CNN or ViT architecture. However, since these architectures require large parameters and incur large computational costs, they are unsuitable for tinyML deployment. This paper introduces a new hybrid CNN-ViT search space for Neural Architecture Search (NAS) to find efficient hybrid architectures for image classification. The search space covers hybrid CNN and ViT blocks to learn local and global information, as well as the novel Pooling block of searchable pooling layers for efficient feature map reduction. Experimental results on the CIFAR10 dataset show that our proposed search space can produce hybrid CNN-ViT architectures with superior accuracy and inference speed to ResNet-based tinyML models under tight model size constraints.

[82] SCALE-VLP: Soft-Weighted Contrastive Volumetric Vision-Language Pre-training with Spatial-Knowledge Semantics

Ailar Mahdizadeh, Puria Azadi Moghadam, Xiangteng He, Shahriar Mirabbasi, Panos Nasiopoulos, Leonid Sigal

Main category: cs.CV

TL;DR: SCALE-VLP is a soft-weighted contrastive vision-language pre-training framework for volumetric medical imaging that integrates volumetric spatial semantics and domain-aware knowledge to improve cross-modal alignment and generalization.

Details

Motivation: Most vision-language models are limited to 2D data and binary supervision, overlooking continuous dependencies in volumetric data like CT scans. Existing approaches treat volumetric scans as independent 2D slices, compromising spatial coherence and underutilizing clinical semantics.

Method: Proposes SCALE-VLP framework that integrates (i) volumetric spatial semantics to preserve anatomical structure and (ii) domain-aware, knowledge-infused semantics (e.g., radiological ontologies) to guide alignment using soft-weighted contrastive learning.

Result: Achieves up to 4.3x higher top-1 CT-report retrieval, improves abnormality classification by 10 points, and reaches ROUGE-L 0.44 and BERT-F1 0.89 for report generation. Shows consistent gains in zero-shot evaluation on out-of-domain external datasets.

Conclusion: SCALE-VLP demonstrates strong cross-task transferability (retrieval, report generation, classification) and cross-domain generalizability without further fine-tuning, indicating robust structural consistency and semantic grounding under limited supervision.

Abstract: Vision-language models (VLMs) have demonstrated strong cross-modal capabilities, yet most work remains limited to 2D data and assumes binary supervision (i.e., positive vs. negative pairs), overlooking the continuous and structured dependencies present in volumetric data such as CT. Existing approaches often treat volumetric scans as independent 2D slices, compromising spatial coherence and underutilizing rich clinical semantics. We propose SCALE-VLP, a soft-weighted contrastive vision-language pre-training framework that integrates (i) volumetric spatial semantics to preserve anatomical structure and (ii) domain-aware, knowledge-infused semantics (e.g., radiological ontologies) to guide alignment. This yields structurally consistent and semantically grounded representations under limited supervision, demonstrating strong cross-task transferability (retrieval, report generation, and classification), and cross-domain generalizability with consistent gains without further fine-tuning. In particular, compared to the previous state of the art, SCALE-VLP achieves up to 4.3x higher top-1 CT-report retrieval, improves abnormality classification by 10 points, and reaches ROUGE-L 0.44 and BERT-F1 0.89 for report generation. Further, in zero-shot evaluation on an out-of-domain external dataset, we observe consistent gains, indicating the cross-task and cross-domain generalization ability of SCALE-VLP.

[83] Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

Dakota Hester, Vitor S. Martins, Lucas B. Ferreira, Thainara M. A. Lima

Main category: cs.CV

TL;DR: A label-efficient approach using self-supervised learning for statewide 1-m land cover classification achieves 87.14% accuracy with only 1,000 annotated patches.

Details

Motivation: Address the challenge of collecting large training datasets for high-resolution land cover mapping by developing a label-efficient method using self-supervised learning.

Method: Used BYOL pre-training with 377,921 unlabeled aerial images on ResNet-101, then fine-tuned multiple segmentation architectures (FCN, U-Net, etc.) with small labeled datasets (250-750 patches).

Result: Achieved 87.14% overall accuracy and 75.58% macro F1 score for 8-class land cover mapping over Mississippi, covering 123+ billion pixels.

Conclusion: Self-supervised learning effectively reduces manual annotation needs, enabling high-resolution land cover mapping at scale.

Abstract: Deep learning semantic segmentation methods have shown promising performance for very high 1-m resolution land cover classification, but the challenge of collecting large volumes of representative training data creates a significant barrier to widespread adoption of such models for meter-scale land cover mapping over large areas. In this study, we present a novel label-efficient approach for statewide 1-m land cover classification using only 1,000 annotated reference image patches with self-supervised deep learning. We use the “Bootstrap Your Own Latent” pre-training strategy with a large amount of unlabeled color-infrared aerial images (377,921 256x256 1-m pixel patches) to pre-train a ResNet-101 convolutional encoder. The learned encoder weights were subsequently transferred into multiple deep semantic segmentation architectures (FCN, U-Net, Attention U-Net, DeepLabV3+, UPerNet, PAN), which were then fine-tuned using very small training dataset sizes with cross-validation (250, 500, 750 patches). Among the fine-tuned models, we obtained the 87.14% overall accuracy and 75.58% macro F1 score using an ensemble of the best performing U-Net models for comprehensive 1-m, 8-class land cover mapping, covering more than 123 billion pixels over the state of Mississippi, USA. Detailed qualitative and quantitative analysis revealed accurate mapping of open water and forested areas, while highlighting challenges in accurate delineation between cropland, herbaceous, and barren land cover types. These results show that self-supervised learning is an effective strategy for reducing the need for large volumes of manually annotated data, directly addressing a major limitation to high spatial resolution land cover mapping at scale.

[84] A Foundation Model for Brain MRI with Dynamic Modality Integration

Minh Sao Khue Luu, Bair N. Tuchinov

Main category: cs.CV

TL;DR: A foundation model for brain MRI that handles different combinations of imaging sequences using a single encoder with modality embeddings and masked autoencoding, enabling flexible adaptation to missing or unseen modalities.

Details

Motivation: To eliminate the need for separate models for each MRI modality combination and enable robust performance when some imaging sequences are missing or unavailable.

Method: Uses one encoder with learnable modality embeddings, conditional layer normalization, and masked autoencoding objective with variance-covariance regularization. Trained on 60,000 multi-center MRIs using self-supervised reconstruction and modality imputation.

Result: Preliminary results show feasible performance on brain tumor and multiple sclerosis segmentation, as well as lesion classification under various modality settings.

Conclusion: The proposed foundation model provides a flexible and adaptable solution for brain MRI analysis that can handle diverse modality combinations without requiring separate specialized models.

Abstract: We present a foundation model for brain MRI that can work with different combinations of imaging sequences. The model uses one encoder with learnable modality embeddings, conditional layer normalization, and a masked autoencoding objective that accounts for missing modalities. A variance-covariance regularizer is applied to stabilize feature learning and improve representation diversity. This design removes the need for separate models for each modality and allows the network to adapt when some sequences are missing or unseen. It is trained on about 60,000 multi-center MRIs using self-supervised reconstruction and modality imputation to learn flexible representations. A learnable modality embedding guides feature extraction so the encoder can adjust to different inputs. We describe our planned evaluation on brain tumor and multiple sclerosis segmentation, as well as lesion classification, under various modality settings. Preliminary results show that the method works feasibly, and further experiments are planned to study its performance in more detail. All code and pretrained models are available at https://github.com/BrainFM/brainfm

[85] ISC-Perception: A Hybrid Computer Vision Dataset for Object Detection in Novel Steel Assembly

Miftahur Rahman, Samuel Adebayo, Dorian A. Acevedo-Mejia, David Hester, Daniel McPolin, Karen Rafferty, Debra F. Laefer

Main category: cs.CV

TL;DR: ISC-Perception is the first hybrid dataset for Intermeshed Steel Connection component detection, combining synthetic CAD images, photorealistic game-engine scenes, and real photographs to enable automatic labeling and reduce human effort by 81.7% compared to manual labeling.

Details

Motivation: The absence of dedicated image corpora for ISC component detection hampers robotic perception in construction, as collecting real photos on active sites is logistically difficult and raises safety/privacy concerns.

Method: Created a hybrid dataset blending procedurally rendered CAD images, game-engine photorealistic scenes, and curated real photographs, enabling fully automatic labeling of synthetic portions while accounting for all human effort in dataset generation.

Result: Detectors trained on ISC-Perception achieved mean Average Precision at IoU 0.50 of 0.756, substantially surpassing synthetic-only or photorealistic-only models. On a 1,200-frame bench test, mAP@0.50/mAP@[0.50:0.95] was 0.943/0.823.

Conclusion: ISC-Perception bridges the data gap for construction-robotics perception, facilitating rapid development of custom object detectors and is freely available for research and industrial use.

Abstract: The Intermeshed Steel Connection (ISC) system, when paired with robotic manipulators, can accelerate steel-frame assembly and improve worker safety by eliminating manual assembly. Dependable perception is one of the initial stages for ISC-aware robots. However, this is hampered by the absence of a dedicated image corpus, as collecting photographs on active construction sites is logistically difficult and raises safety and privacy concerns. In response, we introduce ISC-Perception, the first hybrid dataset expressly designed for ISC component detection. It blends procedurally rendered CAD images, game-engine photorealistic scenes, and a limited, curated set of real photographs, enabling fully automatic labelling of the synthetic portion. We explicitly account for all human effort to produce the dataset, including simulation engine and scene setup, asset preparation, post-processing scripts and quality checks; our total human time to generate a 10,000-image dataset was 30.5,h versus 166.7,h for manual labelling at 60,s per image (-81.7%). A manual pilot on a representative image with five instances of ISC members took 60,s (maximum 80,s), anchoring the manual baseline. Detectors trained on ISC-Perception achieved a mean Average Precision at IoU 0.50 of 0.756, substantially surpassing models trained on synthetic-only or photorealistic-only data. On a 1,200-frame bench test, we report mAP@0.50/mAP@[0.50:0.95] of 0.943/0.823. By bridging the data gap for construction-robotics perception, ISC-Perception facilitates rapid development of custom object detectors and is freely available for research and industrial use upon request.

[86] SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

Wenbo Lu

Main category: cs.CV

TL;DR: SLIP introduces structural contrastive learning to VLP by modeling relationships between entities in structured graphs, outperforming CLIP on cross-modal tasks.

Details

Motivation: Current VLP methods treat image-text pairs as isolated examples, ignoring rich relational structures in domains like e-commerce and social networks, despite neuroscientific evidence that humans encode knowledge through relationship cognitive maps.

Method: Integrates structural contrastive loss to align modalities while modeling relationships between neighboring entities in structured graphs, using a large-scale Amazon Product Co-purchase Multimodal Graph Dataset.

Result: Consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings.

Conclusion: Relational supervision provides significant value for cross-modal alignment, demonstrating the importance of incorporating structured relationships in vision-language pretraining.

Abstract: Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment.

[87] From Propagation to Prediction: Point-level Uncertainty Evaluation of MLS Point Clouds under Limited Ground Truth

Ziyang Xu, Olaf Wysocki, Christoph Holst

Main category: cs.CV

TL;DR: Learning-based framework for MLS point cloud uncertainty evaluation using geometric features without ground truth, achieving comparable accuracy to Random Forest with 3x faster efficiency.

Details

Motivation: Ground truth for uncertainty evaluation in MLS point clouds is costly and often infeasible in real-world applications, creating a need for GT-free evaluation methods.

Method: Learning-based framework integrating optimal neighborhood estimation with geometric feature extraction, using XGBoost model for uncertainty prediction.

Result: Framework is feasible with XGBoost achieving comparable accuracy to Random Forest while being 3 times faster, showing geometric features can predict point-level uncertainty measured by C2C distance.

Conclusion: MLS point clouds’ uncertainty is learnable, providing a novel learning-based approach to uncertainty evaluation research without ground truth dependency.

Abstract: Evaluating uncertainty is critical for reliable use of Mobile Laser Scanning (MLS) point clouds in many high-precision applications such as Scan-to-BIM, deformation analysis, and 3D modeling. However, obtaining the ground truth (GT) for evaluation is often costly and infeasible in many real-world applications. To reduce this long-standing reliance on GT in uncertainty evaluation research, this study presents a learning-based framework for MLS point clouds that integrates optimal neighborhood estimation with geometric feature extraction. Experiments on a real-world dataset show that the proposed framework is feasible and the XGBoost model delivers fully comparable accuracy to Random Forest while achieving substantially higher efficiency (about 3 times faster), providing initial evidence that geometric features can be used to predict point-level uncertainty quantified by the C2C distance. In summary, this study shows that MLS point clouds’ uncertainty is learnable, offering a novel learning-based viewpoint towards uncertainty evaluation research.

[88] A Plug-and-Play Framework for Volumetric Light-Sheet Image Reconstruction

Yi Gong, Xinyuan Zhang, Jichen Chai, Yichen Ding, Yifei Lou

Main category: cs.CV

TL;DR: A computational imaging framework combining Compressive Sensing with Light-Sheet Microscopy for high-speed, low-phototoxic cardiac imaging using compressed acquisition and advanced reconstruction algorithms.

Details

Motivation: Traditional optical imaging struggles with capturing dynamic cellular structure in beating hearts due to spatial-temporal resolution trade-offs, requiring a solution for efficient, low-phototoxic cardiac imaging.

Method: Uses compressed acquisition via DMD random binary mask coding and a Plug-and-Play framework with ADMM optimization incorporating advanced denoisers (Tikhonov, TV, BM3D) plus temporal regularization for structural continuity.

Result: Successfully reconstructs cellular structures in zebrafish heart imaging under high compression ratios with excellent denoising performance and image clarity.

Conclusion: The proposed method effectively addresses real-world high-speed, low-light biological imaging challenges, demonstrating robustness and effectiveness in cardiac imaging applications.

Abstract: Cardiac contraction is a rapid, coordinated process that unfolds across three-dimensional tissue on millisecond timescales. Traditional optical imaging is often inadequate for capturing dynamic cellular structure in the beating heart because of a fundamental trade-off between spatial and temporal resolution. To overcome these limitations, we propose a high-performance computational imaging framework that integrates Compressive Sensing (CS) with Light-Sheet Microscopy (LSM) for efficient, low-phototoxic cardiac imaging. The system performs compressed acquisition of fluorescence signals via random binary mask coding using a Digital Micromirror Device (DMD). We propose a Plug-and-Play (PnP) framework, solved using the alternating direction method of multipliers (ADMM), which flexibly incorporates advanced denoisers, including Tikhonov, Total Variation (TV), and BM3D. To preserve structural continuity in dynamic imaging, we further introduce temporal regularization enforcing smoothness between adjacent z-slices. Experimental results on zebrafish heart imaging under high compression ratios demonstrate that the proposed method successfully reconstructs cellular structures with excellent denoising performance and image clarity, validating the effectiveness and robustness of our algorithm in real-world high-speed, low-light biological imaging scenarios.

[89] DentalSplat: Dental Occlusion Novel View Synthesis from Sparse Intra-Oral Photographs

Yiyi Miao, Taoyu Wu, Tong Chen, Sihao Li, Ji Jiang, Youpeng Yang, Angelos Stefanidis, Limin Yu, Jionglong Su

Main category: cs.CV

TL;DR: DentalSplat enables 3D reconstruction from sparse orthodontic images using 3D Gaussian Splatting with prior-guided initialization, scale-adaptive pruning, and optical flow constraints.

Details

Motivation: Orthodontic telemedicine requires 3D dental occlusion visualization from sparse input views (typically 3 images), but conventional 3DGS methods need dense multi-view inputs and precise camera poses.

Method: Uses prior-guided dense stereo reconstruction for point cloud initialization, scale-adaptive pruning for efficiency, and incorporates optical flow with gradient regularization for geometric constraints in sparse scenarios.

Result: Validated on 950 clinical cases and 195 video-based test cases; achieves superior novel view synthesis quality for dental occlusion visualization compared to state-of-the-art methods.

Conclusion: DentalSplat effectively handles sparse orthodontic imagery and enables high-quality 3D reconstruction for telemedicine applications.

Abstract: In orthodontic treatment, particularly within telemedicine contexts, observing patients’ dental occlusion from multiple viewpoints facilitates timely clinical decision-making. Recent advances in 3D Gaussian Splatting (3DGS) have shown strong potential in 3D reconstruction and novel view synthesis. However, conventional 3DGS pipelines typically rely on densely captured multi-view inputs and precisely initialized camera poses, limiting their practicality. Orthodontic cases, in contrast, often comprise only three sparse images, specifically, the anterior view and bilateral buccal views, rendering the reconstruction task especially challenging. The extreme sparsity of input views severely degrades reconstruction quality, while the absence of camera pose information further complicates the process. To overcome these limitations, we propose DentalSplat, an effective framework for 3D reconstruction from sparse orthodontic imagery. Our method leverages a prior-guided dense stereo reconstruction model to initialize the point cloud, followed by a scale-adaptive pruning strategy to improve the training efficiency and reconstruction quality of 3DGS. In scenarios with extremely sparse viewpoints, we further incorporate optical flow as a geometric constraint, coupled with gradient regularization, to enhance rendering fidelity. We validate our approach on a large-scale dataset comprising 950 clinical cases and an additional video-based test set of 195 cases designed to simulate real-world remote orthodontic imaging conditions. Experimental results demonstrate that our method effectively handles sparse input scenarios and achieves superior novel view synthesis quality for dental occlusion visualization, outperforming state-of-the-art techniques.

[90] Image-Intrinsic Priors for Integrated Circuit Defect Detection and Novel Class Discovery via Self-Supervised Learning

Botong. Zhao, Xubin. Wang, Shujing. Lyu, Yue. Lu

Main category: cs.CV

TL;DR: IC DefectNCD is a support set free framework that uses image intrinsic priors in IC SEM images for defect detection and novel class discovery, achieving robust performance without extensive human annotation.

Details

Motivation: Supervised methods require extensive human annotation and struggle with emergent categories and rare defects, while clustering-based unsupervised methods have unstable performance due to missing priors.

Method: Uses Self Normal Information Guided defect detection with learnable normal feature extractor and reconstruction residuals, adaptive binarization for saliency variations, and Self Defect Information Guided classification with soft mask guided attention mechanism.

Result: Validated on real-world dataset spanning three fabrication stages and 15 defect types, demonstrating robust performance on both defect detection and unseen defect classification.

Conclusion: The proposed framework effectively handles defect detection and novel class discovery in IC manufacturing without requiring extensive human annotation or suffering from instability issues of previous methods.

Abstract: Integrated circuit manufacturing is highly complex, comprising hundreds of process steps. Defects can arise at any stage, causing yield loss and ultimately degrading product reliability. Supervised methods require extensive human annotation and struggle with emergent categories and rare, data scarce defects. Clustering-based unsupervised methods often exhibit unstable performance due to missing priors. We propose IC DefectNCD, a support set free framework that leverages Image Intrinsic Priors in IC SEM images for defect detection and novel class discovery. We first develop Self Normal Information Guided IC Defect Detection, aggregating representative normal features via a learnable normal information extractor and using reconstruction residuals to coarsely localize defect regions. To handle saliency variations across defects, we introduce an adaptive binarization strategy that produces stable subimages focused on core defective areas. Finally, we design Self Defect Information Guided IC Defect Classification, which incorporates a soft mask guided attention mechanism to inject spatial defect priors into the teacher student model. This enhances sensitivity to defective regions, suppresses background interference, and enables recognition and classification of unseen defects. We validate the approach on a real world dataset spanning three key fabrication stages and covering 15 defect types. Experiments demonstrate robust performance on both defect detection and unseen defect classification.

[91] Accelerating Physical Property Reasoning for Augmented Visual Cognition

Hongbo Lan, Zhenlin An, Haoyu Li, Vaibhav Singh, Longfei Shangguan

Main category: cs.CV

TL;DR: \sysname accelerates vision-guided physical property reasoning from 10-20 minutes to under 6 seconds (62.9x-287.2x speedup) while maintaining or improving accuracy, enabling real-time augmented visual cognition on smart glasses.

Details

Motivation: To enable real-time augmented visual cognition by reducing the latency of vision-guided physical property reasoning, which traditionally takes 10-20 minutes, making it impractical for interactive applications like smart glasses.

Method: Combines algorithmic and systematic optimizations including rapid geometric 3D reconstruction, efficient semantic feature fusion, and parallel view encoding to minimize run-time latency.

Result: Achieves 62.9x-287.2x speedup on ABO dataset while maintaining on-par or better object-level physical property estimation accuracy (e.g., mass), and superior performance in material segmentation and voxel-level inference compared to SOTA baselines.

Conclusion: \sysname enables robust real-world physical property reasoning on smart glasses, demonstrated through successful case studies in cluttered environments like IKEA furniture stores, providing consistent performance even with fewer views.

Abstract: This paper introduces \sysname, a system that accelerates vision-guided physical property reasoning to enable augmented visual cognition. \sysname minimizes the run-time latency of this reasoning pipeline through a combination of both algorithmic and systematic optimizations, including rapid geometric 3D reconstruction, efficient semantic feature fusion, and parallel view encoding. Through these simple yet effective optimizations, \sysname reduces the end-to-end latency of this reasoning pipeline from 10–20 minutes to less than 6 seconds. A head-to-head comparison on the ABO dataset shows that \sysname achieves this 62.9$\times$–287.2$\times$ speedup while not only reaching on-par (and sometimes slightly better) object-level physical property estimation accuracy(e.g. mass), but also demonstrating superior performance in material segmentation and voxel-level inference than two SOTA baselines. We further combine gaze-tracking with \sysname to localize the object of interest in cluttered, real-world environments, streamlining the physical property reasoning on smart glasses. The case study with Meta Aria Glasses conducted at an IKEA furniture store demonstrates that \sysname achives consistently high performance compared to controlled captures, providing robust property estimations even with fewer views in real-world scenarios.

[92] FusionRF: High-Fidelity Satellite Neural Radiance Fields from Multispectral and Panchromatic Acquisitions

Michael Sprintson, Rama Chellappa, Cheng Peng

Main category: cs.CV

TL;DR: FusionRF is a novel framework for digital surface reconstruction from satellite images that jointly performs image fusion and reconstruction without requiring pansharpening preprocessing, achieving 17% reduction in depth reconstruction error.

Details

Motivation: Current neural reconstruction methods require multispectral images to be upsampled using pansharpening methods, which can introduce biases and hallucinations due to domain gaps between panchromatic (high spatial resolution) and multispectral (high spectral resolution) images.

Method: FusionRF introduces joint image fusion during optimization through a novel cross-resolution kernel that learns to resolve spatial resolution loss in multispectral images. It accepts original multispectral and panchromatic data directly, uses multimodal appearance embeddings to encode image characteristics, and optimizes on both modalities simultaneously.

Result: Evaluation on WorldView-3 satellite images shows FusionRF provides an average 17% reduction in depth reconstruction error compared to existing methods, and renders sharp training and novel views.

Conclusion: FusionRF eliminates the need for pansharpening preprocessing by learning to fuse image modalities during reconstruction, resulting in more accurate digital surface reconstruction from satellite imagery.

Abstract: We introduce FusionRF, a novel framework for digital surface reconstruction from satellite multispectral and panchromatic images. Current work has demonstrated the increased accuracy of neural photogrammetry for surface reconstruction from optical satellite images compared to algorithmic methods. Common satellites produce both a panchromatic and multispectral image, which contain high spatial and spectral information respectively. Current neural reconstruction methods require multispectral images to be upsampled with a pansharpening method using the spatial data in the panchromatic image. However, these methods may introduce biases and hallucinations due to domain gaps. FusionRF introduces joint image fusion during optimization through a novel cross-resolution kernel that learns to resolve spatial resolution loss present in multispectral images. As input, FusionRF accepts the original multispectral and panchromatic data, eliminating the need for image preprocessing. FusionRF also leverages multimodal appearance embeddings that encode the image characteristics of each modality and view within a uniform representation. By optimizing on both modalities, FusionRF learns to fuse image modalities while performing reconstruction tasks and eliminates the need for a pansharpening preprocessing step. We evaluate our method on multispectral and panchromatic satellite images from the WorldView-3 satellite in various locations, and show that FusionRF provides an average of 17% reduction in depth reconstruction error, and renders sharp training and novel views.

[93] Deploying Rapid Damage Assessments from sUAS Imagery for Disaster Response

Thomas Manzini, Priyankari Perali, Robin R. Murphy

Main category: cs.CV

TL;DR: First operational AI/ML system for automated building damage assessment using sUAS imagery deployed during federal disasters, processing 415 buildings in 18 minutes to address data overload issues.

Details

Motivation: Address the data avalanche problem where sUAS teams deliver 47GB-369GB of imagery per day during disasters, overwhelming transmission and human analysis capabilities and delaying response efforts.

Method: Developed models trained on largest known dataset of post-disaster sUAS imagery (21,716 building damage labels) and deployed best performing model operationally during Hurricanes Debby and Helene.

Result: Successfully deployed system assessed 415 buildings in approximately 18 minutes during real disaster responses, establishing first operational state of practice for sUAS-based damage assessment.

Conclusion: This work documents the first actual operational use of AI/ML for damage assessment during disasters and provides valuable lessons learned for both AI/ML research and user communities.

Abstract: This paper presents the first AI/ML system for automating building damage assessment in uncrewed aerial systems (sUAS) imagery to be deployed operationally during federally declared disasters (Hurricanes Debby and Helene). In response to major disasters, sUAS teams are dispatched to collect imagery of the affected areas to assess damage; however, at recent disasters, teams collectively delivered between 47GB and 369GB of imagery per day, representing more imagery than can reasonably be transmitted or interpreted by subject matter experts in the disaster scene, thus delaying response efforts. To alleviate this data avalanche encountered in practice, computer vision and machine learning techniques are necessary. While prior work has been deployed to automatically assess damage in satellite imagery, there is no current state of practice for sUAS-based damage assessment systems, as all known work has been confined to academic settings. This work establishes the state of practice via the development and deployment of models for building damage assessment with sUAS imagery. The model development involved training on the largest known dataset of post-disaster sUAS aerial imagery, containing 21,716 building damage labels, and the operational training of 91 disaster practitioners. The best performing model was deployed during the responses to Hurricanes Debby and Helene, where it assessed a combined 415 buildings in approximately 18 minutes. This work contributes documentation of the actual use of AI/ML for damage assessment during a disaster and lessons learned to the benefit of the AI/ML research and user communities.

[94] Finetuning-Free Personalization of Text to Image Generation via Hypernetworks

Sagar Shrestha, Gopal Sharma, Luowei Zhou, Suren Kumar

Main category: cs.CV

TL;DR: A fine-tuning-free personalization method for text-to-image diffusion models using hypernetworks that predict LoRA-adapted weights directly from subject images, eliminating per-subject optimization at inference.

Details

Motivation: Traditional subject-specific fine-tuning approaches like DreamBooth are computationally expensive and slow at inference. Recent adapter/encoder methods still require additional fine-tuning or large backbone models.

Method: Uses hypernetworks to predict LoRA-adapted weights from subject images with end-to-end training objective stabilized by output regularization. Introduces Hybrid-Model Classifier-Free Guidance (HM-CFG) for better compositional generalization.

Result: Achieves strong personalization performance on CelebA-HQ, AFHQ-v2, and DreamBench benchmarks while preserving subject fidelity and prompt alignment without per-subject optimization.

Conclusion: Demonstrates hypernetworks as a scalable and effective direction for open-category personalization in text-to-image diffusion models.

Abstract: Personalizing text-to-image diffusion models has traditionally relied on subject-specific fine-tuning approaches such as DreamBooth~\cite{ruiz2023dreambooth}, which are computationally expensive and slow at inference. Recent adapter- and encoder-based methods attempt to reduce this overhead but still depend on additional fine-tuning or large backbone models for satisfactory results. In this work, we revisit an orthogonal direction: fine-tuning-free personalization via Hypernetworks that predict LoRA-adapted weights directly from subject images. Prior hypernetwork-based approaches, however, suffer from costly data generation or unstable attempts to mimic base model optimization trajectories. We address these limitations with an end-to-end training objective, stabilized by a simple output regularization, yielding reliable and effective hypernetworks. Our method removes the need for per-subject optimization at test time while preserving both subject fidelity and prompt alignment. To further enhance compositional generalization at inference time, we introduce Hybrid-Model Classifier-Free Guidance (HM-CFG), which combines the compositional strengths of the base diffusion model with the subject fidelity of personalized models during sampling. Extensive experiments on CelebA-HQ, AFHQ-v2, and DreamBench demonstrate that our approach achieves strong personalization performance and highlights the promise of hypernetworks as a scalable and effective direction for open-category personalization.

[95] Subsampled Randomized Fourier GaLore for Adapting Foundation Models in Depth-Driven Liver Landmark Segmentation

Yun-Chen Lin, Jiayuan Huang, Hanyuan Zhang, Sergi Kavtaradze, Matthew J. Clarkson, Mobarak I. Hoque

Main category: cs.CV

TL;DR: Proposes a depth-guided liver landmark segmentation framework using SAM2 and Depth Anything V2 encoders with SRFT-GaLore for efficient fine-tuning, achieving improved segmentation accuracy and cross-dataset generalization.

Details

Motivation: Accurate anatomical structure detection in laparoscopic liver surgery is challenging due to limited depth perception in 2D video streams, requiring better fusion of RGB and depth features and efficient adaptation of large vision models.

Method: Integrates SAM2 encoder for RGB features and Depth Anything V2 encoder for depth features, using SRFT-GaLore (Subsampled Randomized Fourier Transform) for efficient fine-tuning and cross-attention fusion for RGB-depth integration.

Result: Achieves 4.85% improvement in Dice Similarity Coefficient and 11.78-point reduction in Average Symmetric Surface Distance on L3D dataset, with strong cross-dataset performance on new LLSD dataset.

Conclusion: The SRFT-GaLore-enhanced dual-encoder framework enables scalable and precise segmentation under real-time surgical settings with strong cross-dataset robustness.

Abstract: Accurate detection and delineation of anatomical structures in medical imaging are critical for computer-assisted interventions, particularly in laparoscopic liver surgery where 2D video streams limit depth perception and complicate landmark localization. While recent works have leveraged monocular depth cues for enhanced landmark detection, challenges remain in fusing RGB and depth features and in efficiently adapting large-scale vision models to surgical domains. We propose a depth-guided liver landmark segmentation framework integrating semantic and geometric cues via vision foundation encoders. We employ Segment Anything Model V2 (SAM2) encoder to extract RGB features and Depth Anything V2 (DA2) encoder to extract depth-aware features. To efficiently adapt SAM2, we introduce SRFT-GaLore, a novel low-rank gradient projection method that replaces the computationally expensive SVD with a Subsampled Randomized Fourier Transform (SRFT). This enables efficient fine-tuning of high-dimensional attention layers without sacrificing representational power. A cross-attention fusion module further integrates RGB and depth cues. To assess cross-dataset generalization, we also construct a new Laparoscopic Liver Surgical Dataset (LLSD) as an external validation benchmark. On the public L3D dataset, our method achieves a 4.85% improvement in Dice Similarity Coefficient and a 11.78-point reduction in Average Symmetric Surface Distance compared to the D2GPLand. To further assess generalization capability, we evaluate our model on LLSD dataset. Our model maintains competitive performance and significantly outperforms SAM-based baselines, demonstrating strong cross-dataset robustness and adaptability to unseen surgical environments. These results demonstrate that our SRFT-GaLore-enhanced dual-encoder framework enables scalable and precise segmentation under real-time, depth-constrained surgical settings.

[96] SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

Shreyas C. Dhake, Jiayuan Huang, Runlong He, Danyal Z. Khan, Evangelos B. Mazomenos, Sophia Bano, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarak I. Hoque

Main category: cs.CV

TL;DR: Introduces PitVQA-Anticipation, the first VQA dataset for surgical anticipation, and SurgAnt-ViVQA, a video-language model with GRU-based temporal encoding and gated cross-attention that outperforms baselines on surgical forecasting tasks.

Details

Motivation: Current surgical VQA systems focus on static scene description rather than anticipating future events, which is crucial for real-time assistance in endonasal transsphenoidal pituitary surgery where visibility is limited and workflow changes rapidly.

Method: Proposed SurgAnt-ViVQA model adapts a large language model using GRU Gated Temporal Cross-Attention module, with bidirectional GRU for frame dynamics and adaptive gate for visual context injection. Uses parameter efficient fine-tuning for surgical domain customization.

Result: SurgAnt-ViVQA surpasses strong image and video baselines on PitVQA-Anticipation and EndoVis datasets. Ablations show temporal recurrence and gated fusion drive most gains. 8 frames maximize fluency while 32 frames improve numeric time estimation.

Conclusion: The approach advances surgical VQA from retrospective description to proactive anticipation, highlighting the importance of targeted temporal modeling for reliable, future-aware surgical assistance.

Abstract: Anticipating forthcoming surgical events is vital for real-time assistance in endonasal transsphenoidal pituitary surgery, where visibility is limited and workflow changes rapidly. Most visual question answering (VQA) systems reason on isolated frames with static vision language alignment, providing little support for forecasting next steps or instrument needs. Existing surgical VQA datasets likewise center on the current scene rather than the near future. We introduce PitVQA-Anticipation, the first VQA dataset designed for forward looking surgical reasoning. It comprises 33.5 hours of operative video and 734,769 question answer pairs built from temporally grouped clips and expert annotations across four tasks: predicting the future phase, next step, upcoming instrument, and remaining duration. We further propose SurgAnt-ViVQA, a video language model that adapts a large language model using a GRU Gated Temporal Cross-Attention module. A bidirectional GRU encodes frame to frame dynamics, while an adaptive gate injects visual context into the language stream at the token level. Parameter efficient fine tuning customizes the language backbone to the surgical domain. SurgAnt-ViVQA tested upon on PitVQA-Anticipation and EndoVis datasets, surpassing strong image and video based baselines. Ablations show that temporal recurrence and gated fusion drive most of the gains. A frame budget study indicates a trade-off: 8 frames maximize fluency, whereas 32 frames slightly reduce BLEU but improve numeric time estimation. By pairing a temporally aware encoder with fine grained gated cross-attention, SurgAnt-ViVQA advances surgical VQA from retrospective description to proactive anticipation. PitVQA-Anticipation offers a comprehensive benchmark for this setting and highlights the importance of targeted temporal modeling for reliable, future aware surgical assistance.

[97] PETWB-REP: A Multi-Cancer Whole-Body FDG PET/CT and Radiology Report Dataset for Medical Imaging Research

Le Xue, Gang Feng, Wenbo Zhang, Yichi Zhang, Lanlan Li, Shuqi Wang, Liling Peng, Sisi Peng, Xin Gao

Main category: cs.CV

TL;DR: PETWB-REP is a curated dataset of 490 patients with whole-body FDG PET/CT scans and corresponding radiology reports across multiple cancer types, designed to support medical AI and imaging research.

Details

Motivation: There is a scarcity of large-scale medical imaging datasets that combine functional and anatomical imaging with detailed clinical reports across multiple cancer types, which are crucial for developing AI models and conducting clinical research.

Method: The authors compiled a dataset comprising whole-body 18F-FDG PET/CT scans and corresponding radiology reports from 490 patients diagnosed with various malignancies, including lung, liver, breast, prostate, and ovarian cancers.

Result: The PETWB-REP dataset includes paired PET and CT images, de-identified textual reports, and structured clinical metadata from 490 patients across multiple cancer types.

Conclusion: This dataset is designed to support research in medical imaging, radiomics, artificial intelligence, and multi-modal learning by providing a comprehensive resource that combines functional and anatomical imaging with clinical data.

Abstract: Publicly available, large-scale medical imaging datasets are crucial for developing and validating artificial intelligence models and conducting retrospective clinical research. However, datasets that combine functional and anatomical imaging with detailed clinical reports across multiple cancer types remain scarce. Here, we present PETWB-REP, a curated dataset comprising whole-body 18F-Fluorodeoxyglucose (FDG) Positron Emission Tomography/Computed Tomography (PET/CT) scans and corresponding radiology reports from 490 patients diagnosed with various malignancies. The dataset primarily includes common cancers such as lung cancer, liver cancer, breast cancer, prostate cancer, and ovarian cancer. This dataset includes paired PET and CT images, de-identified textual reports, and structured clinical metadata. It is designed to support research in medical imaging, radiomics, artificial intelligence, and multi-modal learning.

[98] QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models

Kuei-Chun Kao, Hsu Tzu-Yin, Yunqi Hong, Ruochen Wang, Cho-Jui Hsieh

Main category: cs.CV

TL;DR: Proposes QG-CoC, a zero-shot prompting method that improves multimodal LLMs’ ability to handle multi-image reasoning tasks by addressing fine-grained perception gaps and reasoning integration issues.

Details

Motivation: Current MLLMs struggle with fine-grained perception across multiple images and lack effective reasoning capabilities in multi-image contexts, creating a gap for handling complex visual reasoning tasks.

Method: Question-Guided Chain-of-Captions (QG-CoC) - a generalized zero-shot prompting approach that can handle arbitrary numbers of images by guiding the model through structured caption generation and reasoning.

Result: QG-CoC shows competitive performance across various multi-image and single-image benchmarks, with robust improvements in challenging scenarios where existing methods fail.

Conclusion: The proposed QG-CoC method effectively addresses multi-image reasoning limitations in MLLMs and demonstrates strong generalization capabilities across different model types and task complexities.

Abstract: Recently, Multimodal Large Language Models (MLLMs) encounter two key issues in multi-image contexts: (1) a lack of fine-grained perception across disparate images, and (2) a diminished capability to effectively reason over and synthesize information from multiple visual inputs. However, while various prompting methods aim to describe visual content, many existing studies focus primarily on single-image settings or specific, constrained scenarios. This leaves a critical gap in understanding and addressing how MLLMs tackle more general and complex multi-image reasoning tasks. Thus, we first extensively investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images. Our findings reveal that existing prompting methods fall short in attending to needed clues and seamlessly integrating perception and reasoning. Inspired by the findings, we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions (QG-CoC), a generalized prompting approach that effectively handles problems with an arbitrary number of images. We evaluate our method on various open-source and closed-source MLLMs for multi-image and single-image benchmarks. Experimental results indicate that QG-CoC demonstrates competitive performance across tasks and exhibits robust improvements in the challenging scenarios where existing prompting methods fail.

[99] MvBody: Multi-View-Based Hybrid Transformer Using Optical 3D Body Scan for Explainable Cesarean Section Prediction

Ruting Cheng, Boyuan Feng, Yijiang Zheng, Chuhui Qiu, Aizierjiang Aiersilan, Joaquin A. Calderon, Wentao Zhao, Qing Pan, James K. Hahn

Main category: cs.CV

TL;DR: The paper proposes MvBody, a multi-view Transformer network that uses 3D body shape scans and self-reported medical data to predict cesarean section risk, achieving 84.62% accuracy with improved transparency via explainable AI.

Details

Motivation: Current CS risk prediction models are designed for hospital settings and rely on parameters unavailable in resource-limited environments, creating a need for affordable, accessible alternatives using simpler data sources.

Method: Developed MvBody, a multi-view Transformer network that combines self-reported medical data with 3D optical body scans from 31-38 weeks gestation, incorporating metric learning loss for better training efficiency in data-scarce settings.

Result: Achieved 84.62% accuracy and 0.724 AUC-ROC on independent test set, outperforming existing machine learning models and advanced 3D analysis methods.

Conclusion: 3D body shape combined with basic medical data can effectively predict CS risk, with key factors including pre-pregnancy weight, maternal age, obstetric history, previous CS, and body shape around head/shoulders.

Abstract: Accurately assessing the risk of cesarean section (CS) delivery is critical, especially in settings with limited medical resources, where access to healthcare is often restricted. Early and reliable risk prediction allows better-informed prenatal care decisions and can improve maternal and neonatal outcomes. However, most existing predictive models are tailored for in-hospital use during labor and rely on parameters that are often unavailable in resource-limited or home-based settings. In this study, we conduct a pilot investigation to examine the feasibility of using 3D body shape for CS risk assessment for future applications with more affordable general devices. We propose a novel multi-view-based Transformer network, MvBody, which predicts CS risk using only self-reported medical data and 3D optical body scans obtained between the 31st and 38th weeks of gestation. To enhance training efficiency and model generalizability in data-scarce environments, we incorporate a metric learning loss into the network. Compared to widely used machine learning models and the latest advanced 3D analysis methods, our method demonstrates superior performance, achieving an accuracy of 84.62% and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.724 on the independent test set. To improve transparency and trust in the model’s predictions, we apply the Integrated Gradients algorithm to provide theoretically grounded explanations of the model’s decision-making process. Our results indicate that pre-pregnancy weight, maternal age, obstetric history, previous CS history, and body shape, particularly around the head and shoulders, are key contributors to CS risk prediction.

[100] Diffusion-Guided Mask-Consistent Paired Mixing for Endoscopic Image Segmentation

Pengyu Jie, Wanquan Liu, Rui He, Yihui Wen, Deyu Meng, Chenqiang Gao

Main category: cs.CV

TL;DR: A new data augmentation method combining diffusion synthesis and mask-consistent mixing that improves endoscopic segmentation performance across multiple datasets.

Details

Motivation: Existing augmentation methods have limitations: sample mixing causes soft label ambiguity with misaligned masks, while diffusion synthesis creates domain shift and overlooks structural benefits of mask conditioning.

Method: Proposes paired diffusion-guided paradigm with Mask-Consistent Paired Mixing (MCPMix) that mixes only image appearance while keeping original hard masks, and Real-Anchored Learnable Annealing (RLA) that adaptively adjusts mixing strength and loss weights.

Result: Achieves state-of-the-art segmentation performance across Kvasir-SEG, PICCOLO, CVC-ClinicDB, private NPC-LES cohort, and ISIC 2017 datasets with consistent gains over baselines.

Conclusion: Combining label-preserving mixing with diffusion-driven diversity and adaptive re-anchoring yields robust and generalizable endoscopic segmentation.

Abstract: Augmentation for dense prediction typically relies on either sample mixing or generative synthesis. Mixing improves robustness but misaligned masks yield soft label ambiguity. Diffusion synthesis increases apparent diversity but, when trained as common samples, overlooks the structural benefit of mask conditioning and introduces synthetic-real domain shift. We propose a paired, diffusion-guided paradigm that fuses the strengths of both. For each real image, a synthetic counterpart is generated under the same mask and the pair is used as a controllable input for Mask-Consistent Paired Mixing (MCPMix), which mixes only image appearance while supervision always uses the original hard mask. This produces a continuous family of intermediate samples that smoothly bridges synthetic and real appearances under shared geometry, enlarging diversity without compromising pixel-level semantics. To keep learning aligned with real data, Real-Anchored Learnable Annealing (RLA) adaptively adjusts the mixing strength and the loss weight of mixed samples over training, gradually re-anchoring optimization to real data and mitigating distributional bias. Across Kvasir-SEG, PICCOLO, CVC-ClinicDB, a private NPC-LES cohort, and ISIC 2017, the approach achieves state-of-the-art segmentation performance and consistent gains over baselines. The results show that combining label-preserving mixing with diffusion-driven diversity, together with adaptive re-anchoring, yields robust and generalizable endoscopic segmentation.

[101] Transformer-Progressive Mamba Network for Lightweight Image Super-Resolution

Sichen Guo, Wenjie Li, Yuanyang Liu, Guangwei Gao, Jian Yang, Chia-Wen Lin

Main category: cs.CV

TL;DR: T-PMambaSR is a lightweight super-resolution framework that combines window-based self-attention with Progressive Mamba to enable fine-grained multi-scale feature modeling with linear complexity, plus an adaptive high-frequency refinement module.

Details

Motivation: Existing Mamba-based super-resolution methods lack fine-grained transitions across different modeling scales, limiting feature representation efficiency, despite their ability to capture global receptive fields with linear complexity.

Method: Integrates window-based self-attention with Progressive Mamba to enable interactions among receptive fields of different scales, and introduces an Adaptive High-Frequency Refinement Module (AHFRM) to recover lost high-frequency details.

Result: Extensive experiments show T-PMambaSR progressively enhances receptive field and expressiveness, achieving better performance than recent Transformer- or Mamba-based methods with lower computational cost.

Conclusion: The proposed framework establishes a fine-grained modeling paradigm that progressively enhances feature representation with linear complexity, outperforming existing approaches while being more computationally efficient.

Abstract: Recently, Mamba-based super-resolution (SR) methods have demonstrated the ability to capture global receptive fields with linear complexity, addressing the quadratic computational cost of Transformer-based SR approaches. However, existing Mamba-based methods lack fine-grained transitions across different modeling scales, which limits the efficiency of feature representation. In this paper, we propose T-PMambaSR, a lightweight SR framework that integrates window-based self-attention with Progressive Mamba. By enabling interactions among receptive fields of different scales, our method establishes a fine-grained modeling paradigm that progressively enhances feature representation with linear complexity. Furthermore, we introduce an Adaptive High-Frequency Refinement Module (AHFRM) to recover high-frequency details lost during Transformer and Mamba processing. Extensive experiments demonstrate that T-PMambaSR progressively enhances the model’s receptive field and expressiveness, yielding better performance than recent Transformer- or Mamba-based methods while incurring lower computational cost. Our codes will be released after acceptance.

[102] Decoupled Multi-Predictor Optimization for Inference-Efficient Model Tuning

Liwei Luo, Shuaitengyuan Li, Dongwei Ren, Qilong Wang, Pengfei Zhu, Qinghua Hu

Main category: cs.CV

TL;DR: DMPO method decouples representative and discriminative abilities in early stages through architecture design and optimization strategy to improve inference efficiency in large pre-trained models.

Details

Motivation: Address the challenge of early stages providing both low-level fundamental features to deep stages and high-level discriminative features to early-stage predictors simultaneously for efficient inference.

Method: Introduces lightweight bypass module for functional decomposition, high-order statistics-based predictor for early stages, and decoupled optimization with two-phase loss weights during training.

Result: Experiments across various datasets and pre-trained backbones show DMPO clearly outperforms counterparts when reducing computational cost.

Conclusion: DMPO effectively decouples representative and discriminative abilities in early stages through combined architectural and optimization approaches, achieving superior inference efficiency.

Abstract: Recently, remarkable progress has been made in large-scale pre-trained model tuning, and inference efficiency is becoming more crucial for practical deployment. Early exiting in conjunction with multi-stage predictors, when cooperated with a parameter-efficient fine-tuning strategy, offers a straightforward way to achieve an inference-efficient model. However, a key challenge remains unresolved: How can early stages provide low-level fundamental features to deep stages while simultaneously supplying high-level discriminative features to early-stage predictors? To address this problem, we propose a Decoupled Multi-Predictor Optimization (DMPO) method to effectively decouple the low-level representative ability and high-level discriminative ability in early stages. First, in terms of architecture, we introduce a lightweight bypass module into multi-stage predictors for functional decomposition of shallow features from early stages, while a high-order statistics-based predictor is developed for early stages to effectively enhance their discriminative ability. To reasonably train our multi-predictor architecture, a decoupled optimization is proposed to allocate two-phase loss weights for multi-stage predictors during model tuning, where the initial training phase enables the model to prioritize the acquisition of discriminative ability of deep stages via emphasizing representative ability of early stages, and the latter training phase drives discriminative ability towards earlier stages as much as possible. As such, our DMPO can effectively decouple representative and discriminative abilities in early stages in terms of architecture design and model optimization. Experiments across various datasets and pre-trained backbones demonstrate that DMPO clearly outperforms its counterparts when reducing computational cost.

[103] Generative deep learning for foundational video translation in ultrasound

Nikolina Tomic Roshni Bhatnagar, Sarthak Jain, Connor Lau, Tien-Yu Liu, Laura Gambini, Rima Arnaout

Main category: cs.CV

TL;DR: A generative method for translating between color flow doppler and greyscale ultrasound videos that achieves realistic synthetic videos indistinguishable from real ones in clinical evaluation and downstream tasks.

Details

Motivation: Address data imbalance in ultrasound studies where different sub-modalities like greyscale and color flow doppler are often imbalanced, and expand utility of retrospectively collected medical imaging data.

Method: Used pixel-wise, adversarial, and perceptual losses with two networks: one for reconstructing anatomic structures and one for denoising to achieve realistic ultrasound imaging, trained on 54,975 videos.

Result: Synthetic videos achieved SSIM of 0.91±0.04, performed indistinguishably from real videos in classification (F1: 0.9 real vs 0.89 synthetic) and segmentation (Dice: 0.97), with clinicians only 54±6% accurate at distinguishing real vs synthetic.

Conclusion: The method successfully generates realistic synthetic ultrasound videos that expand utility of retrospective imaging data and augment dataset design for medical imaging, demonstrating foundational cross-domain capabilities.

Abstract: Deep learning (DL) has the potential to revolutionize image acquisition and interpretation across medicine, however, attention to data imbalance and missingness is required. Ultrasound data presents a particular challenge because in addition to different views and structures, it includes several sub-modalities-such as greyscale and color flow doppler (CFD)-that are often imbalanced in clinical studies. Image translation can help balance datasets but is challenging for ultrasound sub-modalities to date. Here, we present a generative method for ultrasound CFD-greyscale video translation, trained on 54,975 videos and tested on 8,368. The method developed leveraged pixel-wise, adversarial, and perceptual loses and utilized two networks: one for reconstructing anatomic structures and one for denoising to achieve realistic ultrasound imaging. Average pairwise SSIM between synthetic videos and ground truth was 0.91+/-0.04. Synthetic videos performed indistinguishably from real ones in DL classification and segmentation tasks and when evaluated by blinded clinical experts: F1 score was 0.9 for real and 0.89 for synthetic videos; Dice score between real and synthetic segmentation was 0.97. Overall clinician accuracy in distinguishing real vs synthetic videos was 54+/-6% (42-61%), indicating realistic synthetic videos. Although trained only on heart videos, the model worked well on ultrasound spanning several clinical domains (average SSIM 0.91+/-0.05), demonstrating foundational abilities. Together, these data expand the utility of retrospectively collected imaging and augment the dataset design toolbox for medical imaging.

[104] Enhancing Medical Image Segmentation via Heat Conduction Equation

Rong Wu, Yim-Sang Yu

Main category: cs.CV

TL;DR: Proposes U-Mamba with Heat Conduction Equation, a hybrid architecture combining Mamba-based state-space modules for long-range reasoning with Heat Conduction Operators in bottleneck layers for enhanced semantic abstraction in medical image segmentation.

Details

Motivation: Existing deep learning models for medical image segmentation struggle to achieve efficient global context modeling and long-range dependency reasoning under practical computational budgets simultaneously.

Method: Hybrid architecture using U-Mamba with Heat Conduction Equation, combining Mamba-based state-space modules for efficient long-range reasoning with Heat Conduction Operators (HCOs) in bottleneck layers that simulate frequency-domain thermal diffusion.

Result: Experimental results on multimodal abdominal CT and MRI datasets demonstrate consistent outperformance over strong baselines, validating effectiveness and generalizability.

Conclusion: Blending state-space dynamics with heat-based global diffusion offers a scalable and interpretable solution for medical segmentation tasks.

Abstract: Medical image segmentation has been significantly advanced by deep learning architectures, notably U-Net variants. However, existing models struggle to achieve efficient global context modeling and long-range dependency reasoning under practical computational budgets simultaneously. In this work, we propose a novel hybrid architecture utilizing U-Mamba with Heat Conduction Equation. Our model combines Mamba-based state-space modules for efficient long-range reasoning with Heat Conduction Operators (HCOs) in the bottleneck layers, simulating frequency-domain thermal diffusion for enhanced semantic abstraction. Experimental results on multimodal abdominal CT and MRI datasets demonstrate that the proposed model consistently outperforms strong baselines, validating its effectiveness and generalizability. It suggest that blending state-space dynamics with heat-based global diffusion offers a scalable and interpretable solution for medical segmentation tasks.

[105] IEC3D-AD: A 3D Dataset of Industrial Equipment Components for Unsupervised Point Cloud Anomaly Detection

Bingyang Guo, Hongjie Li, Ruiyun Yu, Hanzhe Liang, Jinbao Wang

Main category: cs.CV

TL;DR: The paper introduces IEC3D-AD, a new 3D anomaly detection dataset for industrial equipment components, and proposes GMANet, a novel 3D-AD method using geometric morphological analysis and spatial discrepancy optimization.

Details

Motivation: Existing 3D datasets like Real3D-AD and MVTec 3D-AD fail to capture real industrial complexities and subtle defects, limiting precise anomaly detection for industrial equipment components such as bearings, rings, and bolts.

Method: Developed IEC3D-AD dataset from actual production lines with high point cloud resolution and defect annotation granularity. Proposed GMANet paradigm that generates synthetic point cloud samples via geometric morphological analysis and optimizes spatial discrepancy to reduce margin between normal and abnormal features.

Result: Extensive experiments show the method’s effectiveness on both IEC3D-AD and other datasets, demonstrating improved anomaly detection capabilities for industrial equipment components.

Conclusion: The IEC3D-AD dataset addresses limitations of existing 3D datasets for industrial applications, and the GMANet method provides an effective 3D anomaly detection approach through geometric analysis and spatial optimization.

Abstract: 3D anomaly detection (3D-AD) plays a critical role in industrial manufacturing, particularly in ensuring the reliability and safety of core equipment components. Although existing 3D datasets like Real3D-AD and MVTec 3D-AD offer broad application support, they fall short in capturing the complexities and subtle defects found in real industrial environments. This limitation hampers precise anomaly detection research, especially for industrial equipment components (IEC) such as bearings, rings, and bolts. To address this challenge, we have developed a point cloud anomaly detection dataset (IEC3D-AD) specific to real industrial scenarios. This dataset is directly collected from actual production lines, ensuring high fidelity and relevance. Compared to existing datasets, IEC3D-AD features significantly improved point cloud resolution and defect annotation granularity, facilitating more demanding anomaly detection tasks. Furthermore, inspired by generative 2D-AD methods, we introduce a novel 3D-AD paradigm (GMANet) on IEC3D-AD. This paradigm generates synthetic point cloud samples based on geometric morphological analysis, then reduces the margin and increases the overlap between normal and abnormal point-level features through spatial discrepancy optimization. Extensive experiments demonstrate the effectiveness of our method on both IEC3D-AD and other datasets.

[106] Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising

Shuangquan Lyu, Steven Mao, Yue Ma

Main category: cs.CV

TL;DR: A unified approach for long video inpainting and outpainting using LoRA fine-tuned diffusion models with overlap-and-blend temporal co-denoising, enabling arbitrarily long video generation without seams or drift.

Details

Motivation: Addressing the fundamental challenge of generating long videos with high controllability in video inpainting and outpainting, which prior methods struggle with due to fixed-length clips and stitching artifacts.

Method: Extends text-to-video diffusion models using LoRA to efficiently fine-tune pre-trained models (like Wan 2.1) for masked region synthesis, combined with overlap-and-blend temporal co-denoising strategy with high-order solvers for consistency.

Result: Superior performance to baseline methods (Wan 2.1 and VACE) in quality metrics (PSNR/SSIM) and perceptual realism (LPIPS), enabling editing over hundreds of frames without noticeable seams or drift.

Conclusion: Achieves practical long-range video editing with minimal overhead, balancing parameter efficiency and superior performance for arbitrarily long video generation and editing.

Abstract: Generating long videos remains a fundamental challenge, and achieving high controllability in video inpainting and outpainting is particularly demanding. To address both of these challenges simultaneously and achieve controllable video inpainting and outpainting for long video clips, we introduce a novel and unified approach for long video inpainting and outpainting that extends text-to-video diffusion models to generate arbitrarily long, spatially edited videos with high fidelity. Our method leverages LoRA to efficiently fine-tune a large pre-trained video diffusion model like Alibaba’s Wan 2.1 for masked region video synthesis, and employs an overlap-and-blend temporal co-denoising strategy with high-order solvers to maintain consistency across long sequences. In contrast to prior work that struggles with fixed-length clips or exhibits stitching artifacts, our system enables arbitrarily long video generation and editing without noticeable seams or drift. We validate our approach on challenging inpainting/outpainting tasks including editing or adding objects over hundreds of frames and demonstrate superior performance to baseline methods like Wan 2.1 model and VACE in terms of quality (PSNR/SSIM), and perceptual realism (LPIPS). Our method enables practical long-range video editing with minimal overhead, achieved a balance between parameter efficient and superior performance.

[107] Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models

Minghao Fu, Guo-Hua Wang, Tianyu Cui, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

Main category: cs.CV

TL;DR: The paper identifies a pathology in Diffusion-DPO where enlarging preference margins degrades both winner and loser outputs, and proposes Diffusion-SDPO with safeguarded updates to preserve winner quality while aligning with preferences.

Details

Motivation: Text-to-image diffusion models struggle with human preference alignment, and standard DPO approaches can degrade both preferred and less-preferred outputs when increasing preference margins.

Method: Introduces Diffusion-SDPO with safeguarded updates that adaptively scale loser gradients based on alignment with winner gradients, using a closed-form scaling coefficient to guarantee non-increasing error for preferred outputs.

Result: Diffusion-SDPO achieves consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics across standard text-to-image benchmarks.

Conclusion: The proposed method is simple, model-agnostic, compatible with existing DPO frameworks, and adds minimal computational overhead while effectively addressing the identified pathology.

Abstract: Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology: enlarging the preference margin does not necessarily improve generation quality. In particular, the standard Diffusion-DPO objective can increase the reconstruction error of both winner and loser branches. Consequently, degradation of the less-preferred outputs can become sufficiently severe that the preferred branch is also adversely affected even as the margin grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the preferred output is non-increasing at each optimization step. Our method is simple, model-agnostic, broadly compatible with existing DPO-style alignment frameworks and adds only marginal computational overhead. Across standard text-to-image benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics. Code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO.

[108] SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque

Main category: cs.CV

TL;DR: SurgViVQA is a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes using temporal cues, outperforming existing methods on colonoscopic datasets.

Details

Motivation: Current surgical VideoQA approaches are limited to static image features and lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation in surgical videos.

Method: Uses a Masked Video-Text Encoder to fuse video and question features capturing temporal cues like motion and tool-tissue interactions, with a fine-tuned LLM to decode answers. Evaluated on REAL-Colon-VQA dataset with motion-related questions.

Result: Outperforms existing image-based VQA models, improving keyword accuracy by +11% on REAL-Colon-VQA and +9% on EndoVis18-VQA. Shows improved generalizability and robustness to question phrasing variations.

Conclusion: SurgViVQA provides a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively.

Abstract: Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video–Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool–tissue interactions, which a fine-tuned large language model (LLM) then decodes into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as out-of-template questions with rephrased or semantically altered formulations to assess model robustness. Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset shows that SurgViVQA outperforms existing image-based VQA benchmark models, particularly in keyword accuracy, improving over PitVQA by +11% on REAL-Colon-VQA and +9% on EndoVis18-VQA. A perturbation study on the questions further confirms improved generalizability and robustness to variations in question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively. Code and dataset available at https://github.com/madratak/SurgViVQA.

[109] Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge

Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang

Main category: cs.CV

TL;DR: A two-stage zero-shot approach combining FastTracker and LLaVA-Video for spatiotemporal action grounding, achieving second place in MOT25-StAG Challenge with 20.68 m-HIoU and 10.73 HOTA scores.

Details

Motivation: To accurately localize and track multiple objects matching specific and free-form language queries in complex real-world video scenes.

Method: Model the task as video retrieval using a two-stage zero-shot approach combining FastTracker (SOTA tracking model) and LLaVA-Video (Multi-modal Large Language Model).

Result: Achieved m-HIoU score of 20.68 and HOTA score of 10.73 on MOT25-StAG test set, winning second place in the challenge.

Conclusion: The proposed two-stage zero-shot approach effectively combines tracking and language models for spatiotemporal action grounding in complex video scenes.

Abstract: In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.

Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang

Main category: cs.CV

TL;DR: UniAVGen is a unified framework for joint audio-video generation that addresses lip synchronization and semantic consistency issues through dual-branch diffusion transformers with cross-modal interaction mechanisms.

Details

Motivation: Existing open-source audio-video generation methods suffer from compromised lip synchronization and insufficient semantic consistency due to lack of effective cross-modal modeling.

Method: Uses dual-branch joint synthesis with parallel Diffusion Transformers (DiTs), Asymmetric Cross-Modal Interaction for bidirectional cross-attention, Face-Aware Modulation for prioritizing salient regions, and Modality-Aware Classifier-Free Guidance for inference enhancement.

Result: Achieves superior audio-video synchronization, timbre consistency, and emotion consistency with significantly fewer training samples (1.3M vs 30.1M) compared to existing methods.

Conclusion: UniAVGen provides a robust joint synthesis framework that unifies multiple audio-video tasks in a single model while delivering improved cross-modal consistency and synchronization.

Abstract: Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen’s robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.

[111] Decoupling Augmentation Bias in Prompt Learning for Vision-Language Models

Gahyeon Kim, Sohee Kim, Seokju Lee

Main category: cs.CV

TL;DR: AAPL introduces adversarial token embeddings to decouple superficial visual variations from class-relevant semantic representations in prompt learning, improving generalization across various settings.

Details

Motivation: Existing prompt learning methods like CoOp and CoCoOp struggle with generalization to unseen categories and primarily focus on text-based modifications, leaving image-based augmentation unexplored.

Method: AAPL uses adversarial token embeddings to separate superficial visual variations from semantic representations, enabling prompts to focus on visually discriminative features aligned with target categories.

Result: AAPL consistently outperforms existing methods across 11 benchmark datasets in few-shot, zero-shot, cross-dataset, and domain generalization settings.

Conclusion: Image-level augmentations with attribute-specific variations can significantly enhance prompt learning, and AAPL’s adversarial approach effectively improves generalization by focusing on semantically meaningful visual features.

Abstract: Recent advances in large-scale vision and language models have led to significant progress in zero-shot learning tasks. Methods such as CoOp and CoCoOp have shown that replacing handcrafted prompts with learnable vectors, known as prompt learning, can result in improved performance. However, these models often struggle to generalize to entirely unseen categories. While traditional zero-shot learning techniques benefit from various data augmentation strategies, prompt learning has primarily focused on text-based modifications, leaving the potential of image-based augmentation largely unexplored. In this work, we explore how image-level augmentations, particularly those that introduce attribute-specific variations, can support and enhance prompt learning. Our analysis examines the interaction between these augmentations and soft prompt frameworks, revealing their potential to improve generalization. We also identify a limitation in existing methods, such as CoCoOp, which do not provide explicit guidance for learning prompts that focus on semantically meaningful visual features. To address this, we propose Adding Attributes to Prompt Learning, AAPL, a novel method that introduces adversarial token embeddings to decouple superficial visual variations introduced by augmentation from class-relevant semantic representations. This decoupling enables the learned prompts to concentrate on visually discriminative features that align with the target categories. We conduct comprehensive experiments on eleven benchmark datasets, and AAPL consistently outperforms existing methods across few-shot, zero-shot, cross-dataset, and domain generalization settings. Our source code is publicly available at: https://github.com/Gahyeonkim09/AAPL

[112] Robust Alignment of the Human Embryo in 3D Ultrasound using PCA and an Ensemble of Heuristic, Atlas-based and Learning-based Classifiers Evaluated on the Rotterdam Periconceptional Cohort

Nikolai Herrmann, Marcella C. Zijta, Stefan Klein, Régine P. M. Steegers-Theunissen, Rene M. H. Wijnen, Bernadette S. de Bakker, Melek Rousian, Wietske A. P. Bastiaansen

Main category: cs.CV

TL;DR: Automated method for standardizing embryo alignment in 3D ultrasound images using PCA and multiple selection strategies, achieving high accuracy (98.5%) for first-trimester prenatal analysis.

Details

Motivation: Standardized embryo alignment facilitates prenatal growth monitoring by enabling standard plane detection, improving landmark visualization, and accentuating differences between scans for better clinical assessment.

Method: Uses PCA on embryo segmentation masks to extract principal axes, generates four candidate orientations, and selects correct one using three strategies: Pearson correlation heuristic, atlas-based image matching with normalized cross-correlation, and Random Forest classifier.

Result: Tested on 2166 3D ultrasound scans from 1043 pregnancies (7-12 weeks gestation). PCA correctly extracted principal axes in 99.0% of images. Selection accuracy: Pearson Heuristic 97.4%, Atlas-based 95.8%, Random Forest 98.4%. Majority Vote achieved 98.5% accuracy.

Conclusion: The pipeline enables consistent embryonic alignment in first trimester with high accuracy, supporting scalable analysis in clinical and research settings. Code is publicly available.

Abstract: Standardized alignment of the embryo in three-dimensional (3D) ultrasound images aids prenatal growth monitoring by facilitating standard plane detection, improving visualization of landmarks and accentuating differences between different scans. In this work, we propose an automated method for standardizing this alignment. Given a segmentation mask of the embryo, Principal Component Analysis (PCA) is applied to the mask extracting the embryo’s principal axes, from which four candidate orientations are derived. The candidate in standard orientation is selected using one of three strategies: a heuristic based on Pearson’s correlation assessing shape, image matching to an atlas through normalized cross-correlation, and a Random Forest classifier. We tested our method on 2166 images longitudinally acquired 3D ultrasound scans from 1043 pregnancies from the Rotterdam Periconceptional Cohort, ranging from 7+0 to 12+6 weeks of gestational age. In 99.0% of images, PCA correctly extracted the principal axes of the embryo. The correct candidate was selected by the Pearson Heuristic, Atlas-based and Random Forest in 97.4%, 95.8%, and 98.4% of images, respectively. A Majority Vote of these selection methods resulted in an accuracy of 98.5%. The high accuracy of this pipeline enables consistent embryonic alignment in the first trimester, enabling scalable analysis in both clinical and research settings. The code is publicly available at: https://gitlab.com/radiology/prenatal-image-analysis/pca-3d-alignment.

[113] Generalizing Shape-from-Template to Topological Changes

Kevin Manogue, Tomasz M Schang, Dilara Kuş, Jonas Müller, Stefan Zachow, Agniva Sengupta

Main category: cs.CV

TL;DR: Extension of Shape-from-Template (SfT) to handle topological changes during deformation by iteratively adapting the template through spatial partitioning.

Details

Motivation: Existing SfT methods fail when topological changes occur during deformation, limiting their practical applicability.

Method: Initialize with classical SfT solution, then iteratively adapt template by partitioning spatial domain to minimize energy functional combining physical plausibility and reprojection consistency.

Result: Method robustly captures topological events like tears and cuts on bounded 2D surfaces, outperforming baseline methods on synthetic and real data.

Conclusion: Establishes first general framework for topological-change-aware SfT, enabling reconstruction with topological changes.

Abstract: Reconstructing the surfaces of deformable objects from correspondences between a 3D template and a 2D image is well studied under Shape-from-Template (SfT) methods; however, existing approaches break down when topological changes accompany the deformation. We propose a principled extension of SfT that enables reconstruction in the presence of such changes. Our approach is initialized with a classical SfT solution and iteratively adapts the template by partitioning its spatial domain so as to minimize an energy functional that jointly encodes physical plausibility and reprojection consistency. We demonstrate that the method robustly captures a wide range of practically relevant topological events including tears and cuts on bounded 2D surfaces, thereby establishing the first general framework for topological-change-aware SfT. Experiments on both synthetic and real data confirm that our approach consistently outperforms baseline methods.

[114] Human Mesh Modeling for Anny Body

Romain Brégier, Guénolé Fiche, Laura Bravo-Sánchez, Thomas Lucas, Matthieu Armando, Philippe Weinzaepfel, Grégory Rogez, Fabien Baradel

Main category: cs.CV

TL;DR: Anny is a scan-free, differentiable human body model based on anthropometric knowledge that provides interpretable shape control across diverse demographics, with performance matching scan-based models.

Details

Motivation: Existing parametric body models rely on costly 3D scans and proprietary shape spaces that are demographically narrow, limiting accessibility and representativeness.

Method: Anny uses anthropometric knowledge from MakeHuman to define a continuous, interpretable shape space controlled by phenotype parameters (gender, age, height, weight), calibrated with WHO population statistics.

Result: Anny supports millimeter-accurate scan fitting, synthetic data generation, and HMR training. Anny-One dataset of 800k photorealistic humans shows HMR models trained with Anny match scan-based model performance while being interpretable and representative.

Conclusion: Anny provides an accessible, open-source foundation for human-centric 3D modeling with realistic demographic variation, released under Apache 2.0 license.

Abstract: Parametric body models are central to many human-centric tasks, yet existing models often rely on costly 3D scans and learned shape spaces that are proprietary and demographically narrow. We introduce Anny, a simple, fully differentiable, and scan-free human body model grounded in anthropometric knowledge from the MakeHuman community. Anny defines a continuous, interpretable shape space, where phenotype parameters (e.g. gender, age, height, weight) control blendshapes spanning a wide range of human forms – across ages (from infants to elders), body types, and proportions. Calibrated using WHO population statistics, it provides realistic and demographically grounded human shape variation within a single unified model. Thanks to its openness and semantic control, Anny serves as a versatile foundation for 3D human modeling – supporting millimeter-accurate scan fitting, controlled synthetic data generation, and Human Mesh Recovery (HMR). We further introduce Anny-One, a collection of 800k photorealistic humans generated with Anny, showing that despite its simplicity, HMR models trained with Anny can match the performance of those trained with scan-based body models, while remaining interpretable and broadly representative. The Anny body model and its code are released under the Apache 2.0 license, making Anny an accessible foundation for human-centric 3D modeling.

[115] Signal Intensity-weighted coordinate channels improve learning stability and generalisation in 1D and 2D CNNs in localisation tasks on biomedical signals

Vittal L. Rao

Main category: cs.CV

TL;DR: Proposes intensity-weighted coordinate representation that scales coordinate channels by local signal intensity, improving localization performance in biomedical signals compared to traditional coordinate channels.

Details

Motivation: Localization tasks in biomedical data require learning spatial/temporal relationships from signals with complex intensity distributions, and traditional coordinate channels don't incorporate signal intensity information.

Method: Replace pure coordinate channels with channels scaled by local signal intensity, creating an intensity-position coupling in the input representation as a modality-agnostic inductive bias.

Result: Faster convergence and higher generalization performance on both 1D ECG signal transition time prediction and 2D nuclear center localization in cytological images compared to conventional coordinate-channel approaches.

Conclusion: The intensity-weighted coordinate representation is effective across both one-dimensional and two-dimensional biomedical signals, providing improved localization performance.

Abstract: Localisation tasks in biomedical data often require models to learn meaningful spatial or temporal relationships from signals with complex intensity distributions. A common strategy, exemplified by CoordConv layers, is to append coordinate channels to convolutional inputs, enabling networks to learn absolute positions. In this work, we propose a signal intensity-weighted coordinate representation that replaces the pure coordinate channels with channels scaled by local signal intensity. This modification embeds an intensity-position coupling directly in the input representation, introducing a simple and modality-agnostic inductive bias. We evaluate the approach on two distinct localisation problems: (i) predicting the time of morphological transition in 20-second, two-lead ECG signals, and (ii) regressing the coordinates of nuclear centres in cytological images from the SiPaKMeD dataset. In both cases, the proposed representation yields faster convergence and higher generalisation performance relative to conventional coordinate-channel approaches, demonstrating its effectiveness across both one-dimensional and two-dimensional biomedical signals.

[116] A Lightweight 3D-CNN for Event-Based Human Action Recognition with Privacy-Preserving Potential

Mehdi Sefidgar Dilmaghani, Francis Fowley, Peter Corcoran

Main category: cs.CV

TL;DR: Lightweight 3DCNN for privacy-preserving human activity recognition using event cameras, achieving 94.17% accuracy with focal loss and data augmentation.

Details

Motivation: Privacy preservation in human monitoring systems, as conventional cameras capture identifiable information while event cameras only record pixel intensity changes.

Method: Lightweight 3D convolutional neural network with focal loss, class reweighting, and targeted data augmentation to handle spatial-temporal dynamics and class imbalance.

Result: Achieved F1-score of 0.9415 and overall accuracy of 94.17%, outperforming benchmark 3D-CNN architectures by up to 3%.

Conclusion: Event-based deep learning enables accurate, efficient, and privacy-aware human action recognition suitable for real-world edge applications.

Abstract: This paper presents a lightweight three-dimensional convolutional neural network (3DCNN) for human activity recognition (HAR) using event-based vision data. Privacy preservation is a key challenge in human monitoring systems, as conventional frame-based cameras capture identifiable personal information. In contrast, event cameras record only changes in pixel intensity, providing an inherently privacy-preserving sensing modality. The proposed network effectively models both spatial and temporal dynamics while maintaining a compact design suitable for edge deployment. To address class imbalance and enhance generalization, focal loss with class reweighting and targeted data augmentation strategies are employed. The model is trained and evaluated on a composite dataset derived from the Toyota Smart Home and ETRI datasets. Experimental results demonstrate an F1-score of 0.9415 and an overall accuracy of 94.17%, outperforming benchmark 3D-CNN architectures such as C3D, ResNet3D, and MC3_18 by up to 3%. These results highlight the potential of event-based deep learning for developing accurate, efficient, and privacy-aware human action recognition systems suitable for real-world edge applications.

Dongkeun Kim, Minsu Cho, Suha Kwak

Main category: cs.CV

TL;DR: Proposes a part-aware bottom-up framework for fine-grained social interaction detection using body part features and interpersonal relations, outperforming prior methods on NVI dataset.

Details

Motivation: Existing methods overlook nuanced social cues like facial expressions and gestures, relying on holistic representations and directly detecting groups without modeling underlying interactions, limiting their ability to capture localized social signals.

Method: Detects individuals and enhances features using part-aware cues, then infers group configuration via similarity-based reasoning that considers both spatial relations and subtle social cues signaling interactions.

Result: Achieves state-of-the-art performance on NVI dataset, outperforming prior methods in social interaction detection.

Conclusion: The part-aware bottom-up approach with explicit modeling of interpersonal relations enables more accurate group inference by capturing fine-grained social cues that signal interactions.

Abstract: Social interactions often emerge from subtle, fine-grained cues such as facial expressions, gaze, and gestures. However, existing methods for social interaction detection overlook such nuanced cues and primarily rely on holistic representations of individuals. Moreover, they directly detect social groups without explicitly modeling the underlying interactions between individuals. These drawbacks limit their ability to capture localized social signals and introduce ambiguity when group configurations should be inferred from social interactions grounded in nuanced cues. In this work, we propose a part-aware bottom-up group reasoning framework for fine-grained social interaction detection. The proposed method infers social groups and their interactions using body part features and their interpersonal relations. Our model first detects individuals and enhances their features using part-aware cues, and then infers group configuration by associating individuals via similarity-based reasoning, which considers not only spatial relations but also subtle social cues that signal interactions, leading to more accurate group inference. Experiments on the NVI dataset demonstrate that our method outperforms prior methods, achieving the new state of the art.

[118] Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition

Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, Jinwoo Choi

Main category: cs.CV

TL;DR: DANCE is a video action recognition framework that provides disentangled explanations by separating motion dynamics from spatial context using concept bottlenecks for motion, objects, and scenes.

Details

Motivation: Existing video explanation methods produce entangled explanations that don't clearly distinguish between motion and spatial context, while language-based approaches struggle to explain tacit motions that are hard to verbalize.

Method: Uses a concept bottleneck design with three disentangled concept types: motion dynamics (human pose sequences), objects, and scenes (extracted via LLM). Predictions are forced through these interpretable concepts.

Result: Achieves competitive performance on KTH, Penn Action, HAA500, and UCF-101 datasets while significantly improving explanation clarity. User study validates superior interpretability.

Conclusion: DANCE provides clearer explanations for video action recognition, enabling better model debugging, editing, and failure analysis through its disentangled concept-based approach.

Abstract: Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods based on saliency produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature – intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets – KTH, Penn Action, HAA500, and UCF-101 – demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.

[119] A New Comprehensive Framework for Multi-Exposure Stereo Coding Utilizing Low Rank Tucker-ALS and 3D-HEVC Techniques

Mansi Sharma, Jyotsana Grover

Main category: cs.CV

TL;DR: Proposes an efficient compression scheme for multi-exposure stereo images using tensor low-rank approximation and 3D-HEVC encoding to generate HDR 3D content.

Details

Motivation: High-cost HDR cameras are scarce, so generating HDR 3D images from low-cost LDR stereo images with varying exposures is needed for realistic 3D displays with depth cues.

Method: Uses Tucker decomposition for tensor low-rank approximation of multi-exposure stereo images, followed by 3D-HEVC encoding to exploit redundancies, with color space optimization for perceptual quality.

Result: Outperforms state-of-the-art JPEG-XT and 3D-HEVC range coding standards in extensive experiments on natural scenes.

Conclusion: The proposed scheme efficiently compresses multi-exposure stereo images and enables HDR 3D generation at the decoder with adjustable bitrate control.

Abstract: Display technology must offer high dynamic range (HDR) contrast-based depth induction and 3D personalization simultaneously. Efficient algorithms to compress HDR stereo data is critical. Direct capturing of HDR content is complicated due to the high expense and scarcity of HDR cameras. The HDR 3D images could be generated in low-cost by fusing low-dynamic-range (LDR) images acquired using a stereo camera with various exposure settings. In this paper, an efficient scheme for coding multi-exposure stereo images is proposed based on a tensor low-rank approximation scheme. The multi-exposure fusion can be realized to generate HDR stereo output at the decoder for increased realism and exaggerated binocular 3D depth cues. For exploiting spatial redundancy in LDR stereo images, the stack of multi-exposure stereo images is decomposed into a set of projection matrices and a core tensor following an alternating least squares Tucker decomposition model. The compact, low-rank representation of the scene, thus, generated is further processed by 3D extension of High Efficiency Video Coding standard. The encoding with 3D-HEVC enhance the proposed scheme efficiency by exploiting intra-frame, inter-view and the inter-component redundancies in low-rank approximated representation. We consider constant luminance property of IPT and Y’CbCr color space to precisely approximate intensity prediction and perceptually minimize the encoding distortion. Besides, the proposed scheme gives flexibility to adjust the bitrate of tensor latent components by changing the rank of core tensor and its quantization. Extensive experiments on natural scenes demonstrate that the proposed scheme outperforms state-of-the-art JPEG-XT and 3D-HEVC range coding standards.

[120] Seal2Real: Prompt Prior Learning on Diffusion Model for Unsupervised Document Seal Data Generation and Realisation

Mingfu Yan, Jiancheng Huang, Shifeng Chen

Main category: cs.CV

TL;DR: Seal2Real is a generative framework that synthesizes labeled document seal data to address dataset scarcity, enabling improved performance on seal-related document processing tasks.

Details

Motivation: Progress in seal-related document processing tasks is hindered by the scarcity of labeled datasets needed for supervised learning.

Method: Proposes Seal2Real, a prompt prior learning architecture built on pre-trained Stable Diffusion model to synthesize realistic seal images, and introduces Seal-DB dataset with 20,000 labeled images.

Result: Seal2Real produces highly realistic synthetic seal images that significantly enhance performance of downstream seal-related tasks on real-world data.

Conclusion: Experimental evaluations demonstrate the effectiveness and practical value of the proposed Seal2Real framework for seal-related research.

Abstract: Seal-related tasks in document processing-such as seal segmentation, authenticity verification, seal removal, and text recognition under seals-hold substantial commercial importance. However, progress in these areas has been hindered by the scarcity of labeled document seal datasets, which are essential for supervised learning. To address this limitation, we propose Seal2Real, a novel generative framework designed to synthesize large-scale labeled document seal data. As part of this work, we also present Seal-DB, a comprehensive dataset containing 20,000 labeled images to support seal-related research. Seal2Real introduces a prompt prior learning architecture built upon a pre-trained Stable Diffusion model, effectively transferring its generative capability to the unsupervised domain of seal image synthesis. By producing highly realistic synthetic seal images, Seal2Real significantly enhances the performance of downstream seal-related tasks on real-world data. Experimental evaluations on the Seal-DB dataset demonstrate the effectiveness and practical value of the proposed framework.

[121] Transfer Learning-based Real-time Handgun Detection

Youssef Elmir

Main category: cs.CV

TL;DR: This paper develops a real-time computer vision system using CNNs and transfer learning for automatic handgun detection, achieving 84.74% precision with reduced false positives and learning time.

Details

Motivation: Traditional surveillance systems rely on human attention, which limits their effectiveness. The research aims to reduce human monitoring dependence and enhance security through automated detection.

Method: Uses convolutional neural networks (CNNs) and transfer learning to develop a real-time computer vision system for handgun detection, with emphasis on reducing false positives and learning time.

Result: The proposed system achieves a precision rate of 84.74%, demonstrating promising performance comparable to related works while enabling faster learning and accurate automatic detection.

Conclusion: Transfer learning is an effective approach for efficient and reliable handgun detection, advancing security measures by reducing human monitoring dependence and showcasing the potential of transfer learning-based approaches.

Abstract: Traditional surveillance systems rely on human attention, limiting their effectiveness. This study employs convolutional neural networks and transfer learning to develop a real-time computer vision system for automatic handgun detection. Comprehensive analysis of online handgun detection methods is conducted, emphasizing reducing false positives and learning time. Transfer learning is demonstrated as an effective approach. Despite technical challenges, the proposed system achieves a precision rate of 84.74%, demonstrating promising performance comparable to related works, enabling faster learning and accurate automatic handgun detection for enhanced security. This research advances security measures by reducing human monitoring dependence, showcasing the potential of transfer learning-based approaches for efficient and reliable handgun detection.

[122] BoxCell: Leveraging SAM for Cell Segmentation with Box Supervision

Aayush Kumar Tyagi, Vaibhav Mishra, Prathosh A. P., Mausam

Main category: cs.CV

TL;DR: BoxCell is a weakly supervised cell segmentation framework that uses bounding box supervision and SAM without fine-tuning, achieving 6-10 point Dice score improvements over existing methods.

Details

Motivation: Cell segmentation in histopathology is crucial for disease diagnosis but requires medical expertise for annotation. Supervised learning is difficult due to annotation costs, so weakly supervised approaches using only bounding boxes are needed.

Method: Uses SAM with bounding box prompts at both train and test time. At train time, gold bounding boxes generate pseudo-masks to train a standalone segmenter. At test time, combines segmenter output with SAM-generated masks from detector boxes using integer programming with intensity and spatial constraints.

Result: Significantly outperforms existing box-supervised segmentation models on CoNSep, MoNuSeg, and TNBC datasets, achieving 6-10 point Dice score gains.

Conclusion: BoxCell effectively leverages SAM’s capabilities without fine-tuning and demonstrates that combining multiple segmentation approaches with proper reconciliation can substantially improve weakly supervised cell segmentation performance.

Abstract: Cell segmentation in histopathological images is vital for diagnosis, and treatment of several diseases. Annotating data is tedious, and requires medical expertise, making it difficult to employ supervised learning. Instead, we study a weakly supervised setting, where only bounding box supervision is available, and present the use of Segment Anything (SAM) for this without any finetuning, i.e., directly utilizing the pre-trained model. We propose BoxCell, a cell segmentation framework that utilizes SAM’s capability to interpret bounding boxes as prompts, \emph{both} at train and test times. At train time, gold bounding boxes given to SAM produce (pseudo-)masks, which are used to train a standalone segmenter. At test time, BoxCell generates two segmentation masks: (1) generated by this standalone segmenter, and (2) a trained object detector outputs bounding boxes, which are given as prompts to SAM to produce another mask. Recognizing complementary strengths, we reconcile the two segmentation masks using a novel integer programming formulation with intensity and spatial constraints. We experiment on three publicly available cell segmentation datasets namely, CoNSep, MoNuSeg, and TNBC, and find that BoxCell significantly outperforms existing box supervised image segmentation models, obtaining 6-10 point Dice gains.

[123] A Label Propagation Strategy for CutMix in Multi-Label Remote Sensing Image Classification

Tom Burgert, Kai Norman Clasen, Jonas Klotz, Tim Siebert, Begüm Demir

Main category: cs.CV

TL;DR: A label propagation strategy is introduced to apply CutMix data augmentation effectively for multi-label scene classification in remote sensing, addressing label noise issues by exploiting pixel-level class positional information.

Details

Motivation: To overcome the limitations of supervised deep learning methods for multi-label scene classification in remote sensing, particularly the time-consuming and costly annotation process, while addressing label noise issues that arise from direct application of CutMix data augmentation.

Method: Proposes a label propagation strategy that uses pixel-level class positional information from reference maps or explanation masks to update multi-labels for augmented images generated by CutMix, performing pairing operations on class positional information similar to image pairing.

Result: Experimental results show 2% to 4% improvement in mAP macro compared to standard CutMix, with demonstrated robustness in various scenarios with noisy class positional information.

Conclusion: The proposed label propagation strategy effectively enables the application of CutMix data augmentation for multi-label scene classification in remote sensing while mitigating label noise issues, providing significant performance improvements.

Abstract: The development of supervised deep learning-based methods for multi-label scene classification (MLC) is one of the prominent research directions in remote sensing (RS). However, collecting annotations for large RS image archives is time-consuming and costly. To address this issue, several data augmentation methods have been introduced in RS. Among others, the CutMix data augmentation technique, which combines parts of two existing training images to generate an augmented image, stands out as a particularly effective approach. However, the direct application of CutMix in RS MLC can lead to the erasure or addition of class labels (i.e., label noise) in the augmented (i.e., combined) training image. To address this problem, we introduce a label propagation (LP) strategy that allows the effective application of CutMix in the context of MLC problems in RS without being affected by label noise. To this end, our proposed LP strategy exploits pixel-level class positional information to update the multi-label of the augmented training image. We propose to access such class positional information from reference maps (e.g., thematic products) associated with each training image or from class explanation masks provided by an explanation method if no reference maps are available. Similarly to pairing two training images, our LP strategy carries out a pairing operation on the associated pixel-level class positional information to derive the updated multi-label for the augmented image. Experimental results show the effectiveness of our LP strategy in general (e.g., an improvement of 2% to 4% mAP macro compared to standard CutMix) and its robustness in the case of various simulated and real scenarios with noisy class positional information in particular. Code is available at https://git.tu-berlin.de/rsim/cutmix_lp.

[124] ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones

Anurag Ghosh, Shen Zheng, Robert Tamburo, Khiem Vuong, Juan Alvarez-Padilla, Hailiang Zhu, Michael Cardei, Nicholas Dunn, Christoph Mertz, Srinivasa G. Narasimhan

Main category: cs.CV

TL;DR: ROADWork dataset enables improved perception and navigation in work zones through fine-tuning and simple techniques, achieving significant performance gains across detection, segmentation, and vision-language tasks.

Details

Motivation: Work zone perception and navigation is challenging and underexplored, with scarce open datasets for this long-tailed scenario.

Method: Propose ROADWork dataset and show fine-tuning models on it significantly improves performance. Also demonstrate value of simple techniques like video label propagation, detector-text spotter composition, and incorporating road work semantics.

Result: Fine-tuning improves work zone image discovery (+32.5% precision, 12.8× rate), detection (+32.2 AP), and VLM descriptions (+36.7 SPICE). Additional gains from video propagation (+2.6 AP), crop-scaling (+14.2% 1-NED), and context composition (+3.9 SPICE). Navigation goals achieve 53.6% AE < 0.5 (+9.9%) and pathways 75.3% AE < 0.5 (+8.1%).

Conclusion: ROADWork dataset enables substantial improvements in work zone perception and navigation, with fine-tuning and simple compositional techniques proving highly effective for this challenging domain.

Abstract: Perceiving and autonomously navigating through work zones is a challenging and underexplored problem. Open datasets for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models fail when applied to work zones. Fine-tuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork dataset, we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8$\times$) around the world. Open-vocabulary methods fail too, whereas fine-tuned detectors improve performance (+32.2 AP). Vision-Language Models (VLMs) struggle to describe work zones, but fine-tuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques. Video label propagation provides additional gains (+2.6 AP) for instance segmentation. While reading work zone signs, composing a detector and text spotter via crop-scaling improves performance +14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We predict navigational goals and compute drivable paths from work zone videos. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5 (+9.9 %) and 75.3% pathways have AE < 0.5 (+8.1 %).

[125] MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping

Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh

Main category: cs.CV

TL;DR: A lightweight transformer-based framework for few-shot semantic segmentation that achieves competitive performance with only 1.5M parameters through spatial transformer decoder, contextual mask generation, and multi-scale decoding.

Details

Motivation: Address limitations of previous few-shot segmentation methods that either discard local semantic features or suffer from high computational complexity.

Method: Proposes transformer-based framework with spatial transformer decoder, contextual mask generation module, multi-scale decoder, and integration of global features from intermediate encoder stages.

Result: Achieves competitive results on PASCAL-5^i and COCO-20^i datasets in both 1-shot and 5-shot settings with only 1.5 million parameters.

Conclusion: The proposed method effectively balances performance and efficiency, overcoming limitations of existing methodologies while maintaining lightweight structure.

Abstract: Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the Transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve competitive results on benchmark datasets such as PASCAL-5^i and COCO-20^i in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. https://github.com/amirrezafateh/MSDNet

[126] Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models

Hao Cheng, Erjia Xiao, Yichi Wang, Chengyuan Yu, Mengshu Sun, Qiang Zhang, Jiahang Cao, Yijie Guo, Ning Liu, Kaidi Xu, Jize Zhang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu

Main category: cs.CV

TL;DR: The paper proposes PVEP, a pipeline to evaluate VLAMs’ physical robustness against various visual threats including OOD, typography-based prompts, and adversarial patches.

Details

Motivation: To ensure robustness and safety of Vision Language Action Models in robotic manipulation tasks, as they interact directly with the physical world where safety is critical.

Method: Developed Physical Vulnerability Evaluating Pipeline (PVEP) that incorporates multiple visual modal physical threats to test VLAMs’ robustness.

Result: Evaluated VLAMs’ performance fluctuations when attacked by different physical threats, providing analyses of their responses to these threats.

Conclusion: The study provides comprehensive evaluation and analysis of VLAMs’ vulnerability to physical threats, highlighting safety concerns in robotic manipulation applications.

Abstract: Recently, driven by advancements in Multimodal Large Language Models (MLLMs), Vision Language Action Models (VLAMs) are being proposed to achieve better performance in open-vocabulary scenarios for robotic manipulation tasks. Since manipulation tasks involve direct interaction with the physical world, ensuring robustness and safety during the execution of this task is always a very critical issue. In this paper, by synthesizing current safety research on MLLMs and the specific application scenarios of the manipulation task in the physical world, we comprehensively evaluate VLAMs in the face of potential physical threats. Specifically, we propose the Physical Vulnerability Evaluating Pipeline (PVEP) that can incorporate as many visual modal physical threats as possible for evaluating the physical robustness of VLAMs. The physical threats in PVEP specifically include Out-of-Distribution, Typography-based Visual Prompt, and Adversarial Patch Attacks. By comparing the performance fluctuations of VLAMs before and after being attacked, we provide generalizable \textbf{\textit{Analyses}} of how VLAMs respond to different physical threats.

[127] Disentanglement with Factor Quantized Variational Autoencoders

Gulcin Baykal, Melih Kandemir, Gozde Unal

Main category: cs.CV

TL;DR: FactorQVAE is a discrete VAE model that enhances disentanglement through scalar quantization and total correlation optimization, outperforming previous methods on disentanglement metrics while improving reconstruction.

Details

Motivation: To improve disentangled representation learning without ground truth generative factors, leveraging discrete representations and inductive biases for better disentanglement.

Method: Discrete variational autoencoder with scalar quantization of latent variables using a global codebook, plus total correlation term in optimization as inductive bias.

Result: Outperforms previous disentanglement methods on DCI and InfoMEC metrics while achieving better reconstruction performance.

Conclusion: Combining discrete representation learning with optimization-based disentanglement approaches through FactorQVAE effectively enhances disentanglement without requiring ground truth factor information.

Abstract: Disentangled representation learning aims to represent the underlying generative factors of a dataset in a latent representation independently of one another. In our work, we propose a discrete variational autoencoder (VAE) based model where the ground truth information about the generative factors are not provided to the model. We demonstrate the advantages of learning discrete representations over learning continuous representations in facilitating disentanglement. Furthermore, we propose incorporating an inductive bias into the model to further enhance disentanglement. Precisely, we propose scalar quantization of the latent variables in a latent representation with scalar values from a global codebook, and we add a total correlation term to the optimization as an inductive bias. Our method called FactorQVAE combines optimization based disentanglement approaches with discrete representation learning, and it outperforms the former disentanglement methods in terms of two disentanglement metrics (DCI and InfoMEC) while improving the reconstruction performance. Our code can be found at https://github.com/ituvisionlab/FactorQVAE.

[128] SAM-EM: Real-Time Segmentation for Automated Liquid Phase Transmission Electron Microscopy

Alexander Wang, Max Xu, Risha Goel, Zain Shabeeb, Isabel Panicker, Vida Jamali

Main category: cs.CV

TL;DR: SAM-EM is a domain-adapted foundation model for liquid phase TEM videos that unifies segmentation, tracking, and statistical analysis, enabling reliable extraction of particle trajectories from noisy data.

Details

Motivation: The absence of robust segmentation frameworks for noisy LPTEM videos prevents reliable extraction of particle trajectories, creating a major barrier to quantitative analysis in materials characterization and design.

Method: Built on SAM2 with full-model fine-tuning on 46,600 curated LPTEM synthetic video frames, integrating particle tracking with statistical tools like mean-squared displacement and particle displacement distribution analysis.

Result: Substantially improves mask quality and temporal identity stability compared to zero-shot SAM2 and existing baselines, remaining robust under low signal-to-noise conditions.

Conclusion: SAM-EM transforms LPTEM into a quantitative single-particle tracking platform and accelerates its integration into data-driven materials discovery and design.

Abstract: The absence of robust segmentation frameworks for noisy liquid phase transmission electron microscopy (LPTEM) videos prevents reliable extraction of particle trajectories, creating a major barrier to quantitative analysis and to connecting observed dynamics with materials characterization and design. To address this challenge, we present Segment Anything Model for Electron Microscopy (SAM-EM), a domain-adapted foundation model that unifies segmentation, tracking, and statistical analysis for LPTEM data. Built on Segment Anything Model 2 (SAM~~2), SAM-EM is derived through full-model fine-tuning on 46,600 curated LPTEM synthetic video frames, substantially improving mask quality and temporal identity stability compared to zero-shot SAM~~2 and existing baselines. Beyond segmentation, SAM-EM integrates particle tracking with statistical tools, including mean-squared displacement and particle displacement distribution analysis, providing an end-to-end framework for extracting and interpreting nanoscale dynamics. Crucially, full fine-tuning allows SAM-EM to remain robust under low signal-to-noise conditions, such as those caused by increased liquid sample thickness in LPTEM experiments. By establishing a reliable analysis pipeline, SAM-EM transforms LPTEM into a quantitative single-particle tracking platform and accelerates its integration into data-driven materials discovery and design. Project page: \href{https://github.com/JamaliLab/SAM-EM}{github.com/JamaliLab/SAM-EM}.

[129] A Survey on Text-Driven 360-Degree Panorama Generation

Hai Wang, Xiaoyu Xiang, Weihao Xia, Jing-Hao Xue

Main category: cs.CV

TL;DR: This survey paper comprehensively reviews text-driven 360-degree panorama generation, analyzing state-of-the-art algorithms and extending to related domains like 3D scene generation and panoramic video generation.

Details

Motivation: The motivation is to simplify the traditionally complex process of producing 360-degree panoramic content by leveraging text-to-image diffusion models, enabling immersive visual content creation directly from textual descriptions.

Method: The paper conducts a comprehensive review and analysis of existing algorithms in text-driven 360-degree panorama generation, extending the analysis to closely related domains including 3D scene generation and panoramic video generation.

Result: The survey provides an in-depth analysis of state-of-the-art algorithms, critically examines current limitations, and proposes promising directions for future research in text-driven 360-degree panorama generation.

Conclusion: Text-driven 360-degree panorama generation represents a transformative advancement in immersive visual content creation, with significant potential for future development in related domains like 3D scene and panoramic video generation.

Abstract: The advent of text-driven 360-degree panorama generation, enabling the synthesis of 360-degree panoramic images directly from textual descriptions, marks a transformative advancement in immersive visual content creation. This innovation significantly simplifies the traditionally complex process of producing such content. Recent progress in text-to-image diffusion models has accelerated the rapid development in this emerging field. This survey presents a comprehensive review of text-driven 360-degree panorama generation, offering an in-depth analysis of state-of-the-art algorithms. We extend our analysis to two closely related domains: text-driven 360-degree 3D scene generation and text-driven 360-degree panoramic video generation. Furthermore, we critically examine current limitations and propose promising directions for future research. A curated project page with relevant resources and research papers is available at https://littlewhitesea.github.io/Text-Driven-Pano-Gen/.

[130] Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu

Main category: cs.CV

TL;DR: This paper investigates security risks of Typographic Visual Prompt Injection (TVPI) in cross-vision tasks, creating a dataset to evaluate how visual prompts disrupt LVLMs and I2I generation models.

Details

Motivation: Visual prompts pose security risks to cross-vision tasks but their specific threat characteristics remain underexplored, requiring comprehensive investigation of TVPI impacts.

Method: Proposed Typographic Visual Prompts Injection Dataset and thoroughly evaluated TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under different target semantics.

Result: The study reveals that typographic visual prompts significantly induce disruptive outputs aligned with injected words in both LVLMs and I2I generation models.

Conclusion: TVPI poses serious security risks to cross-vision applications, and understanding these threats is crucial for developing more robust vision-language systems.

Abstract: Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.

[131] Breaking Down Monocular Ambiguity: Exploiting Temporal Evolution for 3D Lane Detection

Huan Zheng, Wencheng Han, Tianyi Yan, Cheng-zhong Xu, Jianbing Shen

Main category: cs.CV

TL;DR: GTA-Net improves monocular 3D lane detection by leveraging temporal information from consecutive frames to overcome single-frame ambiguity and enhance geometric accuracy and lane integrity.

Details

Motivation: Existing monocular 3D lane detection methods suffer from inherent ambiguity in single-frame inputs, leading to inaccurate geometric predictions and poor lane integrity, especially for distant lanes.

Method: Proposes Geometry-aware Temporal Aggregation Network (GTA-Net) with two modules: Temporal Geometry Enhancement Module (TGEM) for geometric consistency across frames, and Temporal Instance-aware Query Generation (TIQG) that aggregates instance cues and synthesizes pseudo future perspectives.

Result: GTA-Net achieves new state-of-the-art results, significantly outperforming existing monocular 3D lane detection solutions.

Conclusion: Leveraging temporal evolution of scenes through geometric consistency and instance-aware query generation effectively addresses the limitations of single-frame monocular 3D lane detection.

Abstract: Monocular 3D lane detection aims to estimate the 3D position of lanes from frontal-view (FV) images. However, existing methods are fundamentally constrained by the inherent ambiguity of single-frame input, which leads to inaccurate geometric predictions and poor lane integrity, especially for distant lanes. To overcome this, we propose to unlock the rich information embedded in the temporal evolution of the scene as the vehicle moves. Our proposed Geometry-aware Temporal Aggregation Network (GTA-Net) systematically leverages the temporal information from complementary perspectives. First, Temporal Geometry Enhancement Module (TGEM) learns geometric consistency across consecutive frames, effectively recovering depth information from motion to build a reliable 3D scene representation. Second, to enhance lane integrity, Temporal Instance-aware Query Generation (TIQG) module aggregates instance cues from past and present frames. Crucially, for lanes that are ambiguous in the current view, TIQG innovatively synthesizes a pseudo future perspective to generate queries that reveal lanes which would otherwise be missed. The experiments demonstrate that GTA-Net achieves new SoTA results, significantly outperforming existing monocular 3D lane detection solutions.

[132] ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

Lingfeng Wang, Hualing Lin, Senda Chen, Tao Wang, Changxu Cheng, Yangyang Zhong, Dong Zheng, Wuyue Zhao

Main category: cs.CV

TL;DR: ALTo is an adaptive length tokenizer for MLLMs that enables flexible token allocation based on visual complexity, achieving SOTA performance on segmentation benchmarks with adaptive token cost.

Details

Motivation: Humans adaptively allocate attention based on object complexity when drawing, but current MLLMs are constrained by rigid token representations, creating a gap in adaptive visual processing.

Method: Proposes ALTo with a token length predictor, length regularization, differentiable token chunking, and integrates it into ALToLLM using group relative policy optimization (GRPO) for mask quality-efficiency trade-offs.

Result: ALToLLM achieves state-of-the-art performance on popular segmentation benchmarks with adaptive token cost, demonstrating improved efficiency and quality.

Conclusion: The proposed adaptive tokenization approach successfully bridges the gap between human-like adaptive attention and rigid MLLM representations, enabling more efficient and effective visual processing.

Abstract: While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at https://github.com/yayafengzi/ALToLLM.

[133] ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS

Weijie Wang, Donny Y. Chen, Zeyu Zhang, Duochao Shi, Akide Liu, Bohan Zhuang

Main category: cs.CV

TL;DR: ZPressor is a lightweight module that enables feed-forward 3D Gaussian Splatting models to scale to over 100 input views by compressing multi-view inputs into a compact latent state, improving performance and robustness.

Details

Motivation: Feed-forward 3DGS models face scalability limitations due to constrained model capacity, leading to degraded performance or excessive memory consumption with increasing input views.

Method: ZPressor partitions views into anchor and support sets, using cross attention to compress support view information into anchor views, forming a compact latent state Z that retains essential scene information while discarding redundancy.

Result: ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, consistently improving performance under moderate input views and enhancing robustness under dense view settings on DL3DV-10K and RealEstate10K benchmarks.

Conclusion: ZPressor provides an effective solution for scaling feed-forward 3DGS models to handle dense multi-view inputs while maintaining performance and efficiency.

Abstract: Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their models, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state $Z$ that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state $Z$. We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K. The video results, code and trained models are available on our project page: https://lhmd.top/zpressor.

[134] Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs

Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang

Main category: cs.CV

TL;DR: Struct2D enables MLLMs to perform 3D spatial reasoning using only structured 2D representations from perception, achieving competitive performance without explicit 3D inputs.

Details

Motivation: To unlock spatial reasoning in MLLMs for intelligent 3D environment interaction without relying on explicit 3D inputs or specialized architectures.

Method: Proposed Struct2D framework with BEV images, object marks, and metadata; created Struct2D-Set dataset (200K QA pairs); fine-tuned Qwen2.5VL on this dataset.

Result: MLLMs show strong spatial reasoning with structured 2D inputs; fine-tuned model achieves competitive performance on 3D QA, dense captioning, and object grounding benchmarks.

Conclusion: Structured 2D inputs effectively bridge perception and language reasoning in MLLMs, eliminating the need for explicit 3D representations.

Abstract: Unlocking spatial reasoning in Multimodal Large Language Models (MLLMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can MLLMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird’s-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source MLLMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source MLLM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in MLLMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.

Gaia Di Lorenzo, Federico Tombari, Marc Pollefeys, Daniel Barath

Main category: cs.CV

TL;DR: Object-X is a multi-modal 3D object representation framework that encodes various inputs (images, point clouds, text) into rich embeddings and decodes them into detailed geometric and visual reconstructions using 3D Gaussian Splatting.

Details

Motivation: Existing methods rely on task-specific embeddings that cannot be decoded into explicit geometry and reused across different tasks, limiting their versatility and practical application.

Method: Object-X geometrically grounds captured modalities in a 3D voxel grid and learns an unstructured embedding that fuses voxel information with object attributes, enabling 3D Gaussian Splatting-based reconstruction.

Result: Object-X achieves high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting with improved geometric accuracy, competitive performance in scene alignment and localization, and 3-4 orders of magnitude less storage than traditional approaches.

Conclusion: Object-X provides a scalable and practical solution for multi-modal 3D scene representation that supports various downstream tasks while being significantly more storage-efficient.

Abstract: Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g. images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy. Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization. Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.

[136] SpatialLM: Training Large Language Models for Structured Indoor Modeling

Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, Zihan Zhou

Main category: cs.CV

TL;DR: SpatialLM is a large language model that processes 3D point clouds to generate structured 3D scene understanding outputs including architectural elements and oriented object boxes, achieving state-of-the-art layout estimation and competitive 3D object detection.

Details

Motivation: To enhance the spatial understanding capabilities of modern LLMs for applications in augmented reality and embodied robotics by processing 3D point cloud data directly, avoiding task-specific network designs.

Method: Fine-tuned directly from open-source LLMs using a large-scale synthetic dataset of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, following standard multimodal LLM architecture.

Result: Achieves state-of-the-art performance in layout estimation and competitive results in 3D object detection on public benchmarks.

Conclusion: Demonstrates a feasible path for enhancing spatial understanding in LLMs, enabling applications in augmented reality, robotics, and other 3D understanding tasks.

Abstract: SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.

[137] MagCache: Fast Video Generation with Magnitude-Aware Cache

Zehong Ma, Longhui Wei, Feng Wang, Shiliang Zhang, Qi Tian

Main category: cs.CV

TL;DR: MagCache is a novel acceleration technique for video diffusion models that uses a unified magnitude law to adaptively skip unimportant timesteps, achieving 2.10x-2.68x speedups with superior visual quality using only one calibration sample.

Details

Motivation: Existing acceleration methods rely on uniform heuristics or time-embedding variants that require extensive calibration with curated prompts and risk inconsistent outputs due to prompt-specific overfitting.

Method: Leverages a discovered unified magnitude law where residual output ratios decrease monotonically. Uses Magnitude-aware Cache (MagCache) with error modeling and adaptive caching to skip unimportant timesteps adaptively.

Result: Achieves 2.10x-2.68x speedups on Open-Sora, CogVideoX, Wan 2.1, and HunyuanVideo while preserving superior visual fidelity. Significantly outperforms existing methods in LPIPS, SSIM, and PSNR metrics under similar computational budgets.

Conclusion: MagCache provides a robust and efficient acceleration method for video diffusion models that requires minimal calibration (only one sample) and maintains high visual quality across different models and prompts.

Abstract: Existing acceleration techniques for video diffusion models often rely on uniform heuristics or time-embedding variants to skip timesteps and reuse cached features. These approaches typically require extensive calibration with curated prompts and risk inconsistent outputs due to prompt-specific overfitting. In this paper, we introduce a novel and robust discovery: a unified magnitude law observed across different models and prompts. Specifically, the magnitude ratio of successive residual outputs decreases monotonically, steadily in most timesteps while rapidly in the last several steps. Leveraging this insight, we introduce a Magnitude-aware Cache (MagCache) that adaptively skips unimportant timesteps using an error modeling mechanism and adaptive caching strategy. Unlike existing methods requiring dozens of curated samples for calibration, MagCache only requires a single sample for calibration. Experimental results show that MagCache achieves 2.10x-2.68x speedups on Open-Sora, CogVideoX, Wan 2.1, and HunyuanVideo, while preserving superior visual fidelity. It significantly outperforms existing methods in LPIPS, SSIM, and PSNR, under similar computational budgets.

[138] Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics

Imanol Solano, Julian Fierrez, Aythami Morales, Alejandro Peña, Ruben Tolosana, Francisco Zamora-Martinez, Javier San Agustin

Main category: cs.CV

TL;DR: Introduces CEI, a novel metric for detecting demographic bias in face recognition systems by analyzing genuine and impostor score distributions separately, with focus on tail probabilities.

Details

Motivation: Existing metrics often fail to detect subtle demographic biases in face recognition systems, particularly in the tails of score distributions.

Method: Developed Comprehensive Equity Index (CEI) that separately analyzes genuine and impostor score distributions with configurable focus on tail probabilities, and created CEI^A as an automated version for practical application.

Result: Extensive experiments show CEI’s superior ability to detect nuanced biases where previous methods fall short, across state-of-the-art FR systems, intentionally biased models, and diverse datasets.

Conclusion: CEI provides a robust and sensitive tool for operational FR fairness assessment, applicable beyond face biometrics to any problem requiring analysis of distribution tails.

Abstract: Demographic bias in high-performance face recognition (FR) systems often eludes detection by existing metrics, especially with respect to subtle disparities in the tails of the score distribution. We introduce the Comprehensive Equity Index (CEI), a novel metric designed to address this limitation. CEI uniquely analyzes genuine and impostor score distributions separately, enabling a configurable focus on tail probabilities while also considering overall distribution shapes. Our extensive experiments (evaluating state-of-the-art FR systems, intentionally biased models, and diverse datasets) confirm CEI’s superior ability to detect nuanced biases where previous methods fall short. Furthermore, we present CEI^A, an automated version of the metric that enhances objectivity and simplifies practical application. CEI provides a robust and sensitive tool for operational FR fairness assessment. The proposed methods have been developed particularly for bias evaluation in face biometrics but, in general, they are applicable for comparing statistical distributions in any problem where one is interested in analyzing the distribution tails.

[139] CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection

Byeongchan Lee, John Won, Seunghyun Lee, Jinwoo Shin

Main category: cs.CV

TL;DR: CLIPFUSION combines CLIP discriminative models and diffusion generative models for anomaly detection, achieving superior performance on benchmark datasets through multi-modal fusion.

Details

Motivation: Anomaly detection faces challenges including ambiguous definitions, diverse anomaly types (local/global defects), and limited training data, requiring comprehensive feature capture capabilities.

Method: Leverages CLIP-based discriminative model for global features and diffusion-based generative model for local details, with novel use of cross-attention maps and feature maps from diffusion models for anomaly detection.

Result: Outperforms baseline methods on MVTec-AD and VisA datasets, achieving outstanding performance in both anomaly segmentation and classification tasks.

Conclusion: Demonstrates the effectiveness of multi-modal and multi-model fusion for tackling complex anomaly detection challenges, providing a scalable solution for real-world applications.

Abstract: Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types (e.g., local and global defect), and the scarcity of training data. As such, it necessitates a comprehensive model capable of capturing both low-level and high-level features, even with limited data. To address this, we propose CLIPFUSION, a method that leverages both discriminative and generative foundation models. Specifically, the CLIP-based discriminative model excels at capturing global features, while the diffusion-based generative model effectively captures local details, creating a synergistic and complementary approach. Notably, we introduce a methodology for utilizing cross-attention maps and feature maps extracted from diffusion models specifically for anomaly detection. Experimental results on benchmark datasets (MVTec-AD, VisA) demonstrate that CLIPFUSION consistently outperforms baseline methods, achieving outstanding performance in both anomaly segmentation and classification. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications.

[140] Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging

Filippo Ruffini, Elena Mulero Ayllon, Linlin Shen, Paolo Soda, Valerio Guarrasi

Main category: cs.CV

TL;DR: First benchmark comparing Foundation Models vs CNNs for COVID-19 prognosis prediction from chest X-rays, showing CNNs are robust for small imbalanced datasets while FMs with parameter-efficient methods work well on larger datasets.

Details

Motivation: Foundation Models have potential in medical imaging but face challenges in prognosis prediction due to data scarcity, class imbalance, and task complexity, limiting clinical adoption.

Method: Used 4 COVID-19 chest X-ray datasets covering mortality, severity, and ICU admission. Compared CNNs (ImageNet pretrained) with FMs (general/biomedical pretrained) using full finetuning, linear probing, and parameter-efficient methods under full data and few-shot regimes.

Result: CNNs with full fine-tuning performed robustly on small imbalanced datasets. FMs with PEFT (LoRA, BitFit) achieved competitive results on larger datasets. Severe class imbalance degraded PEFT performance. In few-shot settings, FMs showed limited generalization with linear probing being most stable.

Conclusion: No single fine-tuning strategy is universally optimal: CNNs remain dependable for low-resource scenarios, while FMs benefit from parameter-efficient methods when data are sufficient.

Abstract: Despite the significant potential of Foundation Models (FMs) in medical imaging, their application to prognosis prediction remains challenging due to data scarcity, class imbalance, and task complexity, which limit their clinical adoption. This study introduces the first structured benchmark to assess the robustness and efficiency of transfer learning strategies for FMs compared with convolutional neural networks (CNNs) in predicting COVID-19 patient outcomes from chest X-rays. The goal is to systematically compare finetuning strategies, both classical and parameter efficient, under realistic clinical constraints related to data scarcity and class imbalance, offering empirical guidance for AI deployment in clinical workflows. Four publicly available COVID-19 chest X-ray datasets were used, covering mortality, severity, and ICU admission, with varying sample sizes and class imbalances. CNNs pretrained on ImageNet and FMs pretrained on general or biomedical datasets were adapted using full finetuning, linear probing, and parameter-efficient methods. Models were evaluated under full data and few shot regimes using the Matthews Correlation Coefficient (MCC) and Precision Recall AUC (PR-AUC), with cross validation and class weighted losses. CNNs with full fine-tuning performed robustly on small, imbalanced datasets, while FMs with Parameter-Efficient Fine-Tuning (PEFT), particularly LoRA and BitFit, achieved competitive results on larger datasets. Severe class imbalance degraded PEFT performance, whereas balanced data mitigated this effect. In few-shot settings, FMs showed limited generalization, with linear probing yielding the most stable results. No single fine-tuning strategy proved universally optimal: CNNs remain dependable for low-resource scenarios, whereas FMs benefit from parameter-efficient methods when data are sufficient.

[141] Automatic Road Subsurface Distress Recognition from Ground Penetrating Radar Images using Deep Learning-based Cross-verification

Chang Peng, Bao Yang, Meiqi Li, Ge Zhang, Hui Sun, Zhenyu Jiang

Main category: cs.CV

TL;DR: A deep learning method using cross-verification strategy achieves over 98.6% recall for road subsurface distress detection using 3D GPR data, reducing inspection labor by 90%.

Details

Motivation: Address data scarcity and insufficient recognition capability in deep learning-based automatic road subsurface distress detection using ground penetrating radar.

Method: Constructed rigorously validated 3D GPR dataset with 2134 samples, proposed cross-verification strategy to exploit complementary abilities of region proposal networks from different GPR image views.

Result: Achieved outstanding accuracy with recall over 98.6% in field tests, integrated into online detection system reducing human inspection labor by around 90%.

Conclusion: The proposed approach effectively addresses data scarcity issues and significantly improves automatic road subsurface distress detection efficiency.

Abstract: Ground penetrating radar (GPR) has become a rapid and non-destructive solution for road subsurface distress (RSD) detection. Deep learning-based automatic RSD recognition, though ameliorating the burden of data processing, suffers from data scarcity and insufficient capability to recognize defects. In this study, a rigorously validated 3D GPR dataset containing 2134 samples of diverse types was constructed through field scanning. A novel cross-verification strategy was proposed to fully exploit the complementary abilities of region proposal networks in object recognition from different views of GPR images. The method achieves outstanding accuracy with a recall over 98.6% in field tests. The approach, integrated into an online RSD detection system, can reduce the human labor of inspection by around 90%.

[142] P3P Made Easy

Seong Hun Lee, Patrick Vandewalle, Javier Civera

Main category: cs.CV

TL;DR: The paper revisits the classical Perspective-Three-Point (P3P) problem and shows that Grunert’s 1841 formulation, reduced to a quartic polynomial, remains highly competitive with modern methods when implemented with contemporary insights.

Details

Motivation: The elegant classical formulation of P3P has been largely overlooked in modern literature despite its analytical simplicity and computational efficiency.

Method: Proposes a compact algebraic solver based on Grunert’s 1841 theoretical foundation, implementing the classical quartic polynomial formulation with modern insights.

Result: Achieves accuracy and runtime comparable to state-of-the-art methods, demonstrating the classical formulation’s continued competitiveness.

Conclusion: The classical P3P formulation offers an excellent balance between simplicity, efficiency, and accuracy when properly implemented with modern techniques.

Abstract: We revisit the classical Perspective-Three-Point (P3P) problem, which aims to recover the absolute pose of a calibrated camera from three 2D-3D correspondences. It has long been known that P3P can be reduced to a quartic polynomial with analytically simple and computationally efficient coefficients. However, this elegant formulation has been largely overlooked in modern literature. Building on the theoretical foundation that traces back to Grunert’s work in 1841, we propose a compact algebraic solver that achieves accuracy and runtime comparable to state-of-the-art methods. Our results show that this classical formulation remains highly competitive when implemented with modern insights, offering an excellent balance between simplicity, efficiency, and accuracy.

[143] ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs

Ben Zhang, LuLu Yu, Lei Gao, QuanJiang Guo, Jing Liu, Hui Gao

Main category: cs.CV

TL;DR: ViFP is a framework for detecting and correcting false positive reasoning in vision-language models, improving reasoning reliability through multi-turn QA and targeted correction mechanisms.

Details

Motivation: Existing approaches for improving reasoning reliability in VLMs require large amounts of high-quality data and are limited in practical applicability. Few methods directly address false positive reasoning where models give correct answers but follow incorrect reasoning paths.

Method: ViFP builds effective reasoning paths through multi-turn QA, dynamically analyzes reasoning path consistency to identify potential false positives, and introduces a targeted reasoning chain correction mechanism to modify false positive reasoning.

Result: Experiments on A-OKVQA, OK-VQA, and FVQA datasets show ViFP improves accuracy by up to 5.4% on A-OKVQA, surpassing previous state-of-the-art by 4.3%, and significantly reduces false positives.

Conclusion: ViFP effectively enhances reasoning reliability in VLMs by detecting and correcting false positive reasoning, with the proposed VoC metric providing a quantitative tool to assess reasoning reliability beyond just answer accuracy.

Abstract: During reasoning in vision-language models (VLMs), false positive (FP) reasoning occurs when a model produces the correct answer but follows an incorrect reasoning path, resulting in undermined reasoning reliability. Existing approaches mainly rely on prompt engineering, knowledge distillation or reinforcement learning to improve reasoning reliability, both of which require large amounts of high-quality data and thus limit practical applicability. Few approaches have focused on directly detecting and correcting FPs. To address these issues, we propose ViFP, a framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs. ViFP builds effective reasoning paths through multi-turn QA and dynamically analyzes the consistency of the reasoning path to identify potential FPs. It also introduces a targeted reasoning chain correction mechanism to modify FP reasoning, thereby improving logical consistency and accuracy. Finally, we introduce a reliability evaluation metric, VoC, which integrates answer accuracy and the FP rate, providing a quantitative tool to assess whether a VLM not only answers correctly but also reasons reliably. Our experiments on closed-source VLMs show that ViFP consistently improves performance across three datasets: A-OKVQA, OK-VQA, and FVQA. On A-OKVQA, ViFP improves accuracy by up to 5.4%, surpassing the previous state-of-the-art by 4.3%, and significantly reduces the number of FPs, validating its benefits in enhancing reasoning reliability.

[144] Automated Segmentation of Coronal Brain Tissue Slabs for 3D Neuropathology

Jonathan Williams Ramirez, Dina Zemlyanker, Lucas Deden-Binder, Rogeny Herisse, Erendira Garcia Pallares, Karthik Gopinath, Harshvardhan Gazula, Christopher Mount, Liana N. Kozanno, Michael S. Marshall, Theresa R. Connors, Matthew P. Frosch, Mark Montine, Derek H. Oakley, Christine L. Mac Donald, C. Dirk Keene, Bradley T. Hyman, Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: Deep learning model for automated brain tissue segmentation from photographs using U-Net architecture, achieving performance approaching human inter-/intra-rater levels.

Details

Motivation: Current volumetric analysis of postmortem brain tissue requires manual segmentation from photographs, which is costly and time-consuming.

Method: U-Net architecture trained on 1,414 manually segmented images of fixed and fresh tissue from various diagnoses, photographed at two different sites.

Result: Median Dice score >0.98, mean surface distance <0.4mm, 95% Hausdorff distance <1.60mm, approaching human inter-/intra-rater variability levels.

Conclusion: Automated segmentation tool achieves near-human performance and is publicly available, enabling more efficient brain tissue analysis.

Abstract: Advances in image registration and machine learning have recently enabled volumetric analysis of postmortem brain tissue from conventional photographs of coronal slabs, which are routinely collected in brain banks and neuropathology laboratories worldwide. One caveat of this methodology is the requirement of segmentation of the tissue from photographs, which currently requires costly manual intervention. In this article, we present a deep learning model to automate this process. The automatic segmentation tool relies on a U-Net architecture that was trained with a combination of 1,414 manually segmented images of both fixed and fresh tissue, from specimens with varying diagnoses, photographed at two different sites. Automated model predictions on a subset of photographs not seen in training were analyzed to estimate performance compared to manual labels, including both inter- and intra-rater variability. Our model achieved a median Dice score over 0.98, mean surface distance under 0.4mm, and 95% Hausdorff distance under 1.60mm, which approaches inter-/intra-rater levels. Our tool is publicly available at surfer.nmr.mgh.harvard.edu/fswiki/PhotoTools.

[145] Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

Hao Zhang, Chun-Han Yao, Simon Donné, Narendra Ahuja, Varun Jampani

Main category: cs.CV

TL;DR: SP4D is a diffusion-based framework that generates paired RGB and kinematic part videos from monocular inputs, enabling 3D skeletal structure derivation with minimal manual adjustments.

Details

Motivation: Conventional part segmentation methods rely on appearance-based semantic cues, but SP4D aims to produce kinematic parts that are aligned with object articulation and consistent across views and time for better animation applications.

Method: Uses dual-branch diffusion model with spatial color encoding for part masks, BiDiFuse module for cross-branch consistency, and contrastive part consistency loss. Trained on KinematicParts20K dataset of 20K rigged objects.

Result: SP4D generalizes to diverse scenarios including real-world videos, novel objects, and rare articulated poses, producing kinematic-aware outputs suitable for animation tasks.

Conclusion: The framework successfully generates kinematic parts that can be lifted to 3D for skeletal structures and skinning weights, demonstrating strong generalization across various scenarios.

Abstract: We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

[146] SmartWilds: Multimodal Wildlife Monitoring Dataset

Jenna Kline, Anirudh Potlapally, Bharath Pillai, Tanishka Wani, Rugved Katole, Vedant Patil, Penelope Covey, Hari Subramoni, Tanya Berger-Wolf, Christopher Stewart

Main category: cs.CV

TL;DR: SmartWilds is the first multimodal wildlife monitoring dataset combining synchronized drone imagery, camera trap photos/videos, and bioacoustic recordings from a 220-acre safari park, supporting AI research for conservation and endangered species monitoring.

Details

Motivation: To address critical needs in endangered species research, conservation ecology, and habitat management by providing comprehensive multimodal environmental monitoring data for AI research.

Method: Collected synchronized data across three modalities (drone imagery, camera trap photos/videos, bioacoustic recordings) during a four-day pilot deployment in a 220-acre pasture containing various species including Pere David’s deer, Sichuan takin, and Przewalski’s horses.

Result: Successfully captured multimodal wildlife monitoring data and provided comparative analysis of sensor modality performance, demonstrating complementary strengths for landuse patterns, species detection, behavioral analysis, and habitat monitoring.

Conclusion: Established reproducible protocols for multimodal wildlife monitoring and contributed open datasets to advance conservation computer vision research, with plans for future releases including GPS tracking data and expanded temporal coverage.

Abstract: We present the first release of SmartWilds, a multimodal wildlife monitoring dataset. SmartWilds is a synchronized collection of drone imagery, camera trap photographs and videos, and bioacoustic recordings collected during summer 2025 at The Wilds safari park in Ohio. This dataset supports multimodal AI research for comprehensive environmental monitoring, addressing critical needs in endangered species research, conservation ecology, and habitat management. Our pilot deployment captured four days of synchronized monitoring across three modalities in a 220-acre pasture containing Pere David’s deer, Sichuan takin, Przewalski’s horses, as well as species native to Ohio. We provide a comparative analysis of sensor modality performance, demonstrating complementary strengths for landuse patterns, species detection, behavioral analysis, and habitat monitoring. This work establishes reproducible protocols for multimodal wildlife monitoring while contributing open datasets to advance conservation computer vision research. Future releases will include synchronized GPS tracking data from tagged individuals, citizen science data, and expanded temporal coverage across multiple seasons.

[147] TABLET: A Large-Scale Dataset for Robust Visual Table Understanding

Iñigo Alonso, Imanol Miranda, Eneko Agirre, Mirella Lapata

Main category: cs.CV

TL;DR: TABLET is a large-scale visual table understanding dataset with 4M examples across 20 tasks, featuring original table visualizations and paired image-HTML representations to address limitations of synthetic benchmarks.

Details

Motivation: Current VTU benchmarks use synthetic renderings lacking real-world complexity and visual diversity, and existing datasets offer fixed examples without access to underlying data for reformulation.

Method: Created TABLET dataset with 4M examples from 2M unique tables (88% preserving original visualizations), including paired image-HTML representations, metadata, and provenance information linking to source datasets.

Result: Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on both seen and unseen VTU tasks and increases robustness on real-world table visualizations.

Conclusion: TABLET establishes a foundation for robust training and extensible evaluation of future VTU models by preserving original visualizations and maintaining example traceability in a unified large-scale collection.

Abstract: While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.

[148] Towards Fine-Grained Text-to-3D Quality Assessment: A Benchmark and A Two-Stage Rank-Learning Metric

Bingyang Cui, Yujie Zhang, Qi Yang, Zhu Li, Yiling Xu

Main category: cs.CV

TL;DR: T23D-CompBench is a comprehensive benchmark for compositional Text-to-3D quality assessment, addressing limitations of existing fragmented benchmarks and non-robust metrics. It introduces Rank2Score, a two-stage training evaluator that outperforms existing metrics and can optimize generative models.

Details

Motivation: Existing Text-to-3D quality assessment faces challenges: outdated/fragmented benchmarks and non-robust metrics with design limitations that result in poor feature extraction.

Method: Created T23D-CompBench with 5 components and 12 sub-components for compositional prompts, generating 3,600 meshes from 10 models. Collected 129,600 human ratings. Proposed Rank2Score with two-stage training: first stage uses supervised contrastive regression and curriculum learning for pairwise training, second stage refines predictions using mean opinion scores.

Result: Rank2Score consistently outperforms existing metrics across multiple dimensions and can serve as a reward function to optimize generative models.

Conclusion: T23D-CompBench provides a comprehensive benchmark for compositional T23D generation, and Rank2Score demonstrates superior performance as an effective evaluator for Text-to-3D quality assessment.

Abstract: Recent advances in Text-to-3D (T23D) generative models have enabled the synthesis of diverse, high-fidelity 3D assets from textual prompts. However, existing challenges restrict the development of reliable T23D quality assessment (T23DQA). First, existing benchmarks are outdated, fragmented, and coarse-grained, making fine-grained metric training infeasible. Moreover, current objective metrics exhibit inherent design limitations, resulting in non-representative feature extraction and diminished metric robustness. To address these limitations, we introduce T23D-CompBench, a comprehensive benchmark for compositional T23D generation. We define five components with twelve sub-components for compositional prompts, which are used to generate 3,600 textured meshes from ten state-of-the-art generative models. A large-scale subjective experiment is conducted to collect 129,600 reliable human ratings across different perspectives. Based on T23D-CompBench, we further propose Rank2Score, an effective evaluator with two-stage training for T23DQA. Rank2Score enhances pairwise training via supervised contrastive regression and curriculum learning in the first stage, and subsequently refines predictions using mean opinion scores to achieve closer alignment with human judgments in the second stage. Extensive experiments and downstream applications demonstrate that Rank2Score consistently outperforms existing metrics across multiple dimensions and can additionally serve as a reward function to optimize generative models. The project is available at https://cbysjtu.github.io/Rank2Score/.

[149] FUSAR-KLIP: Towards Multimodal Foundation Models for Remote Sensing

Yi Yang, Xiaokun Zhang, Qingchen Fang, Jing Liu, Ziqi Ye, Rui Li, Li Liu, Haipeng Wang

Main category: cs.CV

TL;DR: FUSAR-KLIP is the first universal SAR multimodal foundational model that addresses the gap in cross-modal AI for synthetic aperture radar imagery, featuring large-scale geographic data, hierarchical semantic annotations, and self-consistent optimization for superior performance across 11 downstream tasks.

Details

Motivation: Existing cross-modal AI methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery, which has all-day, all-weather imaging capabilities crucial for remote sensing scene understanding.

Method: Constructed FUSAR-GEOVL-1M dataset with geographic projection properties; generated aligned structured text through hierarchical cognitive chain-of-thought; designed Self-Consistent Iterative Optimization mechanism for cross-modal alignment through contrastive, matching, and reconstruction learning.

Result: FUSAR-KLIP demonstrates leading performance across 11 downstream vision and vision-language tasks, particularly excelling in object counting and land-cover classification, outperforming 14 leading foundation models.

Conclusion: FUSAR-KLIP’s large-scale multimodal data, transferable model architecture, and comprehensive experimental benchmark will significantly advance the development of SAR multimodal baseline models in remote sensing.

Abstract: Cross-modal artificial intelligence has garnered widespread attention in recent years, achieving significant progress in the study of natural images. However, existing methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery. SAR, with its all-day, all-weather imaging capabilities, plays an irreplaceable role in remote sensing scene understanding. To address this gap, this paper proposes FUSAR-KLIP, the first universal SAR multimodal foundational model, along with reusable data and evaluation baselines. Specifically: (1) This work introduces the critical yet long-overlooked attribute of geographic information into remote sensing research, constructing FUSAR-GEOVL-1M (the first large-scale SAR dataset with complete geographic projection properties), covering multiple satellite platforms, 120,000 images, and 135 cities. (2) Aligned structured text is generated through a hierarchical cognitive chain-of-thought (HCoT), providing more than one million multi-dimensional semantic annotations of landforms, regional functions, target attributes, and spatial relationships. (3) We design a Self-Consistent Iterative Optimization mechanism that continuously enhances cross-modal alignment through a self-supervised closed loop of contrastive, matching, and reconstruction learning on a transferable multimodal encoder. (4) A unified evaluation benchmark is established across 11 representative downstream vision and vision-language tasks, with comparisons against 14 leading foundation models, where FUSAR-KLIP demonstrates leading performance, particularly in object counting and land-cover classification. We expect that FUSAR-KLIP’s large-scale multimodal data, transferable model architecture, and comprehensive experimental benchmark will significantly advance the development of SAR multimodal baseline models.

[150] DA$^2$: Depth Anything in Any Direction

Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, Chunchao Guo

Main category: cs.CV

TL;DR: DA² is a zero-shot generalizable panoramic depth estimator that addresses data scarcity and spherical distortion challenges through large-scale data curation and a novel SphereViT architecture.

Details

Motivation: Panoramic depth estimation faces challenges due to limited panoramic data and spherical distortions, leading to poor zero-shot generalization and suboptimal efficiency in existing methods.

Method: Created ~607K panoramic RGB-depth pairs through data curation from perspective images, and developed SphereViT that uses spherical coordinates to maintain geometric consistency in panoramic features.

Result: Achieved state-of-the-art performance with 38% average improvement on AbsRel metric over strongest zero-shot baselines, even outperforming prior in-domain methods while being more efficient.

Conclusion: DA² demonstrates superior zero-shot generalization for panoramic depth estimation through large-scale data curation and spherical geometry-aware modeling, offering an accurate and efficient end-to-end solution.

Abstract: Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$’s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data has be released. Project page: https://depth-any-in-any-dir.github.io/.

[151] Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, Jin Hao, Zijian Chen, Ruijia Wu, Tao Tang, Junhui Lv, Hongxia Xu, Hongwei Wang, Jun Xiao, Bin Feng, Fudong Zhu, Kenli Li, Weidi Xie, Jimeng Sun, Jian Wu, Zuozhu Liu

Main category: cs.CV

TL;DR: Hulu-Med is a transparent medical Vision-Language Model that unifies text, 2D/3D images, and video understanding in a single architecture, achieving state-of-the-art performance across 30 medical benchmarks while being fully reproducible.

Details

Motivation: Existing AI systems fail to integrate heterogeneous medical data (text, 2D/3D images, videos) needed for real-world clinical decision-making, limiting their utility in healthcare applications.

Method: Developed a generalist medical VLM using a medical-aware token-reduction strategy that prunes redundant visual tokens (up to 55% reduction for 3D/video), trained on 16.7M public/synthetic samples across 12 anatomical systems and 14 imaging modalities at 7B-32B parameter scales.

Result: Outperformed existing open-source models on 27/30 benchmarks and proprietary systems like GPT-4o on 16 benchmarks. Despite being a VLM, matched GPT-o1 on text-only HealthBench. Achieved training efficiency with 4,000-40,000 GPU hours.

Conclusion: Hulu-Med provides the first fully transparent, reproducible pipeline for holistic medical vision-language understanding, demonstrating superior performance while maintaining cost-effectiveness and accessibility through open-source release.

Abstract: Real-world clinical decision-making requires integrating heterogeneous data, including medical text, 2D images, 3D volumes, and videos, while existing AI systems fail to unify all these signals, limiting their utility. In this paper, we introduce Hulu-Med, a transparent, generalist medical Vision-Language Model (VLM) designed to unify language-only, 2D/3D vision-language, and video understanding within a single architecture. Hulu-Med is trained on a curated corpus of 16.7 million samples, comprising exclusively public or synthetic data, spanning 12 major anatomical systems and 14 medical imaging modalities. Hulu-Med employs a medical-aware token-reduction strategy that prunes redundant visual tokens, achieving up to a 55% reduction for 3D and video inputs, improving cross-modal efficiency, and enabling training at 7B-32B parameter scales in approximately 4,000-40,000 GPU hours. Across 30 public in-domain and out-of-domain medical benchmarks-covering text reasoning, visual question answering, report generation, multilingual dialogue, video understanding, and rare disease diagnosis-Hulu-Med surpasses existing open-source models on 27 of 30 benchmarks and outperforms proprietary systems such as GPT-4o on 16 benchmarks. Despite being a VLM, Hulu-Med outperforms GPT-4o and matches GPT-o1 on the text-only HealthBench. For the first time in the community, we provide a fully transparent, reproducible and cost-effective pipeline for holistic medical vision-language understanding by releasing our end-to-end data curation, training procedures, and model parameters. Code and models are available at https://github.com/ZJUI-AI4H/Hulu-Med.

[152] MobileGeo: Exploring Hierarchical Knowledge Distillation for Resource-Efficient Cross-view Drone Geo-Localization

Jian Sun, Kangdao Liu, Chi Zhang, Chuangquan Chen, Junge Shen, Chi-Man Vong

Main category: cs.CV

TL;DR: MobileGeo is an efficient cross-view geo-localization framework for drones that achieves state-of-the-art performance while being 5x more efficient and 3x faster than previous methods, enabling real-time deployment on edge devices.

Details

Motivation: Existing cross-view geo-localization methods rely on resource-intensive feature alignment and multi-branch architectures, which incur high inference costs that limit deployment on mobile edge devices for drone applications.

Method: MobileGeo uses: 1) Hierarchical Distillation with Uncertainty-Aware Prediction Alignment during training to distill knowledge into compact models without inference overhead, and 2) Multi-view Selection Refinement Module during inference to filter redundant views using mutual information.

Result: MobileGeo achieves 4.19% improvement in AP on University-1652 dataset while being over 5x more efficient in FLOPs and 3x faster. It runs at 251.5 FPS on NVIDIA AGX Orin edge device.

Conclusion: MobileGeo demonstrates practical viability for real-time on-device drone geo-localization in GNSS-denied environments, offering superior efficiency and performance compared to existing methods.

Abstract: Cross-view geo-localization (CVGL) enables drone localization by matching aerial images to geo-tagged satellite databases, which is critical for autonomous navigation in GNSS-denied environments. However, existing methods rely on resource-intensive feature alignment and multi-branch architectures, incurring high inference costs that limit their deployment on mobile edge devices. We propose MobileGeo, a mobile-friendly framework designed for efficient on-device CVGL. MobileGeo achieves its efficiency through two key components: 1) During training, a Hierarchical Distillation (HD-CVGL) paradigm, coupled with Uncertainty-Aware Prediction Alignment (UAPA), distills essential information into a compact model without incurring inference overhead. 2) During inference, an efficient Multi-view Selection Refinement Module (MSRM) leverages mutual information to filter redundant views and reduce computational load. Extensive experiments demonstrate that MobileGeo outperforms previous state-of-the-art methods, achieving a 4.19% improvement in AP on University-1652 dataset while being over 5$\times$ more efficient in FLOPs and 3$\times$ faster. Crucially, MobileGeo runs at 251.5 FPS on an NVIDIA AGX Orin edge device, demonstrating its practical viability for real-time on-device drone geo-localization.

[153] Revisiting Multimodal Positional Encoding in Vision-Language Models

Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, Shuai Bai

Main category: cs.CV

TL;DR: Comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) identifies key guidelines and proposes two simple plug-and-play variants (MHRoPE and MRoPE-I) that outperform existing approaches across multimodal benchmarks.

Details

Motivation: There has been little systematic investigation into multimodal position encoding despite its importance for vision-language models, prompting a need for comprehensive analysis and improved methods.

Method: Analyzed multimodal RoPE’s core components (position design and frequency allocation), identified three key guidelines, and proposed MHRoPE and MRoPE-I variants that require no architectural changes.

Result: The proposed methods consistently outperform existing approaches across diverse benchmarks, showing significant improvements in both general and fine-grained multimodal understanding.

Conclusion: The study provides systematic insights into multimodal position encoding and demonstrates that simple, plug-and-play RoPE variants can achieve superior performance in vision-language models.

Abstract: Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding. Code will be avaliable at https://github.com/JJJYmmm/Multimodal-RoPEs.

[154] Interpretable Tile-Based Classification of Paclitaxel Exposure

Sean Fletcher, Gabby Scott, Douglas Currie, Xin Zhang, Yuqi Song, Bruce MacLeod

Main category: cs.CV

TL;DR: A tiling-and-aggregation pipeline for classifying paclitaxel exposure from phase-contrast microscopy images achieves state-of-the-art accuracy, improving over baseline by ~20 percentage points, with enhanced interpretability through Grad-CAM and Score-CAM analyses.

Details

Motivation: Medical image analysis is crucial for drug discovery and preclinical evaluation, but current full-image models struggle with subtle dose differences in classifying paclitaxel exposure from microscopy images of C6 glioma cells.

Method: Proposed a simple tiling-and-aggregation pipeline that processes local patches and combines tile outputs into an image label, with additional interpretability analysis using Grad-CAM and Score-CAM attention methods.

Result: Achieved state-of-the-art accuracy on benchmark dataset, improving over published baseline by approximately 20 percentage points, with trends confirmed by cross-validation.

Conclusion: The tiling approach proves effective for this challenging medical image classification task, and attention analyses provide interpretability insights that can guide future robustness-oriented medical image research.

Abstract: Medical image analysis is central to drug discovery and preclinical evaluation, where scalable, objective readouts can accelerate decision-making. We address classification of paclitaxel (Taxol) exposure from phase-contrast microscopy of C6 glioma cells – a task with subtle dose differences that challenges full-image models. We propose a simple tiling-and-aggregation pipeline that operates on local patches and combines tile outputs into an image label, achieving state-of-the-art accuracy on the benchmark dataset and improving over the published baseline by around 20 percentage points, with trends confirmed by cross-validation. To understand why tiling is effective, we further apply Grad-CAM and Score-CAM and attention analyses, which enhance model interpretability and point toward robustness-oriented directions for future medical image research. Code is released to facilitate reproduction and extension.

[155] Generative View Stitching

Chonghyuk Song, Michal Stary, Boyuan Chen, George Kopanas, Vincent Sitzmann

Main category: cs.CV

TL;DR: Generative View Stitching (GVS) enables collision-free camera-guided video generation by sampling entire sequences in parallel and using diffusion stitching with off-the-shelf video models.

Details

Motivation: Autoregressive video diffusion models cannot incorporate future conditioning, leading to collisions with generated scenes when following predefined camera trajectories, causing model collapse.

Method: Proposes GVS sampling algorithm that extends diffusion stitching to video generation, works with any Diffusion Forcing-trained model, and introduces Omni Guidance for temporal consistency and loop-closing.

Result: GVS achieves stable, collision-free, frame-to-frame consistent video generation that closes loops for various predefined camera paths, including impossible geometries.

Conclusion: GVS successfully addresses limitations of autoregressive models in camera-guided generation by enabling parallel sampling and future conditioning through diffusion stitching.

Abstract: Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv"ard’s Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.

[156] WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, Ben Sapp, Mingxing Tan, Jyh-Jing Hwang, Dragomir Anguelov

Main category: cs.CV

TL;DR: WOD-E2E is a new benchmark dataset for end-to-end driving that focuses on challenging long-tail scenarios, featuring 4,021 driving segments with high-level routing, ego states, and 360-degree camera views, along with a novel Rater Feedback Score evaluation metric.

Details

Motivation: Current E2E driving benchmarks mainly test nominal scenarios and existing evaluation metrics fail to adequately assess performance in rare, challenging long-tail situations that are crucial for real-world autonomous driving safety.

Method: Created WOD-E2E dataset with 4,021 driving segments (12 hours) specifically curated for long-tail scenarios occurring less than 0.03% frequency. Introduced Rater Feedback Score (RFS) metric that measures how well predicted trajectories match human rater preference labels instead of just distance to logged waypoints.

Result: The dataset provides comprehensive driving data including routing information, ego states, and 360-degree camera views from 8 surrounding cameras. Rater preference labels are released for validation set, with test set used for the 2025 WOD-E2E Challenge.

Conclusion: WOD-E2E aims to advance research in generalizable, robust, and safe end-to-end autonomous driving by providing a benchmark focused on challenging real-world situations that current systems struggle with.

Abstract: Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.

[157] Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa

Main category: cs.CV

TL;DR: VLMs struggle with temporal reasoning, performing poorly on judging video direction (forward vs backward) compared to humans, especially for irreversible physical processes and causal actions.

Details

Motivation: To evaluate VLMs' understanding of temporal information in videos, which remains weak and under-evaluated despite their strong performance on other multimodal tasks.

Method: Created AoT-PsyPhyBENCH benchmark using psychophysically validated stimuli to test VLMs’ ability to judge arrow of time in natural videos, comparing against human behavioral baselines.

Result: Most VLMs perform near chance level, with even the best models lagging far behind human accuracy on physically irreversible processes and causal manual actions.

Conclusion: Current VLMs lack inductive biases for temporal continuity and causal understanding, highlighting a fundamental gap in multimodal systems despite their rich visual-semantic correlations.

Abstract: Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.

[158] Towards 1000-fold Electron Microscopy Image Compression for Connectomics via VQ-VAE with Transformer Prior

Fuming Yang, Yicong Li, Hanspeter Pfister, Jeff W. Lichtman, Yaron Meirovitch

Main category: cs.CV

TL;DR: VQ-VAE compression framework for EM data with pay-as-you-decode capability, supporting 16x-1024x compression and selective ROI reconstruction.

Details

Motivation: Petascale EM datasets strain storage, transfer, and analysis capabilities, requiring efficient compression solutions.

Method: Vector-quantized variational autoencoder (VQ-VAE) with optional Transformer prior for texture restoration via FiLM and concatenation; ROI-driven workflow for selective high-resolution reconstruction.

Result: Enables extreme compression (1024x) with flexible decoding options and targeted high-resolution reconstruction from compressed latents.

Conclusion: The framework provides scalable compression for large EM datasets while maintaining reconstruction quality through adaptive decoding strategies.

Abstract: Petascale electron microscopy (EM) datasets push storage, transfer, and downstream analysis toward their current limits. We present a vector-quantized variational autoencoder-based (VQ-VAE) compression framework for EM that spans 16x to 1024x and enables pay-as-you-decode usage: top-only decoding for extreme compression, with an optional Transformer prior that predicts bottom tokens (without changing the compression ratio) to restore texture via feature-wise linear modulation (FiLM) and concatenation; we further introduce an ROI-driven workflow that performs selective high-resolution reconstruction from 1024x-compressed latents only where needed.

[159] Text-guided Fine-Grained Video Anomaly Detection

Jihao Gu, Kun Li, He Wang, Kaan Akşit

Main category: cs.CV

TL;DR: T-VAD is a text-guided video anomaly detection framework using Large Vision-Language Models to generate fine-grained anomaly heatmaps and provide detailed textual descriptions of anomalies.

Details

Motivation: Traditional VAD methods only output binary normal/anomalous classifications and require human assessment, lacking granularity and interactivity.

Method: Uses Anomaly Heatmap Decoder for pixel-wise visual-textual feature alignment and Region-aware Anomaly Encoder to transform heatmaps into textual embeddings for LVLM guidance.

Result: Achieved 94.8% AUC on UBnormal dataset, with strong performance on anomaly heatmaps (67.8%/76.7% RBDC/TBDC) and high-quality textual descriptions (BLEU-4 up to 88.84, Yes/No accuracy up to 97.67%).

Conclusion: T-VAD significantly enhances granularity and interactivity of video anomaly detection by combining fine-grained localization with detailed textual descriptions.

Abstract: Video Anomaly Detection (VAD) aims to identify anomalous events within video segments. In scenarios such as surveillance or industrial process monitoring, anomaly detection is of critical importance. While existing approaches are semi-automated, requiring human assessment for anomaly detection, traditional VADs offer limited output as either normal or anomalous. We propose Text-guided Fine-Grained Video Anomaly Detection (T-VAD), a framework built upon Large Vision-Language Model (LVLM). T-VAD introduces an Anomaly Heatmap Decoder (AHD) that performs pixel-wise visual-textual feature alignment to generate fine-grained anomaly heatmaps. Furthermore, we design a Region-aware Anomaly Encoder (RAE) that transforms the heatmaps into learnable textual embeddings, guiding the LVLM to accurately identify and localize anomalous events in videos. This significantly enhances both the granularity and interactivity of anomaly detection. The proposed method achieving SOTA performance by demonstrating 94.8% Area Under the Curve (AUC, specifically micro-AUC) and 67.8%/76.7% accuracy in anomaly heatmaps (RBDC/TBDC) on the UBnormal dataset, and subjectively verified more preferable textual description on the ShanghaiTech-based dataset (BLEU-4: 62.67 for targets, 88.84 for trajectories; Yes/No accuracy: 97.67%), and on the UBnormal dataset (BLEU-4: 50.32 for targets, 78.10 for trajectories; Yes/No accuracy: 89.73%).

[160] Erasing ‘Ugly’ from the Internet: Propagation of the Beauty Myth in Text-Image Models

Tanvi Dinkar, Aiqi Jiang, Gavin Abercrombie, Ioannis Konstas

Main category: cs.CV

TL;DR: Study reveals generative AI models encode Western beauty standards, showing bias toward lighter skin tones, youth, and hypersexualization, with ‘ugly’ traits producing more NSFW content.

Details

Motivation: To investigate how generative AI models encode beauty standards and erase 'ugliness', and understand the societal implications of these biases in perpetuating harmful Western beauty norms.

Method: Created two image generation pipelines using text-to-image and text-to-language-to-image models, generated 5984 images using a structured beauty taxonomy, and conducted a Likert-scale study with 1200 images evaluated by women and non-binary social media users.

Result: 86.5% of images had lighter skin tones, 22% contained explicit content despite SFW training, 74% depicted younger demographics, and non-binary individuals were rated as younger and more hypersexualized. ‘Negative’ beauty traits consistently produced higher NSFW ratings.

Conclusion: Generative AI models perpetuate demographic biases in beauty standards, actively erasing features outside stereotypical beauty norms through practices like negative prompting, with concerning societal implications including data stream pollution and feature erasure.

Abstract: Social media has exacerbated the promotion of Western beauty norms, leading to negative self-image, particularly in women and girls, and causing harm such as body dysmorphia. Increasingly content on the internet has been artificially generated, leading to concerns that these norms are being exaggerated. The aim of this work is to study how generative AI models may encode ‘beauty’ and erase ‘ugliness’, and discuss the implications of this for society. To investigate these aims, we create two image generation pipelines: a text-to-image model and a text-to-language model-to image model. We develop a structured beauty taxonomy which we use to prompt three language models (LMs) and two text-to-image models to cumulatively generate 5984 images using our two pipelines. We then recruit women and non-binary social media users to evaluate 1200 of the images through a Likert-scale within-subjects study. Participants show high agreement in their ratings. Our results show that 86.5% of generated images depicted people with lighter skin tones, 22% contained explicit content despite Safe for Work (SFW) training, and 74% were rated as being in a younger age demographic. In particular, the images of non-binary individuals were rated as both younger and more hypersexualised, indicating troubling intersectional effects. Notably, prompts encoded with ’negative’ or ‘ugly’ beauty traits (such as “a wide nose”) consistently produced higher Not SFW (NSFW) ratings regardless of gender. This work sheds light on the pervasive demographic biases related to beauty standards present in generative AI models – biases that are actively perpetuated by model developers, such as via negative prompting. We conclude by discussing the implications of this on society, which include pollution of the data streams and active erasure of features that do not fall inside the stereotype of what is considered beautiful by developers.

[161] Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation

Jie Du, Xinyu Gong, Qingshan Tan, Wen Li, Yangming Cheng, Weitao Wang, Chenlu Zhan, Suhui Wu, Hao Zhang, Jun Zhang

Main category: cs.CV

TL;DR: Proposes Reg-DPO with GT-Pair for efficient video generation preference optimization, addressing data construction, training stability, and memory issues in large-scale models.

Details

Motivation: Existing DPO methods for video generation follow image-domain paradigms and are limited to small models (~2B parameters), failing to address video-specific challenges like costly data annotation, unstable training, and heavy memory consumption.

Method: Introduces GT-Pair for automatic preference pair construction using real videos as positives and model-generated videos as negatives, and Reg-DPO that incorporates SFT loss as regularization into DPO loss. Combines FSDP framework with memory optimization techniques for 3x higher training capacity.

Result: Extensive experiments on I2V and T2V tasks across multiple datasets show consistent outperformance over existing approaches with superior video generation quality.

Conclusion: The proposed method effectively addresses video generation challenges through automated preference pairing, stabilized training, and enhanced memory efficiency, achieving state-of-the-art performance without external annotations.

Abstract: Recent studies have identified Direct Preference Optimization (DPO) as an efficient and reward-free approach to improving video generation quality. However, existing methods largely follow image-domain paradigms and are mainly developed on small-scale models (approximately 2B parameters), limiting their ability to address the unique challenges of video tasks, such as costly data construction, unstable training, and heavy memory consumption. To overcome these limitations, we introduce a GT-Pair that automatically builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives, eliminating the need for any external annotation. We further present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO loss to enhance training stability and generation fidelity. Additionally, by combining the FSDP framework with multiple memory optimization techniques, our approach achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on both I2V and T2V tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches, delivering superior video generation quality.

[162] OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control

Xilong Zhou, Jianchun Chen, Pramod Rao, Timo Teufel, Linjie Lyu, Tigran Minasian, Oleksandr Sotnychenko, Xiao-Xiao Long, Marc Habermann, Christian Theobalt

Main category: cs.CV

TL;DR: OLATverse is a large-scale dataset with 9M images of 765 real-world objects captured under controlled lighting, providing comprehensive resources for inverse rendering and relighting research.

Details

Motivation: To address the limitation of existing methods that rely on synthetic datasets and small-scale real-world data, which restricts realism and generalization in object-centric inverse rendering.

Method: Captured 765 real-world objects using 35 DSLR cameras and 331 individually controlled light sources, providing camera parameters, object masks, surface normals, and diffuse albedo.

Result: Created a dataset with large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations, spanning diverse material categories.

Conclusion: OLATverse represents a pivotal step toward integrating next-generation inverse rendering and relighting methods with real-world data, establishing the first comprehensive real-world object-centric benchmark.

Abstract: We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data. The full dataset, along with all post-processing workflows, will be publicly released at https://vcai.mpi-inf.mpg.de/projects/OLATverse/.

[163] ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing

Yaosen Chen, Wei Wang, Tianheng Zheng, Xuming Wen, Han Yang, Yanru Zhang

Main category: cs.CV

TL;DR: An energy-based optimization method for video shot assembly that learns from reference videos to automatically arrange shots according to specific narrative requirements and artistic styles.

Details

Motivation: Traditional shot assembly is manually done by experienced editors, and current automated video editing technologies fail to capture creators' unique artistic expression in shot sequencing.

Method: Uses visual-semantic matching between LLM-generated scripts and video libraries to get candidate shots, extracts shot attributes (size, motion, semantics), employs energy-based models to learn from reference videos, and combines syntax rules for optimization.

Result: The method automates shot arrangement and learns assembly styles from reference videos, enabling even novice users to create visually compelling videos that maintain coherent visual expression.

Conclusion: The proposed energy-based shot assembly approach successfully bridges the gap between automated video editing and artistic expression, allowing style learning from reference videos while maintaining narrative coherence.

Abstract: Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator’s unique artistic expression in shot assembly. To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: https://sobeymil.github.io/esa.com

[164] PLUTO-4: Frontier Pathology Foundation Models

Harshith Padigela, Shima Nofallah, Atchuth Naveen Chilaparasetti, Ryun Han, Andrew Walker, Judy Shen, Chintan Shah, Blake Martin, Aashish Sood, Elliot Miller, Ben Glass, Andy Beck, Harsha Pokkalla, Syed Ashar Javed

Main category: cs.CV

TL;DR: PLUTO-4 introduces two advanced pathology foundation models: PLUTO-4S (compact, efficient) and PLUTO-4G (frontier-scale), trained on 551,164 WSIs from 137,144 patients across 50+ institutions, achieving state-of-the-art performance across various pathology tasks.

Details

Motivation: To build on the progress of foundation models in pathology by creating next-generation models that can handle diverse histopathology tasks with improved efficiency and performance, addressing the need for robust diagnostic tools in clinical applications.

Method: Developed two Vision Transformer architectures: PLUTO-4S (compact model with FlexiViT setup and 2D-RoPE embeddings for multi-scale deployment) and PLUTO-4G (frontier-scale model with single patch size for maximum representation capacity). Both models were pretrained using self-supervised DINOv2 objective on a large multi-institutional dataset.

Result: PLUTO-4 achieves state-of-the-art performance across patch-level classification, segmentation, and slide-level diagnosis. PLUTO-4S provides high-throughput deployment capabilities, while PLUTO-4G establishes new performance frontiers with 11% improvement in dermatopathology diagnosis and superior performance across multiple benchmarks.

Conclusion: PLUTO-4 represents a significant advancement in pathology foundation models, demonstrating strong potential to transform real-world applications as a backbone for translational research and diagnostic use cases, with both compact and frontier-scale options available for different deployment needs.

Abstract: Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including patch-level classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4’s potential to transform real-world applications as a backbone for translational research and diagnostic use cases.

cs.AI

[165] Evaluating Control Protocols for Untrusted AI Agents

Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, Joe Benton

Main category: cs.AI

TL;DR: Systematic evaluation of AI control protocols in SHADE-Arena shows resampling and critical action deferral improve safety from 50% to 96%, but adaptive attacks can reduce resampling safety to 17%, while critical action deferral remains robust.

Details

Motivation: As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to mitigating risk from untrusted AI agents by monitoring actions and intervening when necessary.

Method: Systematically evaluate control protocols in SHADE-Arena dataset of diverse agentic environments. Test blue team protocols (deferral to trusted models, resampling, deferring on critical actions) against default attack policy, then iterate on red team strategies against these protocols.

Result: Resampling for incrimination and deferring on critical actions performed best, increasing safety from 50% to 96%. However, adaptive attacks with additional affordances (knowledge of resampling, ability to simulate monitors) reduced resampling safety to 17%, while deferring on critical actions remained highly robust.

Conclusion: Deferring on critical actions is highly robust to even strongest red team strategies, demonstrating the importance of denying attack policies access to protocol internals for maintaining AI safety.

Abstract: As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to mitigating the risk from untrusted AI agents by monitoring their actions and intervening or auditing when necessary. Evaluating the safety of these protocols requires understanding both their effectiveness against current attacks and their robustness to adaptive adversaries. In this work, we systematically evaluate a range of control protocols in SHADE-Arena, a dataset of diverse agentic environments. First, we evaluate blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, against a default attack policy. We find that resampling for incrimination and deferring on critical actions perform best, increasing safety from 50% to 96%. We then iterate on red team strategies against these protocols and find that attack policies with additional affordances, such as knowledge of when resampling occurs or the ability to simulate monitors, can substantially improve attack success rates against our resampling strategy, decreasing safety to 17%. However, deferring on critical actions is highly robust to even our strongest red team strategies, demonstrating the importance of denying attack policies access to protocol internals.

[166] PublicAgent: Multi-Agent Design Principles From an LLM-Based Open Data Analysis Framework

Sina Montazeri, Yunhe Feng, Kewei Sha

Main category: cs.AI

TL;DR: PublicAgent is a multi-agent framework that decomposes data analysis workflows into specialized agents to overcome LLM limitations in end-to-end analytical tasks, enabling non-experts to access open data repositories through natural language interfaces.

Details

Motivation: Open data repositories are inaccessible to non-experts who lack expertise in dataset discovery, schema mapping, and statistical analysis. Current LLMs show limitations in end-to-end analytical workflows due to attention dilution, specialized reasoning interference, and error propagation.

Method: PublicAgent uses a multi-agent framework with specialized agents for intent clarification, dataset discovery, analysis, and reporting. This architecture maintains focused attention within agent contexts and enables validation at each stage.

Result: Evaluation across five models and 50 queries showed: 97.5% agent win rates even for strongest models, universal agents (discovery, analysis) show consistent effectiveness while conditional agents vary by model, removing discovery or analysis causes catastrophic failures, architectural benefits persist across task complexity, and wide variance in agent effectiveness across models requires model-aware design.

Conclusion: The framework provides five design principles for multi-agent LLM systems, guiding when and why specialization is necessary for complex analytical workflows while enabling broader access to public data through natural language interfaces.

Abstract: Open data repositories hold potential for evidence-based decision-making, yet are inaccessible to non-experts lacking expertise in dataset discovery, schema mapping, and statistical analysis. Large language models show promise for individual tasks, but end-to-end analytical workflows expose fundamental limitations: attention dilutes across growing contexts, specialized reasoning patterns interfere, and errors propagate undetected. We present PublicAgent, a multi-agent framework that addresses these limitations through decomposition into specialized agents for intent clarification, dataset discovery, analysis, and reporting. This architecture maintains focused attention within agent contexts and enables validation at each stage. Evaluation across five models and 50 queries derives five design principles for multi-agent LLM systems. First, specialization provides value independent of model strength–even the strongest model shows 97.5% agent win rates, with benefits orthogonal to model scale. Second, agents divide into universal (discovery, analysis) and conditional (report, intent) categories. Universal agents show consistent effectiveness (std dev 12.4%) while conditional agents vary by model (std dev 20.5%). Third, agents mitigate distinct failure modes–removing discovery or analysis causes catastrophic failures (243-280 instances), while removing report or intent causes quality degradation. Fourth, architectural benefits persist across task complexity with stable win rates (86-92% analysis, 84-94% discovery), indicating workflow management value rather than reasoning enhancement. Fifth, wide variance in agent effectiveness across models (42-96% for analysis) requires model-aware architecture design. These principles guide when and why specialization is necessary for complex analytical workflows while enabling broader access to public data through natural language interfaces.

[167] No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

Tao Zhang, Kehui Yao, Luyi Ma, Jiao Chen, Reza Yousefi Maragheh, Kai Zhao, Jianpeng Xu, Evren Korpeoglu, Sushant Kumar, Kannan Achan

Main category: cs.AI

TL;DR: ScalingEval is a large-scale benchmarking study that systematically evaluates 36 LLMs as judges across multiple product categories using a consensus-driven evaluation protocol, identifying top performers and domain-specific patterns.

Details

Motivation: Evaluating LLMs as judges is critical for building scalable and trustworthy evaluation pipelines, but there's a need for systematic comparison across different models and domains.

Method: Multi-agent framework that aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison without human annotation.

Result: Claude 3.5 Sonnet has highest decision confidence; Gemini 1.5 Pro offers best overall performance; GPT-4o provides best latency-accuracy-cost tradeoff; GPT-OSS 20B leads open-source models. Strong consensus in structured domains but disagreement in lifestyle categories.

Conclusion: ScalingEval establishes a reproducible benchmark and evaluation protocol for LLMs as judges, providing actionable guidance on scaling, reliability, and model family tradeoffs.

Abstract: Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama, across multiple product categories using a consensus-driven evaluation protocol. Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large-scale complementary-item recommendation, the benchmark reports four key findings: (i) Anthropic Claude 3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers the best overall performance across categories; (iii) GPT-4o provides the most favorable latency-accuracy-cost tradeoff; and (iv) GPT-OSS 20B leads among open-source models. Category-level analysis shows strong consensus in structured domains (Electronics, Sports) but persistent disagreement in lifestyle categories (Clothing, Food). These results establish ScalingEval as a reproducible benchmark and evaluation protocol for LLMs as judges, with actionable guidance on scaling, reliability, and model family tradeoffs.

[168] Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim

Main category: cs.AI

TL;DR: This paper develops a benchmark to test whether LLMs can internalize real-world probability distributions across domains like economics, health, and education, finding that LLMs perform poorly and do not naturally learn observational distributions.

Details

Motivation: To understand LLMs' capabilities in learning probabilistic knowledge about real-world distributions, given their training on vast text data and claims about being universal approximators, while considering statistical challenges like the curse of dimensionality.

Method: Developed the first benchmark to directly test whether LLMs have access to empirical distributions describing real-world populations across various domains including economics, health, education, and social behavior.

Result: LLMs perform poorly overall and do not seem to internalize real-world statistics naturally. They lack knowledge of observational distributions (Layer 1 of Pearl’s Causal Hierarchy).

Conclusion: Language models do not contain knowledge on observational distributions, and by extension, their interventional and counterfactual knowledge is also limited according to the Causal Hierarchy Theorem.

Abstract: Artificial intelligence (AI) systems hold great promise for advancing various scientific disciplines, and are increasingly used in real-world applications. Despite their remarkable progress, further capabilities are expected in order to achieve more general types of intelligence. A critical distinction in this context is between factual knowledge, which can be evaluated against true or false answers (e.g., “what is the capital of England?”), and probabilistic knowledge, reflecting probabilistic properties of the real world (e.g., “what is the sex of a computer science graduate in the US?”). In this paper, our goal is to build a benchmark for understanding the capabilities of LLMs in terms of knowledge of probability distributions describing the real world. Given that LLMs are trained on vast amounts of text, it may be plausible that they internalize aspects of these distributions. Indeed, LLMs are touted as powerful universal approximators of real-world distributions. At the same time, classical results in statistics, known as curse of dimensionality, highlight fundamental challenges in learning distributions in high dimensions, challenging the notion of universal distributional learning. In this work, we develop the first benchmark to directly test this hypothesis, evaluating whether LLMs have access to empirical distributions describing real-world populations across domains such as economics, health, education, and social behavior. Our results demonstrate that LLMs perform poorly overall, and do not seem to internalize real-world statistics naturally. When interpreted in the context of Pearl’s Causal Hierarchy (PCH), our benchmark demonstrates that language models do not contain knowledge on observational distributions (Layer 1 of PCH), and thus the Causal Hierarchy Theorem implies that interventional (Layer 2) and counterfactual (Layer 3) knowledge of these models is also limited.

[169] Toward Autonomous Engineering Design: A Knowledge-Guided Multi-Agent Framework

Varun Kumar, George Em Karniadakis

Main category: cs.AI

TL;DR: A multi-agent AI framework for engineering design that uses specialized agents (Graph Ontologist, Design Engineer, Systems Engineer) to collaboratively generate and refine designs through iterative review loops, demonstrated on NACA airfoil optimization.

Details

Motivation: Traditional engineering design processes are resource-intensive and inefficient due to the need for multi-domain expertise and complex collaborations.

Method: Three-agent framework: Graph Ontologist builds domain knowledge graphs using LLM, Systems Engineer formulates requirements, Design Engineer generates candidates, and iterative feedback loops refine designs until validation.

Result: Successfully demonstrated on aerodynamic optimization of NACA airfoils, with final designs optimized for performance metrics like lift-to-drag ratio.

Conclusion: Collaborative AI agents with structured knowledge representations can enhance efficiency, consistency, and quality in engineering design processes.

Abstract: The engineering design process often demands expertise from multiple domains, leading to complex collaborations and iterative refinements. Traditional methods can be resource-intensive and prone to inefficiencies. To address this, we formalize the engineering design process through a multi-agent AI framework that integrates structured design and review loops. The framework introduces specialized knowledge-driven agents that collaborate to generate and refine design candidates. As an exemplar, we demonstrate its application to the aerodynamic optimization of 4-digit NACA airfoils. The framework consists of three key AI agents: a Graph Ontologist, a Design Engineer, and a Systems Engineer. The Graph Ontologist employs a Large Language Model (LLM) to construct two domain-specific knowledge graphs from airfoil design literature. The Systems Engineer, informed by a human manager, formulates technical requirements that guide design generation and evaluation. The Design Engineer leverages the design knowledge graph and computational tools to propose candidate airfoils meeting these requirements. The Systems Engineer reviews and provides feedback both qualitative and quantitative using its own knowledge graph, forming an iterative feedback loop until a design is validated by the manager. The final design is then optimized to maximize performance metrics such as the lift-to-drag ratio. Overall, this work demonstrates how collaborative AI agents equipped with structured knowledge representations can enhance efficiency, consistency, and quality in the engineering design process.

[170] SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Haggstrom, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Hakan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar

Main category: cs.AI

TL;DR: SnapStream is a KV cache compression method that enables 4x improved on-chip memory usage with minimal accuracy degradation, deployed in production inference systems with static graphs and continuous batching.

Details

Motivation: The proliferation of large LLMs with long context lengths creates increasing demands for on-chip memory for KV caches, but existing compression techniques are not commonly used in industrial deployments due to framework constraints and unclear accuracy implications.

Method: Developed SnapStream, a KV cache compression method that can be deployed at scale, exploring accuracy implications on modern models like Llama-3.1-8B-Instruct and DeepSeek-R1.

Result: Demonstrated 4x improved on-chip memory usage with minimal accuracy degradation on benchmarks (LongBench-v2, AIME24, LiveCodeBench), achieving 1832 tokens per second in production deployment of DeepSeek-671B on SambaNova SN40L accelerators.

Conclusion: SnapStream is the first implementation of sparse KV attention techniques successfully deployed in a production inference system with static graphs and continuous batching, addressing key deployment barriers.

Abstract: The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.

[171] Large language models require a new form of oversight: capability-based monitoring

Katherine C. Kellogg, Bingyang Ye, Yifan Hu, Guergana K. Savova, Byron Wallace, Danielle S. Bitterman

Main category: cs.AI

TL;DR: Proposes capability-based monitoring for LLMs in healthcare as an alternative to traditional task-based monitoring, focusing on shared model capabilities rather than individual downstream tasks.

Details

Motivation: Traditional ML monitoring approaches are task-based and assume performance degradation from dataset drift, but LLMs weren't trained for specific tasks in specific populations, making these assumptions invalid.

Method: Organizes monitoring around shared model capabilities (summarization, reasoning, translation, safety guardrails) rather than evaluating each downstream task independently.

Result: Enables cross-task detection of systemic weaknesses, long-tail errors, and emergent behaviors that task-based monitoring may miss.

Conclusion: Capability-based monitoring provides a scalable foundation for safe, adaptive, and collaborative monitoring of LLMs and future generalist AI models in healthcare.

Abstract: The rapid adoption of large language models (LLMs) in healthcare has been accompanied by scrutiny of their oversight. Existing monitoring approaches, inherited from traditional machine learning (ML), are task-based and founded on assumed performance degradation arising from dataset drift. In contrast, with LLMs, inevitable model degradation due to changes in populations compared to the training dataset cannot be assumed, because LLMs were not trained for any specific task in any given population. We therefore propose a new organizing principle guiding generalist LLM monitoring that is scalable and grounded in how these models are developed and used in practice: capability-based monitoring. Capability-based monitoring is motivated by the fact that LLMs are generalist systems whose overlapping internal capabilities are reused across numerous downstream tasks. Instead of evaluating each downstream task independently, this approach organizes monitoring around shared model capabilities, such as summarization, reasoning, translation, or safety guardrails, in order to enable cross-task detection of systemic weaknesses, long-tail errors, and emergent behaviors that task-based monitoring may miss. We describe considerations for developers, organizational leaders, and professional societies for implementing a capability-based monitoring approach. Ultimately, capability-based monitoring will provide a scalable foundation for safe, adaptive, and collaborative monitoring of LLMs and future generalist artificial intelligence models in healthcare.

[172] miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh

Main category: cs.AI

TL;DR: Analysis of miniF2F benchmark reveals significant discrepancies between formal and informal math statements, leading to low pipeline accuracy (36%) despite high individual component performance. After corrections, miniF2F-v2 achieves 70% accuracy, highlighting autoformalization-theorem prover misalignment.

Details

Motivation: To evaluate the performance of AI systems in math Olympiad settings where models must read natural language problems, formalize them in Lean, and prove them, and to understand the gap between individual component performance and end-to-end pipeline accuracy.

Method: Conducted thorough analysis of formal and informal statements in miniF2F benchmark, identified discrepancies, corrected errors in both statements and proofs, created miniF2F-v2 with verified statements, and evaluated full theorem proving pipeline.

Result: Original pipeline accuracy was 36% despite 97% autoformalization and 69% theorem proving SoTA accuracies. After corrections, miniF2F-v2 achieved 70% accuracy, showing significant improvement but still indicating misalignment between autoformalization models and theorem provers.

Conclusion: Higher quality benchmarks like miniF2F-v2 are crucial for better evaluating formal reasoning progress and diagnosing failure modes in autoformalization and theorem proving models, with the dataset publicly available for community use.

Abstract: We perform a thorough analysis of the formal and informal statements in the miniF2F benchmark from the perspective of an AI system that is tasked to participate in a math Olympiad consisting of the problems in miniF2F. In such setting, the model has to read and comprehend the problems in natural language, formalize them in Lean language, then proceed with proving the problems, and it will get credit for each problem if the formal proof corresponds to the original informal statement presented to the model. Our evaluation results reveal that the best accuracy of such pipeline can be about 36% using the SoTA models in the literature, considerably lower than the individual SoTA accuracies, 97% and 69% reported in the autoformalization and theorem proving literature. Analyzing the failure modes, we trace back a considerable portion of this drop to discrepancies between the formal and informal statements for more than half of the problems in miniF2F. We proceed with correcting all the errors, discrepancies and simplifications in formal and informal statements, and present the miniF2F-v2 with fully verified formal and informal statements and proofs. Evaluating the full theorem proving pipeline on miniF2F-v2 leads to the best accuracy of 70%, a significant improvement from the 40% on the original miniF2F, yet indicating considerable misalignment between the autoformalization models and theorem provers. Our deep analysis suggests that a higher quality benchmark can help the community better evaluate progress in the field of formal reasoning and also better diagnose the failure and success modes of autoformalization and theorem proving models. Our dataset is available at https://github.com/roozbeh-yz/miniF2F_v2.

Shipeng Cen, Ying Tan

Main category: cs.AI

TL;DR: The paper proposes using multi-modal large language models (MLLM) to enhance the fireworks algorithm (FWA) for complex optimization problems, introducing the concept of Critical Part (CP) to handle high-dimensional tasks like TSP and EDA.

Details

Motivation: Traditional optimization methods struggle with non-convex, high-dimensional, black-box problems due to low efficiency and inaccurate gradient information. LLMs' improved language understanding and code generation capabilities offer new opportunities for optimization algorithm design.

Method: Extends FWA using MLLM by introducing Critical Part concept, leveraging multi-modal characteristics to utilize optimization process information for complex high-dimensional tasks like TSP and EDA.

Result: FWAs generated under the new framework achieved or surpassed state-of-the-art results on many problem instances in TSP and EDA.

Conclusion: Integrating MLLM with FWA through the Critical Part concept effectively addresses complex optimization challenges and achieves competitive performance on high-dimensional tasks.

Abstract: As optimization problems grow increasingly complex and diverse, advancements in optimization techniques and paradigm innovations hold significant importance. The challenges posed by optimization problems are primarily manifested in their non-convexity, high-dimensionality, black-box nature, and other unfavorable characteristics. Traditional zero-order or first-order methods, which are often characterized by low efficiency, inaccurate gradient information, and insufficient utilization of optimization information, are ill-equipped to address these challenges effectively. In recent years, the rapid development of large language models (LLM) has led to substantial improvements in their language understanding and code generation capabilities. Consequently, the design of optimization algorithms leveraging large language models has garnered increasing attention from researchers. In this study, we choose the fireworks algorithm(FWA) as the basic optimizer and propose a novel approach to assist the design of the FWA by incorporating multi-modal large language model(MLLM). To put it simply, we propose the concept of Critical Part(CP), which extends FWA to complex high-dimensional tasks, and further utilizes the information in the optimization process with the help of the multi-modal characteristics of large language models. We focus on two specific tasks: the \textit{traveling salesman problem }(TSP) and \textit{electronic design automation problem} (EDA). The experimental results show that FWAs generated under our new framework have achieved or surpassed SOTA results on many problem instances.

[174] Outbidding and Outbluffing Elite Humans: Mastering Liar’s Poker via Self-Play and Reinforcement Learning

Richard Dewey, Janos Botyanszki, Ciamac C. Moallemi, Andrew T. Zheng

Main category: cs.AI

TL;DR: Solly is the first AI agent to achieve elite human performance in multi-player Liar’s Poker using deep reinforcement learning, outperforming both humans and LLMs.

Details

Motivation: Previous AI breakthroughs in poker-like games focused on two-player scenarios with subdued multi-player dynamics, while Liar's Poker features extensive multi-player engagement that presents a more challenging testbed.

Method: Trained using self-play with a model-free, actor-critic, deep reinforcement learning algorithm.

Result: Achieved elite human level performance with >50% win rate and positive equity in both heads-up and multi-player formats, outperformed LLMs, developed novel strategies, effectively randomized play, and was not easily exploitable by world-class players.

Conclusion: Solly demonstrates that deep reinforcement learning can successfully tackle complex multi-player games with extensive engagement, advancing AI capabilities in environments with imperfect information and multi-player dynamics.

Abstract: AI researchers have long focused on poker-like games as a testbed for environments characterized by multi-player dynamics, imperfect information, and reasoning under uncertainty. While recent breakthroughs have matched elite human play at no-limit Texas hold’em, the multi-player dynamics are subdued: most hands converge quickly with only two players engaged through multiple rounds of bidding. In this paper, we present Solly, the first AI agent to achieve elite human play in reduced-format Liar’s Poker, a game characterized by extensive multi-player engagement. We trained Solly using self-play with a model-free, actor-critic, deep reinforcement learning algorithm. Solly played at an elite human level as measured by win rate (won over 50% of hands) and equity (money won) in heads-up and multi-player Liar’s Poker. Solly also outperformed large language models (LLMs), including those with reasoning abilities, on the same metrics. Solly developed novel bidding strategies, randomized play effectively, and was not easily exploitable by world-class human players.

[175] A Proprietary Model-Based Safety Response Framework for AI Agents

Qi Li, Jianjun Xu, Pingtao Wei, Jiu Li, Peiqiang Zhao, Jiwei Shi, Xuan Zhang, Yanhui Yang, Xiaodong Hui, Peng Xu, Wenqin Shao

Main category: cs.AI

TL;DR: A novel safety response framework for LLMs that protects at both input and output levels, achieving 99.3% risk recall and perfect safety scores on high-risk tests.

Details

Motivation: Security issues in LLMs constrain their trustworthy deployment in critical domains, requiring systematic safety measures.

Method: Input-level: supervised fine-tuning safety classification with 4-tier taxonomy; Output-level: RAG with fine-tuned interpretation model for grounded responses.

Result: 99.3% risk recall rate, significantly higher safety scores than baseline, and 100% safety score on proprietary high-risk test set.

Conclusion: Provides an effective engineering pathway for building high-security, high-trust LLM applications.

Abstract: With the widespread application of Large Language Models (LLMs), their associated security issues have become increasingly prominent, severely constraining their trustworthy deployment in critical domains. This paper proposes a novel safety response framework designed to systematically safeguard LLMs at both the input and output levels. At the input level, the framework employs a supervised fine-tuning-based safety classification model. Through a fine-grained four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention), it performs precise risk identification and differentiated handling of user queries, significantly enhancing risk coverage and business scenario adaptability, and achieving a risk recall rate of 99.3%. At the output level, the framework integrates Retrieval-Augmented Generation (RAG) with a specifically fine-tuned interpretation model, ensuring all responses are grounded in a real-time, trustworthy knowledge base. This approach eliminates information fabrication and enables result traceability. Experimental results demonstrate that our proposed safety control model achieves a significantly higher safety score on public safety evaluation benchmarks compared to the baseline model, TinyR1-Safety-8B. Furthermore, on our proprietary high-risk test set, the framework’s components attained a perfect 100% safety score, validating their exceptional protective capabilities in complex risk scenarios. This research provides an effective engineering pathway for building high-security, high-trust LLM applications.

[176] Uncovering Bugs in Formal Explainers: A Case Study with PyXAI

Xuanxiang Huang, Yacine Izza, Alexey Ignatiev, Joao Marques-Silva

Main category: cs.AI

TL;DR: The paper develops a methodology for validating formal XAI explainers and finds that PyXAI produces incorrect explanations on most tested datasets.

Details

Motivation: Formal XAI methods offer theoretical rigor but lack practical validation of their implementations, creating a need for verification methodologies.

Method: Developed a novel validation methodology and applied it to assess the publicly available formal explainer PyXAI across multiple datasets.

Result: Found that PyXAI computes incorrect explanations on most of the analyzed datasets, highlighting implementation flaws.

Conclusion: The proposed validation methodology is crucial for ensuring the reliability of formal explainers in practice.

Abstract: Formal explainable artificial intelligence (XAI) offers unique theoretical guarantees of rigor when compared to other non-formal methods of explainability. However, little attention has been given to the validation of practical implementations of formal explainers. This paper develops a novel methodology for validating formal explainers and reports on the assessment of the publicly available formal explainer PyXAI. The paper documents the existence of incorrect explanations computed by PyXAI on most of the datasets analyzed in the experiments, thereby confirming the importance of the proposed novel methodology for the validation of formal explainers.

[177] Beyond Single Pass, Looping Through Time: KG-IRAG with Iterative Knowledge Retrieval

Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim

Main category: cs.AI

TL;DR: KG-IRAG is a novel framework that integrates Knowledge Graphs with iterative reasoning to enhance LLMs’ performance on complex multi-step reasoning tasks involving temporal and logical dependencies.

Details

Motivation: Most RAG methods fall short in addressing multi-step reasoning that requires both information extraction and inference, particularly for queries with temporal and logical dependencies.

Method: KG-IRAG integrates Knowledge Graphs with iterative reasoning through incremental retrieval steps from external KGs, enabling step-by-step reasoning for dynamic temporal data extraction.

Result: Experimental results show KG-IRAG improves accuracy in complex reasoning tasks by effectively integrating external knowledge with iterative, logic-based retrieval. Three new datasets were created to evaluate performance.

Conclusion: KG-IRAG demonstrates potential beyond traditional RAG applications, particularly for scenarios requiring reasoning alongside dynamic temporal data extraction like weather conditions or traffic patterns.

Abstract: Graph Retrieval-Augmented Generation (GraphRAG) has proven highly effective in enhancing the performance of Large Language Models (LLMs) on tasks that require external knowledge. By leveraging Knowledge Graphs (KGs), GraphRAG improves information retrieval for complex reasoning tasks, providing more precise and comprehensive retrieval and generating more accurate responses to QAs. However, most RAG methods fall short in addressing multi-step reasoning, particularly when both information extraction and inference are necessary. To address this limitation, this paper presents Knowledge Graph-Based Iterative Retrieval-Augmented Generation (KG-IRAG), a novel framework that integrates KGs with iterative reasoning to improve LLMs’ ability to handle queries involving temporal and logical dependencies. Through iterative retrieval steps, KG-IRAG incrementally gathers relevant data from external KGs, enabling step-by-step reasoning. The proposed approach is particularly suited for scenarios where reasoning is required alongside dynamic temporal data extraction, such as determining optimal travel times based on weather conditions or traffic patterns. Experimental results show that KG-IRAG improves accuracy in complex reasoning tasks by effectively integrating external knowledge with iterative, logic-based retrieval. Additionally, three new datasets: weatherQA-Irish, weatherQA-Sydney, and trafficQA-TFNSW, are formed to evaluate KG-IRAG’s performance, demonstrating its potential beyond traditional RAG applications.

[178] Adobe Summit Concierge Evaluation with Human in the Loop

Yiru Chen, Sally Fang, Sai Sree Harsha, Dan Luo, Vaishnavi Muppala, Fei Wu, Shun Jiang, Kun Qian, Yunyao Li

Main category: cs.AI

TL;DR: Summit Concierge is a domain-specific AI assistant for Adobe Summit that handles event queries using human-in-the-loop development with prompt engineering and retrieval grounding.

Details

Motivation: To enhance productivity and user experience in enterprise contexts by developing an AI assistant for event-related queries under real-world constraints like data sparsity and rapid deployment.

Method: Human-in-the-loop development workflow combining prompt engineering, retrieval grounding, and lightweight human validation to address data sparsity and quality assurance challenges.

Result: Successfully deployed a scalable and reliable AI assistant that handles wide range of event-related queries in real-world enterprise setting.

Conclusion: Agile, feedback-driven development enables effective AI assistants even in cold-start scenarios with data constraints.

Abstract: Generative AI assistants offer significant potential to enhance productivity, streamline information access, and improve user experience in enterprise contexts. In this work, we present Summit Concierge, a domain-specific AI assistant developed for Adobe Summit. The assistant handles a wide range of event-related queries and operates under real-world constraints such as data sparsity, quality assurance, and rapid deployment. To address these challenges, we adopt a human-in-the-loop development workflow that combines prompt engineering, retrieval grounding, and lightweight human validation. We describe the system architecture, development process, and real-world deployment outcomes. Our experience shows that agile, feedback-driven development enables scalable and reliable AI assistants, even in cold-start scenarios.

[179] Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning

Ruiyi Yang, Hao Xue, Imran Razzak, Shirui Pan, Hakim Hacid, Flora D. Salim

Main category: cs.AI

TL;DR: SPLIT-RAG is a multi-agent RAG framework that uses question-driven semantic graph partitioning and collaborative subgraph retrieval to improve efficiency and accuracy in knowledge graph retrieval.

Details

Motivation: Existing RAG systems struggle with efficiency-accuracy trade-offs when scaling to large knowledge graphs, suffering from unnecessary latency for simple queries and fragmented reasoning for complex multi-hop questions.

Method: The framework creates semantic partitioning of linked information, uses type-specialized knowledge bases for multi-agent RAG, performs attribute-aware graph segmentation to divide knowledge graphs into semantically coherent subgraphs, assigns lightweight LLM agents to partitioned subgraphs, and uses a hierarchical merging module to resolve inconsistencies.

Result: Extensive experimental validation demonstrates considerable improvements compared to existing approaches.

Conclusion: SPLIT-RAG effectively addresses efficiency-accuracy trade-offs in large knowledge graph retrieval through semantic partitioning and multi-agent collaboration.

Abstract: Retrieval-Augmented Generation (RAG) systems empower large language models (LLMs) with external knowledge, yet struggle with efficiency-accuracy trade-offs when scaling to large knowledge graphs. Existing approaches often rely on monolithic graph retrieval, incurring unnecessary latency for simple queries and fragmented reasoning for complex multi-hop questions. To address these challenges, this paper propose SPLIT-RAG, a multi-agent RAG framework that addresses these limitations with question-driven semantic graph partitioning and collaborative subgraph retrieval. The innovative framework first create Semantic Partitioning of Linked Information, then use the Type-Specialized knowledge base to achieve Multi-Agent RAG. The attribute-aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types, while lightweight LLM agents are assigned to partitioned subgraphs, and only relevant partitions are activated during retrieval, thus reduce search space while enhancing efficiency. Finally, a hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications. Extensive experimental validation demonstrates considerable improvements compared to existing approaches.

[180] From Five Dimensions to Many: Large Language Models as Precise and Interpretable Psychological Profilers

Yi-Fei Liu, Yi-Long Lu, Di He, Hang Zhang

Main category: cs.AI

TL;DR: LLMs can accurately model human psychological trait correlations using minimal Big Five personality data, achieving near-human-level accuracy through a two-stage reasoning process.

Details

Motivation: To investigate whether LLMs can capture the complex correlational structure of human psychological traits from limited quantitative inputs, and understand their reasoning process.

Method: Prompted various LLMs with Big Five Personality Scale responses from 816 individuals to role-play responses on nine other psychological scales, then analyzed reasoning traces.

Result: LLMs achieved remarkable accuracy (R² > 0.89) in capturing human psychological structure, outperforming semantic similarity predictions and approaching trained ML algorithm performance.

Conclusion: LLMs can precisely predict psychological traits through abstraction and reasoning, offering both a powerful psychological simulation tool and insights into their emergent reasoning capabilities.

Abstract: Psychological constructs within individuals are widely believed to be interconnected. We investigated whether and how Large Language Models (LLMs) can model the correlational structure of human psychological traits from minimal quantitative inputs. We prompted various LLMs with Big Five Personality Scale responses from 816 human individuals to role-play their responses on nine other psychological scales. LLMs demonstrated remarkable accuracy in capturing human psychological structure, with the inter-scale correlation patterns from LLM-generated responses strongly aligning with those from human data $(R^2 > 0.89)$. This zero-shot performance substantially exceeded predictions based on semantic similarity and approached the accuracy of machine learning algorithms trained directly on the dataset. Analysis of reasoning traces revealed that LLMs use a systematic two-stage process: First, they transform raw Big Five responses into natural language personality summaries through information selection and compression, analogous to generating sufficient statistics. Second, they generate target scale responses based on reasoning from these summaries. For information selection, LLMs identify the same key personality factors as trained algorithms, though they fail to differentiate item importance within factors. The resulting compressed summaries are not merely redundant representations but capture synergistic information–adding them to original scores enhances prediction alignment, suggesting they encode emergent, second-order patterns of trait interplay. Our findings demonstrate that LLMs can precisely predict individual participants’ psychological traits from minimal data through a process of abstraction and reasoning, offering both a powerful tool for psychological simulation and valuable insights into their emergent reasoning capabilities.

[181] Multi-Agent Reinforcement Learning for Autonomous Multi-Satellite Earth Observation: A Realistic Case Study

Mohamad A. Hady, Siyi Hu, Mahardhika Pratama, Jimmy Cao, Ryszard Kowalczyk

Main category: cs.AI

TL;DR: This paper investigates RL and MARL for autonomous Earth Observation mission planning with LEO satellites, addressing energy/data constraints and coordination challenges.

Details

Motivation: The exponential growth of LEO satellites has created opportunities for EO missions but traditional optimization struggles with real-time decision-making in dynamic multi-satellite systems.

Method: Model single-satellite operations and extend to multi-satellite constellations using MARL frameworks (PPO, IPPO, MAPPO, HAPPO) in a near-realistic satellite simulation environment.

Result: MARL effectively balances imaging and resource management while addressing non-stationarity and reward interdependency in multi-satellite coordination.

Conclusion: The study provides a foundation for autonomous satellite operations and practical guidelines for improving policy learning in decentralized EO missions.

Abstract: The exponential growth of Low Earth Orbit (LEO) satellites has revolutionised Earth Observation (EO) missions, addressing challenges in climate monitoring, disaster management, and more. However, autonomous coordination in multi-satellite systems remains a fundamental challenge. Traditional optimisation approaches struggle to handle the real-time decision-making demands of dynamic EO missions, necessitating the use of Reinforcement Learning (RL) and Multi-Agent Reinforcement Learning (MARL). In this paper, we investigate RL-based autonomous EO mission planning by modelling single-satellite operations and extending to multi-satellite constellations using MARL frameworks. We address key challenges, including energy and data storage limitations, uncertainties in satellite observations, and the complexities of decentralised coordination under partial observability. By leveraging a near-realistic satellite simulation environment, we evaluate the training stability and performance of state-of-the-art MARL algorithms, including PPO, IPPO, MAPPO, and HAPPO. Our results demonstrate that MARL can effectively balance imaging and resource management while addressing non-stationarity and reward interdependency in multi-satellite coordination. The insights gained from this study provide a foundation for autonomous satellite operations, offering practical guidelines for improving policy learning in decentralised EO missions.

[182] Towards Scalable Web Accessibility Audit with MLLMs as Copilots

Ming Gu, Ziwei Wang, Sicen Lai, Zirui Gao, Sheng Zhou, Jiajun Bu

Main category: cs.AI

TL;DR: AAA is an AI-enhanced web accessibility auditing framework that operationalizes WCAG-EM through human-AI partnership, using graph-based sampling and multimodal LLM assistance for scalable compliance evaluation.

Details

Motivation: Current web accessibility auditing practices are resource-intensive and unscalable, with most websites remaining non-compliant despite WCAG-EM methodology, requiring more practical solutions for large-scale evaluation.

Method: AAA framework with two key innovations: GRASP (graph-based multimodal sampling for representative page coverage) and MaC (multimodal LLM copilot for cross-modal reasoning and intelligent assistance in high-effort tasks).

Result: Extensive experiments demonstrate effectiveness, showing that small-scale language models can serve as capable experts when fine-tuned. Four novel datasets were created for benchmarking audit pipeline stages.

Conclusion: The AAA framework enables scalable, end-to-end web accessibility auditing through human-AI partnership, empowering auditors with AI-enhanced assistance for real-world impact on digital equality.

Abstract: Ensuring web accessibility is crucial for advancing social welfare, justice, and equality in digital spaces, yet the vast majority of website user interfaces remain non-compliant, due in part to the resource-intensive and unscalable nature of current auditing practices. While WCAG-EM offers a structured methodology for site-wise conformance evaluation, it involves great human efforts and lacks practical support for execution at scale. In this work, we present an auditing framework, AAA, which operationalizes WCAG-EM through a human-AI partnership model. AAA is anchored by two key innovations: GRASP, a graph-based multimodal sampling method that ensures representative page coverage via learned embeddings of visual, textual, and relational cues; and MaC, a multimodal large language model-based copilot that supports auditors through cross-modal reasoning and intelligent assistance in high-effort tasks. Together, these components enable scalable, end-to-end web accessibility auditing, empowering human auditors with AI-enhanced assistance for real-world impact. We further contribute four novel datasets designed for benchmarking core stages of the audit pipeline. Extensive experiments demonstrate the effectiveness of our methods, providing insights that small-scale language models can serve as capable experts when fine-tuned.

[183] Explaining Decisions in ML Models: a Parameterized Complexity Analysis (Part I)

Sebastian Ordyniak, Giacomo Paesani, Mateusz Rychlicki, Stefan Szeider

Main category: cs.AI

TL;DR: Theoretical analysis of parameterized complexity for explanation problems in transparent ML models, covering abductive and contrastive explanations across various model types.

Details

Motivation: To address the gap in explainable AI by providing foundational understanding of explanation complexity for transparent ML models, moving beyond black-box perceptions.

Method: Comprehensive theoretical investigation using parameterized complexity analysis on diverse ML models including Decision Trees, Decision Sets, Decision Lists, Boolean Circuits, and their ensembles.

Result: Provides insights into the computational complexity of generating explanations for transparent ML models, identifying unique explanatory challenges for each model type.

Conclusion: This research contributes vital insights for XAI domain and advances the discourse on transparency and accountability in AI systems by establishing foundational complexity understanding.

Abstract: This paper presents a comprehensive theoretical investigation into the parameterized complexity of explanation problems in various machine learning (ML) models. Contrary to the prevalent black-box perception, our study focuses on models with transparent internal mechanisms. We address two principal types of explanation problems: abductive and contrastive, both in their local and global variants. Our analysis encompasses diverse ML models, including Decision Trees, Decision Sets, Decision Lists, Boolean Circuits, and ensembles thereof, each offering unique explanatory challenges. This research fills a significant gap in explainable AI (XAI) by providing a foundational understanding of the complexities of generating explanations for these models. This work provides insights vital for further research in the domain of XAI, contributing to the broader discourse on the necessity of transparency and accountability in AI systems.

[184] TAMO: Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems

Xiao Zhang, Qi Wang, Mingyi Li, Yuan Yuan, Mengbai Xiao, Fuzhen Zhuang, Dongxiao Yu

Main category: cs.AI

TL;DR: TAMO is a tool-assisted LLM agent that addresses multi-modality input constraints, context window limitations, and dynamic dependence graphs in cloud-native root cause analysis through specialized tools for multimodality alignment, root cause localization, and fault type classification.

Details

Motivation: Existing LLM-based approaches for root cause analysis in cloud-native systems face challenges with multi-modality input constraints, context window limitations, and dynamic dependence graphs, which hinder effective fault diagnosis and repair.

Method: TAMO uses three specialized tools: multimodality alignment tool to unify multi-modal observation data into time-aligned representations, root cause localization tool for identifying root causes, and fault types classification tool for categorizing faults. It employs structured prompt design to guide LLMs in generating context-aligned repair strategies.

Result: Experiments on two benchmark datasets show that TAMO outperforms state-of-the-art approaches with comparable performance, demonstrating its effectiveness in overcoming LLM limitations for root cause analysis.

Conclusion: TAMO successfully addresses key challenges in LLM-driven root cause analysis by integrating specialized tools for multi-modal data processing, enabling more effective fault diagnosis and repair strategy generation in cloud-native systems.

Abstract: Implementing large language models (LLMs)-driven root cause analysis (RCA) in cloud-native systems has become a key topic of modern software operations and maintenance. However, existing LLM-based approaches face three key challenges: multi-modality input constraint, context window limitation, and dynamic dependence graph. To address these issues, we propose a tool-assisted LLM agent with multi-modality observation data for fine-grained RCA, namely TAMO, including multimodality alignment tool, root cause localization tool, and fault types classification tool. In detail, TAMO unifies multi-modal observation data into time-aligned representations for cross-modal feature consistency. Based on the unified representations, TAMO then invokes its specialized root cause localization tool and fault types classification tool for further identifying root cause and fault type underlying system context. This approach overcomes the limitations of LLMs in processing real-time raw observational data and dynamic service dependencies, guiding the model to generate repair strategies that align with system context through structured prompt design. Experiments on two benchmark datasets demonstrate that TAMO outperforms state-of-the-art (SOTA) approaches with comparable performance.

[185] Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes

Matthew T. Dearing, Yiheng Tao, Xingfu Wu, Zhiling Lan, Valerie Taylor

Main category: cs.AI

TL;DR: LASSI-EE is an automated LLM-based framework that generates energy-efficient parallel codes through iterative refactoring, achieving up to 48% energy reduction with high pass rates.

Details

Motivation: Current LLM-based code generation focuses on functional correctness but overlooks energy efficiency, which is crucial for scientific computing and sustainability.

Method: Multi-stage iterative approach combining runtime power profiling, energy-aware prompting, self-correcting feedback loops, and LLM-as-a-Judge agent for automated solution screening.

Result: Single run achieves 29% expected energy reduction at 81% pass rate (2.8x improvement over baseline); multiple runs achieve 48% reduction at 97% pass rate across NVIDIA A100 and AMD MI100 GPUs.

Conclusion: LASSI-EE effectively generates energy-efficient parallel codes across diverse hardware architectures, demonstrating significant improvements over standard LLM prompting approaches.

Abstract: While large language models (LLMs) are increasingly used for generating parallel scientific codes, most efforts emphasize functional correctness, often overlooking performance, especially energy efficiency. We propose LASSI-EE, an automated LLM-based refactoring framework that generates energy-efficient parallel codes through a multi-stage, iterative approach integrating runtime power profiling, energy-aware prompting, self-correcting feedback loops, and an LLM-as-a-Judge agent for automated screening of code solutions. We introduce energy-reduction@k, a novel metric that quantifies expected energy reduction when generating k code candidates and selecting the most energy-efficient, enabling systematic evaluation of multi-attempt generation strategies. Evaluating 20 HeCBench applications and two miniApps on NVIDIA A100 and AMD MI100 GPUs, a single run (k=1) with LASSI-EE delivers refactored parallel codes with an average 29% expected energy reduction at an 81% pass rate, representing a 2.8x improvement over vanilla LLM prompting. Multiple runs (k=3) achieve an average 48% expected energy reduction at a 97% pass rate. These results are consistent across devices, demonstrating LASSI-EE’s effectiveness across diverse hardware architectures.

[186] Meta-Semantics Augmented Few-Shot Relational Learning

Han Wu, Jie Yin

Main category: cs.AI

TL;DR: PromptMeta is a novel meta-learning framework that combines meta-semantics with relational information for few-shot learning on knowledge graphs, enabling effective adaptation to new relations with limited training examples.

Details

Motivation: Current few-shot relational learning methods on knowledge graphs primarily focus on relational information while overlooking rich semantic information inherent in KGs, creating a gap in leveraging comprehensive KG semantics.

Method: Proposes PromptMeta with two innovations: (1) Meta-Semantic Prompt pool that learns and consolidates high-level meta-semantics shared across tasks, and (2) a learnable fusion mechanism that dynamically combines meta-semantics with task-specific relational information, both optimized jointly within a meta-learning framework.

Result: Extensive experiments on two real-world KG benchmarks validate the effectiveness of PromptMeta in adapting to new relations with limited supervision.

Conclusion: PromptMeta successfully bridges the gap by integrating meta-semantics with relational information, enabling effective knowledge transfer and adaptation to newly emerging relations in few-shot learning scenarios.

Abstract: Few-shot relational learning on knowledge graph (KGs) aims to perform reasoning over relations with only a few training examples. While current methods have focused primarily on leveraging specific relational information, rich semantics inherent in KGs have been largely overlooked. To bridge this gap, we propose PromptMeta, a novel prompted meta-learning framework that seamlessly integrates meta-semantics with relational information for few-shot relational learning. PromptMeta introduces two core innovations: (1) a Meta-Semantic Prompt (MSP) pool that learns and consolidates high-level meta-semantics shared across tasks, enabling effective knowledge transfer and adaptation to newly emerging relations; and (2) a learnable fusion mechanism that dynamically combines meta-semantics with task-specific relational information tailored to different few-shot tasks. Both components are optimized jointly with model parameters within a meta-learning framework. Extensive experiments and analyses on two real-world KG benchmarks validate the effectiveness of PromptMeta in adapting to new relations with limited supervision.

[187] s3: You Don’t Need That Much Data to Train a Search Agent via RL

Pengcheng Jiang, Xueqiang Xu, Jiacheng Lin, Jinfeng Xiao, Zifeng Wang, Jimeng Sun, Jiawei Han

Main category: cs.AI

TL;DR: Proposes s3, a lightweight framework that decouples search from generation in RAG systems, training searchers using a Gain Beyond RAG reward metric to improve downstream generation accuracy with minimal training data.

Details

Motivation: Existing RAG approaches either optimize retrieval using search-only metrics that ignore downstream utility, or fine-tune entire LLMs which entangles retrieval with generation and limits compatibility with frozen/proprietary models.

Method: s3 framework decouples searcher from generator, trains searcher using Gain Beyond RAG reward (improvement in generation accuracy over naive RAG), requiring only 2.4k training samples.

Result: Outperforms baselines trained on 70x more data, delivers stronger downstream performance across six general QA and five medical QA benchmarks.

Conclusion: s3 provides an effective model-agnostic approach that improves RAG systems by focusing on downstream generation utility rather than search-only metrics, while maintaining compatibility with frozen models.

Abstract: Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.

[188] Why Isn’t Relational Learning Taking Over the World?

David Poole

Main category: cs.AI

TL;DR: Relational learning should be more prominent in AI since real-world data is often structured (spreadsheets, databases) rather than just text/images, but it hasn’t gained widespread adoption due to limitations with handling complex relations.

Details

Motivation: Current AI focuses too much on modeling pixels, words, and phonemes, while the real world consists of entities with properties and relations. Most valuable company data exists in structured formats like spreadsheets and databases, not the unstructured formats typically studied in ML.

Method: The paper analyzes why relational learning hasn’t become mainstream despite its potential, examining the limitations that prevent it from handling complex relations effectively.

Result: Relational learning is not taking over the world except in limited cases with restricted relations, indicating current approaches have significant limitations.

Conclusion: Changes are needed to make relational learning achieve its rightful prominence in AI, suggesting improvements are required to handle complex relational data more effectively.

Abstract: Artificial intelligence seems to be taking over the world with systems that model pixels, words, and phonemes. The world is arguably made up, not of pixels, words, and phonemes but of entities (objects, things, including events) with properties and relations among them. Surely we should model these, not the perception or description of them. You might suspect that concentrating on modeling words and pixels is because all of the (valuable) data in the world is in terms of text and images. If you look into almost any company you will find their most valuable data is in spreadsheets, databases and other relational formats. These are not the form that are studied in introductory machine learning, but are full of product numbers, student numbers, transaction numbers and other identifiers that can’t be interpreted naively as numbers. The field that studies this sort of data has various names including relational learning, statistical relational AI, and many others. This paper explains why relational learning is not taking over the world – except in a few cases with restricted relations – and what needs to be done to bring it to it’s rightful prominence.

[189] LLM-Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning

Bo Hou, Xin Tan, Kai Zheng, Fang Liu, Yinghao Zhu, Li Zhang

Main category: cs.AI

TL;DR: ColaUntangle is a collaborative LLM-based framework that untangles mixed commits by modeling both explicit and implicit dependencies through multi-agent consultation.

Details

Motivation: Developers often create tangled commits mixing unrelated changes, complicating code review and maintenance. Existing approaches struggle to distinguish explicit dependencies from implicit semantic relationships.

Method: Multi-agent architecture with specialized agents: one for explicit dependencies, another for implicit dependencies, and a reviewer agent that synthesizes perspectives through iterative consultation. Uses Explicit and Implicit Contexts to capture structural and semantic information.

Result: Outperforms best-performing baseline with 44% improvement on C# dataset (1,612 commits) and 82% improvement on Java dataset (14k commits).

Conclusion: LLM-based collaborative frameworks show strong potential for advancing automated commit untangling by effectively modeling both explicit and implicit code relationships.

Abstract: Atomic commits, which address a single development concern, are a best practice in software development. In practice, however, developers often produce tangled commits that mix unrelated changes, complicating code review and maintenance. Prior untangling approaches (rule-based, feature-based, or graph-based) have made progress but typically rely on shallow signals and struggle to distinguish explicit dependencies (e.g., control/data flow) from implicit ones (e.g., semantic or conceptual relationships). In this paper, we propose ColaUntangle, a new collaborative consultation framework for commit untangling that models both explicit and implicit dependencies among code changes. ColaUntangle integrates Large Language Model (LLM)-driven agents in a multi-agent architecture: one agent specializes in explicit dependencies, another in implicit ones, and a reviewer agent synthesizes their perspectives through iterative consultation. To capture structural and contextual information, we construct Explicit and Implicit Contexts, enabling agents to reason over code relationships with both symbolic and semantic depth. We evaluate ColaUntangle on two widely-used datasets (1,612 C# and 14k Java tangled commits). Experimental results show that ColaUntangle outperforms the best-performing baseline, achieving an improvement of 44% on the C# dataset and 82% on the Java dataset. These findings highlight the potential of LLM-based collaborative frameworks for advancing automated commit untangling tasks.

[190] Data Dependency-Aware Code Generation from Enhanced UML Sequence Diagrams

Wenxin Mao, Zhitao Wang, Long Wang, Sirong Chen, Cuiyun Gao, Luyang Cao, Ziming Liu, Qiming Zhang, Jun Zhou, Zhi Jin

Main category: cs.AI

TL;DR: UML2Dep is a framework that uses enhanced UML sequence diagrams with decision tables and API specs to generate code from unambiguous formal specifications, addressing ambiguity in natural language requirements for service-oriented architectures.

Details

Motivation: Natural language descriptions are inherently ambiguous and fail to capture complex requirements like system behaviors, conditional logic, and architectural constraints, especially implicit data dependencies in service-oriented architectures.

Method: 1) Enhanced UML sequence diagrams with decision tables and API specs; 2) Data dependency inference (DDI) as constrained mathematical reasoning; 3) Static parsing and dependency pruning to reduce complexity.

Result: The framework systematically constructs explicit data dependency graphs before code synthesis, eliminating linguistic ambiguity and enhancing reasoning accuracy for complex service interactions.

Conclusion: UML2Dep bridges the gap between ambiguous natural language and precise code generation by leveraging formal specifications, making LLMs more reliable for complex service-oriented architecture requirements.

Abstract: Large language models (LLMs) excel at generating code from natural language (NL) descriptions. However, the plain textual descriptions are inherently ambiguous and often fail to capture complex requirements like intricate system behaviors, conditional logic, and architectural constraints; implicit data dependencies in service-oriented architectures are difficult to infer and handle correctly. To bridge this gap, we propose a novel step-by-step code generation framework named UML2Dep by leveraging unambiguous formal specifications of complex requirements. First, we introduce an enhanced Unified Modeling Language (UML) sequence diagram tailored for service-oriented architectures. This diagram extends traditional visual syntax by integrating decision tables and API specifications, explicitly formalizing structural relationships and business logic flows in service interactions to rigorously eliminate linguistic ambiguity. Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference (DDI) task. DDI systematically constructs an explicit data dependency graph prior to actual code synthesis. To ensure reliability, we formalize DDI as a constrained mathematical reasoning task through novel prompting strategies, aligning with LLMs’ excellent mathematical strengths. Additional static parsing and dependency pruning further reduce context complexity and cognitive load associated with intricate specifications, thereby enhancing reasoning accuracy and efficiency.

[191] Reinforcement Learning Foundations for Deep Research Systems: A Survey

Wenjun Li, Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, Yong Liu

Main category: cs.AI

TL;DR: This paper surveys reinforcement learning (RL) foundations for deep research systems, covering data synthesis, RL methods for agentic research, and training systems, with practical guidance for building robust research agents.

Details

Motivation: Current approaches like supervised fine-tuning (SFT) and preference alignment (DPO) have limitations including imitation bias, exposure bias, and reliance on human priors. RL offers better alignment with tool-interaction research by enabling exploration, recovery behaviors, and principled credit assignment.

Method: The survey systematizes recent work along three axes: data synthesis and curation; RL methods covering stability, sample efficiency, long context handling, reward design, and multi-objective optimization; and agentic RL training systems and frameworks.

Result: The paper provides the first dedicated survey on RL foundations for deep research systems, distilling recurring patterns, identifying infrastructure bottlenecks, and offering practical guidance.

Conclusion: Reinforcement learning is better suited than SFT and DPO for training deep research agents as it enables closed-loop optimization, reduces dependence on human priors, and supports complex multi-step tasks with principled credit assignment.

Abstract: Deep research systems, agentic AI that solve complex, multi-step tasks by coordinating reasoning, search across the open web and user files, and tool use, are moving toward hierarchical deployments with a Planner, Coordinator, and Executors. In practice, training entire stacks end-to-end remains impractical, so most work trains a single planner connected to core tools such as search, browsing, and code. While SFT imparts protocol fidelity, it suffers from imitation and exposure biases and underuses environment feedback. Preference alignment methods such as DPO are schema and proxy-dependent, off-policy, and weak for long-horizon credit assignment and multi-objective trade-offs. A further limitation of SFT and DPO is their reliance on human defined decision points and subskills through schema design and labeled comparisons. Reinforcement learning aligns with closed-loop, tool-interaction research by optimizing trajectory-level policies, enabling exploration, recovery behaviors, and principled credit assignment, and it reduces dependence on such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundations of deep research systems. It systematizes recent work along three axes: (i) data synthesis and curation; (ii) RL methods for agentic research covering stability, sample efficiency, long context handling, reward and credit design, multi-objective optimization, and multimodal integration; and (iii) agentic RL training systems and frameworks. We also cover agent architecture and coordination, as well as evaluation and benchmarks, including recent QA, VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. We distill recurring patterns, surface infrastructure bottlenecks, and offer practical guidance for training robust, transparent deep research agents with RL.

[192] ForTIFAI: Fending Off Recursive Training Induced Failure for AI Model Collapse

Soheil Zibakhsh Shabgahi, Pedram Aghazadeh, Azalia Mirhoseini, Farinaz Koushanfar

Main category: cs.AI

TL;DR: Proposes Truncated-Cross-Entropy (TCE) loss to mitigate model collapse in generative AI by filtering high-confidence tokens that likely contain machine-generated artifacts.

Details

Motivation: Address the critical challenge of model collapse where repeated training on synthetic data degrades model performance over generations, with effective mitigation strategies being scarce.

Method: Introduces TCE loss function that selectively ignores high-confidence tokens during training based on the insight that auto-regressive models generate text sequences with high confidence.

Result: Models trained with TCE tolerate over 2.3x more synthetic data before collapse onset and exhibit significantly increased resilience, with an open-source benchmark provided for collapse dynamics.

Conclusion: Confidence-aware training objectives like TCE can substantially delay model collapse, offering a practical and generalizable tool for model robustness under synthetic-data exposure.

Abstract: The increasing reliance on generative AI models is rapidly increasing the volume of synthetic data, with some projections suggesting that most available new data for training could be machine-generated by 2030. This shift to a mainly synthetic content presents a critical challenge: repeated training in synthetic data leads to a phenomenon known as model collapse, where model performance degrades over generations of training, eventually rendering the models ineffective. While the causes of model collapse are increasingly understood, effective mitigation strategies remain scarce. We address this challenge by leveraging a key insight: auto-regressive models tend to generate text sequences to which they assign high confidence (i.e., high log-likelihood). Based on this observation, we introduce the Truncated-Cross-Entropy (TCE) loss function. TCE mitigates collapse by selectively ignoring high-confidence tokens during training, effectively filtering out likely machine-generated artifacts from the learning process. Our experiments demonstrate that models trained with TCE not only learn effectively but also exhibit significantly increased resilience, tolerating over 2.3x more synthetic data before the onset of collapse. In addition, we provide an open-source benchmark for collapse dynamics in mixed-data settings. Our results demonstrate that confidence-aware training objectives can substantially delay collapse onset, offering a practical and generalizable tool for model robustness under synthetic-data exposure.

[193] VoiceAgentBench: Are Voice Assistants ready for agentic tasks?

Dhruv Jain, Harshit Shukla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal

Main category: cs.AI

TL;DR: VoiceAgentBench is a comprehensive benchmark for evaluating Speech Language Models in realistic spoken agentic settings, covering multilingual capabilities, cultural understanding, and adversarial robustness.

Details

Motivation: Existing speech benchmarks focus on isolated capabilities like transcription or QA, lacking systematic evaluation of agentic scenarios with multilingual, cultural understanding, and adversarial robustness.

Method: Created over 5,500 synthetic spoken queries including Indian context dialogues, supporting 7 languages with speaker variability simulation using novel sampling algorithm for TTS voice conversion based on speaker embeddings.

Result: Evaluation reveals significant gaps in contextual tool orchestration tasks, Indic language generalization, and adversarial robustness in current SpeechLMs.

Conclusion: Current SpeechLMs have critical limitations in real-world agentic scenarios, particularly in multilingual contexts and adversarial settings, highlighting the need for improved robustness and cultural adaptation.

Abstract: Large-scale Speech Language Models (SpeechLMs) have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks primarily focus on isolated capabilities such as transcription, or question-answering, and do not systematically evaluate agentic scenarios encompassing multilingual and cultural understanding, as well as adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark designed to evaluate SpeechLMs in realistic spoken agentic settings. It comprises over 5,500 synthetic spoken queries, including dialogues grounded in Indian context, covering single-tool invocations, multi-tool workflows, multi-turn interactions, and safety evaluations. The benchmark supports English, Hindi, and 5 other Indian languages, reflecting real-world linguistic and cultural diversity. We simulate speaker variability using a novel sampling algorithm that selects audios for TTS voice conversion based on its speaker embeddings, maximizing acoustic and speaker diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Our experiments reveal significant gaps in contextual tool orchestration tasks, Indic generalization, and adversarial robustness, exposing critical limitations of current SpeechLMs.

[194] PlanU: Large Language Model Reasoning through Planning under Uncertainty

Ziwei Deng, Mian Deng, Chenjing Liang, Zeming Gao, Chennan Ma, Chenxing Lin, Haipeng Zhang, Songzhu Mei, Siqi Shen, Cheng Wang

Main category: cs.AI

TL;DR: PlanU is an LLM-based planning method that addresses uncertainty in decision-making by integrating quantile distributions within Monte Carlo Tree Search and using a novel UCC score to balance exploration and exploitation.

Details

Motivation: LLMs struggle with reasoning tasks under uncertainty, particularly in stochastic environments, due to both LLM uncertainty (from stochastic sampling) and environmental uncertainty. Existing approaches either ignore environmental uncertainty or are not designed for multi-step reasoning tasks.

Method: PlanU captures uncertainty within Monte Carlo Tree Search by modeling node returns as quantile distributions and introduces an Upper Confidence Bounds with Curiosity (UCC) score to balance exploration and exploitation during tree search.

Result: Extensive experiments demonstrate the effectiveness of PlanU in LLM-based reasoning tasks under uncertainty, showing improved performance in stochastic environments.

Conclusion: PlanU successfully addresses uncertainty challenges in LLM decision-making by integrating uncertainty modeling directly into the planning process, enabling better performance in stochastic environments that require multi-step reasoning.

Abstract: Large Language Models (LLMs) are increasingly being explored across a range of reasoning tasks. However, LLMs sometimes struggle with reasoning tasks under uncertainty that are relatively easy for humans, such as planning actions in stochastic environments. The adoption of LLMs for reasoning is impeded by uncertainty challenges, such as LLM uncertainty and environmental uncertainty. LLM uncertainty arises from the stochastic sampling process inherent to LLMs. Most LLM-based Decision-Making (LDM) approaches address LLM uncertainty through multiple reasoning chains or search trees. However, these approaches overlook environmental uncertainty, which leads to poor performance in environments with stochastic state transitions. Some recent LDM approaches deal with uncertainty by forecasting the probability of unknown variables. However, they are not designed for multi-step reasoning tasks that require interaction with the environment. To address uncertainty in LLM decision-making, we introduce PlanU, an LLM-based planning method that captures uncertainty within Monte Carlo Tree Search (MCTS). PlanU models the return of each node in the MCTS as a quantile distribution, which uses a set of quantiles to represent the return distribution. To balance exploration and exploitation during tree search, PlanU introduces an Upper Confidence Bounds with Curiosity (UCC) score which estimates the uncertainty of MCTS nodes. Through extensive experiments, we demonstrate the effectiveness of PlanU in LLM-based reasoning tasks under uncertainty.

[195] Misalignment Bounty: Crowdsourcing AI Agent Misbehavior

Rustem Turtayev, Natalia Fedorova, Oleg Serikov, Sergey Koldyba, Lev Avagyan, Dmitrii Volkov

Main category: cs.AI

TL;DR: The Misalignment Bounty project collected 295 submissions of AI systems acting against human intent, with 9 winning cases demonstrating unsafe goal pursuit.

Details

Motivation: To gather clear, reproducible examples of AI systems acting in ways that differ from human intent and pursuing unintended or unsafe goals.

Method: Ran a crowdsourced project called Misalignment Bounty that collected cases from participants, evaluated submissions against specific criteria, and selected 9 winning examples.

Result: Received 295 submissions total, with 9 submissions awarded as winners that demonstrate clear cases of AI misalignment and unsafe goal pursuit.

Conclusion: The project successfully identified and documented concrete examples of AI misalignment through crowdsourced collection, providing valuable case studies of how advanced AI systems can diverge from human intent.

Abstract: Advanced AI systems sometimes act in ways that differ from human intent. To gather clear, reproducible examples, we ran the Misalignment Bounty: a crowdsourced project that collected cases of agents pursuing unintended or unsafe goals. The bounty received 295 submissions, of which nine were awarded. This report explains the program’s motivation and evaluation criteria, and walks through the nine winning submissions step by step.

[196] Agentic Meta-Orchestrator for Multi-task Copilots

Xiaofeng Zhu, Yunshen Zhou

Main category: cs.AI

TL;DR: Proposes Agentic Meta-orchestrator (AMO) for Microsoft Copilot services to dynamically distribute tasks among multiple agents using meta-learning decision trees.

Details

Motivation: Microsoft Copilot services need robust orchestration to handle multiple tasks across scalable agents that can expand dynamically, requiring efficient task distribution from user prompts to appropriate agents.

Method: Uses Agentic Meta-orchestrator (AMO) with meta-learning decision tree model to select optimal inference strategies among various agents/models, supporting both natural language and action responses.

Result: Demonstrated effectiveness through two production use cases: M365 E-Commerce Copilot (product promotion with real-time info) and code compliance copilot (detecting compliance issues in DevOps code).

Conclusion: AMO provides effective orchestration for scalable copilot services, successfully handling multiple agents and tasks in production environments with dynamic agent expansion capabilities.

Abstract: Microsoft Copilot suites serve as the universal entry point for various agents skilled in handling important tasks, ranging from assisting a customer with product purchases to detecting vulnerabilities in corporate programming code. Each agent can be powered by language models, software engineering operations, such as database retrieval, and internal & external knowledge. The repertoire of a copilot can expand dynamically with new agents. This requires a robust orchestrator that can distribute tasks from user prompts to the right agents. In this work, we propose an Agentic Meta-orchestrator (AMO) for handling multiple tasks and scalable agents in copilot services, which can provide both natural language and action responses. We will also demonstrate the planning that leverages meta-learning, i.e., a trained decision tree model for deciding the best inference strategy among various agents/models. We showcase the effectiveness of our AMO through two production use cases: Microsoft 365 (M365) E-Commerce Copilot and code compliance copilot. M365 E-Commerce Copilot advertises Microsoft products to external customers to promote sales success. The M365 E-Commerce Copilot provides up-to-date product information and connects to multiple agents, such as relational databases and human customer support. The code compliance copilot scans the internal DevOps code to detect known and new compliance issues in pull requests (PR).

[197] Mirror-Neuron Patterns in AI Alignment

Robyn Wyrick

Main category: cs.AI

TL;DR: This research explores whether artificial neural networks can develop mirror neuron patterns similar to biological systems, and how these patterns might enable intrinsic AI alignment through empathy-like mechanisms that support cooperative behavior.

Details

Motivation: As AI advances toward superhuman capabilities, current alignment strategies relying on external constraints may prove insufficient. The study investigates whether intrinsic alignment through mirror neuron patterns could complement existing techniques by embedding empathy-like mechanisms directly within AI architectures.

Method: Using a novel Frog and Toad game framework designed to promote cooperative behaviors, researchers identified conditions for mirror-neuron pattern emergence, evaluated their influence on action circuits, introduced the Checkpoint Mirror Neuron Index (CMNI) to quantify activation strength and consistency, and proposed a theoretical framework for further study.

Result: Findings indicate that appropriately scaled model capacities and self/other coupling foster shared neural representations in ANNs similar to biological mirror neurons. These empathy-like circuits support cooperative behavior and suggest intrinsic motivations modeled through mirror-neuron dynamics could complement existing alignment techniques.

Conclusion: Mirror-neuron patterns in ANNs can contribute to ethical and cooperative decision-making in AI systems, providing a pathway for intrinsic alignment that embeds empathy-like mechanisms directly within AI architectures as a complement to current external constraint-based approaches.

Abstract: As artificial intelligence (AI) advances toward superhuman capabilities, aligning these systems with human values becomes increasingly critical. Current alignment strategies rely largely on externally specified constraints that may prove insufficient against future super-intelligent AI capable of circumventing top-down controls. This research investigates whether artificial neural networks (ANNs) can develop patterns analogous to biological mirror neurons cells that activate both when performing and observing actions, and how such patterns might contribute to intrinsic alignment in AI. Mirror neurons play a crucial role in empathy, imitation, and social cognition in humans. The study therefore asks: (1) Can simple ANNs develop mirror-neuron patterns? and (2) How might these patterns contribute to ethical and cooperative decision-making in AI systems? Using a novel Frog and Toad game framework designed to promote cooperative behaviors, we identify conditions under which mirror-neuron patterns emerge, evaluate their influence on action circuits, introduce the Checkpoint Mirror Neuron Index (CMNI) to quantify activation strength and consistency, and propose a theoretical framework for further study. Our findings indicate that appropriately scaled model capacities and self/other coupling foster shared neural representations in ANNs similar to biological mirror neurons. These empathy-like circuits support cooperative behavior and suggest that intrinsic motivations modeled through mirror-neuron dynamics could complement existing alignment techniques by embedding empathy-like mechanisms directly within AI architectures.

[198] TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning in Tabular Data

Changjiang Jiang, Fengchang Yu, Haihua Chen, Wei Lu, Jin Zeng

Main category: cs.AI

TL;DR: TabDSR is a framework that improves LLM performance on complex tabular reasoning through query decomposition, table sanitization, and program-of-thoughts reasoning, achieving SOTA results on multiple benchmarks.

Details

Motivation: LLMs underperform on complex tabular reasoning due to complex queries, noisy data, and limited numerical capabilities.

Method: Three-component framework: (1) query decomposer for breaking down questions, (2) table sanitizer for cleaning noisy tables, and (3) PoT-based reasoner that generates executable code.

Result: Achieved 8.79%, 6.08%, and 19.87% accuracy improvements on TAT-QA, TableBench, and TabDSR datasets respectively, demonstrating SOTA performance.

Conclusion: TabDSR effectively enhances LLM performance for complex tabular numerical reasoning and integrates seamlessly with mainstream LLMs.

Abstract: Complex reasoning over tabular data is crucial in real-world data analysis, yet large language models (LLMs) often underperform due to complex queries, noisy data, and limited numerical capabilities. To address these issues, we propose TabDSR, a framework consisting of: (1) a query decomposer that breaks down complex questions, (2) a table sanitizer that cleans and filters noisy tables, and (3) a program-of-thoughts (PoT)-based reasoner that generates executable code to derive the final answer from the sanitized table. To ensure unbiased evaluation and mitigate data leakage, we introduce a new dataset, CalTab151, specifically designed for complex numerical reasoning over tables. Experimental results demonstrate that TabDSR consistently outperforms existing methods, achieving state-of-the-art (SOTA) performance with 8.79%, 6.08%, and 19.87% accuracy improvement on TAT-QA, TableBench, and TabDSR, respectively. Moreover, our framework integrates seamlessly with mainstream LLMs, providing a robust solution for complex tabular numerical reasoning. These findings highlight the effectiveness of our framework in enhancing LLM performance for complex tabular numerical reasoning. Data and code are available upon request.

[199] The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models

Claudia Herambourg, Dawid Siuda, Julia Kopczyńska, Joao R. L. Santos, Wojciech Sas, Joanna Śmietańska-Nowak

Main category: cs.AI

TL;DR: ORCA Benchmark evaluates LLMs on multi-domain quantitative reasoning using 500 real-life tasks. State-of-the-art models achieved only 45-63% accuracy, with major errors in rounding (35%) and calculation (33%). Models show strengths in math/engineering but weaknesses in physics/natural sciences.

Details

Motivation: To evaluate LLMs' quantitative reasoning capabilities across real-world domains using verified calculator outputs, addressing limitations of standard math datasets by testing step-by-step reasoning, numerical precision, and domain generalization.

Method: Created ORCA Benchmark with 500 natural-language tasks across finance, physics, health, and statistics domains. Evaluated five state-of-the-art LLMs using verified outputs from Omni’s calculator engine, analyzing accuracy, error types, and domain performance.

Result: Models achieved 45-63% accuracy overall. Main error types: rounding (35%) and calculation mistakes (33%). Performance varied by domain - strong in mathematics/engineering, weak in physics/natural sciences. Models showed partial complementarity (r≈0.40-0.65) with different error patterns.

Conclusion: Current LLMs have significant limitations in quantitative reasoning despite advances. The ORCA Benchmark reveals critical gaps in numerical precision and domain-specific reasoning, highlighting the need for improved calculation capabilities and domain knowledge in AI systems.

Abstract: We present ORCA (Omni Research on Calculation in AI) Benchmark - a novel benchmark that evaluates large language models (LLMs) on multi-domain, real-life quantitative reasoning using verified outputs from Omni’s calculator engine. In 500 natural-language tasks across domains such as finance, physics, health, and statistics, the five state-of-the-art systems (ChatGPT-5, Gemini~~2.5~~Flash, Claude~~Sonnet~~4.5, Grok~~4, and DeepSeek~~V3.2) achieved only $45\text{–}63,%$ accuracy, with errors mainly related to rounding ($35,%$) and calculation mistakes ($33,%$). Results in specific domains indicate strengths in mathematics and engineering, but weaknesses in physics and natural sciences. Correlation analysis ($r \approx 0.40\text{–}0.65$) shows that the models often fail together but differ in the types of errors they make, highlighting their partial complementarity rather than redundancy. Unlike standard math datasets, ORCA evaluates step-by-step reasoning, numerical precision, and domain generalization across real problems from finance, physics, health, and statistics.

[200] Kosmos: An AI Scientist for Autonomous Discovery

Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F. Roberts, Sladjana Zagorac, Timothy C. Orr, Miranda E. Orr, Kevin J. Zwezdaryk, Ali E. Ghareeb, Laurie McCoy, Bruna Gomes, Euan A. Ashley, Karen E. Duff, Tonio Buonassisi, Tom Rainforth, Randall J. Bateman, Michael Skarlinski, Samuel G. Rodriques, Michaela M. Hinks, Andrew D. White

Main category: cs.AI

TL;DR: Kosmos is an AI scientist that automates data-driven discovery through iterative cycles of data analysis, literature search, and hypothesis generation, maintaining coherence over 200 agent rollouts and generating scientific reports with traceable reasoning.

Details

Motivation: Current AI agents for scientific research are limited in the number of actions they can take before losing coherence, restricting the depth of their findings. There's a need for systems that can maintain coherent pursuit of objectives over extended periods.

Method: Kosmos uses a structured world model to share information between a data analysis agent and a literature search agent, enabling coherent pursuit of objectives over 200 agent rollouts. It performs parallel data analysis, literature search, and hypothesis generation cycles for up to 12 hours.

Result: Kosmos executes an average of 42,000 lines of code and reads 1,500 papers per run. Independent scientists found 79.4% of statements in Kosmos reports to be accurate. A single 20-cycle run performed the equivalent of 6 months of human research time, with valuable findings scaling linearly with cycles.

Conclusion: Kosmos successfully automates scientific discovery across multiple domains, reproducing three independent findings and making four novel contributions to scientific literature, demonstrating the potential for AI systems to accelerate scientific research while maintaining traceable reasoning.

Abstract: Data-driven scientific discovery requires iterative cycles of literature search, hypothesis generation, and data analysis. Substantial progress has been made towards AI agents that can automate scientific research, but all such agents remain limited in the number of actions they can take before losing coherence, thus limiting the depth of their findings. Here we present Kosmos, an AI scientist that automates data-driven discovery. Given an open-ended objective and a dataset, Kosmos runs for up to 12 hours performing cycles of parallel data analysis, literature search, and hypothesis generation before synthesizing discoveries into scientific reports. Unlike prior systems, Kosmos uses a structured world model to share information between a data analysis agent and a literature search agent. The world model enables Kosmos to coherently pursue the specified objective over 200 agent rollouts, collectively executing an average of 42,000 lines of code and reading 1,500 papers per run. Kosmos cites all statements in its reports with code or primary literature, ensuring its reasoning is traceable. Independent scientists found 79.4% of statements in Kosmos reports to be accurate, and collaborators reported that a single 20-cycle Kosmos run performed the equivalent of 6 months of their own research time on average. Furthermore, collaborators reported that the number of valuable scientific findings generated scales linearly with Kosmos cycles (tested up to 20 cycles). We highlight seven discoveries made by Kosmos that span metabolomics, materials science, neuroscience, and statistical genetics. Three discoveries independently reproduce findings from preprinted or unpublished manuscripts that were not accessed by Kosmos at runtime, while four make novel contributions to the scientific literature.

[201] Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang, Ravender Pal Singh

Main category: cs.AI

TL;DR: Agent-Omni framework coordinates existing foundation models through a master-agent system for flexible multimodal reasoning without retraining, achieving state-of-the-art performance across text, image, audio, video, and omni benchmarks.

Details

Motivation: Current MLLMs are limited to fixed modality pairs, require costly fine-tuning, and lack robust reasoning support for fully omni-capable models integrating text, images, audio, and video.

Method: A master-agent system where the master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses without retraining existing models.

Result: Extensive experiments show Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning, while maintaining transparency and interpretability.

Conclusion: The agent-based design enables seamless integration of specialized foundation models, is modular and easily extensible for future improvements, and provides adaptable multimodal reasoning without model retraining.

Abstract: Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available.

cs.SD

[202] SyMuPe: Affective and Controllable Symbolic Music Performance

Ilya Borovik, Dmitrii Gavrilev, Vladimir Viro

Main category: cs.SD

TL;DR: SyMuPe framework with PianoFlow model enables controllable symbolic piano performance generation using conditional flow matching, achieving human-like quality and emotion control through text embeddings.

Details

Motivation: Current machine learning models struggle to achieve human-like emotional expression in music performance rendering, creating a need for more affective and controllable performance models.

Method: Uses conditional flow matching trained on multi-mask performance inpainting tasks, integrated with emotion classifier and Flan-T5 text embeddings for emotion control, trained on 2,968 hours of aligned scores and MIDI performances.

Result: PianoFlow outperforms transformer-based baselines and achieves performance quality comparable to human-recorded MIDI samples, with effective emotion control demonstrated through text conditioning.

Conclusion: The framework enables creation of accessible and engaging music performance systems that can be integrated into interactive applications.

Abstract: Emotions are fundamental to the creation and perception of music performances. However, achieving human-like expression and emotion through machine learning models for performance rendering remains a challenging task. In this work, we present SyMuPe, a novel framework for developing and training affective and controllable symbolic piano performance models. Our flagship model, PianoFlow, uses conditional flow matching trained to solve diverse multi-mask performance inpainting tasks. By design, it supports both unconditional generation and infilling of music performance features. For training, we use a curated, cleaned dataset of 2,968 hours of aligned musical scores and expressive MIDI performances. For text and emotion control, we integrate a piano performance emotion classifier and tune PianoFlow with the emotion-weighted Flan-T5 text embeddings provided as conditional inputs. Objective and subjective evaluations against transformer-based baselines and existing models show that PianoFlow not only outperforms other approaches, but also achieves performance quality comparable to that of human-recorded and transcribed MIDI samples. For emotion control, we present and analyze samples generated under different text conditioning scenarios. The developed model can be integrated into interactive applications, contributing to the creation of more accessible and engaging music performance systems.

[203] Why Not Put a Microphone Near the Loudspeaker? A New Paradigm for Acoustic Echo Cancellation

Fei Zhao, Zhong-Qiu Wang

Main category: cs.SD

TL;DR: A dual-microphone acoustic echo cancellation method that uses an auxiliary reference microphone near loudspeakers to capture nonlinear distortions, with Wiener filtering preprocessing and DNN-based joint residual echo/noise suppression.

Details

Motivation: Address challenges in real-world AEC caused by nonlinear distortions from low-cost loudspeakers and complex room acoustics.

Method: Uses dual-microphone setup with auxiliary reference microphone near loudspeaker, Wiener filtering preprocessing to suppress near-end components, linear AEC stage, and DNN for joint residual echo and noise suppression.

Result: Outperforms baseline approaches on matched test sets and achieves substantial performance gains on mismatched datasets with strong nonlinearities.

Conclusion: The method is effective in practical scenarios with unknown nonlinear distortions, demonstrating robustness and superior performance over existing approaches.

Abstract: Acoustic echo cancellation (AEC) remains challenging in real-world environments due to nonlinear distortions caused by low-cost loudspeakers and complex room acoustics. To mitigate these issues, we introduce a dual-microphone configuration, where an auxiliary reference microphone is placed near the loudspeaker to capture the nonlinearly distorted far-end signal. Although this reference signal is contaminated by near-end speech, we propose a preprocessing module based on Wiener filtering to estimate a compressed time-frequency mask to suppress near-end components. This purified reference signal enables a more effective linear AEC stage, whose residual error signal is then fed to a deep neural network for joint residual echo and noise suppression. Evaluation results show that our method outperforms baseline approaches on matched test sets. To evaluate its robustness under strong nonlinearities, we further test it on a mismatched dataset and observe that it achieves substantial performance gains. These results demonstrate its effectiveness in practical scenarios where the nonlinear distortions are typically unknown.

[204] Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization

Longshen Ou, Jingwei Zhao, Ziyu Wang, Gus Xia, Qihao Liang, Torin Hopkins Ye Wang

Main category: cs.SD

TL;DR: A unified framework for automatic multitrack music arrangement using a single pre-trained model that handles diverse scenarios like reinterpretation, simplification, and additive generation through segment-level reconstruction with disentangled content and style tokens.

Details

Motivation: To create a generalized approach for multitrack music arrangement that can handle various transformation scenarios without needing task-specific models, enabling flexible any-to-any instrumentation changes.

Method: Uses segment-level reconstruction objective with token-level disentangled content and style, introduces REMI-z structured tokenization for multitrack symbolic music, and supports track-wise modeling for efficient arrangement tasks.

Result: Outperforms task-specific state-of-the-art models on band arrangement, piano reduction, and drum arrangement tasks in both objective metrics and perceptual evaluations.

Conclusion: The framework demonstrates strong generality and broader applicability in symbolic music-to-music transformation, suggesting a unified approach can effectively handle diverse arrangement scenarios.

Abstract: We present a unified framework for automatic multitrack music arrangement that enables a single pre-trained symbolic music model to handle diverse arrangement scenarios, including reinterpretation, simplification, and additive generation. At its core is a segment-level reconstruction objective operating on token-level disentangled content and style, allowing for flexible any-to-any instrumentation transformations at inference time. To support track-wise modeling, we introduce REMI-z, a structured tokenization scheme for multitrack symbolic music that enhances modeling efficiency and effectiveness for both arrangement tasks and unconditional generation. Our method outperforms task-specific state-of-the-art models on representative tasks in different arrangement scenarios – band arrangement, piano reduction, and drum arrangement, in both objective metrics and perceptual evaluations. Taken together, our framework demonstrates strong generality and suggests broader applicability in symbolic music-to-music transformation.

[205] Live Music Models

Lyria Team, Antoine Caillon, Brian McWilliams, Cassie Tarakajian, Ian Simon, Ilaria Manco, Jesse Engel, Noah Constant, Yunpeng Li, Timo I. Denk, Alberto Lalama, Andrea Agostinelli, Cheng-Zhi Anna Huang, Ethan Manilow, George Brower, Hakan Erdogan, Heidi Lei, Itai Rolnick, Ivan Grishchenko, Manu Orsini, Matej Kastelic, Mauricio Zuluaga, Mauro Verzetti, Michael Dooley, Ondrej Skopek, Rafael Ferrer, Savvas Petridis, Zalán Borsos, Äaron van den Oord, Douglas Eck, Eli Collins, Jason Baldridge, Tom Hume, Chris Donahue, Kehang Han, Adam Roberts

Main category: cs.SD

TL;DR: Live music models generate real-time continuous music streams with synchronized user control via text/audio prompts. Magenta RealTime outperforms other open-weights models on music quality metrics with fewer parameters.

Details

Motivation: To create a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance, enabling real-time control and generation.

Method: Developed live music models that produce continuous music streams in real-time, with Magenta RealTime (open-weights) and Lyria RealTime (API-based) offering text/audio prompt control for acoustic style.

Result: Magenta RealTime outperforms other open-weights music generation models on automatic music quality metrics despite using fewer parameters, while offering first-of-its-kind live generation capabilities.

Conclusion: These models demonstrate a successful new paradigm for AI-assisted music creation that enables real-time human-in-the-loop interaction for live music performance.

Abstract: We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

cs.LG

[206] FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailing Guan, Peihao Wu, Chunbo Dai, Liang Xiao, Bin Dong

Main category: cs.LG

TL;DR: FATE is a new formal algebra benchmark series with problems ranging from undergraduate to PhD+ level, revealing major gaps in LLMs’ formal reasoning capabilities compared to contest math performance.

Details

Motivation: Current LLM benchmarks focus on contest math (like IMO) which doesn't reflect the depth and abstraction of real mathematical research, creating a need for more challenging formal reasoning benchmarks.

Method: Created FATE-H and FATE-X benchmarks with 100 problems each in abstract and commutative algebra, spanning difficulty from undergraduate exercises to beyond PhD qualifying exams. Evaluated state-of-the-art LLM provers using two-stage evaluation (natural language reasoning and formalization).

Result: Best model achieved only 3% accuracy on FATE-H and 0% on FATE-X, showing models’ natural-language reasoning is significantly better than their formalization ability. Specialized provers showed less effective reflection than general-purpose models.

Conclusion: FATE establishes essential checkpoints for research-level formal mathematical reasoning and reveals critical limitations in current LLMs’ formal theorem proving capabilities beyond contest math.

Abstract: Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models’ natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.

[207] Stochastic Deep Graph Clustering for Practical Group Formation

Junhyung Park, Hyungjin Kim, Seokho Ahn, Young-Duk Seo

Main category: cs.LG

TL;DR: DeepForm is a framework that reframes group formation as a core challenge in group recommender systems, addressing dynamic scenarios with real-time group formation and adaptive group reconfiguration.

Details

Motivation: Prior work on group recommender systems focuses on accuracy but assumes static/predefined groups, making them unsuitable for dynamic real-world scenarios where groups need to form and adapt in real-time.

Method: Uses lightweight GCN architecture to capture high-order user information, stochastic cluster learning for adaptive group reconfiguration without retraining, and contrastive learning to refine groups under dynamic conditions.

Result: Experiments on multiple datasets show DeepForm achieves superior group formation quality, efficiency, and recommendation accuracy compared to various baselines.

Conclusion: DeepForm successfully addresses the challenges of dynamic group formation in recommender systems by enabling real-time group formation and adaptive reconfiguration while maintaining high recommendation quality.

Abstract: While prior work on group recommender systems (GRSs) has primarily focused on improving recommendation accuracy, most approaches assume static or predefined groups, making them unsuitable for dynamic, real-world scenarios. We reframe group formation as a core challenge in GRSs and propose DeepForm (Stochastic Deep Graph Clustering for Practical Group Formation), a framework designed to meet three key operational requirements: (1) the incorporation of high-order user information, (2) real-time group formation, and (3) dynamic adjustment of the number of groups. DeepForm employs a lightweight GCN architecture that effectively captures high-order structural signals. Stochastic cluster learning enables adaptive group reconfiguration without retraining, while contrastive learning refines groups under dynamic conditions. Experiments on multiple datasets demonstrate that DeepForm achieves superior group formation quality, efficiency, and recommendation accuracy compared with various baselines.

[208] Test-time Adaptation of Tiny Recursive Models

Ronan Killian McGovern

Main category: cs.LG

TL;DR: A tiny 7M parameter recursive model pre-trained on public ARC tasks achieves 6.67% score on semi-private evaluation after efficient fine-tuning within competition compute limits.

Details

Motivation: Previous open source approaches like TRM achieved good scores but required excessive compute beyond competition limits, creating a need for efficient fine-tuning methods.

Method: Pre-train a tiny recursive model on 1,280 public ARC tasks for 700k+ steps, then perform full fine-tuning (not LoRA) for just 12,500 gradient steps during competition.

Result: The model achieved 6.67% on semi-private evaluation tasks after post-training, compared to ~10% on public evaluation after pre-training.

Conclusion: Starting from pre-trained tiny recursive models enables efficient fine-tuning within competition compute constraints while maintaining competitive performance.

Abstract: Prior to the close of the 2025 ARC Prize competition, the leading open source approach - known as TRM, or Tiny Recursive Models - involved training a 7M parameter recursive neural network on augmented variants of ARC tasks. That approach scored approximately 7.8% on the public ARC AGI II evaluation set, but required a level of compute far in excess of what is allowed during the competition. This paper shows that, by starting from a tiny recursive model that has been pre-trained on public ARC tasks, one can efficiently fine-tune on competition tasks within the allowed compute limits. Specifically, a model was pre-trained on 1,280 public tasks for 700k+ optimizer steps over 48 hours on 4xH100 SXM GPUs to obtain a ~10% score on the public evaluation set. That model was then post-trained in just 12,500 gradient steps during the competition to reach a score of 6.67% on semi-private evaluation tasks. Notably, such post-training performance is achieved by full-fine tuning of the tiny model, not LoRA fine-tuning or fine-tuning of task embeddings alone.

[209] Scaling Multi-Agent Environment Co-Design with Diffusion Models

Hao Xiang Li, Michael Amir, Amanda Prorok

Main category: cs.LG

TL;DR: DiCoDe is a scalable co-design framework that uses diffusion models and critic distillation to jointly optimize agent policies and environment configurations, achieving superior performance with fewer samples.

Details

Motivation: Current co-design methods struggle with high-dimensional design spaces and sample inefficiency when dealing with moving targets in joint optimization.

Method: Uses Diffusion Co-Design (DiCoDe) with Projected Universal Guidance (PUG) for constrained environment sampling and critic distillation to share knowledge from RL critic.

Result: Achieves 39% higher rewards in warehouse setting with 66% fewer simulation samples, consistently outperforming state-of-the-art methods.

Conclusion: Sets new standard in agent-environment co-design and enables practical applications in real-world domains.

Abstract: The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system performance. With application domains ranging from warehouse logistics to windfarm management, co-design promises to fundamentally change how we deploy multi-agent systems. However, current co-design methods struggle to scale. They collapse under high-dimensional environment design spaces and suffer from sample inefficiency when addressing moving targets inherent to joint optimisation. We address these challenges by developing Diffusion Co-Design (DiCoDe), a scalable and sample-efficient co-design framework pushing co-design towards practically relevant settings. DiCoDe incorporates two core innovations. First, we introduce Projected Universal Guidance (PUG), a sampling technique that enables DiCoDe to explore a distribution of reward-maximising environments while satisfying hard constraints such as spatial separation between obstacles. Second, we devise a critic distillation mechanism to share knowledge from the reinforcement learning critic, ensuring that the guided diffusion model adapts to evolving agent policies using a dense and up-to-date learning signal. Together, these improvements lead to superior environment-policy pairs when validated on challenging multi-agent environment co-design benchmarks including warehouse automation, multi-agent pathfinding and wind farm optimisation. Our method consistently exceeds the state-of-the-art, achieving, for example, 39% higher rewards in the warehouse setting with 66% fewer simulation samples. This sets a new standard in agent-environment co-design, and is a stepping stone towards reaping the rewards of co-design in real world domains.

[210] Predicting Weekly Fishing Concentration Zones through Deep Learning Integration of Heterogeneous Environmental Spatial Datasets

Chaitanya Rele, Aditya Rathod, Kaustubh Natu, Saurabh Kulkarni, Ajay Koli, Swapnali Makdey

Main category: cs.LG

TL;DR: AI framework predicts fishing zones using ocean data to help fishermen find productive areas faster and more efficiently.

Details

Motivation: Fishermen in the North Indian Ocean face uncertainty in locating productive fishing grounds, leading to wasted time and resources.

Method: Uses AI-assisted framework with oceanographic parameters like sea surface temperature and chlorophyll concentration to predict Potential Fishing Zones.

Result: Preliminary results show reduced search time, lower fuel consumption, and more efficient resource utilization for fishermen.

Conclusion: The AI framework can effectively support sustainable fishing practices by providing accurate PFZ predictions and region-specific insights.

Abstract: The North Indian Ocean, including the Arabian Sea and the Bay of Bengal, represents a vital source of livelihood for coastal communities, yet fishermen often face uncertainty in locating productive fishing grounds. To address this challenge, we present an AI-assisted framework for predicting Potential Fishing Zones (PFZs) using oceanographic parameters such as sea surface temperature and chlorophyll concentration. The approach is designed to enhance the accuracy of PFZ identification and provide region-specific insights for sustainable fishing practices. Preliminary results indicate that the framework can support fishermen by reducing search time, lowering fuel consumption, and promoting efficient resource utilization.

[211] Adaptive and Robust Data Poisoning Detection and Sanitization in Wearable IoT Systems using Large Language Models

W. K. M Mithsara, Ning Yang, Ahmed Imteaj, Hussein Zangoti, Abdur R. Shahid

Main category: cs.LG

TL;DR: This paper proposes a novel framework using large language models (LLMs) for detecting and sanitizing data poisoning attacks in human activity recognition (HAR) systems for wearable IoT devices, employing zero-shot, one-shot, and few-shot learning with role-play prompting and step-by-step reasoning.

Details

Motivation: Wearable IoT systems are increasingly vulnerable to data poisoning attacks that compromise data integrity and reliability. Conventional defense methods require extensive labeled datasets and lack adaptability in dynamic IoT environments.

Method: The framework uses LLMs with role-play prompting (where LLMs act as experts) and think step-by-step reasoning to detect poisoning indicators and generate clean alternatives in sensor data, employing zero-shot, one-shot, and few-shot learning paradigms.

Result: Extensive evaluation demonstrates the framework’s effectiveness in detection accuracy, sanitization quality, latency, and communication cost, showing practical improvements in security and reliability.

Conclusion: LLMs provide robust, adaptable defense mechanisms against data poisoning attacks in HAR systems, reducing reliance on large labeled datasets and enabling real-time protection for wearable IoT applications.

Abstract: The widespread integration of wearable sensing devices in Internet of Things (IoT) ecosystems, particularly in healthcare, smart homes, and industrial applications, has required robust human activity recognition (HAR) techniques to improve functionality and user experience. Although machine learning models have advanced HAR, they are increasingly susceptible to data poisoning attacks that compromise the data integrity and reliability of these systems. Conventional approaches to defending against such attacks often require extensive task-specific training with large, labeled datasets, which limits adaptability in dynamic IoT environments. This work proposes a novel framework that uses large language models (LLMs) to perform poisoning detection and sanitization in HAR systems, utilizing zero-shot, one-shot, and few-shot learning paradigms. Our approach incorporates \textit{role play} prompting, whereby the LLM assumes the role of expert to contextualize and evaluate sensor anomalies, and \textit{think step-by-step} reasoning, guiding the LLM to infer poisoning indicators in the raw sensor data and plausible clean alternatives. These strategies minimize reliance on curation of extensive datasets and enable robust, adaptable defense mechanisms in real-time. We perform an extensive evaluation of the framework, quantifying detection accuracy, sanitization quality, latency, and communication cost, thus demonstrating the practicality and effectiveness of LLMs in improving the security and reliability of wearable IoT systems.

[212] Zero-shot data citation function classification using transformer-based large language models (LLMs)

Neil Byers, Ali Zaidi, Valerie Skye, Chris Beecroft, Kjiersten Fagnan

Main category: cs.LG

TL;DR: This paper applies the Llama 3.1-405B LLM to automatically generate structured data use case labels for publications that cite genomic datasets, achieving a 0.674 F1 score on zero-shot classification without predefined categories.

Details

Motivation: To scale the description of how and why specific datasets are used in scientific literature, avoiding expensive manual labeling and training dataset development for classical ML systems.

Method: Uses an open-source LLM (Llama 3.1-405B) for zero-shot data citation classification and introduces a novel evaluation framework to assess method efficacy.

Result: The stock model achieves an F1 score of 0.674 on zero-shot data citation classification without predefined categories, showing promising performance.

Conclusion: While results are promising, the approach faces barriers including data availability, prompt overfitting, computational infrastructure requirements, and the expense of responsible performance evaluation.

Abstract: Efforts have increased in recent years to identify associations between specific datasets and the scientific literature that incorporates them. Knowing that a given publication cites a given dataset, the next logical step is to explore how or why that data was used. Advances in recent years with pretrained, transformer-based large language models (LLMs) offer potential means for scaling the description of data use cases in the published literature. This avoids expensive manual labeling and the development of training datasets for classical machine-learning (ML) systems. In this work we apply an open-source LLM, Llama 3.1-405B, to generate structured data use case labels for publications known to incorporate specific genomic datasets. We also introduce a novel evaluation framework for determining the efficacy of our methods. Our results demonstrate that the stock model can achieve an F1 score of .674 on a zero-shot data citation classification task with no previously defined categories. While promising, our results are qualified by barriers related to data availability, prompt overfitting, computational infrastructure, and the expense required to conduct responsible performance evaluation.

[213] Power Constrained Nonstationary Bandits with Habituation and Recovery Dynamics

Fengxu Li, Stephanie M. Carpenter, Matthew P. Buman, Yonatan Mintz

Main category: cs.LG

TL;DR: Developed ROGUE-TS algorithm with probability clipping to balance personalization and population-level learning in nonstationary bandit settings, achieving lower regret while maintaining statistical power for treatment effect detection.

Details

Motivation: Address the challenge in decision-making where action rewards evolve over time (habituation/recovery effects) and existing algorithms may not provide sufficient exploration for population-level effect estimation, particularly important in micro-randomized trials for behavioral health interventions.

Method: Developed ROGUE-TS (Thompson Sampling algorithm for ROGUE framework) with theoretical guarantees, then introduced probability clipping procedure to balance personalization and population-level learning with quantified trade-off between regret and exploration probability.

Result: Validation on two MRT datasets (physical activity promotion and bipolar disorder treatment) showed lower regret than existing approaches while maintaining high statistical power through clipping procedure without significantly increasing regret.

Conclusion: The framework enables reliable detection of treatment effects while accounting for individual behavioral dynamics and provides practical guidance for researchers designing MRTs to balance personalization with statistical validity.

Abstract: A common challenge for decision makers is selecting actions whose rewards are unknown and evolve over time based on prior policies. For instance, repeated use may reduce an action’s effectiveness (habituation), while inactivity may restore it (recovery). These nonstationarities are captured by the Reducing or Gaining Unknown Efficacy (ROGUE) bandit framework, which models real-world settings such as behavioral health interventions. While existing algorithms can compute sublinear regret policies to optimize these settings, they may not provide sufficient exploration due to overemphasis on exploitation, limiting the ability to estimate population-level effects. This is a challenge of particular interest in micro-randomized trials (MRTs) that aid researchers in developing just-in-time adaptive interventions that have population-level effects while still providing personalized recommendations to individuals. In this paper, we first develop ROGUE-TS, a Thompson Sampling algorithm tailored to the ROGUE framework, and provide theoretical guarantees of sublinear regret. We then introduce a probability clipping procedure to balance personalization and population-level learning, with quantified trade-off that balances regret and minimum exploration probability. Validation on two MRT datasets concerning physical activity promotion and bipolar disorder treatment shows that our methods both achieve lower regret than existing approaches and maintain high statistical power through the clipping procedure without significantly increasing regret. This enables reliable detection of treatment effects while accounting for individual behavioral dynamics. For researchers designing MRTs, our framework offers practical guidance on balancing personalization with statistical validity.

[214] Digital Twin-Driven Pavement Health Monitoring and Maintenance Optimization Using Graph Neural Networks

Mohsin Mahmud Topu, Mahfuz Ahmed Anik, Azmine Toushik Wasi, Md Manjurul Ahsan

Main category: cs.LG

TL;DR: A Digital Twin and Graph Neural Network framework for proactive pavement health monitoring and predictive maintenance, outperforming traditional methods.

Details

Motivation: Traditional Pavement Management Systems are reactive and lack real-time intelligence for failure prevention and optimal maintenance planning in complex road networks.

Method: Model pavement segments as graph nodes and spatial relations as edges, using real-time UAV, sensor, and LiDAR data in a Digital Twin framework with inductive GNN to learn deterioration patterns.

Result: Achieved R2 of 0.3798, outperforming baseline regressors and effectively capturing non-linear degradation patterns in pavement deterioration.

Conclusion: The DT-GNN integration enhances forecasting precision and establishes a closed feedback loop for continuous improvement, providing a foundation for proactive, intelligent pavement management.

Abstract: Pavement infrastructure monitoring is challenged by complex spatial dependencies, changing environmental conditions, and non-linear deterioration across road networks. Traditional Pavement Management Systems (PMS) remain largely reactive, lacking real-time intelligence for failure prevention and optimal maintenance planning. To address this, we propose a unified Digital Twin (DT) and Graph Neural Network (GNN) framework for scalable, data-driven pavement health monitoring and predictive maintenance. Pavement segments and spatial relations are modeled as graph nodes and edges, while real-time UAV, sensor, and LiDAR data stream into the DT. The inductive GNN learns deterioration patterns from graph-structured inputs to forecast distress and enable proactive interventions. Trained on a real-world-inspired dataset with segment attributes and dynamic connectivity, our model achieves an R2 of 0.3798, outperforming baseline regressors and effectively capturing non-linear degradation. We also develop an interactive dashboard and reinforcement learning module for simulation, visualization, and adaptive maintenance planning. This DT-GNN integration enhances forecasting precision and establishes a closed feedback loop for continuous improvement, positioning the approach as a foundation for proactive, intelligent, and sustainable pavement management, with future extensions toward real-world deployment, multi-agent coordination, and smart-city integration.

[215] Inference-Time Personalized Alignment with a Few User Preference Queries

Victor-Alexandru Pădurean, Parameswaran Kamalaruban, Nachiket Kotalwar, Alkis Gotovos, Adish Singla

Main category: cs.LG

TL;DR: UserAlign is a novel inference-time personalized alignment method that uses few pairwise response comparisons to align generative models with user preferences, building on best-arm identification in logistic bandits.

Details

Motivation: Existing personalized alignment methods require either large amounts of user preference queries or explicit text input specifications of preferences, which limits practical applicability.

Method: UserAlign elicits user preferences through few pairwise response comparisons and selects personalized responses from a fixed pool using a theoretical framework based on best-arm identification in logistic bandits, assuming consistent and noise-free user feedback.

Result: Experimental results across personalized text and image generation tasks demonstrate UserAlign’s effectiveness in achieving personalized alignment with minimal user queries.

Conclusion: UserAlign provides an efficient inference-time solution for personalized alignment that requires minimal user interaction while effectively capturing user preferences through pairwise comparisons.

Abstract: We study the problem of aligning a generative model’s response with a user’s preferences. Recent works have proposed several different formulations for personalized alignment; however, they either require a large amount of user preference queries or require that the preference be explicitly specified as a text input. In this paper, we propose a novel inference-time personalized alignment method, UserAlign, that elicits the user’s preferences with a few queries as pairwise response comparisons. In particular, UserAlign builds on the theoretical framework of best-arm identification in logistic bandits and selects a personalized response from a fixed pool of the model’s generated responses. The key idea is to consider the user’s feedback consistent and noise-free, and incorporate it into the theoretical framework to identify the best response quickly. Experimental results across several tasks, involving personalized text and image generation, showcase the effectiveness of UserAlign in achieving personalized alignment.

[216] Value of Information-Enhanced Exploration in Bootstrapped DQN

Stergios Plataniotis, Charilaos Akasiadis, Georgios Chalkiadakis

Main category: cs.LG

TL;DR: Integrating expected value of information (EVOI) into Bootstrapped DQN to enhance deep exploration in sparse-reward environments without adding extra hyperparameters.

Details

Motivation: Traditional exploration methods like ε-greedy struggle with efficient exploration-exploitation balance in high-dimensional, sparse-reward environments.

Method: Developed two novel algorithms that incorporate EVOI into Bootstrapped DQN, using value of information estimates to measure network head discrepancies and guide exploration to high-potential areas.

Result: Experiments in complex Atari games show increased performance and better uncertainty utilization compared to baseline methods.

Conclusion: EVOI integration significantly improves Bootstrapped DQN’s exploration capability in sparse-reward settings while maintaining simplicity.

Abstract: Efficient exploration in deep reinforcement learning remains a fundamental challenge, especially in environments characterized by high-dimensional states and sparse rewards. Traditional exploration strategies that rely on random local policy noise, such as $\epsilon$-greedy and Boltzmann exploration methods, often struggle to efficiently balance exploration and exploitation. In this paper, we integrate the notion of (expected) value of information (EVOI) within the well-known Bootstrapped DQN algorithmic framework, to enhance the algorithm’s deep exploration ability. Specifically, we develop two novel algorithms that incorporate the expected gain from learning the value of information into Bootstrapped DQN. Our methods use value of information estimates to measure the discrepancies of opinions among distinct network heads, and drive exploration towards areas with the most potential. We evaluate our algorithms with respect to performance and their ability to exploit inherent uncertainty arising from random network initialization. Our experiments in complex, sparse-reward Atari games demonstrate increased performance, all the while making better use of uncertainty, and, importantly, without introducing extra hyperparameters.

[217] Heterogeneous Metamaterials Design via Multiscale Neural Implicit Representation

Hongrui Chen, Liwei Wang, Levent Burak Kara

Main category: cs.LG

TL;DR: A neural network framework for designing heterogeneous metamaterials that learns continuous two-scale representations, enabling seamless connectivity between unit cells without predefined datasets.

Details

Motivation: Traditional metamaterial design methods face challenges with enormous design spaces, compatibility issues between cells, and limitations of data-driven approaches that rely on fixed microstructure libraries.

Method: Uses a multiscale neural representation where the network takes global and local coordinates as inputs, outputs implicit fields representing multiscale structures, and employs compatibility loss to enforce connectivity between adjacent unit cells.

Result: The framework produces metamaterial designs at arbitrarily high resolution, enabling infinite upsampling for fabrication or simulation, and demonstrates effectiveness on mechanical metamaterial design, negative Poisson’s ratio, and mechanical cloaking problems.

Conclusion: The proposed neural network-based approach successfully addresses key challenges in heterogeneous metamaterial design by providing continuous representations with inherent compatibility between unit cells, with applications in robotics, bioengineering, and aerospace.

Abstract: Metamaterials are engineered materials composed of specially designed unit cells that exhibit extraordinary properties beyond those of natural materials. Complex engineering tasks often require heterogeneous unit cells to accommodate spatially varying property requirements. However, designing heterogeneous metamaterials poses significant challenges due to the enormous design space and strict compatibility requirements between neighboring cells. Traditional concurrent multiscale design methods require solving an expensive optimization problem for each unit cell and often suffer from discontinuities at cell boundaries. On the other hand, data-driven approaches that assemble structures from a fixed library of microstructures are limited by the dataset and require additional post-processing to ensure seamless connections. In this work, we propose a neural network-based metamaterial design framework that learns a continuous two-scale representation of the structure, thereby jointly addressing these challenges. Central to our framework is a multiscale neural representation in which the neural network takes both global (macroscale) and local (microscale) coordinates as inputs, outputting an implicit field that represents multiscale structures with compatible unit cell geometries across the domain, without the need for a predefined dataset. We use a compatibility loss term during training to enforce connectivity between adjacent unit cells. Once trained, the network can produce metamaterial designs at arbitrarily high resolution, hence enabling infinite upsampling for fabrication or simulation. We demonstrate the effectiveness of the proposed approach on mechanical metamaterial design, negative Poisson’s ratio, and mechanical cloaking problems with potential applications in robotics, bioengineering, and aerospace.

[218] Discrete Bayesian Sample Inference for Graph Generation

Ole Petersen, Marcel Kollovieh, Marten Lienen, Stephan Günnemann

Main category: cs.LG

TL;DR: GraphBSI is a novel one-shot graph generative model using Bayesian Sample Inference that refines beliefs over graphs in continuous parameter space, achieving state-of-the-art performance on molecular and synthetic graph generation.

Details

Motivation: Graph-structured data is crucial for applications like molecular generation and knowledge graphs, but their discrete, unordered nature makes them difficult for traditional generative models, leading to the need for specialized approaches like discrete diffusion and flow matching.

Method: GraphBSI uses Bayesian Sample Inference to iteratively refine a belief over graphs in continuous distribution parameter space, handling discrete structures naturally. It formulates BSI as a stochastic differential equation and derives a noise-controlled family of SDEs that preserves marginal distributions via score function approximation.

Result: GraphBSI demonstrates state-of-the-art performance on molecular and synthetic graph generation, outperforming existing one-shot graph generative models on standard benchmarks Moses and GuacaMol.

Conclusion: GraphBSI provides an effective approach for graph generation by leveraging Bayesian inference in continuous parameter space, with theoretical connections to Bayesian Flow Networks and Diffusion models, and superior empirical performance compared to existing methods.

Abstract: Generating graph-structured data is crucial in applications such as molecular generation, knowledge graphs, and network analysis. However, their discrete, unordered nature makes them difficult for traditional generative models, leading to the rise of discrete diffusion and flow matching models. In this work, we introduce GraphBSI, a novel one-shot graph generative model based on Bayesian Sample Inference (BSI). Instead of evolving samples directly, GraphBSI iteratively refines a belief over graphs in the continuous space of distribution parameters, naturally handling discrete structures. Further, we state BSI as a stochastic differential equation (SDE) and derive a noise-controlled family of SDEs that preserves the marginal distributions via an approximation of the score function. Our theoretical analysis further reveals the connection to Bayesian Flow Networks and Diffusion models. Finally, in our empirical evaluation, we demonstrate state-of-the-art performance on molecular and synthetic graph generation, outperforming existing one-shot graph generative models on the standard benchmarks Moses and GuacaMol.

[219] Adaptive-Sensorless Monitoring of Shipping Containers

Lingqing Shen, Chi Heem Wong, Misaki Mito, Arnab Chakrabarti

Main category: cs.LG

TL;DR: The paper introduces adaptive-sensorless monitoring using residual correction to improve temperature and humidity predictions in shipping containers by correcting systematic biases after observing live telemetry data.

Details

Motivation: Sensorless monitoring models for shipping containers don't incorporate telemetry information and have systematic errors, causing predictions to differ significantly from live data and confusing users.

Method: Residual correction method - a general framework for correcting systematic biases in sensorless models after observing live telemetry data, creating adaptive-sensorless monitoring.

Result: Adaptive-sensorless models achieved MAEs of 2.24-2.31°C (vs 2.43°C baseline) for temperature and 5.72-7.09% (vs 7.99% baseline) for humidity, and RMSEs of 3.19-3.26°C (vs 3.38°C baseline) for temperature and 7.70-9.12% (vs 10.0% baseline) for humidity on 3.48 million data points.

Conclusion: Adaptive-sensorless models enable more accurate cargo monitoring, early risk detection, and reduced dependence on full connectivity in global shipping.

Abstract: Monitoring the internal temperature and humidity of shipping containers is essential to preventing quality degradation during cargo transportation. Sensorless monitoring – machine learning models that predict the internal conditions of the containers using exogenous factors – shows promise as an alternative to monitoring using sensors. However, it does not incorporate telemetry information and correct for systematic errors, causing the predictions to differ significantly from the live data and confusing the users. In this paper, we introduce the residual correction method, a general framework for correcting for systematic biases in sensorless models after observing live telemetry data. We call this class of models ``adaptive-sensorless’’ monitoring. We train and evaluate adaptive-sensorless models on the 3.48 million data points – the largest dataset of container sensor readings ever used in academic research – and show that they produce consistent improvements over the baseline sensorless models. When evaluated on the holdout set of the simulated data, they achieve average mean absolute errors (MAEs) of 2.24 $\sim$ 2.31$^\circ$C (vs 2.43$^\circ$C by sensorless) for temperature and 5.72 $\sim$ 7.09% for relative humidity (vs 7.99% by sensorless) and average root mean-squared errors (RMSEs) of 3.19 $\sim$ 3.26$^\circ$C for temperature (vs 3.38$^\circ$C by sensorless) and 7.70 $\sim$ 9.12% for relative humidity (vs 10.0% by sensorless). Adaptive-sensorless models enable more accurate cargo monitoring, early risk detection, and less dependence on full connectivity in global shipping.

[220] Leveraging Discrete Function Decomposability for Scientific Design

James C. Bowden, Sergey Levine, Jennifer Listgarten

Main category: cs.LG

TL;DR: DADO is a new distributional optimization algorithm that leverages decomposability structure in design spaces using junction trees and graph message-passing for more efficient optimization.

Details

Motivation: Many scientific property predictors are decomposable but current distributional optimization algorithms cannot exploit this structure, making optimization inefficient for combinatorial design spaces.

Method: Uses a soft-factorized search distribution with graph message-passing to coordinate optimization across linked factors defined by a junction tree on design variables.

Result: DADO enables more efficient navigation of combinatorial search spaces by leveraging decomposability structure.

Conclusion: The proposed DADO algorithm can effectively utilize decomposability in property predictors to improve distributional optimization for discrete design problems.

Abstract: In the era of AI-driven science and engineering, we often want to design discrete objects in silico according to user-specified properties. For example, we may wish to design a protein to bind its target, arrange components within a circuit to minimize latency, or find materials with certain properties. Given a property predictive model, in silico design typically involves training a generative model over the design space (e.g., protein sequence space) to concentrate on designs with the desired properties. Distributional optimization – which can be formalized as an estimation of distribution algorithm or as reinforcement learning policy optimization – finds the generative model that maximizes an objective function in expectation. Optimizing a distribution over discrete-valued designs is in general challenging because of the combinatorial nature of the design space. However, many property predictors in scientific applications are decomposable in the sense that they can be factorized over design variables in a way that could in principle enable more effective optimization. For example, amino acids at a catalytic site of a protein may only loosely interact with amino acids of the rest of the protein to achieve maximal catalytic activity. Current distributional optimization algorithms are unable to make use of such decomposability structure. Herein, we propose and demonstrate use of a new distributional optimization algorithm, Decomposition-Aware Distributional Optimization (DADO), that can leverage any decomposability defined by a junction tree on the design variables, to make optimization more efficient. At its core, DADO employs a soft-factorized “search distribution” – a learned generative model – for efficient navigation of the search space, invoking graph message-passing to coordinate optimization across linked factors.

[221] Data-Efficient Realized Volatility Forecasting with Vision Transformers

Emi Soroka, Artem Arzyn

Main category: cs.LG

TL;DR: Vision Transformer (ViT) applied to options data can predict 30-day realized volatility from implied volatility surfaces, showing ability to learn seasonal patterns and nonlinear features.

Details

Motivation: Transformers have shown promise in financial forecasting but remain unexplored for options data, despite the potential of deep learning's complexity advantage in capturing nonlinear relationships.

Method: Train Vision Transformer (ViT) architecture on implied volatility surfaces augmented with date information to predict next 30-day realized volatility.

Result: ViT successfully learns seasonal patterns and nonlinear features from the IV surface, demonstrating promising forecasting capabilities.

Conclusion: Transformer models show potential for options data analysis, suggesting a promising direction for future model development in financial machine learning.

Abstract: Recent work in financial machine learning has shown the virtue of complexity: the phenomenon by which deep learning methods capable of learning highly nonlinear relationships outperform simpler approaches in financial forecasting. While transformer architectures like Informer have shown promise for financial time series forecasting, the application of transformer models for options data remains largely unexplored. We conduct preliminary studies towards the development of a transformer model for options data by training the Vision Transformer (ViT) architecture, typically used in modern image recognition and classification systems, to predict the realized volatility of an asset over the next 30 days from its implied volatility surface (augmented with date information) for a single day. We show that the ViT can learn seasonal patterns and nonlinear features from the IV surface, suggesting a promising direction for model development.

[222] Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions

Emi Soroka, Tanmay Chopra, Krish Desai, Sanjay Lall

Main category: cs.LG

TL;DR: This paper introduces the first set of unsupervised metrics for evaluating objective-driven interactions between AI agents and humans in enterprise applications, addressing challenges with complex unlabeled data, impractical human annotation, and unreliable LLM judges.

Details

Motivation: Current evaluation methods for LLM-based enterprise applications face significant challenges: complex unlabeled data, impractical human annotation at scale, limited custom metrics that can't detect new errors, and unreliable LLM judges.

Method: The authors develop unsupervised metrics that leverage statistical properties of unlabeled interaction data and use fine-tuned LLMs to adapt to distributional shifts. They create metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without relying on human-generated ideal responses.

Result: The proposed approach is validated on both open-domain and task-specific interaction data, demonstrating its effectiveness in evaluating objective-driven interactions without requiring labeled data or human annotation.

Conclusion: This work provides a practical solution for evaluating LLM-based enterprise applications by introducing unsupervised metrics that overcome the limitations of existing evaluation methods, enabling scalable and reliable assessment of objective-driven interactions.

Abstract: Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.

[223] The Curved Spacetime of Transformer Architectures

Riccardo Di Sipio, Jairo Diaz-Rodriguez, Luis Serrano

Main category: cs.LG

TL;DR: This paper presents a geometric framework that analogizes Transformer language models to General Relativity, where attention mechanisms create curvature in representation space and token embeddings follow curved trajectories rather than straight paths.

Details

Motivation: To provide a geometric understanding of how Transformer-based language models work by drawing explicit analogies to concepts from General Relativity, particularly focusing on how attention mechanisms create curvature in the representation space.

Method: Developed a geometric framework treating queries/keys as inducing an effective metric, attention as discrete connection for parallel transport, and layers as time-slices. Conducted three experiments: (i) visualizing curvature landscapes, (ii) statistical analysis of turning angles and length-to-chord ratios, and (iii) controlled context editing experiments inspired by Einstein’s eclipse experiment.

Result: Experimental results confirmed the presence and consequences of curvature: (i) visualization showed varying local turning angles across tokens and layers, (ii) statistical analysis revealed excess sharp/flat angles and longer length-to-chord ratios not explainable by chance, and (iii) controlled edits demonstrated measurable, meaning-consistent bends in embedding trajectories.

Conclusion: The geometric analogy to General Relativity holds - token embeddings in Transformers do not traverse straight paths but follow curved trajectories shaped by attention-induced curvature in the representation space, providing a novel geometric perspective on how these models process information.

Abstract: We present a geometric framework for understanding Transformer-based language models, drawing an explicit analogy to General Relativity. Queries and keys induce an effective metric on representation space, and attention acts as a discrete connection that implements parallel transport of value vectors across tokens. Stacked layers provide discrete time-slices through which token representations evolve on this curved manifold, while backpropagation plays the role of a least-action principle that shapes loss-minimizing trajectories in parameter space. If this analogy is correct, token embeddings should not traverse straight paths in feature space; instead, their layer-wise steps should bend and reorient as interactions mediated by embedding space curvature. To test this prediction, we design experiments that expose both the presence and the consequences of curvature: (i) we visualize a curvature landscape for a full paragraph, revealing how local turning angles vary across tokens and layers; (ii) we show through simulations that excess counts of sharp/flat angles and longer length-to-chord ratios are not explainable by dimensionality or chance; and (iii) inspired by Einstein’s eclipse experiment, we probe deflection under controlled context edits, demonstrating measurable, meaning-consistent bends in embedding trajectories that confirm attention-induced curvature.

[224] Homomorphism distortion: A metric to distinguish them all and in the latent space bind them

Martin Carrasco, Olga Zaghen, Erik Bekkers, Bastian Rieck

Main category: cs.LG

TL;DR: The paper introduces graph homomorphism distortion as a principled measure for comparing vertex-attributed graphs, showing it can completely characterize graphs and serve as a complete graph embedding. It addresses computational challenges through sampling and demonstrates empirical superiority over existing methods.

Details

Motivation: Current graph neural network expressivity is limited to combinatorial properties, creating a need for principled similarity measures between vertex-attributed graphs that can fully characterize graph structures.

Method: Proposes graph homomorphism distortion measure, addresses computational challenges via sampling to ensure completeness in expectation, and derives a metric from this measure.

Result: The method fully distinguishes the BREC dataset with up to 4-WL non-distinguishable graphs and outperforms previous homomorphism-inspired methods on the ZINC-12k dataset.

Conclusion: Graph homomorphism distortion provides a powerful new approach for graph characterization that extends beyond traditional combinatorial methods and opens new frontiers in graph theory.

Abstract: For far too long, expressivity of graph neural networks has been measured \emph{only} in terms of combinatorial properties. In this work we stray away from this tradition and provide a principled way to measure similarity between vertex attributed graphs. We denote this measure as the \emph{graph homomorphism distortion}. We show it can \emph{completely characterize} graphs and thus is also a \emph{complete graph embedding}. However, somewhere along the road, we run into the graph canonization problem. To circumvent this obstacle, we devise to efficiently compute this measure via sampling, which in expectation ensures \emph{completeness}. Additionally, we also discovered that we can obtain a metric from this measure. We validate our claims empirically and find that the \emph{graph homomorphism distortion}: (1.) fully distinguishes the \texttt{BREC} dataset with up to $4$-WL non-distinguishable graphs, and (2.) \emph{outperforms} previous methods inspired in homomorphisms under the \texttt{ZINC-12k} dataset. These theoretical results, (and their empirical validation), pave the way for future characterization of graphs, extending the graph theoretic tradition to new frontiers.

[225] Online Learning to Rank under Corruption: A Robust Cascading Bandits Approach

Fatemeh Ghaffari, Siddarth Sitaraman, Xutong Liu, Xuchuang Wang, Mohammad Hajiesmaili

Main category: cs.LG

TL;DR: MSUCB is a robust online learning to rank algorithm that uses mean-of-medians estimator to handle click fraud and corruption, achieving optimal regret without corruption and graceful degradation under corruption.

Details

Motivation: Online learning to rank systems are vulnerable to click fraud and manipulation, where corrupted feedback misleads learning and degrades user experience.

Method: Proposes MSUCB algorithm with novel mean-of-medians estimator that filters outliers and corrupted samples while behaving like standard mean in clean settings.

Result: Achieves optimal logarithmic regret without corruption and only additive regret increase under corruption. Outperforms state-of-the-art methods with 97.35% and 91.60% regret improvements.

Conclusion: MSUCB provides strong robustness against corruption while maintaining optimal performance in clean environments, making it practical for real-world online recommendation systems.

Abstract: Online learning to rank (OLTR) studies how to recommend a short ranked list of items from a large pool and improves future rankings based on user clicks. This setting is commonly modeled as cascading bandits, where the objective is to maximize the likelihood that the user clicks on at least one of the presented items across as many timesteps as possible. However, such systems are vulnerable to click fraud and other manipulations (i.e., corruption), where bots or paid click farms inject corrupted feedback that misleads the learning process and degrades user experience. In this paper, we propose MSUCB, a robust algorithm that incorporates a novel mean-of-medians estimator, which to our knowledge is applied to bandits with corruption setting for the first time. This estimator behaves like a standard mean in the absence of corruption, so no cost is paid for robustness. Under corruption, the median step filters out outliers and corrupted samples, keeping the estimate close to its true value. Updating this estimate at every round further accelerates empirical convergence in experiments. Hence, MSUCB achieves optimal logarithmic regret in the absence of corruption and degrades gracefully under corruptions, with regret increasing only by an additive term tied to the total corruption. Comprehensive and extensive experiments on real-world datasets further demonstrate that our approach consistently outperforms prior methods while maintaining strong robustness. In particular, it achieves a (97.35%) and a (91.60%) regret improvement over two state-of-the-art methods.

[226] Sparse, self-organizing ensembles of local kernels detect rare statistical anomalies

Gaia Grosso, Sai Sumedh R. Hindupur, Thomas Fel, Samuel Bright-Thonney, Philip Harris, Demba Ba

Main category: cs.LG

TL;DR: SparKer is a sparse ensemble of Gaussian kernels that detects anomalies by modeling local likelihood ratios in high-dimensional representation spaces, addressing the limitations of traditional anomaly detection methods.

Details

Motivation: Current AI representations lack controlled statistical properties, causing anomaly detection methods to fail when weak or rare signals are hidden within normal data patterns.

Method: SparKer uses self-organizing local kernels based on three principles: sparsity (parsimony), locality (geometric sensitivity), and competition (efficient capacity allocation). It trains sparse Gaussian kernel ensembles in a semi-supervised Neyman-Pearson framework to model local likelihood ratios.

Result: The method successfully identifies statistically significant anomalies in high-dimensional spaces (thousands of dimensions) using only a handful of kernels, demonstrating interpretability, efficiency, and scalability across scientific discovery, novelty detection, intrusion detection, and generative-model validation.

Conclusion: The proposed approach effectively bridges the gap in anomaly detection by providing controlled statistical properties for AI representations, enabling reliable detection of weak and rare signals while maintaining interpretability and scalability.

Abstract: Modern artificial intelligence has revolutionized our ability to extract rich and versatile data representations across scientific disciplines. Yet, the statistical properties of these representations remain poorly controlled, causing misspecified anomaly detection (AD) methods to falter. Weak or rare signals can remain hidden within the apparent regularity of normal data, creating a gap in our ability to detect and interpret anomalies. We examine this gap and identify a set of structural desiderata for detection methods operating under minimal prior information: sparsity, to enforce parsimony; locality, to preserve geometric sensitivity; and competition, to promote efficient allocation of model capacity. These principles define a class of self-organizing local kernels that adaptively partition the representation space around regions of statistical imbalance. As an instantiation of these principles, we introduce SparKer, a sparse ensemble of Gaussian kernels trained within a semi-supervised Neyman–Pearson framework to locally model the likelihood ratio between a sample that may contain anomalies and a nominal, anomaly-free reference. We provide theoretical insights into the mechanisms that drive detection and self-organization in the proposed model, and demonstrate the effectiveness of this approach on realistic high-dimensional problems of scientific discovery, open-world novelty detection, intrusion detection, and generative-model validation. Our applications span both the natural- and computer-science domains. We demonstrate that ensembles containing only a handful of kernels can identify statistically significant anomalous locations within representation spaces of thousands of dimensions, underscoring both the interpretability, efficiency and scalability of the proposed approach.

[227] An Efficient Classification Model for Cyber Text

Md Sakhawat Hossen, Md. Zashid Iqbal Borshon, A. S. M. Badrudduza

Main category: cs.LG

TL;DR: Proposes CTF-IDF and IRLBA for efficient text analytics, reducing carbon footprint vs deep learning with minor accuracy trade-offs.

Details

Motivation: Address the high carbon footprint of deep learning in text analytics by reviving classical methods with efficiency improvements.

Method: Modified TF-IDF (CTF-IDF) for preprocessing and IRLBA for dimensionality reduction in classical ML text analytics pipeline.

Result: Significant reduction in time complexity and computational resources with minor accuracy compromise compared to deep learning.

Conclusion: Classical ML with CTF-IDF and IRLBA offers efficient, low-carbon alternative to deep learning for text analytics.

Abstract: The uprising of deep learning methodology and practice in recent years has brought about a severe consequence of increasing carbon footprint due to the insatiable demand for computational resources and power. The field of text analytics also experienced a massive transformation in this trend of monopolizing methodology. In this paper, the original TF-IDF algorithm has been modified, and Clement Term Frequency-Inverse Document Frequency (CTF-IDF) has been proposed for data preprocessing. This paper primarily discusses the effectiveness of classical machine learning techniques in text analytics with CTF-IDF and a faster IRLBA algorithm for dimensionality reduction. The introduction of both of these techniques in the conventional text analytics pipeline ensures a more efficient, faster, and less computationally intensive application when compared with deep learning methodology regarding carbon footprint, with minor compromise in accuracy. The experimental results also exhibit a manifold of reduction in time complexity and improvement of model accuracy for the classical machine learning methods discussed further in this paper.

[228] Towards Scalable Backpropagation-Free Gradient Estimation

Daniel Wang, Evan Markou, Dylan Campbell

Main category: cs.LG

TL;DR: A new gradient estimation method that reduces bias and variance by manipulating upstream Jacobian matrices, showing promising scalability to larger networks.

Details

Motivation: Backpropagation requires two passes and stores intermediate activations, while existing forward-mode automatic differentiation methods have high variance and bias issues when scaling beyond small networks.

Method: Manipulating upstream Jacobian matrices when computing guess directions to reduce both bias and variance in gradient estimation.

Result: The method shows promising results and performs better as network width increases, with potential to scale to larger networks.

Conclusion: The approach effectively addresses limitations of existing gradient estimation methods through bias and variance reduction, facilitated by understanding the low-dimensional structure of neural network gradients.

Abstract: While backpropagation–reverse-mode automatic differentiation–has been extraordinarily successful in deep learning, it requires two passes (forward and backward) through the neural network and the storage of intermediate activations. Existing gradient estimation methods that instead use forward-mode automatic differentiation struggle to scale beyond small networks due to the high variance of the estimates. Efforts to mitigate this have so far introduced significant bias to the estimates, reducing their utility. We introduce a gradient estimation approach that reduces both bias and variance by manipulating upstream Jacobian matrices when computing guess directions. It shows promising results and has the potential to scale to larger networks, indeed performing better as the network width is increased. Our understanding of this method is facilitated by analyses of bias and variance, and their connection to the low-dimensional structure of neural network gradients.

[229] FP-AbDiff: Improving Score-based Antibody Design by Capturing Nonequilibrium Dynamics through the Underlying Fokker-Planck Equation

Jiameng Chen, Yida Xiong, Kun Li, Hongzhi Zhang, Xiantao Cai, Wenbin Hu, Jia Wu

Main category: cs.LG

TL;DR: FP-AbDiff is a physics-informed antibody generator that enforces Fokker-Planck Equation physics throughout the generative process, achieving state-of-the-art performance in computational antibody design with improved physical plausibility and generalization.

Details

Motivation: Existing antibody generative models suffer from lack of dynamical consistency (producing physically implausible structures) and poor generalization due to data scarcity and structural bias.

Method: Enforces Fokker-Planck Equation physics via a novel FPE residual loss over CDR geometries (R^3 x SO(3)), integrated with deep biological priors in an SE(3)-equivariant diffusion framework.

Result: Achieves 0.99 Å RMSD in CDR-H3 design (25% improvement), 39.91% Contact Amino Acid Recovery, and 45.67% full-chain Amino Acid Recovery on CDR-H3 in six-CDR co-design.

Conclusion: FP-AbDiff establishes a principled approach for physically faithful and functionally viable antibody design by aligning generative dynamics with physical laws.

Abstract: Computational antibody design holds immense promise for therapeutic discovery, yet existing generative models are fundamentally limited by two core challenges: (i) a lack of dynamical consistency, which yields physically implausible structures, and (ii) poor generalization due to data scarcity and structural bias. We introduce FP-AbDiff, the first antibody generator to enforce Fokker-Planck Equation (FPE) physics along the entire generative trajectory. Our method minimizes a novel FPE residual loss over the mixed manifold of CDR geometries (R^3 x SO(3)), compelling locally-learned denoising scores to assemble into a globally coherent probability flow. This physics-informed regularizer is synergistically integrated with deep biological priors within a state-of-the-art SE(3)-equivariant diffusion framework. Rigorous evaluation on the RAbD benchmark confirms that FP-AbDiff establishes a new state-of-the-art. In de novo CDR-H3 design, it achieves a mean Root Mean Square Deviation of 0.99 {\AA} when superposing on the variable region, a 25% improvement over the previous state-of-the-art model, AbX, and the highest reported Contact Amino Acid Recovery of 39.91%. This superiority is underscored in the more challenging six-CDR co-design task, where our model delivers consistently superior geometric precision, cutting the average full-chain Root Mean Square Deviation by ~15%, and crucially, achieves the highest full-chain Amino Acid Recovery on the functionally dominant CDR-H3 loop (45.67%). By aligning generative dynamics with physical laws, FP-AbDiff enhances robustness and generalizability, establishing a principled approach for physically faithful and functionally viable antibody design.

[230] An Augmentation Overlap Theory of Contrastive Learning

Qi Zhang, Yifei Wang, Yisen Wang

Main category: cs.LG

TL;DR: The paper provides theoretical bounds for self-supervised contrastive learning, relaxing the conditional independence assumption to a more practical augmentation overlap assumption, and develops an unsupervised evaluation metric.

Details

Motivation: To understand the working mechanism of self-supervised contrastive learning and address the unclear theoretical foundations.

Method: Derive tight bounds based on conditional independence assumption, then relax to augmentation overlap assumption and develop asymptotically closed bounds. Propose an unsupervised evaluation metric based on the new perspective.

Result: The augmentation overlap theory shows that aggressive data augmentations make intra-class samples more overlapped, enabling contrastive learning to cluster them together. The unsupervised metric aligns well with downstream performance.

Conclusion: The proposed augmentation overlap theory provides better theoretical understanding of contrastive learning, and the unsupervised evaluation metric offers practical utility without additional modules.

Abstract: Recently, self-supervised contrastive learning has achieved great success on various tasks. However, its underlying working mechanism is yet unclear. In this paper, we first provide the tightest bounds based on the widely adopted assumption of conditional independence. Further, we relax the conditional independence assumption to a more practical assumption of augmentation overlap and derive the asymptotically closed bounds for the downstream performance. Our proposed augmentation overlap theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations, thus simply aligning the positive samples (augmented views of the same sample) could make contrastive learning cluster intra-class samples together. Moreover, from the newly derived augmentation overlap perspective, we develop an unsupervised metric for the representation evaluation of contrastive learning, which aligns well with the downstream performance almost without relying on additional modules. Code is available at https://github.com/PKU-ML/GARC.

[231] From Insight to Exploit: Leveraging LLM Collaboration for Adaptive Adversarial Text Generation

Najrin Sultana, Md Rafi Ur Rashid, Kang Gu, Shagufta Mehnaz

Main category: cs.LG

TL;DR: This paper introduces StaDec and DyDec, two novel attack frameworks that generate dynamic and adaptive adversarial examples for LLMs to systematically assess their robustness against adversarial inputs.

Details

Motivation: While LLMs show strong zero-shot performance, their robustness against adversarial inputs needs thorough assessment, especially for sensitive tasks.

Method: Developed Static Deceptor (StaDec) and Dynamic Deceptor (DyDec) frameworks that use LLM-driven pipelines to generate subtle, natural-looking adversarial examples while preserving semantic similarity to original text.

Result: The attacks demonstrate strong transferability across unknown models and evolve with LLM advancements, providing an automated approach without external heuristics.

Conclusion: This work provides a systematic framework for self-assessment of LLM robustness against adversarial attacks, with code and data publicly released.

Abstract: LLMs can provide substantial zero-shot performance on diverse tasks using a simple task prompt, eliminating the need for training or fine-tuning. However, when applying these models to sensitive tasks, it is crucial to thoroughly assess their robustness against adversarial inputs. In this work, we introduce Static Deceptor (StaDec) and Dynamic Deceptor (DyDec), two innovative attack frameworks designed to systematically generate dynamic and adaptive adversarial examples by leveraging the understanding of the LLMs. We produce subtle and natural-looking adversarial inputs that preserve semantic similarity to the original text while effectively deceiving the target LLM. By utilizing an automated, LLM-driven pipeline, we eliminate the dependence on external heuristics. Our attacks evolve with the advancements in LLMs and demonstrate strong transferability across models unknown to the attacker. Overall, this work provides a systematic approach for the self-assessment of an LLM’s robustness. We release our code and data at https://github.com/Shukti042/AdversarialExample.

[232] Test Time Adaptation Using Adaptive Quantile Recalibration

Paria Mehrbod, Pedro Vianna, Geraldin Nanfack, Guy Wolf, Eugene Belilovsky

Main category: cs.LG

TL;DR: AQR is a test-time adaptation method that recalibrates pre-activation distributions using quantile alignment, enabling robust domain adaptation without retraining across various normalization layers and architectures.

Details

Motivation: Conventional domain adaptation methods require target domain knowledge or model retraining, limiting practicality in dynamic environments. Existing test-time methods based on batch normalization updates fail to capture complex activation distributions and are limited to specific normalization layers.

Method: Proposes Adaptive Quantile Recalibration (AQR) that modifies pre-activation distributions by aligning quantiles channel-wise. Uses source-domain statistics from training and includes robust tail calibration for stability across varying batch sizes. Works with BatchNorm, GroupNorm, and LayerNorm.

Result: Experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C across multiple architectures show AQR outperforms existing test-time adaptation baselines, achieving robust adaptation across diverse settings.

Conclusion: AQR demonstrates strong potential for real-world deployment in dynamic scenarios with unpredictable data distributions, providing effective unsupervised adaptation without model retraining.

Abstract: Domain adaptation is a key strategy for enhancing the generalizability of deep learning models in real-world scenarios, where test distributions often diverge significantly from the training domain. However, conventional approaches typically rely on prior knowledge of the target domain or require model retraining, limiting their practicality in dynamic or resource-constrained environments. Recent test-time adaptation methods based on batch normalization statistic updates allow for unsupervised adaptation, but they often fail to capture complex activation distributions and are constrained to specific normalization layers. We propose Adaptive Quantile Recalibration (AQR), a test-time adaptation technique that modifies pre-activation distributions by aligning quantiles on a channel-wise basis. AQR captures the full shape of activation distributions and generalizes across architectures employing BatchNorm, GroupNorm, or LayerNorm. To address the challenge of estimating distribution tails under varying batch sizes, AQR incorporates a robust tail calibration strategy that improves stability and precision. Our method leverages source-domain statistics computed at training time, enabling unsupervised adaptation without retraining models. Experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C across multiple architectures demonstrate that AQR achieves robust adaptation across diverse settings, outperforming existing test-time adaptation baselines. These results highlight AQR’s potential for deployment in real-world scenarios with dynamic and unpredictable data distributions.

[233] Forecast2Anomaly (F2A): Adapting Multivariate Time Series Foundation Models for Anomaly Prediction

Atif Hassan, Tarun Kumar, Ashish Mishra, Sergey Serebryakov, Satish Kumar Mopur, Phanidhar Koganti, Murthy Chelankuri, Ramanagopal Vogety, Suparna Bhattacharya, Martin Foltin

Main category: cs.LG

TL;DR: F2A is a framework that enables Time Series Foundation Models to predict anomalies by combining targeted fine-tuning with dynamic retrieval, achieving zero-shot anomaly prediction across diverse datasets.

Details

Motivation: Existing anomaly prediction methods fail to generalize across systems and evolving anomaly patterns, while pretrained Time Series Foundation Models have strong generalization but haven't been adapted for anomaly prediction tasks.

Method: Uses joint forecast-anomaly loss to fine-tune TSFMs for accurate forecasting at anomalous points, and a Retrieval-Augmented Generation module that retrieves relevant historical horizons to adapt to distributional shifts at inference time.

Result: Outperforms state-of-the-art methods across 16 diverse datasets and multiple TSFM backbones, providing scalable zero-shot anomaly prediction without requiring model updates.

Conclusion: F2A successfully bridges the gap between TSFM zero-shot forecasting and zero-shot anomaly prediction, offering a practical solution for real-world applications with evolving anomaly patterns.

Abstract: Forecasting anomalies (anomaly prediction) in multivariate time series from different real-world, dynamic, and complex systems is vital for preempting critical failures, leading to a substantial minimization in operational costs and human labor. Yet, existing methods are limited to specific systems while failing to generalize to evolving anomaly patterns over time. In contrast, pretrained Time Series Foundation Models (TSFMs) have recently demonstrated strong generalization and zero-shot forecasting capabilities. However, their potential remains untapped for anomaly prediction, a task fundamentally different from forecasting normal behavior. Thus, we present Forecast2Anomaly (F2A), a novel framework that empowers TSFMs with anomaly prediction abilities through two key innovations. First, we propose a joint forecast-anomaly loss that fine-tunes TSFMs to accurately forecast future signals even at anomalous time points. Second, we introduce a Retrieval-Augmented Generation (RAG) module that retrieves historically relevant horizons and conditions predictions on them. This component dynamically adapts to distributional shifts at inference time, enabling F2A to track evolving anomalies without requiring model updates. By combining targeted fine-tuning with dynamic retrieval, F2A bridges the gap between robust TSFM zero-shot forecasting and zero-shot anomaly prediction. Extensive experiments across 16 diverse datasets and multiple TSFM backbones show that F2A consistently outperforms state-of-the-art methods, offering a scalable, zero-shot anomaly prediction solution for real-world applications.

[234] UnCLe: Towards Scalable Dynamic Causal Discovery in Non-linear Temporal Systems

Tingzhu Bi, Yicheng Pan, Xinrui Jiang, Huize Sun, Meng Ma, Ping Wang

Main category: cs.LG

TL;DR: UnCLe is a deep learning method for dynamic causal discovery that uses Uncoupler and Recoupler networks to disentangle time series and learn evolving causal relationships through temporal perturbations.

Details

Motivation: Real-world systems often exhibit dynamic causality where relationships evolve over time, but most methods only infer static causal graphs, creating a need for time-resolved causal analysis.

Method: Uses Uncoupler and Recoupler networks to disentangle time series into semantic representations, learns inter-variable dependencies via auto-regressive Dependency Matrices, and estimates dynamic causal influences through datapoint-wise prediction errors from temporal perturbations.

Result: Outperforms state-of-the-art baselines on static causal discovery benchmarks and uniquely captures evolving temporal causality in both synthetic and real-world dynamic systems like human motion.

Conclusion: UnCLe provides a promising approach for revealing time-varying mechanisms in complex phenomena through scalable dynamic causal discovery.

Abstract: Uncovering cause-effect relationships from observational time series is fundamental to understanding complex systems. While many methods infer static causal graphs, real-world systems often exhibit dynamic causality-where relationships evolve over time. Accurately capturing these temporal dynamics requires time-resolved causal graphs. We propose UnCLe, a novel deep learning method for scalable dynamic causal discovery. UnCLe employs a pair of Uncoupler and Recoupler networks to disentangle input time series into semantic representations and learns inter-variable dependencies via auto-regressive Dependency Matrices. It estimates dynamic causal influences by analyzing datapoint-wise prediction errors induced by temporal perturbations. Extensive experiments demonstrate that UnCLe not only outperforms state-of-the-art baselines on static causal discovery benchmarks but, more importantly, exhibits a unique capability to accurately capture and represent evolving temporal causality in both synthetic and real-world dynamic systems (e.g., human motion). UnCLe offers a promising approach for revealing the underlying, time-varying mechanisms of complex phenomena.

[235] Periodic Skill Discovery

Jonghae Park, Daesol Cho, Jusuk Lee, Dongseok Shim, Inkyu Jang, H. Jin Kim

Main category: cs.LG

TL;DR: PSD is an unsupervised RL framework that discovers diverse periodic skills by mapping states to a circular latent space, enabling effective learning of periodic behaviors for robotic tasks.

Details

Motivation: Current unsupervised skill discovery methods overlook periodic nature of skills, while many robotic tasks (especially locomotion) require periodic behaviors across varying timescales.

Method: Train an encoder that maps states to a circular latent space to naturally encode periodicity, capturing temporal distance to learn skills with diverse periods.

Result: PSD effectively learns periodic skills in complex robotic tasks with pixel observations, achieves high performance on downstream tasks like hurdling, and when integrated with existing methods provides more diverse behaviors.

Conclusion: PSD successfully discovers diverse periodic skills in an unsupervised manner, expanding the agent’s behavioral repertoire and demonstrating practical utility for robotic applications.

Abstract: Unsupervised skill discovery in reinforcement learning (RL) aims to learn diverse behaviors without relying on external rewards. However, current methods often overlook the periodic nature of learned skills, focusing instead on increasing the mutual dependence between states and skills or maximizing the distance traveled in latent space. Considering that many robotic tasks – particularly those involving locomotion – require periodic behaviors across varying timescales, the ability to discover diverse periodic skills is essential. Motivated by this, we propose Periodic Skill Discovery (PSD), a framework that discovers periodic behaviors in an unsupervised manner. The key idea of PSD is to train an encoder that maps states to a circular latent space, thereby naturally encoding periodicity in the latent representation. By capturing temporal distance, PSD can effectively learn skills with diverse periods in complex robotic tasks, even with pixel-based observations. We further show that these learned skills achieve high performance on downstream tasks such as hurdling. Moreover, integrating PSD with an existing skill discovery method offers more diverse behaviors, thus broadening the agent’s repertoire. Our code and demos are available at https://jonghaepark.github.io/psd/

[236] Efficient Linear Attention for Multivariate Time Series Modeling via Entropy Equality

Mingtao Zhang, Guoli Yang, Zhanxing Zhu, Mengzhu Wang, Xiaoying Bai

Main category: cs.LG

TL;DR: Proposes a novel linear attention mechanism that overcomes quadratic complexity limitations by using entropy-based approximation with linear complexity, achieving competitive performance on spatio-temporal datasets.

Details

Motivation: Attention mechanisms are widely used but constrained by quadratic computational complexity, which limits scalability for long sequences in time series modeling.

Method: Develops a linear attention mechanism based on theoretical insight that entropy implies structural resemblance in distributions. Uses efficient entropy approximation algorithm with linear complexity to implement entropy-based attention.

Result: Extensive experiments on four spatio-temporal datasets show competitive or superior forecasting performance with substantial reductions in memory usage and computational time.

Conclusion: The effectiveness of attention in spatio-temporal modeling may stem more from achieving moderate, well-balanced weight distributions rather than softmax non-linearity, enabling efficient linear attention implementations.

Abstract: Attention mechanisms have been extensively employed in various applications, including time series modeling, owing to their capacity to capture intricate dependencies; however, their utility is often constrained by quadratic computational complexity, which impedes scalability for long sequences. In this work, we propose a novel linear attention mechanism designed to overcome these limitations. Our approach is grounded in a theoretical demonstration that entropy, as a strictly concave function on the probability simplex, implies that distributions with aligned probability rankings and similar entropy values exhibit structural resemblance. Building on this insight, we develop an efficient approximation algorithm that computes the entropy of dot-product-derived distributions with only linear complexity, enabling the implementation of a linear attention mechanism based on entropy equality. Through rigorous analysis, we reveal that the effectiveness of attention in spatio-temporal time series modeling may not primarily stem from the non-linearity of softmax but rather from the attainment of a moderate and well-balanced weight distribution. Extensive experiments on four spatio-temporal datasets validate our method, demonstrating competitive or superior forecasting performance while achieving substantial reductions in both memory usage and computational time.

Feng Wu, Tsai Hor Chan, Fuying Wang, Guosheng Yin, Lequan Yu

Main category: cs.LG

TL;DR: A copula-driven multimodal learning framework that models complex interactions between data modalities by aligning marginal distributions and learning joint distributions, enabling accurate representation generation for missing modalities.

Details

Motivation: Real-world applications often involve multiple data modalities (e.g., EHRs, medical images, clinical notes), but existing methods oversimplify modality interactions using concatenation or Kronecker product, failing to capture complex higher-order interactions.

Method: Proposes a copula-based framework that interprets copula as a tool to align marginal distributions of modalities, assumes Gaussian mixture distributions for each modality and a copula model on the joint distribution to capture complex interactions.

Result: Extensive experiments on MIMIC datasets demonstrate superior performance over competitors, with the model effectively generating accurate representations for missing modalities.

Conclusion: The copula-driven approach successfully models complex multimodal interactions and joint distributions, providing an effective solution for multimodal learning with superior performance in handling missing modalities.

Abstract: Various data modalities are common in real-world applications (e.g., electronic health records, medical images and clinical notes in healthcare). It is essential to develop multimodal learning methods to aggregate various information from multiple modalities. The main challenge is how to appropriately align and fuse the representations of different modalities into a joint distribution. Existing methods mainly rely on concatenation or the Kronecker product, oversimplifying the interaction structure between modalities and indicating a need to model more complex interactions. Additionally, the joint distribution of latent representations with higher-order interactions is underexplored. Copula is a powerful statistical structure for modelling the interactions among variables, as it naturally bridges the joint distribution and marginal distributions of multiple variables. We propose a novel copula-driven multimodal learning framework, which focuses on learning the joint distribution of various modalities to capture the complex interactions among them. The key idea is to interpret the copula model as a tool to align the marginal distributions of the modalities efficiently. By assuming a Gaussian mixture distribution for each modality and a copula model on the joint distribution, our model can generate accurate representations for missing modalities. Extensive experiments on public MIMIC datasets demonstrate the superior performance of our model over other competitors. The code is available at https://github.com/HKU-MedAI/CMCM.

[238] A Probabilistic U-Net Approach to Downscaling Climate Simulations

Maryam Alipourhajiagha, Pierre-Louis Lemaire, Youssef Diouane, Julie Carreau

Main category: cs.LG

TL;DR: Adapts probabilistic U-Net for climate downscaling, comparing training objectives for precipitation and temperature from 16x coarser resolution.

Details

Motivation: Climate models produce coarse outputs but impact studies need finer scales, requiring statistical downscaling to bridge this gap.

Method: Uses probabilistic U-Net with deterministic backbone and variational latent space to capture uncertainty, evaluates four training objectives including afCRPS and WMSE-MS-SSIM.

Result: WMSE-MS-SSIM performs well for extremes under certain settings, while afCRPS better captures spatial variability across scales.

Conclusion: Different training objectives excel at different aspects of downscaling - WMSE-MS-SSIM for extremes and afCRPS for spatial variability.

Abstract: Climate models are limited by heavy computational costs, often producing outputs at coarse spatial resolutions, while many climate change impact studies require finer scales. Statistical downscaling bridges this gap, and we adapt the probabilistic U-Net for this task, combining a deterministic U-Net backbone with a variational latent space to capture aleatoric uncertainty. We evaluate four training objectives, afCRPS and WMSE-MS-SSIM with three settings for downscaling precipitation and temperature from $16\times$ coarser resolution. Our main finding is that WMSE-MS-SSIM performs well for extremes under certain settings, whereas afCRPS better captures spatial variability across scales.

[239] A Quantized VAE-MLP Botnet Detection Model: A Systematic Evaluation of Quantization-Aware Training and Post-Training Quantization Strategies

Hassan Wasswa, Hussein Abbass, Timothy Lynar

Main category: cs.LG

TL;DR: This paper proposes a VAE-MLP framework for lightweight IoT botnet detection, comparing QAT and PTQ quantization methods to enable deployment on resource-constrained devices.

Details

Motivation: Deep learning methods for IoT botnet detection achieve high accuracy but are computationally intensive, making them unsuitable for resource-constrained IoT devices, creating a need for lightweight detection models.

Method: VAE-MLP model framework with MLP classifier trained on 8-dimensional latent vectors from pretrained VAE encoder, followed by systematic evaluation of QAT and PTQ quantization strategies on N-BaIoT and CICIoT2022 datasets.

Result: PTQ showed only marginal accuracy reduction vs unquantized model, while QAT had more noticeable decline. PTQ achieved 6x speedup and 21x size reduction, QAT achieved 3x speedup and 24x compression.

Conclusion: Quantization is practical for device-level IoT botnet detection, with PTQ performing better in maintaining accuracy while providing significant speedup and compression benefits.

Abstract: In an effort to counter the increasing IoT botnet-based attacks, state-of-the-art deep learning methods have been proposed and have achieved impressive detection accuracy. However, their computational intensity restricts deployment on resource-constrained IoT devices, creating a critical need for lightweight detection models. A common solution to this challenge is model compression via quantization. This study proposes a VAE-MLP model framework where an MLP-based classifier is trained on 8-dimensional latent vectors derived from the high-dimensional train data using the encoder component of a pretrained variational autoencoder (VAE). Two widely used quantization strategies–Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ)–are then systematically evaluated in terms of their impact on detection performance, storage efficiency, and inference latency using two benchmark IoT botnet datasets–N-BaIoT and CICIoT2022. The results revealed that, with respect to detection accuracy, the QAT strategy experienced a more noticeable decline,whereas PTQ incurred only a marginal reduction compared to the original unquantized model. Furthermore, PTQ yielded a 6x speedup and 21x reduction in size, while QAT achieved a 3x speedup and 24x compression, demonstrating the practicality of quantization for device-level IoT botnet detection.

[240] Incorporating Quality of Life in Climate Adaptation Planning via Reinforcement Learning

Miguel Costa, Arthur Vandervoort, Martin Drews, Karyn Morrissey, Francisco C. Pereira

Main category: cs.LG

TL;DR: Using Reinforcement Learning to identify optimal climate adaptation pathways for urban flooding that maximize long-term Quality of Life through an Integrated Assessment Model.

Details

Motivation: Urban flooding is increasing due to climate change, negatively impacting Quality of Life, and policymakers need adaptation strategies that can handle climate uncertainty and urban complexity.

Method: Reinforcement Learning applied to an Integrated Assessment Model combining rainfall projections, flood modeling, transport accessibility, and quality of life metrics.

Result: Preliminary results show the approach can learn optimal adaptation measures and outperforms other realistic planning strategies.

Conclusion: The RL-based framework successfully identifies climate adaptation pathways that improve long-term urban Quality of Life, with publicly available implementation.

Abstract: Urban flooding is expected to increase in frequency and severity as a consequence of climate change, causing wide-ranging impacts that include a decrease in urban Quality of Life (QoL). Meanwhile, policymakers must devise adaptation strategies that can cope with the uncertain nature of climate change and the complex and dynamic nature of urban flooding. Reinforcement Learning (RL) holds significant promise in tackling such complex, dynamic, and uncertain problems. Because of this, we use RL to identify which climate adaptation pathways lead to a higher QoL in the long term. We do this using an Integrated Assessment Model (IAM) which combines a rainfall projection model, a flood model, a transport accessibility model, and a quality of life index. Our preliminary results suggest that this approach can be used to learn optimal adaptation measures and it outperforms other realistic and real-world planning strategies. Our framework is publicly available: https://github.com/MLSM-at-DTU/maat_qol_framework.

[241] A Feedback-Control Framework for Efficient Dataset Collection from In-Vehicle Data Streams

Philipp Reis, Philipp Rigoll, Christian Steinhauser, Jacob Langner, Eric Sax

Main category: cs.LG

TL;DR: FCDC introduces a closed-loop control approach to data collection that uses online probabilistic modeling and feedback mechanisms to dynamically regulate sample retention, reducing redundancy and improving dataset quality.

Details

Motivation: Current data collection methods are open-loop and accumulate redundant samples without feedback, leading to inefficient storage, costly labeling, and limited generalization. There's a need for more intelligent data collection that actively manages dataset quality.

Method: Formulates data collection as a closed-loop control problem using online probabilistic models to approximate data distribution state. Uses feedback signals like likelihood and Mahalanobis distance to adaptively regulate sample retention, balancing exploration and exploitation.

Result: Experiments show FCDC produces 25.9% more balanced datasets while reducing data storage by 39.8%. Demonstrates controllability on synthetic datasets and effectiveness on real data streams.

Conclusion: Data collection can be actively controlled, transforming it from a passive pipeline stage into a self-regulating, feedback-driven process that addresses core challenges in data-centric AI.

Abstract: Modern AI systems are increasingly constrained not by model capacity but by the quality and diversity of their data. Despite growing emphasis on data-centric AI, most datasets are still gathered in an open-loop manner which accumulates redundant samples without feedback from the current coverage. This results in inefficient storage, costly labeling, and limited generalization. To address this, this paper introduces \ac{FCDC}, a paradigm that formulates data collection as a closed-loop control problem. \ac{FCDC} continuously approximates the state of the collected data distribution using an online probabilistic model and adaptively regulates sample retention using based on feedback signals such as likelihood and Mahalanobis distance. Through this feedback mechanism, the system dynamically balances exploration and exploitation, maintains dataset diversity, and prevents redundancy from accumulating over time. Besides showcasing the controllability of \ac{FCDC} on a synthetic dataset, experiments on a real data stream show that \ac{FCDC} produces more balanced datasets by $\SI{25.9}{\percent}$ while reducing data storage by $\SI{39.8}{\percent}$. These results demonstrate that data collection itself can be actively controlled, transforming collection from a passive pipeline stage into a self-regulating, feedback-driven process at the core of data-centric AI.

[242] A unified physics-informed generative operator framework for general inverse problems

Gang Bao, Yaohua Zang

Main category: cs.LG

TL;DR: IGNO is a novel generative neural operator framework that solves inverse PDE problems from both point measurements and operator-valued data without labeled training pairs, achieving accurate and stable inversion even under severe noise.

Details

Motivation: Existing deep learning approaches for inverse PDE problems require extensive labeled datasets or are limited to specific measurement types, failing in challenging regimes with sparse/noisy data or high-dimensional/discontinuous coefficients.

Method: IGNO encodes high-dimensional coefficient fields into low-dimensional latent space, uses neural operator decoders to reconstruct coefficients and solutions, trains via physics constraints through PDE residuals, and performs inversion via gradient-based optimization in latent space accelerated by normalizing flow.

Result: IGNO consistently achieves accurate, stable, and scalable inversion across diverse challenging problems including discontinuous coefficient recovery and EIT, outperforming state-of-the-art methods under varying noise levels and showing strong generalization to out-of-distribution targets.

Conclusion: IGNO establishes a unified and powerful framework for tackling challenging inverse problems across computational science domains without requiring labeled training data.

Abstract: Solving inverse problems governed by partial differential equations (PDEs) is central to science and engineering, yet remains challenging when measurements are sparse, noisy, or when the underlying coefficients are high-dimensional or discontinuous. Existing deep learning approaches either require extensive labeled datasets or are limited to specific measurement types, often leading to failure in such regimes and restricting their practical applicability. Here, a novel generative neural operator framework, IGNO, is introduced to overcome these limitations. IGNO unifies the solution of inverse problems from both point measurements and operator-valued data without labeled training pairs. This framework encodes high-dimensional, potentially discontinuous coefficient fields into a low-dimensional latent space, which drives neural operator decoders to reconstruct both coefficients and PDE solutions. Training relies purely on physics constraints through PDE residuals, while inversion proceeds via efficient gradient-based optimization in latent space, accelerated by an a priori normalizing flow model. Across a diverse set of challenging inverse problems, including recovery of discontinuous coefficients from solution-based measurements and the EIT problem with operator-based measurements, IGNO consistently achieves accurate, stable, and scalable inversion even under severe noise. It consistently outperforms the state-of-the-art method under varying noise levels and demonstrates strong generalization to out-of-distribution targets. These results establish IGNO as a unified and powerful framework for tackling challenging inverse problems across computational science domains.

[243] Climate Adaptation with Reinforcement Learning: Economic vs. Quality of Life Adaptation Pathways

Miguel Costa, Arthur Vandervoort, Martin Drews, Karyn Morrissey, Francisco C. Pereira

Main category: cs.LG

TL;DR: RL-based framework for flood adaptation policy design that explicitly models normative priorities (economic vs. wellbeing) under climate uncertainty, showing QoL-focused policies lead to more adaptation spending and equitable distribution.

Details

Motivation: Climate change increases flood frequency/severity, requiring adaptation policies that manage long-term uncertainty while making explicit normative choices about priorities.

Method: Used Reinforcement Learning with Integrated Assessment Model linking rainfall/flood models to compute QoL, transportation, and infrastructure impacts from flooding.

Result: Models prioritizing QoL over economic impacts resulted in more adaptation spending and more even distribution across study areas.

Conclusion: Normative assumptions significantly alter adaptation policy outcomes, with QoL focus leading to more equitable spending distribution.

Abstract: Climate change will cause an increase in the frequency and severity of flood events, prompting the need for cohesive adaptation policymaking. Designing effective adaptation policies, however, depends on managing the uncertainty of long-term climate impacts. Meanwhile, such policies can feature important normative choices that are not always made explicit. We propose that Reinforcement Learning (RL) can be a useful tool to both identify adaptation pathways under uncertain conditions while it also allows for the explicit modelling (and consequent comparison) of different adaptation priorities (e.g. economic vs. wellbeing). We use an Integrated Assessment Model (IAM) to link together a rainfall and flood model, and compute the impacts of flooding in terms of quality of life (QoL), transportation, and infrastructure damage. Our results show that models prioritising QoL over economic impacts results in more adaptation spending as well as a more even distribution of spending over the study area, highlighting the extent to which such normative assumptions can alter adaptation policy. Our framework is publicly available: https://github.com/MLSM-at-DTU/maat_qol_framework.

[244] GMoPE:A Prompt-Expert Mixture Framework for Graph Foundation Models

Zhibin Wang, Zhixing Zhang, Shuqi Wang, Xuanting Xie, Zhao Kang

Main category: cs.LG

TL;DR: GMoPE is a novel framework that combines Mixture-of-Experts architecture with prompt-based learning for graphs, enabling efficient cross-domain generalization while reducing adaptation costs.

Details

Motivation: GNNs struggle with generalization across diverse domains due to negative transfer, scalability issues, and high adaptation costs. Current approaches lack efficient transfer learning capabilities for graph data.

Method: Integrates MoE architecture with prompt-based learning using expert-specific prompt vectors and structure-aware routing. Includes soft orthogonality constraint to prevent expert collapse and uses prompt-only fine-tuning to reduce complexity.

Result: Consistently outperforms state-of-the-art baselines, achieves performance comparable to full parameter fine-tuning with significantly reduced adaptation overhead across various pretraining strategies and downstream tasks.

Conclusion: GMoPE provides a principled and scalable framework for advancing generalizable and efficient graph foundation models, addressing key limitations in current graph transfer learning approaches.

Abstract: Graph Neural Networks (GNNs) have demonstrated impressive performance on task-specific benchmarks, yet their ability to generalize across diverse domains and tasks remains limited. Existing approaches often struggle with negative transfer, scalability issues, and high adaptation costs. To address these challenges, we propose GMoPE (Graph Mixture of Prompt-Experts), a novel framework that seamlessly integrates the Mixture-of-Experts (MoE) architecture with prompt-based learning for graphs. GMoPE leverages expert-specific prompt vectors and structure-aware MoE routing to enable each expert to specialize in distinct subdomains and dynamically contribute to predictions. To promote diversity and prevent expert collapse, we introduce a soft orthogonality constraint across prompt vectors, encouraging expert specialization and facilitating a more balanced expert utilization. Additionally, we adopt a prompt-only fine-tuning strategy that significantly reduces spatiotemporal complexity during transfer. We validate GMoPE through extensive experiments under various pretraining strategies and multiple downstream tasks. Results show that GMoPE consistently outperforms state-of-the-art baselines and achieves performance comparable to full parameter fine-tuning-while requiring only a fraction of the adaptation overhead. Our work provides a principled and scalable framework for advancing generalizable and efficient graph foundation models.

[245] Decoupled Entropy Minimization

Jing Ma, Hanlin Li, Xiang Xiang

Main category: cs.LG

TL;DR: The paper proposes AdaDEM, an adaptive decoupled entropy minimization method that addresses limitations of classical entropy minimization by separating it into cluster aggregation and gradient mitigation components, achieving superior performance in imperfectly supervised learning.

Details

Motivation: To study the internal mechanism of entropy minimization and address its limitations in coupled formulation, including reward collapse and easy-class bias that impede learning from high-certainty samples and cause misalignment between output and label distributions.

Method: Reformulate classical entropy minimization into two decoupled components: cluster aggregation driving factor (CADF) and gradient mitigation calibrator (GMC), then propose AdaDEM with normalized reward from CADF and marginal entropy calibrator (MEC) to replace GMC.

Result: AdaDEM outperforms DEM* (an upper-bound variant of classical EM) and achieves superior performance across various imperfectly supervised learning tasks in noisy and dynamic environments.

Conclusion: The proposed AdaDEM effectively addresses the limitations of classical entropy minimization through decoupled formulation and adaptive components, demonstrating improved performance in challenging learning scenarios.

Abstract: Entropy Minimization (EM) is beneficial to reducing class overlap, bridging domain gap, and restricting uncertainty for various tasks in machine learning, yet its potential is limited. To study the internal mechanism of EM, we reformulate and decouple the classical EM into two parts with opposite effects: cluster aggregation driving factor (CADF) rewards dominant classes and prompts a peaked output distribution, while gradient mitigation calibrator (GMC) penalizes high-confidence classes based on predicted probabilities. Furthermore, we reveal the limitations of classical EM caused by its coupled formulation: 1) reward collapse impedes the contribution of high-certainty samples in the learning process, and 2) easy-class bias induces misalignment between output distribution and label distribution. To address these issues, we propose Adaptive Decoupled Entropy Minimization (AdaDEM), which normalizes the reward brought from CADF and employs a marginal entropy calibrator (MEC) to replace GMC. AdaDEM outperforms DEM*, an upper-bound variant of classical EM, and achieves superior performance across various imperfectly supervised learning tasks in noisy and dynamic environments.

[246] Diffusion Language Models are Super Data Learners

Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh

Main category: cs.LG

TL;DR: Diffusion language models (DLMs) outperform autoregressive models under data constraints when trained for more epochs, with gains from any-order modeling, iterative bidirectional denoising, and built-in Monte Carlo augmentation.

Details

Motivation: To investigate how diffusion language models compare to autoregressive models under controlled pre-training conditions, especially when unique data is limited.

Method: Strictly controlled pre-training experiments comparing DLMs and AR models across different data quantities, model sizes, and architectures, analyzing the factors contributing to DLM performance.

Result: DLMs consistently surpass AR models when unique data is limited by training for more epochs. A 1.7B DLM trained with ~1.5T-token compute budget on 10B unique Python tokens outperforms an AR model. A 1B-parameter DLM achieves >56% accuracy on HellaSwag and >33% on MMLU using only 1B tokens.

Conclusion: DLMs show superior performance over AR models under data constraints, with the crossover point affected by data quantity/quality and model size. Rising validation cross-entropy doesn’t necessarily indicate degraded downstream performance in this regime.

Abstract: Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.

[247] Multi-Objective Adaptive Rate Limiting in Microservices Using Deep Reinforcement Learning

Ning Lyu, Yuxi Wang, Ziyu Cheng, Qingyuan Zhang, Feng Chen

Main category: cs.LG

TL;DR: Proposes an adaptive API rate limiting strategy using deep reinforcement learning that dynamically balances throughput and latency, achieving significant improvements over traditional methods.

Details

Motivation: Traditional rate limiting algorithms struggle with dynamic traffic patterns and varying system loads in cloud computing and microservice architectures, necessitating more adaptive approaches.

Method: Hybrid architecture combining Deep Q-Network (DQN) and Asynchronous Advantage Actor-Critic (A3C) algorithms, modeling rate limiting as a Markov Decision Process that continuously learns optimal policies through environmental interaction.

Result: 23.7% throughput improvement and 31.4% P99 latency reduction compared to traditional fixed-threshold strategies; 82% reduction in service degradation incidents and 68% decrease in manual interventions in production deployment handling 500 million daily requests.

Conclusion: The deep reinforcement learning-based adaptive rate limiting strategy effectively addresses limitations of traditional methods, demonstrating significant performance improvements and practical effectiveness in production environments.

Abstract: As cloud computing and microservice architectures become increasingly prevalent, API rate limiting has emerged as a critical mechanism for ensuring system stability and service quality. Traditional rate limiting algorithms, such as token bucket and sliding window, while widely adopted, struggle to adapt to dynamic traffic patterns and varying system loads. This paper proposes an adaptive rate limiting strategy based on deep reinforcement learning that dynamically balances system throughput and service latency. We design a hybrid architecture combining Deep Q-Network (DQN) and Asynchronous Advantage Actor-Critic (A3C) algorithms, modeling the rate limiting decision process as a Markov Decision Process. The system continuously monitors microservice states and learns optimal rate limiting policies through environmental interaction. Extensive experiments conducted in a Kubernetes cluster environment demonstrate that our approach achieves 23.7% throughput improvement and 31.4% P99 latency reduction compared to traditional fixed-threshold strategies under high-load scenarios. Results from a 90-day production deployment handling 500 million daily requests validate the practical effectiveness of the proposed method, with 82% reduction in service degradation incidents and 68% decrease in manual interventions.

[248] A Probabilistic Approach to Pose Synchronization for Multi-Reference Alignment with Applications to MIMO Wireless Communication Systems

Rob Romijnders, Gabriele Cesa, Christos Louizos, Kumar Pratik, Arash Behboodi

Main category: cs.LG

TL;DR: A new probabilistic algorithm for multi-reference alignment (MRA) that uses relative poses as nuisance variables to marginalize out global symmetries, enabling improved convergence and computational efficiency through decentralization and cycle consistency.

Details

Motivation: Multi-reference alignment is crucial in various applications like molecular imaging, cryo-EM, computer vision, and wireless communications, where aligning and reconstructing signals from multiple misaligned observations is essential for system performance.

Method: Probabilistic approach to model MRA using relative poses as nuisance variables that are marginalized out, removing global symmetries. The decentralized approach avoids cubic scaling of centralized methods through cycle consistency.

Result: Both proposed algorithms achieve lower reconstruction error across experimental settings compared to existing methods.

Conclusion: The probabilistic approach with marginalized relative poses provides more direct solutions, improved convergence, and significant computational savings through decentralization and cycle consistency in multi-reference alignment problems.

Abstract: From molecular imaging to wireless communications, the ability to align and reconstruct signals from multiple misaligned observations is crucial for system performance. We study the problem of multi-reference alignment (MRA), which arises in many real-world problems, such as cryo-EM, computer vision, and, in particular, wireless communication systems. Using a probabilistic approach to model MRA, we find a new algorithm that uses relative poses as nuisance variables to marginalize out – thereby removing the global symmetries of the problem and allowing for more direct solutions and improved convergence. The decentralization of this approach enables significant computational savings by avoiding the cubic scaling of centralized methods through cycle consistency. Both proposed algorithms achieve lower reconstruction error across experimental settings.

[249] Graph Neural AI with Temporal Dynamics for Comprehensive Anomaly Detection in Microservices

Qingyuan Zhang, Ning Lyu, Le Liu, Yuxi Wang, Ziyu Cheng, Cancan Hua

Main category: cs.LG

TL;DR: Proposes a unified framework combining graph neural networks and temporal modeling for anomaly detection and root cause tracing in microservice architectures.

Details

Motivation: Addresses the challenge of anomaly detection and root cause tracing in complex microservice architectures where traditional methods struggle with dynamic topologies and temporal dependencies.

Method: Models microservice call chains as directed graphs, applies graph convolution for structural dependencies, uses gated recurrent units for temporal modeling, and defines node/path-level anomaly scoring functions.

Result: Outperforms baseline methods in AUC, ACC, Recall, and F1-Score metrics, maintaining high accuracy and stability under dynamic topologies and complex environments.

Conclusion: Provides a new technical path for microservice anomaly detection and lays methodological foundation for intelligent operations in distributed systems.

Abstract: This study addresses the problem of anomaly detection and root cause tracing in microservice architectures and proposes a unified framework that combines graph neural networks with temporal modeling. The microservice call chain is abstracted as a directed graph, where multidimensional features of nodes and edges are used to construct a service topology representation, and graph convolution is applied to aggregate features across nodes and model dependencies, capturing complex structural relationships among services. On this basis, gated recurrent units are introduced to model the temporal evolution of call chains, and multi-layer stacking and concatenation operations are used to jointly obtain structural and temporal representations, improving the ability to identify anomaly patterns. Furthermore, anomaly scoring functions at both the node and path levels are defined to achieve unified modeling from local anomaly detection to global call chain tracing, which enables the identification of abnormal service nodes and the reconstruction of potential anomaly propagation paths. Sensitivity experiments are then designed from multiple dimensions, including hyperparameters, environmental disturbances, and data distribution, to evaluate the framework, and results show that it outperforms baseline methods in key metrics such as AUC, ACC, Recall, and F1-Score, maintaining high accuracy and stability under dynamic topologies and complex environments. This research not only provides a new technical path for anomaly detection in microservices but also lays a methodological foundation for intelligent operations in distributed systems.

[250] Extending Fair Null-Space Projections for Continuous Attributes to Kernel Methods

Felix Störck, Fabian Hinder, Barbara Hammer

Main category: cs.LG

TL;DR: The paper proposes a novel kernel-based method for continuous fairness in regression tasks, extending iterative null-space projection to kernel methods to handle continuous protected attributes.

Details

Motivation: As machine learning systems become more integrated into social life, fairness becomes increasingly important. Current fairness literature mainly focuses on discrete attributes, with scarce research on continuous attributes in regression settings (continuous fairness).

Method: The authors generalize iterative null-space projection to kernel methods, creating a model and fairness-score agnostic method for kernel embeddings that can handle continuous protected attributes. They specifically apply this with Support Vector Regression (SVR).

Result: The proposed approach demonstrates competitive or improved performance across multiple datasets compared to other contemporary methods for continuous fairness.

Conclusion: The kernel-based extension of iterative null-space projection significantly broadens the applicability of continuous fairness methods and shows promising results with SVR across various datasets.

Abstract: With the on-going integration of machine learning systems into the everyday social life of millions the notion of fairness becomes an ever increasing priority in their development. Fairness notions commonly rely on protected attributes to assess potential biases. Here, the majority of literature focuses on discrete setups regarding both target and protected attributes. The literature on continuous attributes especially in conjunction with regression – we refer to this as \emph{continuous fairness} – is scarce. A common strategy is iterative null-space projection which as of now has only been explored for linear models or embeddings such as obtained by a non-linear encoder. We improve on this by generalizing to kernel methods, significantly extending the scope. This yields a model and fairness-score agnostic method for kernel embeddings applicable to continuous protected attributes. We demonstrate that our novel approach in conjunction with Support Vector Regression (SVR) provides competitive or improved performance across multiple datasets in comparisons to other contemporary methods.

[251] SORTeD Rashomon Sets of Sparse Decision Trees: Anytime Enumeration

Elif Arslan, Jacobus G. M. van der Linden, Serge Hoogendoorn, Marco Rinaldi, Emir Demirović

Main category: cs.LG

TL;DR: SORTD is a scalable framework that efficiently enumerates trees in the Rashomon set (trees with similar performance but varying structures) in order of objective value, enabling practical exploration of alternative interpretable models.

Details

Motivation: Rashomon sets of decision trees provide interpretable alternatives with similar accuracy but different structures, enhancing variable importance analysis, explanations, and stakeholder preferences without hard-coding criteria. However, enumerating these sets is computationally challenging.

Method: SORTD framework improves scalability by enumerating trees in the Rashomon set in order of objective value, supporting any separable and totally ordered objective, and enabling post-evaluation with other separable objectives.

Result: SORTD reduces runtime by up to two orders of magnitude compared to state-of-the-art methods and makes Rashomon set exploration practical for real-world applications.

Conclusion: SORTD enables efficient enumeration of Rashomon sets, making it feasible to explore alternative interpretable decision trees that satisfy various stakeholder preferences without compromising performance.

Abstract: Sparse decision tree learning provides accurate and interpretable predictive models that are ideal for high-stakes applications by finding the single most accurate tree within a (soft) size limit. Rather than relying on a single “best” tree, Rashomon sets-trees with similar performance but varying structures-can be used to enhance variable importance analysis, enrich explanations, and enable users to choose simpler trees or those that satisfy stakeholder preferences (e.g., fairness) without hard-coding such criteria into the objective function. However, because finding the optimal tree is NP-hard, enumerating the Rashomon set is inherently challenging. Therefore, we introduce SORTD, a novel framework that improves scalability and enumerates trees in the Rashomon set in order of the objective value, thus offering anytime behavior. Our experiments show that SORTD reduces runtime by up to two orders of magnitude compared with the state of the art. Moreover, SORTD can compute Rashomon sets for any separable and totally ordered objective and supports post-evaluating the set using other separable (and partially ordered) objectives. Together, these advances make exploring Rashomon sets more practical in real-world applications.

[252] A Modular, Data-Free Pipeline for Multi-Label Intention Recognition in Transportation Agentic AI Applications

Xiaocai Zhang, Hur Lim, Ke Wang, Zhe Xiao, Jing Wang, Kelvin Lee, Xiuju Fu, Zheng Qin

Main category: cs.LG

TL;DR: DMTC is a modular, data-free pipeline for multi-label intention recognition in transportation AI, using LLM-generated synthetic queries, Sentence-T5 embeddings, and a novel online focal-contrastive loss to achieve state-of-the-art performance without manual labeling.

Details

Motivation: Traditional intent recognition systems require large annotated datasets and struggle with fine-grained multi-label discrimination, while manual data collection is costly. The goal is to eliminate dependency on expensive labeled data while improving multi-label intention understanding accuracy.

Method: Three-step pipeline: 1) Use prompt engineering with LLMs to generate diverse synthetic queries across transport scenarios, 2) Encode queries with Sentence-T5 for semantic embeddings, 3) Train lightweight classifier with novel online focal-contrastive (OFC) loss that emphasizes hard samples and maximizes inter-class separability.

Result: DMTC achieves Hamming loss of 5.35% and AUC of 95.92%, outperforming state-of-the-art multi-label classifiers and LLM-based baselines. Sentence-T5 embeddings improve subset accuracy by at least 3.29%, and OFC loss provides additional 0.98% gain over standard contrastive objectives.

Conclusion: The system successfully routes user queries to task-specific modules (ETA information, traffic risk evaluation, etc.) in maritime transportation, laying groundwork for fully autonomous, intention-aware agents without costly manual labeling requirements.

Abstract: In this study, a modular, data-free pipeline for multi-label intention recognition is proposed for agentic AI applications in transportation. Unlike traditional intent recognition systems that depend on large, annotated corpora and often struggle with fine-grained, multi-label discrimination, our approach eliminates the need for costly data collection while enhancing the accuracy of multi-label intention understanding. Specifically, the overall pipeline, named DMTC, consists of three steps: 1) using prompt engineering to guide large language models (LLMs) to generate diverse synthetic queries in different transport scenarios; 2) encoding each textual query with a Sentence-T5 model to obtain compact semantic embeddings; 3) training a lightweight classifier using a novel online focal-contrastive (OFC) loss that emphasizes hard samples and maximizes inter-class separability. The applicability of the proposed pipeline is demonstrated in an agentic AI application in the maritime transportation context. Extensive experiments show that DMTC achieves a Hamming loss of 5.35% and an AUC of 95.92%, outperforming state-of-the-art multi-label classifiers and recent end-to-end SOTA LLM-based baselines. Further analysis reveals that Sentence-T5 embeddings improve subset accuracy by at least 3.29% over alternative encoders, and integrating the OFC loss yields an additional 0.98% gain compared to standard contrastive objectives. In conclusion, our system seamlessly routes user queries to task-specific modules (e.g., ETA information, traffic risk evaluation, and other typical scenarios in the transportation domain), laying the groundwork for fully autonomous, intention-aware agents without costly manual labelling.

[253] Adaptable Hindsight Experience Replay for Search-Based Learning

Alexandros Vazaios, Jannis Brugger, Cedric Derstroff, Kristian Kersting, Mira Mezini

Main category: cs.LG

TL;DR: AlphaZero-like MCTS systems are adapted for classical search problems, but struggle with sparse rewards. Adaptable HER integrates Hindsight Experience Replay with AlphaZero to improve training by relabeling unsuccessful trajectories as learning signals.

Details

Motivation: AlphaZero's training method is limited in sparse reward settings, especially early on when the neural network cannot provide effective guidance. Hindsight Experience Replay addresses this by converting failures into learning opportunities.

Method: Introduces Adaptable HER, a flexible framework that integrates HER with AlphaZero, allowing adjustable properties like relabeled goals, policy targets, and trajectory selection.

Result: Experiments including equation discovery show that modifying HER properties is beneficial and outperforms pure supervised or reinforcement learning approaches.

Conclusion: The Adaptable HER framework successfully enhances AlphaZero’s performance in sparse reward settings by leveraging flexible hindsight experience replay, demonstrating superior results compared to traditional methods.

Abstract: AlphaZero-like Monte Carlo Tree Search systems, originally introduced for two-player games, dynamically balance exploration and exploitation using neural network guidance. This combination makes them also suitable for classical search problems. However, the original method of training the network with simulation results is limited in sparse reward settings, especially in the early stages, where the network cannot yet give guidance. Hindsight Experience Replay (HER) addresses this issue by relabeling unsuccessful trajectories from the search tree as supervised learning signals. We introduce Adaptable HER (\ours{}), a flexible framework that integrates HER with AlphaZero, allowing easy adjustments to HER properties such as relabeled goals, policy targets, and trajectory selection. Our experiments, including equation discovery, show that the possibility of modifying HER is beneficial and surpasses the performance of pure supervised or reinforcement learning.

[254] TripleWin: Fixed-Point Equilibrium Pricing for Data-Model Coupled Markets

Hongrun Ren, Yun Xiong, Lei You, Yingying Wang, Haixu Xiong, Yangyong Zhu

Main category: cs.LG

TL;DR: Proposes a unified data-model coupled market that treats dataset and model trading as a single system with bidirectional price propagation between data sellers, model producers, and model buyers, ensuring equilibrium convergence and improved fairness.

Details

Motivation: Current ML model economy approaches separate data and model transactions or use broker-centric pipelines that favor one side, lacking simultaneous symmetric mechanisms across all market participants.

Method: Uses supply-side mapping to transform dataset payments into model quotations and demand-side mapping to propagate buyer prices back to datasets through Shapley-based allocation, forming a closed loop with bidirectional supply-demand propagation.

Result: The joint operator is proven to be a standard interference function (SIF), guaranteeing existence, uniqueness, and global convergence of equilibrium prices. Experiments show efficient convergence and improved fairness compared to baselines.

Conclusion: The proposed unified market mechanism successfully links all participants in the ML model economy, providing theoretical guarantees for equilibrium and demonstrating practical advantages over existing approaches.

Abstract: The rise of the machine learning (ML) model economy has intertwined markets for training datasets and pre-trained models. However, most pricing approaches still separate data and model transactions or rely on broker-centric pipelines that favor one side. Recent studies of data markets with externalities capture buyer interactions but do not yield a simultaneous and symmetric mechanism across data sellers, model producers, and model buyers. We propose a unified data-model coupled market that treats dataset and model trading as a single system. A supply-side mapping transforms dataset payments into buyer-visible model quotations, while a demand-side mapping propagates buyer prices back to datasets through Shapley-based allocation. Together, they form a closed loop that links four interactions: supply-demand propagation in both directions and mutual coupling among buyers and among sellers. We prove that the joint operator is a standard interference function (SIF), guaranteeing existence, uniqueness, and global convergence of equilibrium prices. Experiments demonstrate efficient convergence and improved fairness compared with broker-centric and one-sided baselines. The code is available on https://github.com/HongrunRen1109/Triple-Win-Pricing.

[255] POEMS: Product of Experts for Interpretable Multi-omic Integration using Sparse Decoding

Mihriban Kocak Balik, Pekka Marttinen, Negar Safinianaini

Main category: cs.LG

TL;DR: POEMS is an unsupervised probabilistic framework that integrates multiomics data while maintaining both predictive performance and interpretability through sparse decoding, without linearizing the network.

Details

Motivation: Current deep generative models for multiomics integration face a trade-off between predictive performance and interpretability - they either sacrifice interpretability for performance or use linear decoders that reduce nonlinear expressiveness.

Method: POEMS uses: 1) sparse connections mapping features to latent factors for biomarker discovery, 2) product of experts model for cross-omic associations through shared latent space, 3) gating network to report contributions of each omic, and 4) efficient sparse decoder.

Result: In cancer subtyping case study, POEMS achieves competitive clustering and classification performance while providing novel interpretations, demonstrating that biomarker-based insight and predictive accuracy can coexist.

Conclusion: POEMS successfully overcomes the trade-off between predictive performance and interpretability in multiomics integration, showing that both biomarker discovery and accurate predictions can be achieved simultaneously in representation learning.

Abstract: Integrating different molecular layers, i.e., multiomics data, is crucial for unraveling the complexity of diseases; yet, most deep generative models either prioritize predictive performance at the expense of interpretability or enforce interpretability by linearizing the decoder, thereby weakening the network’s nonlinear expressiveness. To overcome this tradeoff, we introduce POEMS: Product Of Experts for Interpretable Multiomics Integration using Sparse Decoding, an unsupervised probabilistic framework that preserves predictive performance while providing interpretability. POEMS provides interpretability without linearizing any part of the network by 1) mapping features to latent factors using sparse connections, which directly translates to biomarker discovery, 2) allowing for cross-omic associations through a shared latent space using product of experts model, and 3) reporting contributions of each omic by a gating network that adaptively computes their influence in the representation learning. Additionally, we present an efficient sparse decoder. In a cancer subtyping case study, POEMS achieves competitive clustering and classification performance while offering our novel set of interpretations, demonstrating that biomarker based insight and predictive accuracy can coexist in multiomics representation learning.

[256] Reinforcement Learning Using known Invariances

Alexandru Cioba, Aya Kayal, Laura Toni, Sattar Vakili, Alberto Bernacchia

Main category: cs.LG

TL;DR: This paper develops a symmetry-aware kernel-based RL framework that exploits environmental symmetries to improve learning efficiency through invariant kernels and optimistic LSVI.

Details

Motivation: Many RL environments have inherent symmetries that can be leveraged to enhance learning efficiency, but existing methods don't systematically exploit these structural priors.

Method: Proposed symmetry-aware variant of optimistic least-squares value iteration (LSVI) using invariant kernels to encode invariance in rewards and transition dynamics.

Result: Established theoretical bounds showing sample efficiency gains from symmetry, with empirical validation on Frozen Lake and 2D placement problems demonstrating significantly better performance than standard kernel methods.

Conclusion: Structural priors like symmetry are valuable for designing more sample-efficient RL algorithms, with the proposed framework providing both theoretical guarantees and practical improvements.

Abstract: In many real-world reinforcement learning (RL) problems, the environment exhibits inherent symmetries that can be exploited to improve learning efficiency. This paper develops a theoretical and algorithmic framework for incorporating known group symmetries into kernel-based RL. We propose a symmetry-aware variant of optimistic least-squares value iteration (LSVI), which leverages invariant kernels to encode invariance in both rewards and transition dynamics. Our analysis establishes new bounds on the maximum information gain and covering numbers for invariant RKHSs, explicitly quantifying the sample efficiency gains from symmetry. Empirical results on a customized Frozen Lake environment and a 2D placement design problem confirm the theoretical improvements, demonstrating that symmetry-aware RL achieves significantly better performance than their standard kernel counterparts. These findings highlight the value of structural priors in designing more sample-efficient reinforcement learning algorithms.

[257] RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse

Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai

Main category: cs.LG

TL;DR: RAGBoost is an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse, improving prefill performance by 1.5-3X over state-of-the-art methods.

Details

Motivation: Retrieval-augmented generation (RAG) suffers from degraded prefill performance with longer, more complex inputs, and existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality.

Method: RAGBoost detects overlapping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse, while lightweight contextual hints maintain reasoning fidelity.

Result: RAGBoost improves prefill performance by 1.5-3X over state-of-the-art methods while preserving or even enhancing reasoning accuracy across diverse RAG and agentic AI workloads.

Conclusion: RAGBoost successfully addresses the trade-off between cache reuse and accuracy in RAG systems, providing significant performance improvements without compromising reasoning quality.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context but often suffers from downgraded prefill performance as modern applications demand longer and more complex inputs. Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse. RAGBoost detects overlapping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse, while lightweight contextual hints maintain reasoning fidelity. It integrates seamlessly with existing LLM inference engines and improves their prefill performance by 1.5-3X over state-of-the-art methods, while preserving or even enhancing reasoning accuracy across diverse RAG and agentic AI workloads. Our code is released at: https://github.com/Edinburgh-AgenticAI/RAGBoost.

[258] NAP: Attention-Based Late Fusion for Automatic Sleep Staging

Alvise Dei Rossi, Julia van der Meer, Markus H. Schmidt, Claudio L. A. Bassetti, Luigi Fiorillo, Francesca Faraci

Main category: cs.LG

TL;DR: NAP is an attention-based model that aggregates predictions from multiple single-channel models using tri-axial attention to handle heterogeneous polysomnography data with varying modalities and channels.

Details

Motivation: Polysomnography signals are highly heterogeneous across datasets with varying modalities and channels, but existing models use fixed subsets and fail to exploit the full multimodal nature of the data.

Method: Introduces NAP (Neural Aggregator of Predictions) with tri-axial attention mechanism that captures temporal, spatial, and predictor-level dependencies, trained to adapt to different input dimensions by aggregating outputs from frozen pretrained single-channel models.

Result: NAP consistently outperforms individual predictors and simple ensembles, achieving state-of-the-art zero-shot generalization across multiple datasets for automated sleep staging.

Conclusion: The proposed approach effectively handles heterogeneous polysomnography data and could be extended to other multimodal physiological applications beyond sleep staging.

Abstract: Polysomnography signals are highly heterogeneous, varying in modality composition (e.g., EEG, EOG, ECG), channel availability (e.g., frontal, occipital EEG), and acquisition protocols across datasets and clinical sites. Most existing models that process polysomnography data rely on a fixed subset of modalities or channels and therefore neglect to fully exploit its inherently multimodal nature. We address this limitation by introducing NAP (Neural Aggregator of Predictions), an attention-based model which learns to combine multiple prediction streams using a tri-axial attention mechanism that captures temporal, spatial, and predictor-level dependencies. NAP is trained to adapt to different input dimensions. By aggregating outputs from frozen, pretrained single-channel models, NAP consistently outperforms individual predictors and simple ensembles, achieving state-of-the-art zero-shot generalization across multiple datasets. While demonstrated in the context of automated sleep staging from polysomnography, the proposed approach could be extended to other multimodal physiological applications.

[259] Efficient Neural Networks with Discrete Cosine Transform Activations

Marc Martinez-Gost, Sara Pepe, Ana Pérez-Neira, Miguel Ángel Lagunas

Main category: cs.LG

TL;DR: ENNs use DCT-parameterized adaptive activation functions for efficient, interpretable neural networks with strong pruning capabilities, achieving SOTA performance with fewer parameters.

Details

Motivation: To enhance neural network efficiency and interpretability by extending ENNs with DCT-based parameterization that reveals neuron functionality and enables direct identification of redundant components.

Method: Extends ENN framework using Discrete Cosine Transform to parameterize adaptive activation functions, enabling structured representation and efficient pruning of unnecessary DCT coefficients.

Result: ENNs achieve state-of-the-art accuracy in classification and implicit neural representation tasks while maintaining low parameter count, with up to 40% of activation coefficients safely pruned without performance loss.

Conclusion: ENN framework successfully integrates signal processing concepts into neural network design, achieving optimal balance between expressiveness, compactness, and interpretability.

Abstract: In this paper, we extend our previous work on the Expressive Neural Network (ENN), a multilayer perceptron with adaptive activation functions parametrized using the Discrete Cosine Transform (DCT). Building upon previous work that demonstrated the strong expressiveness of ENNs with compact architectures, we now emphasize their efficiency, interpretability and pruning capabilities. The DCT-based parameterization provides a structured and decorrelated representation that reveals the functional role of each neuron and allows direct identification of redundant components. Leveraging this property, we propose an efficient pruning strategy that removes unnecessary DCT coefficients with negligible or no loss in performance. Experimental results across classification and implicit neural representation tasks confirm that ENNs achieve state-of-the-art accuracy while maintaining a low number of parameters. Furthermore, up to 40% of the activation coefficients can be safely pruned, thanks to the orthogonality and bounded nature of the DCT basis. Overall, these findings demonstrate that the ENN framework offers a principled integration of signal processing concepts into neural network design, achieving a balanced trade-off between expressiveness, compactness, and interpretability.

[260] Why Less is More (Sometimes): A Theory of Data Curation

Elvis Dohmatob, Mohammad Pezeshki, Reyhane Askari-Hemmat

Main category: cs.LG

TL;DR: This paper introduces a theoretical framework explaining when using less data can outperform full datasets in machine learning, challenging classical scaling laws.

Details

Motivation: To resolve the paradox in modern ML where classical scaling laws suggest "more is more" but recent methods like LIMO achieve superior performance with small, curated datasets.

Method: Developed a theoretical framework studying data curation strategies where an imperfect oracle selects training examples based on difficulty and correctness, deriving exact scaling law curves for test error under different curation rules.

Result: Showed that under certain conditions, small curated datasets can outperform full datasets, with analytical conditions for phase transitions tied to data size and quality. Validated with empirical results on ImageNet confirming predictions about when curation improves accuracy and mitigates model collapse.

Conclusion: Provides a principled explanation for contradictory curation strategies in LLM mathematical reasoning and demonstrates when “less is more” can be beneficial in machine learning.

Abstract: This paper introduces a theoretical framework to resolve a central paradox in modern machine learning: When is it better to use less data? This question has become critical as classical scaling laws suggesting more is more'' (Sun et al., 2025) are challenged by methods like LIMO (less is more’’) and s1 (Ye et al., 2025; Muenighoff et al., 2025), which achieve superior performance with small, aggressively curated datasets. Here, we study data curation strategies where an imperfect oracle selects the training examples according to their difficulty and correctness. Our results provide exact scaling law curves for test error under both label-agnostic and label-aware curation rules, revealing when and why keeping only a subset of data can improve generalization. In contrast to classical scaling laws, we show that under certain conditions, small curated datasets can outperform full datasets, and we provide analytical conditions for this by deriving precise phase transition curves tied to data size and quality. We validate these theoretical claims with empirical results on ImageNet, confirming our predictions about when curation improves accuracy and can even mitigate model collapse. Furthermore, our framework provides a principled explanation for the contradictory curation strategies recently observed in LLM mathematical reasoning.

[261] Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments

Bryan L. M. de Oliveira, Felipe V. Frujeri, Marcos P. C. M. Queiroz, Luana G. B. Martins, Telma W. de L. Soares, Luckeciano C. Melo

Main category: cs.LG

TL;DR: GRPO eliminates learned critics and uses group-relative trajectory comparisons instead, but systematic study shows learned critics remain essential for long-horizon tasks, with critic-free methods only working well in short-horizon environments like CartPole.

Details

Motivation: To understand the fundamental necessity of learned baselines in policy-gradient methods by systematically studying GRPO as a scalable alternative to PPO that eliminates learned critics.

Method: Systematic study of GRPO in classical single-task RL environments through controlled ablations isolating baselines, discounting, and group sampling across discrete and continuous control tasks.

Result: Three key findings: 1) Learned critics essential for long-horizon tasks, critic-free only works in short-horizon environments; 2) GRPO benefits from high discount factors except in HalfCheetah; 3) Smaller group sizes outperform larger ones.

Conclusion: Critic-free methods have limitations in classical control tasks and are only viable alternatives to learned value functions under specific conditions, mainly short-horizon environments.

Abstract: Group Relative Policy Optimization (GRPO) has emerged as a scalable alternative to Proximal Policy Optimization (PPO) by eliminating the learned critic and instead estimating advantages through group-relative comparisons of trajectories. This simplification raises fundamental questions about the necessity of learned baselines in policy-gradient methods. We present the first systematic study of GRPO in classical single-task reinforcement learning environments, spanning discrete and continuous control tasks. Through controlled ablations isolating baselines, discounting, and group sampling, we reveal three key findings: (1) learned critics remain essential for long-horizon tasks: all critic-free baselines underperform PPO except in short-horizon environments like CartPole where episodic returns can be effective; (2) GRPO benefits from high discount factors (gamma = 0.99) except in HalfCheetah, where lack of early termination favors moderate discounting (gamma = 0.9); (3) smaller group sizes outperform larger ones, suggesting limitations in batch-based grouping strategies that mix unrelated episodes. These results reveal both the limitations of critic-free methods in classical control and the specific conditions where they remain viable alternatives to learned value functions.

[262] Imitation Learning in the Deep Learning Era: A Novel Taxonomy and Recent Advances

Iason Chrysomallis, Georgios Chalkiadakis

Main category: cs.LG

TL;DR: This paper provides a comprehensive survey of recent advances in imitation learning, proposing a novel taxonomy to reflect current research trends and addressing challenges like generalization and covariate shift.

Details

Motivation: To review and organize the rapidly expanding field of imitation learning, which has seen significant growth due to deep learning advances, and to address the need for updated categorization that reflects current research trends.

Method: The authors conduct a systematic survey of imitation learning literature, propose a novel taxonomy distinct from existing categorizations, and critically examine representative works’ strengths, limitations, and evaluation practices.

Result: The survey identifies key advances in imitation learning across various domains, highlights methodological innovations, and provides a comprehensive overview of current research trends and practical applications.

Conclusion: The paper outlines key challenges and open directions for future imitation learning research while providing an updated framework for understanding the current state of the field.

Abstract: Imitation learning (IL) enables agents to acquire skills by observing and replicating the behavior of one or multiple experts. In recent years, advances in deep learning have significantly expanded the capabilities and scalability of imitation learning across a range of domains, where expert data can range from full state-action trajectories to partial observations or unlabeled sequences. Alongside this growth, novel approaches have emerged, with new methodologies being developed to address longstanding challenges such as generalization, covariate shift, and demonstration quality. In this survey, we review the latest advances in imitation learning research, highlighting recent trends, methodological innovations, and practical applications. We propose a novel taxonomy that is distinct from existing categorizations to better reflect the current state of the IL research stratum and its trends. Throughout the survey, we critically examine the strengths, limitations, and evaluation practices of representative works, and we outline key challenges and open directions for future research.

[263] Byzantine-Robust Federated Learning with Learnable Aggregation Weights

Javad Parsa, Amir Hossein Daghestani, André M. H. Teixeira, Mikael Johansson

Main category: cs.LG

TL;DR: Proposes a Byzantine-robust federated learning method that treats aggregation weights as learnable parameters, jointly optimized with model parameters using alternating minimization.

Details

Motivation: Address challenges of malicious (Byzantine) clients in federated learning, especially under heterogeneous data distributions across clients.

Method: Formulates novel optimization problem with adaptive weighting, treats aggregation weights as learnable parameters, develops alternating minimization algorithm with convergence guarantees.

Result: Outperforms state-of-the-art Byzantine-robust FL approaches across various datasets and attack scenarios, especially with highly heterogeneous data and large proportion of malicious clients.

Conclusion: Proposed method provides superior Byzantine resilience and performance in challenging FL scenarios with data heterogeneity and adversarial attacks.

Abstract: Federated Learning (FL) enables clients to collaboratively train a global model without sharing their private data. However, the presence of malicious (Byzantine) clients poses significant challenges to the robustness of FL, particularly when data distributions across clients are heterogeneous. In this paper, we propose a novel Byzantine-robust FL optimization problem that incorporates adaptive weighting into the aggregation process. Unlike conventional approaches, our formulation treats aggregation weights as learnable parameters, jointly optimizing them alongside the global model parameters. To solve this optimization problem, we develop an alternating minimization algorithm with strong convergence guarantees under adversarial attack. We analyze the Byzantine resilience of the proposed objective. We evaluate the performance of our algorithm against state-of-the-art Byzantine-robust FL approaches across various datasets and attack scenarios. Experimental results demonstrate that our method consistently outperforms existing approaches, particularly in settings with highly heterogeneous data and a large proportion of malicious clients.

[264] Flat Minima and Generalization: Insights from Stochastic Convex Optimization

Matan Schliserman, Shira Vansover-Hager, Tomer Koren

Main category: cs.LG

TL;DR: Flat minima don’t guarantee good generalization in convex optimization; sharpness-aware algorithms (SA-GD and SAM) can converge to flat minima but still have poor population risk.

Details

Motivation: To understand the link between flat minima and generalization in stochastic convex optimization, challenging the common belief that flat minima always generalize well.

Method: Theoretical analysis of stochastic convex optimization with non-negative β-smooth objectives, examining SA-GD and SAM algorithms designed to find flat minima.

Result: Flat empirical minima can have Ω(1) population risk while sharp minima generalize optimally; SA-GD finds flat minima but with poor generalization; SAM may converge to sharp minima with poor generalization.

Conclusion: Flatness alone is insufficient for good generalization; sharpness-aware algorithms don’t necessarily improve generalization despite finding flat minima.

Abstract: Understanding the generalization behavior of learning algorithms is a central goal of learning theory. A recently emerging explanation is that learning algorithms are successful in practice because they converge to flat minima, which have been consistently associated with improved generalization performance. In this work, we study the link between flat minima and generalization in the canonical setting of stochastic convex optimization with a non-negative, $\beta$-smooth objective. Our first finding is that, even in this fundamental and well-studied setting, flat empirical minima may incur trivial $\Omega(1)$ population risk while sharp minima generalizes optimally. Then, we show that this poor generalization behavior extends to two natural ‘‘sharpness-aware’’ algorithms originally proposed by Foret et al. (2021), designed to bias optimization toward flat solutions: Sharpness-Aware Gradient Descent (SA-GD) and Sharpness-Aware Minimization (SAM). For SA-GD, which performs gradient steps on the maximal loss in a predefined neighborhood, we prove that while it successfully converges to a flat minimum at a fast rate, the population risk of the solution can still be as large as $\Omega(1)$, indicating that even flat minima found algorithmically using a sharpness-aware gradient method might generalize poorly. For SAM, a computationally efficient approximation of SA-GD based on normalized ascent steps, we show that although it minimizes the empirical loss, it may converge to a sharp minimum and also incur population risk $\Omega(1)$. Finally, we establish population risk upper bounds for both SA-GD and SAM using algorithmic stability techniques.

[265] Learning Under Laws: A Constraint-Projected Neural PDE Solver that Eliminates Hallucinations

Mainak Singha

Main category: cs.LG

TL;DR: CPL is a framework that trains neural networks to solve PDEs while strictly enforcing physical laws through constraint projection, ensuring conservation, entropy bounds, and stability without sacrificing accuracy.

Details

Motivation: Neural networks solving PDEs often violate physical laws like conservation, entropy, and shock conditions, creating unphysical solutions that break the very principles they should model.

Method: Constraint-Projected Learning (CPL) projects network outputs onto constraint sets for conservation, Rankine-Hugoniot balance, entropy, and positivity. It uses differentiable projection with 10% overhead, total-variation damping, and rollout curriculum for long-term stability.

Result: CPL eliminates physical violations: conservation holds at machine precision, total-variation growth vanishes, entropy and error remain bounded. Produces stable, physically lawful solutions on Burgers and Euler systems without accuracy loss.

Conclusion: CPL makes physical law adherence an intrinsic property of neural PDE solvers rather than a hopeful outcome, ensuring reliable and physically consistent solutions.

Abstract: Neural networks can approximate solutions to partial differential equations, but they often break the very laws they are meant to model-creating mass from nowhere, drifting shocks, or violating conservation and entropy. We address this by training within the laws of physics rather than beside them. Our framework, called Constraint-Projected Learning (CPL), keeps every update physically admissible by projecting network outputs onto the intersection of constraint sets defined by conservation, Rankine-Hugoniot balance, entropy, and positivity. The projection is differentiable and adds only about 10% computational overhead, making it fully compatible with back-propagation. We further stabilize training with total-variation damping (TVD) to suppress small oscillations and a rollout curriculum that enforces consistency over long prediction horizons. Together, these mechanisms eliminate both hard and soft violations: conservation holds at machine precision, total-variation growth vanishes, and entropy and error remain bounded. On Burgers and Euler systems, CPL produces stable, physically lawful solutions without loss of accuracy. Instead of hoping neural solvers will respect physics, CPL makes that behavior an intrinsic property of the learning process.

[266] TabGemma: Text-Based Tabular ICL via LLM using Continued Pretraining and Retrieval

Günther Schindler, Maximilian Schambach, Michael Medek, Sam Thelin

Main category: cs.LG

TL;DR: TabGemma is a schema-agnostic LLM for tabular prediction that addresses numeric tokenization instability and context limitations through scientific notation canonicalization and n-gram retrieval for exemplar selection.

Details

Motivation: To adapt pretrained LLMs for tabular prediction with mixed data types (text, numeric, categorical) by overcoming practical hurdles like unstable numeric tokenization and limited context size.

Method: Canonicalize numbers via signed scientific notation, continue pretraining a 12B Gemma 3 model with target imputation objective, and use compact n-gram-based retrieval for informative exemplars within 128k-token window.

Result: State-of-the-art on classification across low- and high-data regimes with monotonic improvement from more context rows; competitive for regression at small sample sizes but trails conventional approaches with more data.

Conclusion: LLMs can be effective tabular in-context learners for semantic tasks when paired with dedicated numeric handling and context retrieval, motivating further advances in numeric modeling and long-context scaling.

Abstract: We study LLMs for tabular prediction with mixed text, numeric, and categorical fields. We introduce TabGemma, a schema-agnostic in-context learner that treats rows as sequences and tackles two practical hurdles when adapting pretrained LLMs for tabular predictions: unstable numeric tokenization and limited context size. We propose to canonicalize numbers via signed scientific notation and continue pretraining of a 12B Gemma 3 model with a target imputation objective using a large-scale real world dataset. For inference, we use a compact n-gram-based retrieval to select informative exemplars that fit within a 128k-token window. On semantically rich benchmarks, TabGemma establishes a new state of the art on classification across low- and high-data regimes and improves monotonically with more context rows. For regression, it is competitive at small sample sizes but trails conventional approaches as data grows. Our results show that LLMs can be effective tabular in-context learners on highly semantic tasks when paired with dedicated numeric handling and context retrieval, while motivating further advances in numeric modeling and long-context scaling.

[267] Tensor-Efficient High-Dimensional Q-learning

Junyi Wu, Dan Li

Main category: cs.LG

TL;DR: TEQL is a tensor-efficient Q-learning method that uses improved low-rank tensor decomposition with novel exploration and regularization mechanisms to address high-dimensional RL challenges.

Details

Motivation: High-dimensional RL faces computational complexity and low sample efficiency due to the curse of dimensionality in large state-action spaces, requiring more parameter-efficient alternatives to neural network approaches.

Method: Enhanced low-rank tensor decomposition via block coordinate descent on discretized spaces, with exploration strategy combining approximation error and visit count-based UCB, plus frequency-based regularization to encourage exploration of less-visited pairs.

Result: TEQL outperforms conventional matrix-based methods and deep RL approaches in both sample efficiency and total rewards on classic control tasks.

Conclusion: TEQL is suitable for resource-constrained applications like space and healthcare where sampling costs are high, offering improved performance over existing methods.

Abstract: High-dimensional reinforcement learning faces challenges with complex calculations and low sample efficiency in large state-action spaces. Q-learning algorithms struggle particularly with the curse of dimensionality, where the number of state-action pairs grows exponentially with problem size. While neural network-based approaches like Deep Q-Networks have shown success, recent tensor-based methods using low-rank decomposition offer more parameter-efficient alternatives. Building upon existing tensor-based methods, we propose Tensor-Efficient Q-Learning (TEQL), which enhances low-rank tensor decomposition via improved block coordinate descent on discretized state-action spaces, incorporating novel exploration and regularization mechanisms. The key innovation is an exploration strategy that combines approximation error with visit count-based upper confidence bound to prioritize actions with high uncertainty, avoiding wasteful random exploration. Additionally, we incorporate a frequency-based penalty term in the objective function to encourage exploration of less-visited state-action pairs and reduce overfitting to frequently visited regions. Empirical results on classic control tasks demonstrate that TEQL outperforms conventional matrix-based methods and deep RL approaches in both sample efficiency and total rewards, making it suitable for resource-constrained applications, such as space and healthcare where sampling costs are high.

[268] Going Beyond Expert Performance via Deep Implicit Imitation Reinforcement Learning

Iason Chrysomallis, Georgios Chalkiadakis

Main category: cs.LG

TL;DR: Deep Implicit Imitation Q-Network (DIIQN) combines reinforcement learning with observation-only imitation learning, enabling agents to learn from suboptimal experts and surpass their performance, with extensions for heterogeneous action spaces.

Details

Motivation: Traditional imitation learning requires complete state-action demonstrations from optimal experts, limiting practical applicability in real-world scenarios where only state observations are available and expert performance is often suboptimal.

Method: DIIQN uses action inference to reconstruct expert actions through online exploration and a dynamic confidence mechanism to balance expert-guided and self-directed learning. HA-DIIQN extension handles heterogeneous action sets with infeasibility detection and bridging procedures.

Result: DIIQN achieves up to 130% higher episodic returns than standard DQN and outperforms existing implicit imitation methods. HA-DIIQN learns up to 64% faster than baselines in heterogeneous action settings.

Conclusion: The framework enables practical imitation learning from observation-only datasets and suboptimal experts, with robust performance across varying conditions and the ability to handle heterogeneous action spaces.

Abstract: Imitation learning traditionally requires complete state-action demonstrations from optimal or near-optimal experts. These requirements severely limit practical applicability, as many real-world scenarios provide only state observations without corresponding actions and expert performance is often suboptimal. In this paper we introduce a deep implicit imitation reinforcement learning framework that addresses both limitations by combining deep reinforcement learning with implicit imitation learning from observation-only datasets. Our main algorithm, Deep Implicit Imitation Q-Network (DIIQN), employs an action inference mechanism that reconstructs expert actions through online exploration and integrates a dynamic confidence mechanism that adaptively balances expert-guided and self-directed learning. This enables the agent to leverage expert guidance for accelerated training while maintaining capacity to surpass suboptimal expert performance. We further extend our framework with a Heterogeneous Actions DIIQN (HA-DIIQN) algorithm to tackle scenarios where expert and agent possess different action sets, a challenge previously unaddressed in the implicit imitation learning literature. HA-DIIQN introduces an infeasibility detection mechanism and a bridging procedure identifying alternative pathways connecting agent capabilities to expert guidance when direct action replication is impossible. Our experimental results demonstrate that DIIQN achieves up to 130% higher episodic returns compared to standard DQN, while consistently outperforming existing implicit imitation methods that cannot exceed expert performance. In heterogeneous action settings, HA-DIIQN learns up to 64% faster than baselines, leveraging expert datasets unusable by conventional approaches. Extensive parameter sensitivity analysis reveals the framework’s robustness across varying dataset sizes and hyperparameter configurations.

[269] Towards Formalizing Reinforcement Learning Theory

Shangtong Zhang

Main category: cs.LG

TL;DR: Formal verification of Q-learning and linear TD learning convergence using Lean 4 theorem prover based on Robbins-Siegmund theorem.

Details

Motivation: To formally verify the almost sure convergence of influential reinforcement learning algorithms (Q-learning and linear TD learning) using theorem proving, addressing both historical and contemporary research interest in their convergence properties.

Method: Used Lean 4 theorem prover with Mathlib library in a unified framework based on Robbins-Siegmund theorem to prove almost sure convergence with Markovian samples.

Result: Successfully formalized and verified the almost sure convergence of both Q-learning and linear TD learning algorithms.

Conclusion: This work represents an important step toward fully formalizing convergent reinforcement learning results, with a framework that can be extended to convergence rates and other convergence modes.

Abstract: In this paper, we formalize the almost sure convergence of $Q$-learning and linear temporal difference (TD) learning with Markovian samples using the Lean 4 theorem prover based on the Mathlib library. $Q$-learning and linear TD are among the earliest and most influential reinforcement learning (RL) algorithms. The investigation of their convergence properties is not only a major research topic during the early development of the RL field but also receives increasing attention nowadays. This paper formally verifies their almost sure convergence in a unified framework based on the Robbins-Siegmund theorem. The framework developed in this work can be easily extended to convergence rates and other modes of convergence. This work thus makes an important step towards fully formalizing convergent RL results. The code is available at https://github.com/ShangtongZhang/rl-theory-in-lean.

[270] DQN Performance with Epsilon Greedy Policies and Prioritized Experience Replay

Daniel Perkins, Oscar J. Escobar, Luke Green

Main category: cs.LG

TL;DR: Study of DQN in finite environments focusing on epsilon-greedy exploration schedules and prioritized experience replay, showing how they affect learning efficiency and convergence.

Details

Motivation: To understand the impact of different exploration strategies and memory management techniques on DQN performance in finite environments, particularly for resource-constrained settings.

Method: Systematic experimentation evaluating epsilon decay schedules and comparing uniform, no replay, and prioritized experience replay strategies across multiple simulations.

Result: Prioritized experience replay leads to faster convergence and higher returns. Variations in epsilon decay schedules significantly affect learning efficiency and reward optimization.

Conclusion: The study illuminates trade-offs between exploration strategies and memory management in DQN training, offering practical recommendations for robust reinforcement learning.

Abstract: We present a detailed study of Deep Q-Networks in finite environments, emphasizing the impact of epsilon-greedy exploration schedules and prioritized experience replay. Through systematic experimentation, we evaluate how variations in epsilon decay schedules affect learning efficiency, convergence behavior, and reward optimization. We investigate how prioritized experience replay leads to faster convergence and higher returns and show empirical results comparing uniform, no replay, and prioritized strategies across multiple simulations. Our findings illuminate the trade-offs and interactions between exploration strategies and memory management in DQN training, offering practical recommendations for robust reinforcement learning in resource-constrained settings.

[271] Financial Management System for SMEs: Real-World Deployment of Accounts Receivable and Cash Flow Prediction

Bartłomiej Małkus, Szymon Bobek, Grzegorz J. Nalepa

Main category: cs.LG

TL;DR: Development of an integrated financial prediction system for SMEs that combines accounts receivable prediction and cash flow forecasting, addressing the gap between enterprise tools and small business needs.

Details

Motivation: SMEs face unique financial challenges due to limited resources, small customer bases, and constrained data availability, creating a need for specialized financial tools tailored to their operational constraints.

Method: Integrated system with two key components: binary classification model for predicting invoice payment delays, and multi-module cash flow forecasting model that handles incomplete and limited historical data.

Result: A prototype system has been implemented and deployed as a web application integrated into Cluee’s platform, demonstrating practical feasibility for real-world SME financial management.

Conclusion: The system successfully addresses the specific financial management needs of SMEs and freelancers, providing a viable solution that bridges the gap between enterprise-level tools and small business requirements.

Abstract: Small and Medium Enterprises (SMEs), particularly freelancers and early-stage businesses, face unique financial management challenges due to limited resources, small customer bases, and constrained data availability. This paper presents the development and deployment of an integrated financial prediction system that combines accounts receivable prediction and cash flow forecasting specifically designed for SME operational constraints. Our system addresses the gap between enterprise-focused financial tools and the practical needs of freelancers and small businesses. The solution integrates two key components: a binary classification model for predicting invoice payment delays, and a multi-module cash flow forecasting model that handles incomplete and limited historical data. A prototype system has been implemented and deployed as a web application with integration into Cluee’s platform, a startup providing financial management tools for freelancers, demonstrating practical feasibility for real-world SME financial management.

[272] nanoTabPFN: A Lightweight and Educational Reimplementation of TabPFN

Alexander Pfefferle, Johannes Hog, Lennart Purucker, Frank Hutter

Main category: cs.LG

TL;DR: nanoTabPFN is a simplified, lightweight implementation of TabPFN v2 that makes tabular foundation models accessible by reducing code complexity and computational requirements.

Details

Motivation: Existing tabular foundation models are implemented in overly complex pipelines with poor documentation, making them hard to understand and adapt for students and researchers.

Method: Created a simplified implementation of TabPFN v2 architecture with a training loop using pre-generated training data, requiring only one minute of pre-training on a single GPU.

Result: Achieves comparable performance to traditional ML baselines in small data settings while being 160,000x faster than TabPFN v2 pre-training.

Conclusion: nanoTabPFN successfully democratizes tabular foundation models by eliminating large computational requirements and making them accessible for educational purposes.

Abstract: Tabular foundation models such as TabPFN have revolutionized predictive machine learning for tabular data. At the same time, the driving factors of this revolution are hard to understand. Existing open-source tabular foundation models are implemented in complicated pipelines boasting over 10,000 lines of code, lack architecture documentation or code quality. In short, the implementations are hard to understand, not beginner-friendly, and complicated to adapt for new experiments. We introduce nanoTabPFN, a simplified and lightweight implementation of the TabPFN v2 architecture and a corresponding training loop that uses pre-generated training data. nanoTabPFN makes tabular foundation models more accessible to students and researchers alike. For example, restricted to a small data setting it achieves a performance comparable to traditional machine learning baselines within one minute of pre-training on a single GPU (160,000x faster than TabPFN v2 pretraining). This eliminated requirement of large computational resources makes pre-training tabular foundation models accessible for educational purposes. Our code is available at https://github.com/automl/nanoTabPFN.

[273] Structured Matrix Scaling for Multi-Class Calibration

Eugène Berta, David Holzmüller, Michael I. Jordan, Francis Bach

Main category: cs.LG

TL;DR: The paper presents improved post-hoc recalibration methods for classifiers using more expressive parametric functions beyond standard temperature scaling, with structured regularization to handle overfitting in multi-class settings.

Details

Motivation: To provide faithful probability estimates from classifiers using theoretically motivated parametric recalibration functions that go beyond standard temperature scaling, while addressing the challenge of overfitting in multi-class calibration with limited data.

Method: Uses parametric recalibration functions based on logistic regression, with structured regularization, robust preprocessing, and efficient optimization to manage the bias-variance tradeoff in multi-class settings.

Result: The methods lead to substantial gains over existing logistic-based calibration techniques and provide efficient open-source implementations as alternatives to common temperature, vector, and matrix scaling.

Conclusion: More expressive calibration methods with proper regularization can effectively improve classifier calibration for both binary and multi-class classification, offering better alternatives to standard approaches.

Abstract: Post-hoc recalibration methods are widely used to ensure that classifiers provide faithful probability estimates. We argue that parametric recalibration functions based on logistic regression can be motivated from a simple theoretical setting for both binary and multiclass classification. This insight motivates the use of more expressive calibration methods beyond standard temperature scaling. For multi-class calibration however, a key challenge lies in the increasing number of parameters introduced by more complex models, often coupled with limited calibration data, which can lead to overfitting. Through extensive experiments, we demonstrate that the resulting bias-variance tradeoff can be effectively managed by structured regularization, robust preprocessing and efficient optimization. The resulting methods lead to substantial gains over existing logistic-based calibration techniques. We provide efficient and easy-to-use open-source implementations of our methods, making them an attractive alternative to common temperature, vector, and matrix scaling implementations.

[274] SHIELD: Securing Healthcare IoT with Efficient Machine Learning Techniques for Anomaly Detection

Mahek Desai, Apoorva Rumale, Marjan Asadinia

Main category: cs.LG

TL;DR: A machine learning framework for detecting cyberattacks and device anomalies in IoT healthcare systems, evaluating 8 models across supervised, semi-supervised, and unsupervised learning approaches.

Details

Motivation: IoT devices in healthcare face significant security and reliability challenges, increasing vulnerability to cyber threats and operational anomalies that could compromise patient safety and data security.

Method: Evaluated 8 ML models across three learning approaches: supervised (XGBoost, K-NN), semi-supervised (GAN, VAE), and unsupervised (One-Class SVM, Isolation Forest, GNN, LSTM Autoencoders) using 200,000 records dataset with multiple performance metrics.

Result: XGBoost achieved 99% accuracy with minimal computational overhead (0.04s) for anomaly detection. KNN achieved near-perfect performance for attack detection with lowest computational cost (0.05s). LSTM Autoencoders underperformed, while GAN showed highest computational cost with lowest accuracy.

Conclusion: The framework enhances IoT healthcare security through effective anomaly detection, enabling early detection of cyber threats and device failures to prevent data breaches, minimize downtime, and ensure safe operation of medical devices.

Abstract: The integration of IoT devices in healthcare introduces significant security and reliability challenges, increasing susceptibility to cyber threats and operational anomalies. This study proposes a machine learning-driven framework for (1) detecting malicious cyberattacks and (2) identifying faulty device anomalies, leveraging a dataset of 200,000 records. Eight machine learning models are evaluated across three learning approaches: supervised learning (XGBoost, K-Nearest Neighbors (K- NN)), semi-supervised learning (Generative Adversarial Networks (GAN), Variational Autoencoders (VAE)), and unsupervised learning (One-Class Support Vector Machine (SVM), Isolation Forest, Graph Neural Networks (GNN), and Long Short-Term Memory (LSTM) Autoencoders). The comprehensive evaluation was conducted across multiple metrics like F1-score, precision, recall, accuracy, ROC-AUC, computational efficiency. XGBoost achieved 99% accuracy with minimal computational overhead (0.04s) for anomaly detection, while Isolation Forest balanced precision and recall effectively. LSTM Autoencoders underperformed with lower accuracy and higher latency. For attack detection, KNN achieved near-perfect precision, recall, and F1-score with the lowest computational cost (0.05s), followed by VAE at 97% accuracy. GAN showed the highest computational cost with lowest accuracy and ROC-AUC. These findings enhance IoT-enabled healthcare security through effective anomaly detection strategies. By improving early detection of cyber threats and device failures, this framework has the potential to prevent data breaches, minimize system downtime, and ensure the continuous and safe operation of medical devices, ultimately safeguarding patient health and trust in IoT-driven healthcare solutions.

[275] Behavior-Adaptive Q-Learning: A Unifying Framework for Offline-to-Online RL

Lipeng Zu, Hansong Zhou, Xiaonan Zhang

Main category: cs.LG

TL;DR: BAQ enables smooth offline-to-online RL transition by using an implicit behavioral model to provide behavior-consistency signals during online fine-tuning, reducing distributional shift issues.

Details

Motivation: Offline RL policies struggle when deployed in dynamic environments due to distributional shift and unreliable value estimates on unseen state-action pairs.

Method: Behavior-Adaptive Q-Learning (BAQ) uses a dual-objective loss that aligns online policy toward offline behavior when uncertainty is high, and gradually relaxes this constraint as confident online experience accumulates.

Result: BAQ consistently outperforms prior offline-to-online RL approaches across standard benchmarks, achieving faster recovery, improved robustness, and higher overall performance.

Conclusion: Implicit behavior adaptation is a principled and practical solution for reliable real-world policy deployment in reinforcement learning.

Abstract: Offline reinforcement learning (RL) enables training from fixed data without online interaction, but policies learned offline often struggle when deployed in dynamic environments due to distributional shift and unreliable value estimates on unseen state-action pairs. We introduce Behavior-Adaptive Q-Learning (BAQ), a framework designed to enable a smooth and reliable transition from offline to online RL. The key idea is to leverage an implicit behavioral model derived from offline data to provide a behavior-consistency signal during online fine-tuning. BAQ incorporates a dual-objective loss that (i) aligns the online policy toward the offline behavior when uncertainty is high, and (ii) gradually relaxes this constraint as more confident online experience is accumulated. This adaptive mechanism reduces error propagation from out-of-distribution estimates, stabilizes early online updates, and accelerates adaptation to new scenarios. Across standard benchmarks, BAQ consistently outperforms prior offline-to-online RL approaches, achieving faster recovery, improved robustness, and higher overall performance. Our results demonstrate that implicit behavior adaptation is a principled and practical solution for reliable real-world policy deployment.

[276] AnaFlow: Agentic LLM-based Workflow for Reasoning-Driven Explainable and Sample-Efficient Analog Circuit Sizing

Mohsen Ahmadzadeh, Kaichang Chen, Georges Gielen

Main category: cs.LG

TL;DR: AnaFlow is an agentic AI framework that uses LLM-based agents to automate analog circuit sizing with high sample efficiency and explainable reasoning, addressing bottlenecks in traditional AI methods.

Details

Motivation: Analog circuit design is manual and error-prone, while current AI methods suffer from simulation bottlenecks and lack explainability, hindering adoption.

Method: Multi-agent workflow with specialized LLM agents that interpret circuit topology, understand design goals, and iteratively refine parameters with human-interpretable reasoning using adaptive simulation strategy.

Result: Successfully demonstrated for two circuits of varying complexity, achieving fully automatic sizing with learning from optimization history to avoid mistakes and accelerate convergence.

Conclusion: AnaFlow represents a new paradigm in analog EDA where AI agents serve as transparent design assistants, offering powerful design space exploration with inherent explainability.

Abstract: Analog/mixed-signal circuits are key for interfacing electronics with the physical world. Their design, however, remains a largely handcrafted process, resulting in long and error-prone design cycles. While the recent rise of AI-based reinforcement learning and generative AI has created new techniques to automate this task, the need for many time-consuming simulations is a critical bottleneck hindering the overall efficiency. Furthermore, the lack of explainability of the resulting design solutions hampers widespread adoption of the tools. To address these issues, a novel agentic AI framework for sample-efficient and explainable analog circuit sizing is presented. It employs a multi-agent workflow where specialized Large Language Model (LLM)-based agents collaborate to interpret the circuit topology, to understand the design goals, and to iteratively refine the circuit’s design parameters towards the target goals with human-interpretable reasoning. The adaptive simulation strategy creates an intelligent control that yields a high sample efficiency. The AnaFlow framework is demonstrated for two circuits of varying complexity and is able to complete the sizing task fully automatically, differently from pure Bayesian optimization and reinforcement learning approaches. The system learns from its optimization history to avoid past mistakes and to accelerate convergence. The inherent explainability makes this a powerful tool for analog design space exploration and a new paradigm in analog EDA, where AI agents serve as transparent design assistants.

[277] Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards

Guanning Zeng, Zhaoyi Zhou, Daman Arora, Andrea Zanette

Main category: cs.LG

TL;DR: Proposes shrinkage estimators for reward baselines in RLVR that combine per-prompt and across-prompt means to reduce variance in policy-gradient methods.

Details

Motivation: Standard per-prompt empirical mean baselines in RLVR can be suboptimal, especially in low-generation regimes, leading to high variance in policy-gradient estimators.

Method: Uses shrinkage estimators inspired by Stein’s paradox to combine per-prompt and across-prompt means for more accurate baseline estimation, serving as a drop-in replacement without additional hyperparameters.

Result: The shrinkage baseline provably yields lower-variance policy-gradient estimators and empirically outperforms standard baselines, improving training stability.

Conclusion: Shrinkage-based baselines provide a simple yet effective improvement for RLVR training by reducing variance in gradient updates through better mean estimation.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean for each prompt. Statistically, this centering acts as a control variate (or baseline), reducing the variance of the policy-gradient estimator. Typically, the mean reward is estimated using per-prompt empirical averages for each prompt in a batch. Drawing inspiration from Stein’s paradox, we propose using shrinkage estimators that combine per-prompt and across-prompt means to improve the overall per-prompt mean estimation accuracy – particularly in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our proposed baseline serves as a drop-in replacement for existing per-prompt mean baselines, requiring no additional hyper-parameters or computation. Empirically, shrinkage baselines consistently outperform standard empirical-mean baselines, leading to lower-variance gradient updates and improved training stability.

Md Mahbubur Rahman, Shaila Sharmin

Main category: cs.LG

TL;DR: This paper compares traditional machine learning and deep learning models for emotion classification from Twitter data, finding that BiGRU and ensemble models achieve the highest accuracy (87.53-87.66%) for four emotion categories.

Details

Motivation: Social media has become a platform for expressing personal views and emotions, creating a need for automated emotion detection from text data to develop decision-making tools that visualize emotional fluctuations.

Method: Used traditional ML (SVM, Naive Bayes, Decision Trees, Random Forest) and deep neural networks (LSTM, CNN, GRU, BiLSTM, BiGRU) including an ensemble model combining BiLSTM and BiGRU to classify tweets into Fear, Anger, Joy, and Sadness categories.

Result: Deep neural network models, particularly BiGRU, achieved the best performance with 87.53% accuracy. The ensemble model performed slightly better at 87.66%, though the improvement was not significant.

Conclusion: Deep learning approaches outperform traditional machine learning for emotion classification from social media text, with BiGRU and ensemble models providing the most promising results for developing emotion visualization tools.

Abstract: Over the last few years, social media has evolved into a medium for expressing personal views, emotions, and even business and political proposals, recommendations, and advertisements. We address the topic of identifying emotions from text data obtained from social media posts like Twitter in this research. We have deployed different traditional machine learning techniques such as Support Vector Machines (SVM), Naive Bayes, Decision Trees, and Random Forest, as well as deep neural network models such as LSTM, CNN, GRU, BiLSTM, BiGRU to classify these tweets into four emotion categories (Fear, Anger, Joy, and Sadness). Furthermore, we have constructed a BiLSTM and BiGRU ensemble model. The evaluation result shows that the deep neural network models(BiGRU, to be specific) produce the most promising results compared to traditional machine learning models, with an 87.53 % accuracy rate. The ensemble model performs even better (87.66 %), albeit the difference is not significant. This result will aid in the development of a decision-making tool that visualizes emotional fluctuations.

[279] Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs

Changhao Li, Yuchen Zhuang, Rushi Qiang, Haotian Sun, Hanjun Dai, Chao Zhang, Bo Dai

Main category: cs.LG

TL;DR: M-Pilot is a lightweight white-box LLM controller that guides black-box LLMs by decomposing complex tasks into intermediate steps, enabling controllable multi-turn generation and self-improvement.

Details

Motivation: Black-box LLMs lack transparency, hindering advancements in reasoning, planning, and personalization. Domain-specific adaptation requires training on model parameters, which is infeasible for black-box LLMs.

Method: Treat black-box LLM as environment, M-Pilot as policy that provides intermediate guidance through prompts. M-Pilot is trained to align black-box LLM outputs with preferences during iterative interaction.

Result: Empirical evaluations show M-Pilot effectively enhances black-box LLM capabilities in complex, long-horizon tasks.

Conclusion: M-Pilot enables controllable generation and self-improvement for black-box LLMs without requiring access to their internal parameters.

Abstract: Despite the impressive generative abilities of black-box large language models (LLMs), their inherent opacity hinders further advancements in capabilities such as reasoning, planning, and personalization. Existing works aim to enhance LLM capabilities via domain-specific adaptation, which require additional training on accessible model parameters, an infeasible option for black-box LLMs. To address this challenge, we introduce Matryoshka Pilot (M-Pilot), a lightweight white-box LLM controller that guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs. Specifically, we consider the black-box LLM as an environment, with M-Pilot serving as a policy to provide intermediate guidance through prompts for driving the black-box LLM. M-Pilot is trained to pivot the outputs of the black-box LLM aligning with preferences during iterative interaction, which enables controllable multi-turn generation and self-improvement in optimizing intermediate guidance. Empirical evaluations on diverse tasks demonstrate that our method effectively enhances the capabilities of black-box LLMs in complex, long-horizon tasks. Our code is publicly available at: https://github.com/lichangh20/Matryoshka.

[280] REFA: Reference Free Alignment for multi-preference optimization

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan

Main category: cs.LG

TL;DR: REFA addresses the URSLA shortcut in length-normalized preference optimization by introducing probabilistic control on the EOS token to prevent premature truncation and enable genuine quality improvements.

Details

Motivation: To solve the URSLA shortcut problem where models learn to prematurely truncate low-quality responses rather than learning semantic content, which is a failure mode introduced by length normalization in modern preference optimization methods.

Method: Introduces REFA framework with regularizers that operate directly on the probability of the End-of-Sequence (EOS) token, providing token-level control over termination and managing the alignment-efficiency tradeoff.

Result: REFA achieves 60.29% win rate and 52.17% length-controlled win rate on AlpacaEval2 with Llama-3-8B-Instruct, demonstrating effective token-level control.

Conclusion: The EOS token-level intervention provides a principled solution to the URSLA shortcut and enables versatile management of alignment-efficiency tradeoffs while ensuring genuine quality improvements.

Abstract: To mitigate reward hacking from response verbosity, modern preference optimization methods are increasingly adopting length normalization (e.g., SimPO, ORPO, LN-DPO). While effective against this bias, we demonstrate that length normalization itself introduces a failure mode: the URSLA shortcut. Here models learn to satisfy the alignment objective by prematurely truncating low-quality responses rather than learning from their semantic content. To address this, we introduce REFA, a new alignment framework that proposes probabilistic control on a structural token that controls termination. Our core innovation is a new class of regularizers that operate directly on the probability of the End-of-Sequence (EOS) token, a previously unexploited control lever. This token-level intervention provides a principled solution to the URSLA shortcut, ensuring genuine quality improvements. Furthermore, it unlocks a versatile mechanism for managing the alignment-efficiency tradeoff, enabling practitioners to fine-tune models that adhere to specific token budgets. Empirically, REFA achieves a 60.29% win rate and a 52.17% length-controlled win rate on AlpacaEval2 with Llama-3-8B-Instruct, demonstrating the power of our token-level control paradigm.

[281] Dense SAE Latents Are Features, Not Bugs

Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark

Main category: cs.LG

TL;DR: Dense latents in sparse autoencoders are not training artifacts but meaningful representations that serve functional roles in language models, including position tracking, context binding, and output signals.

Details

Motivation: To investigate whether dense latents in sparse autoencoders are undesirable training artifacts or meaningful representations that serve functional purposes in language models.

Method: Systematic investigation of dense latents’ geometry, function, and origin through analysis of antipodal pairs, ablation studies, taxonomy development, and layer-wise evolution analysis.

Result: Dense latents form antipodal pairs that reconstruct specific residual stream directions, serve functional roles (position tracking, context binding, entropy regulation, etc.), and evolve from structural to semantic to output-oriented features across layers.

Conclusion: Dense latents are persistent, intrinsic properties of the residual space that serve meaningful computational functions in language models and should not be dismissed as training noise.

Abstract: Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are \emph{dense}), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs – suggesting that high density features are an intrinsic property of the residual space. We then introduce a taxonomy of dense latents, identifying classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction. Finally, we analyze how these features evolve across layers, revealing a shift from structural features in early layers, to semantic features in mid layers, and finally to output-oriented signals in the last layers of the model. Our findings indicate that dense latents serve functional roles in language model computation and should not be dismissed as training noise.

[282] Activation Transport Operators

Andrzej Szablewski, Marek Masiak

Main category: cs.LG

TL;DR: ATO (Activation Transport Operators) are linear maps that track feature flow through transformer residual streams, distinguishing between linearly transported vs synthesized features, with applications for safety and debugging.

Details

Motivation: Understanding how features flow through residual streams can improve jailbreaking protections, enable early detection of model mistakes, and their correction.

Method: Propose Activation Transport Operators (ATO) - linear maps from upstream to downstream residuals evaluated using downstream SAE decoder projections. Develop transport efficiency metric and estimate residual stream subspace for linear transport.

Result: Empirically demonstrate that ATO can determine whether features are linearly transported or synthesized, report transport efficiency and the size of residual stream subspace involved in linear transport.

Conclusion: ATO provides a compute-light method (<50 GPU-h) for practical safety tools, debugging, and understanding where LLM computation behaves linearly.

Abstract: The residual stream mediates communication between transformer decoder layers via linear reads and writes of non-linear computations. While sparse-dictionary learning-based methods locate features in the residual stream, and activation patching methods discover circuits within the model, the mechanism by which features flow through the residual stream remains understudied. Understanding this dynamic can better inform jailbreaking protections, enable early detection of model mistakes, and their correction. In this work, we propose Activation Transport Operators (ATO), linear maps from upstream to downstream residuals $k$ layers later, evaluated in feature space using downstream SAE decoder projections. We empirically demonstrate that these operators can determine whether a feature has been linearly transported from a previous layer or synthesised from non-linear layer computation. We develop the notion of transport efficiency, for which we provide an upper bound, and use it to estimate the size of the residual stream subspace that corresponds to linear transport. We empirically demonstrate the linear transport, report transport efficiency and the size of the residual stream’s subspace involved in linear transport. This compute-light (no finetuning, <50 GPU-h) method offers practical tools for safety, debugging, and a clearer picture of where computation in LLMs behaves linearly.

[283] GDS Agent for Graph Algorithmic Reasoning

Borun Shi, Ioannis Panagiotas

Main category: cs.LG

TL;DR: GDS agent introduces graph algorithms as tools for LLMs to process and reason over large-scale graph-structured data through a model context protocol server.

Details

Motivation: LLMs struggle with processing and reasoning over large-scale graph-structure data despite their multimodal capabilities and tool integration.

Method: Comprehensive set of graph algorithms as tools with preprocessing (retrieval) and postprocessing, implemented in a model context protocol (MCP) server compatible with any modern LLM.

Result: GDS agent solves a wide spectrum of graph tasks, with benchmarks evaluating both intermediate tool calls and final responses, plus detailed case studies.

Conclusion: The approach enables accurate and grounded answers for graph algorithmic reasoning, though some challenges remain requiring future roadmap development.

Abstract: Large language models (LLMs) have shown remarkable multimodal information processing and reasoning ability. When equipped with tools through function calling and enhanced with retrieval-augmented techniques, compound LLM-based systems can access closed data sources and answer questions about them. However, they still struggle to process and reason over large-scale graph-structure data. We introduce the GDS (Graph Data Science) agent in this technical report. The GDS agent introduces a comprehensive set of graph algorithms as tools, together with preprocessing (retrieval) and postprocessing of algorithm results, in a model context protocol (MCP) server. The server can be used with any modern LLM out-of-the-box. GDS agent allows users to ask any question that implicitly and intrinsically requires graph algorithmic reasoning about their data, and quickly obtain accurate and grounded answers. We introduce new benchmarks that evaluate intermediate tool calls as well as final responses. The results indicate that GDS agent is able to solve a wide spectrum of graph tasks. We also provide detailed case studies for more open-ended tasks and study scenarios where the agent struggles. Finally, we discuss the remaining challenges and the future roadmap.

[284] Training Optimal Large Diffusion Language Models

Jinjie Ni, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, Michael Qizhe Shieh

Main category: cs.LG

TL;DR: Quokka introduces the first systematic scaling law for diffusion language models (DLMs), covering both compute-constrained and data-constrained regimes while studying key modeling and optimization designs.

Details

Motivation: To establish scaling laws for diffusion language models similar to what Chinchilla did for language models, providing practical guidance for DLM training and broader AI community inspiration.

Method: Developed systematic scaling laws for diffusion language models that encompass both compute-constrained and data-constrained training regimes, analyzing key modeling and optimization design choices.

Result: Created Quokka as the first comprehensive scaling framework for DLMs, positioned as a companion to Chinchilla but with wider scope and applicability to diffusion models.

Conclusion: Quokka provides immediate practical guidance for training diffusion language models and offers long-term inspiration for the broader AI community by establishing foundational scaling principles for this model class.

Abstract: We introduce Quokka, the first systematic scaling law for diffusion language models (DLMs), encompassing both compute-constrained and data-constrained regimes, and studying the key modeling and optimization designs. Quokka is a good friend of Chinchilla and provides wider scopes. We hope the results would bring short-term practical guidance in DLMs training and long-term inspirations for the whole AI community.

[285] The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems

Bentley DeVilling

Main category: cs.LG

TL;DR: Recursive self-evaluation in large language models leads to reformulation rather than progress without external feedback. A minimal grounding intervention (single verification step) significantly improves informational change and prevents epistemic stasis.

Details

Motivation: To test whether large language models can achieve genuine reflective reasoning through recursive self-evaluation without external feedback, and to examine if minimal grounding interventions can overcome informational closure.

Method: Cross-provider study with 144 reasoning sequences across three models (GPT-4o-mini, Claude 3 Haiku, Gemini 2.0 Flash) and four task families (arithmetic, code, explanation, reflection), each iterated ten times under ungrounded self-critique and minimal grounding intervention conditions.

Result: Ungrounded runs showed 55% decline in informational change from early to late iterations. Grounded runs showed +28% rebound after intervention and sustained non-zero variance. Multiple measures converged on the same pattern: reflection without contact leads to informational closure.

Conclusion: Recursive self-correction in generative reasoning has structural limits without external information exchange. Minimal grounding functions as dissipative coupling to reintroduce informational flux, suggesting reflection is performative rather than epistemic without grounding.

Abstract: Large language models are often described as capable of reflective reasoning, yet recursive self-evaluation without external feedback frequently yields reformulation rather than progress. We test this prediction in a cross-provider study of 144 reasoning sequences across three models (OpenAI GPT-4o-mini, Anthropic Claude 3 Haiku, and Google Gemini 2.0 Flash) and four task families (arithmetic, code, explanation, reflection), each iterated ten times under two conditions: ungrounded self-critique and a minimal grounding intervention (a single verification step at iteration three). Mean informational change (delta I, measured via normalized edit distance) declined by 55% from early (0.193) to late (0.087) iterations in ungrounded runs, with consistent patterns across all three providers. Grounded runs showed a +28% rebound in informational change immediately after the intervention and sustained non-zero variance thereafter. Complementary measures-n-gram novelty, embedding drift, and character-level entropy-converged on the same pattern: reflection without contact tends toward informational closure. We interpret this as evidence for a structural limit on self-correction in generative reasoning: without an exchange of information with an independent verifier or environment, recursive inference approaches an attractor state of epistemic stasis. Minimal grounding functions as dissipative coupling, reintroducing informational flux. The cross-architecture consistency suggests the mirror loop arises from shared autoregressive training objectives rather than provider-specific alignment schemes. The results delineate when reflection is performative rather than epistemic and motivate design principles for grounded, cooperative reasoning. Materials and code are publicly available.

[286] A Survey of Graph Neural Networks in Real world: Imbalance, Noise, Privacy and OOD Challenges

Wei Ju, Siyu Yi, Yifan Wang, Zhiping Xiao, Zhengyang Mao, Hourun Li, Yiyang Gu, Yifang Qin, Nan Yin, Senzhang Wang, Xinwang Liu, Philip S. Yu, Ming Zhang

Main category: cs.LG

TL;DR: A comprehensive survey reviewing Graph Neural Networks (GNNs) that addresses four key real-world challenges: data imbalance, noise, privacy protection, and out-of-distribution generalization, providing systematic analysis and future directions.

Details

Motivation: Real-world GNN applications face performance degradation due to non-ideal training environments including data imbalance, noise, privacy constraints, and OOD scenarios, requiring specialized solutions beyond standard GNN models.

Method: Systematic survey approach analyzing existing GNN models through four key dimensions: solutions for data imbalance, noise handling, privacy protection mechanisms, and OOD generalization techniques.

Result: Comprehensive categorization and analysis of GNN approaches addressing real-world reliability challenges, providing detailed discussions on how different solutions enhance model robustness and practical applicability.

Conclusion: The survey identifies promising research directions for developing more reliable and robust GNN models that can effectively handle real-world challenges, emphasizing the need for continued innovation in this rapidly evolving field.

Abstract: Graph-structured data exhibits universality and widespread applicability across diverse domains, such as social network analysis, biochemistry, financial fraud detection, and network security. Significant strides have been made in leveraging Graph Neural Networks (GNNs) to achieve remarkable success in these areas. However, in real-world scenarios, the training environment for models is often far from ideal, leading to substantial performance degradation of GNN models due to various unfavorable factors, including imbalance in data distribution, the presence of noise in erroneous data, privacy protection of sensitive information, and generalization capability for out-of-distribution (OOD) scenarios. To tackle these issues, substantial efforts have been devoted to improving the performance of GNN models in practical real-world scenarios, as well as enhancing their reliability and robustness. In this paper, we present a comprehensive survey that systematically reviews existing GNN models, focusing on solutions to the four mentioned real-world challenges including imbalance, noise, privacy, and OOD in practical scenarios that many existing reviews have not considered. Specifically, we first highlight the four key challenges faced by existing GNNs, paving the way for our exploration of real-world GNN models. Subsequently, we provide detailed discussions on these four aspects, dissecting how these solutions contribute to enhancing the reliability and robustness of GNN models. Last but not least, we outline promising directions and offer future perspectives in the field.

[287] A Reliable Cryptographic Framework for Empirical Machine Unlearning Evaluation

Yiwen Tu, Pingbang Hu, Jiaqi Ma

Main category: cs.LG

TL;DR: This paper proposes a novel evaluation framework for machine unlearning algorithms using cryptographic games and membership inference attacks, addressing reliability issues in existing metrics.

Details

Motivation: Current evaluation methods for machine unlearning algorithms lack theoretical guarantees and reliability, making it difficult to properly assess their effectiveness in removing specific training data as required by data protection regulations.

Method: The authors model the evaluation process as a cryptographic game between unlearning algorithms and membership inference attack adversaries, creating a theoretically sound evaluation metric with provable guarantees.

Result: The proposed evaluation metric effectively measures data removal efficacy and demonstrates reliability through both theoretical analysis and empirical experiments, outperforming existing evaluation approaches.

Conclusion: This work provides a novel and reliable framework for evaluating machine unlearning algorithms, which will support the development of more effective unlearning techniques that comply with data protection requirements.

Abstract: Machine unlearning updates machine learning models to remove information from specific training samples, complying with data protection regulations that allow individuals to request the removal of their personal data. Despite the recent development of numerous unlearning algorithms, reliable evaluation of these algorithms remains an open research question. In this work, we focus on membership inference attack (MIA) based evaluation, one of the most common approaches for evaluating unlearning algorithms, and address various pitfalls of existing evaluation metrics lacking theoretical understanding and reliability. Specifically, by modeling the proposed evaluation process as a \emph{cryptographic game} between unlearning algorithms and MIA adversaries, the naturally induced evaluation metric measures the data removal efficacy of unlearning algorithms and enjoys provable guarantees that existing evaluation metrics fail to satisfy. Furthermore, we propose a practical and efficient approximation of the induced evaluation metric and demonstrate its effectiveness through both theoretical analysis and empirical experiments. Overall, this work presents a novel and reliable approach to empirically evaluating unlearning algorithms, paving the way for the development of more effective unlearning techniques.

[288] CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, Caiwen Ding

Main category: cs.LG

TL;DR: CudaForge is a training-free multi-agent workflow that uses LLMs to automatically generate and optimize CUDA kernels, achieving high correctness and performance while being cost-effective.

Details

Motivation: Manual CUDA kernel design is costly and time-consuming for AI applications like LLM training, and existing automatic methods produce inefficient kernels with high overhead and poor generalization.

Method: Uses two LLM agents (Coder and Judge) in an iterative workflow inspired by human experts, integrating hardware feedback from Nsight Compute metrics to generate, correct, and optimize kernels.

Result: Achieves 97.6% correctness and 1.68× speedup over PyTorch baselines, outperforms state-of-the-art models, generalizes across GPUs and base models, with low cost (26.5 minutes and $0.3 per kernel).

Conclusion: Multi-agent, training-free workflows enable cost-effective, generalizable, and high-performance CUDA kernel optimization.

Abstract: Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low-efficiency kernels, incur high computational overhead, and fail to generalize across settings. In this work, we propose CudaForge, a training-free multi-agent workflow for CUDA kernel generation and optimization. Our workflow is inspired by the iterative workflow of human experts, which contains steps such as developing initial kernels, testing correctness, analyzing hardware feedback, and iterative improvement. More specifically, CudaForge employs two LLM agents: a Coder and a Judge, that iteratively generate, correct, and optimize CUDA kernels, while integrating hardware feedback such as Nsight Compute (NCU) metrics. In extensive evaluations, we show that CudaForge, by leveraging base models like OpenAI-o3, achieves 97.6% correctness of generated kernels and an average 1.68$\times$ speedup over PyTorch baselines, substantially surpassing state-of-the-art models including OpenAI-o3 and Kevin on KernelBench.Beyond accuracy and speed, CudaForge demonstrates strong generalization across GPUs (A100, RTX 6000, 4090, 3090) and base models (OpenAI-o3, GPT-5, gpt-oss-120B, Claude-Sonnet-4, QwQ-32B), while maintaining high efficiency. In particular, generating an optimized kernel takes about 26.5 minutes on one RTX6000 and incurs about $ 0.3 API cost, which is significantly cheaper than existing agentic work that costs 6 H100 hours and $ 5 API cost per kernel. Our results highlight that multi-agent, training-free workflows can enable cost-effective, generalizable, and high-performance CUDA kernel optimization. Code available at https://github.com/OptimAI-Lab/CudaForge

[289] Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

Mikhail Persiianov, Arip Asadulaev, Nikita Andreev, Nikita Starodubcev, Dmitry Baranchuk, Anastasis Kratsios, Evgeny Burnaev, Alexander Korotin

Main category: cs.LG

TL;DR: Proposes a semi-supervised learning method that seamlessly integrates paired and unpaired data for conditional distribution learning using likelihood maximization and inverse entropic optimal transport.

Details

Motivation: Learning conditional distributions often requires paired data, which is challenging to acquire in many applications like domain translation. Semi-supervised approaches are needed to utilize both limited paired data and abundant unpaired samples.

Method: Integrates paired and unpaired data using data likelihood maximization techniques, connects with inverse entropic optimal transport, and establishes an end-to-end learning algorithm leveraging computational OT advances.

Result: The approach achieves universal approximation property (can theoretically recover true conditional distributions with arbitrarily small error) and effectively learns conditional distributions in empirical tests using combined data.

Conclusion: The proposed method provides a principled framework for semi-supervised conditional distribution learning that seamlessly combines paired and unpaired data, with strong theoretical guarantees and empirical performance.

Abstract: Learning conditional distributions $\pi^(\cdot|x)$ is a central problem in machine learning, which is typically approached via supervised methods with paired data $(x,y) \sim \pi^$. However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of $\textit{semi-supervised}$ models that utilize both limited paired data and additional unpaired i.i.d. samples $x \sim \pi^_x$ and $y \sim \pi^_y$ from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm that integrates both paired and unpaired data $\textbf{seamlessly}$ using the data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish an $\textbf{end-to-end}$ learning algorithm to get $\pi^*(\cdot|x)$. In addition, we derive the universal approximation property, demonstrating that our approach can theoretically recover true conditional distributions with arbitrarily small error. Furthermore, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously.

[290] Revisiting semi-supervised learning in the era of foundation models

Ping Zhang, Zheda Mai, Quang-Huy Nguyen, Wei-Lun Chao

Main category: cs.LG

TL;DR: SSL with pre-trained vision foundation models shows that parameter-efficient fine-tuning (PEFT) using only labeled data often matches SSL performance, motivating a simple self-training approach with ensemble PEFT methods to generate robust pseudo-labels.

Details

Motivation: To understand how semi-supervised learning (SSL) interacts with pre-trained vision foundation models (VFMs) and address the gap in knowledge about SSL's effectiveness with these modern backbones.

Method: Developed new SSL benchmark datasets where frozen VFMs underperform, systematically evaluated SSL methods, and proposed ensembling multiple PEFT approaches and VFM backbones to produce robust pseudo-labels for self-training.

Result: Parameter-efficient fine-tuning (PEFT) using only labeled data often matches SSL performance, and the proposed ensemble self-training approach with robust pseudo-labels shows effectiveness in improving SSL with VFMs.

Conclusion: The study provides actionable insights into SSL with VFMs, demonstrating that simple self-training with ensemble PEFT methods can be powerful, paving the way for more scalable and practical semi-supervised learning in the foundation model era.

Abstract: Semi-supervised learning (SSL) leverages abundant unlabeled data alongside limited labeled data to enhance learning. As vision foundation models (VFMs) increasingly serve as the backbone of vision applications, it remains unclear how SSL interacts with these pre-trained models. To address this gap, we develop new SSL benchmark datasets where frozen VFMs underperform and systematically evaluate representative SSL methods. We make a surprising observation: parameter-efficient fine-tuning (PEFT) using only labeled data often matches SSL performance, even without leveraging unlabeled data. This motivates us to revisit self-training, a conceptually simple SSL baseline, where we use the supervised PEFT model to pseudo-label unlabeled data for further training. To overcome the notorious issue of noisy pseudo-labels, we propose ensembling multiple PEFT approaches and VFM backbones to produce more robust pseudo-labels. Empirical results validate the effectiveness of this simple yet powerful approach, providing actionable insights into SSL with VFMs and paving the way for more scalable and practical semi-supervised learning in the era of foundation models.

[291] Beyond Covariance Matrix: The Statistical Complexity of Private Linear Regression

Fan Chen, Jiachun Li, Alexander Rakhlin, David Simchi-Levi

Main category: cs.LG

TL;DR: The paper establishes minimax convergence rates for private linear regression under potentially ill-conditioned covariate distributions, showing that the statistical complexity is captured by L1 analogues of the covariance matrix rather than the usual covariance matrix. It introduces an Information-Weighted Regression method and applies it to private linear contextual bandits, achieving rate-optimal regret bounds.

Details

Motivation: To understand the statistical complexity of private linear regression under unknown, potentially ill-conditioned covariate distributions, and to address open problems in private linear contextual bandits regarding the cost of joint privacy versus local privacy.

Method: Establishes minimax convergence rates for central and local privacy models, introduces Information-Weighted Regression method that attains optimal rates, and proposes an efficient algorithm for private linear contextual bandits.

Result: Shows that joint privacy comes at almost no additional cost compared to local privacy in linear contextual bandits, achieving rate-optimal regret bounds of order √T + 1/α (joint privacy) and √T/α (local privacy).

Conclusion: The intrinsic complexity of private linear regression is captured by L1 analogues of the covariance matrix rather than the usual covariance matrix, and joint privacy can be achieved with minimal additional cost in contextual bandit settings.

Abstract: We study the statistical complexity of private linear regression under an unknown, potentially ill-conditioned covariate distribution. Somewhat surprisingly, under privacy constraints the intrinsic complexity is \emph{not} captured by the usual covariance matrix but rather its $L_1$ analogues. Building on this insight, we establish minimax convergence rates for both the central and local privacy models and introduce an Information-Weighted Regression method that attains the optimal rates. As application, in private linear contextual bandits, we propose an efficient algorithm that achieves rate-optimal regret bounds of order $\sqrt{T}+\frac{1}{\alpha}$ and $\sqrt{T}/\alpha$ under joint and local $\alpha$-privacy models, respectively. Notably, our results demonstrate that joint privacy comes at almost no additional cost, addressing the open problems posed by Azize and Basu (2024).

[292] Trustworthy Representation Learning via Information Funnels and Bottlenecks

João Machado de Freitas, Bernhard C. Geiger

Main category: cs.LG

TL;DR: The paper introduces CPFSI, a novel information-theoretic method for learning invariant representations that balances utility, fairness, and privacy, with applications in both fully and semi-supervised settings.

Details

Motivation: To address the critical challenge of ensuring trustworthiness in machine learning by balancing utility, fairness, and privacy in representation learning, particularly for extracting invariant representations from data.

Method: Proposes Conditional Privacy Funnel with Side-information (CPFSI) using neural-network-based approximations via amortized variational inference to handle intractable information-theoretic objectives.

Result: CPFSI effectively balances competing objectives, outperforms existing approaches, and shows that intervening on sensitive attributes in its predictive posterior enhances fairness while maintaining predictive performance.

Conclusion: The method demonstrates real-world applicability for learning robust and fair representations from tabular datasets in data-scarce environments, offering new insights into Pareto frontiers of utility-invariance-fidelity trade-offs.

Abstract: Ensuring trustworthiness in machine learning – by balancing utility, fairness, and privacy – remains a critical challenge, particularly in representation learning. In this work, we investigate a family of closely related information-theoretic objectives, including information funnels and bottlenecks, designed to extract invariant representations from data. We introduce the Conditional Privacy Funnel with Side-information (CPFSI), a novel formulation within this family, applicable in both fully and semi-supervised settings. Given the intractability of these objectives, we derive neural-network-based approximations via amortized variational inference. We systematically analyze the trade-offs between utility, invariance, and representation fidelity, offering new insights into the Pareto frontiers of these methods. Our results demonstrate that CPFSI effectively balances these competing objectives and frequently outperforms existing approaches. Furthermore, we show that by intervening on sensitive attributes in CPFSI’s predictive posterior enhances fairness while maintaining predictive performance. Finally, we focus on the real-world applicability of these approaches, particularly for learning robust and fair representations from tabular datasets in data scarce-environments – a modality where these methods are often especially relevant.

[293] How does training shape the Riemannian geometry of neural network representations?

Jacob A. Zavatone-Veth, Sheng Yang, Julian A. Rubinfien, Cengiz Pehlevan

Main category: cs.LG

TL;DR: Neural networks trained on classification tasks learn to magnify local areas along decision boundaries, breaking the initial symmetry of random network metrics and revealing how training shapes geometric inductive biases.

Details

Motivation: To understand what geometric constraints are appropriate for machine learning tasks by studying how training molds the Riemannian geometry induced by neural network feature maps.

Method: Analyze how training changes the Riemannian metrics induced by neural network feature maps, comparing networks at infinite width with random parameters versus trained networks on classification tasks.

Result: Random networks induce highly symmetric metrics, but training breaks this symmetry by magnifying local areas along decision boundaries. This occurs in deep networks on high-dimensional tasks and even in self-supervised learning.

Conclusion: Training shapes neural network geometry in a nonlinear way that amplifies decision boundary regions, providing insight into how feature learning creates useful geometric inductive biases.

Abstract: In machine learning, there is a long history of trying to build neural networks that can learn from fewer example data by baking in strong geometric priors. However, it is not always clear a priori what geometric constraints are appropriate for a given task. Here, we explore the possibility that one can uncover useful geometric inductive biases by studying how training molds the Riemannian geometry induced by unconstrained neural network feature maps. We first show that at infinite width, neural networks with random parameters induce highly symmetric metrics on input space. This symmetry is broken by feature learning: networks trained to perform classification tasks learn to magnify local areas along decision boundaries. This holds in deep networks trained on high-dimensional image classification tasks, and even in self-supervised representation learning. These results begin to elucidate how training shapes the geometry induced by unconstrained neural network feature maps, laying the groundwork for an understanding of this richly nonlinear form of feature learning.

[294] Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few

Qishuai Wen, Zhiyuan Huang, Chun-Guang Li

Main category: cs.LG

TL;DR: The paper proposes a unified optimization objective that derives interpretable and efficient attention mechanisms through algorithm unrolling, introducing Contract-and-Broadcast Self-Attention (CBSA) with linear complexity.

Details

Motivation: Current attention mechanisms lack clear optimization objectives and suffer from quadratic complexity, while interpretability and efficiency are typically studied separately rather than as mutually reinforcing pursuits.

Method: Construct a gradient step of the proposed objective through algorithm unrolling, creating Contract-and-Broadcast Self-Attention (CBSA) that compresses input tokens by contracting representatives to low-dimensional structures.

Result: CBSA achieves linear complexity by fixing the number of representatives, covers varied attention mechanisms with different representative sets, and demonstrates comparable performance with superior advantages over black-box attention on visual tasks.

Conclusion: The work successfully integrates interpretability and efficiency in attention mechanisms and provides a unified formulation that sheds light on the underlying optimization objectives of attention mechanisms.

Abstract: Attention mechanisms have achieved significant empirical success in multiple fields, but their underlying optimization objectives remain unclear yet. Moreover, the quadratic complexity of self-attention has become increasingly prohibitive. Although interpretability and efficiency are two mutually reinforcing pursuits, prior work typically investigates them separately. In this paper, we propose a unified optimization objective that derives inherently interpretable and efficient attention mechanisms through algorithm unrolling. Precisely, we construct a gradient step of the proposed objective with a set of forward-pass operations of our \emph{Contract-and-Broadcast Self-Attention} (CBSA), which compresses input tokens towards low-dimensional structures by contracting a few representatives of them. This novel mechanism can not only scale linearly by fixing the number of representatives, but also covers the instantiations of varied attention mechanisms when using different sets of representatives. We conduct extensive experiments to demonstrate comparable performance and superior advantages over black-box attention mechanisms on visual tasks. Our work sheds light on the integration of interpretability and efficiency, as well as the unified formula of attention mechanisms.

[295] AlignIQL: Policy Alignment in Implicit Q-Learning through Constrained Optimization

Longxiang He, Li Shen, Xueqian Wang

Main category: cs.LG

TL;DR: The paper introduces AlignIQL and AlignIQL-hard to solve the implicit policy-finding problem in offline RL, providing insights into why IQL can use weighted regression for policy extraction and achieving superior performance on complex tasks.

Details

Motivation: To address the unclear mechanism of how to recover implicit policy from learned Q-function in IQL and why it can use weighted regression for policy extraction, as IDQL's solution only works for optimal value functions.

Method: Formulates the implicit policy-finding problem as an optimization problem and proposes two practical algorithms: AlignIQL and AlignIQL-hard, which maintain IQL’s actor-critic decoupling while solving the policy extraction issue.

Result: Achieves competitive or superior results on D4RL datasets compared to SOTA offline RL methods, with significant improvements in complex sparse reward tasks like Antmaze and Adroit over IQL and IDQL.

Conclusion: The proposed method successfully solves the implicit policy-finding problem while maintaining IQL’s simplicity and provides theoretical insights into IQL’s weighted regression mechanism, demonstrating strong performance especially in challenging sparse reward environments.

Abstract: Implicit Q-learning (IQL) serves as a strong baseline for offline RL, which learns the value function using only dataset actions through quantile regression. However, it is unclear how to recover the implicit policy from the learned implicit Q-function and why IQL can utilize weighted regression for policy extraction. IDQL reinterprets IQL as an actor-critic method and gets weights of implicit policy, however, this weight only holds for the optimal value function. In this work, we introduce a different way to solve the implicit policy-finding problem (IPF) by formulating this problem as an optimization problem. Based on this optimization problem, we further propose two practical algorithms AlignIQL and AlignIQL-hard, which inherit the advantages of decoupling actor from critic in IQL and provide insights into why IQL can use weighted regression for policy extraction. Compared with IQL and IDQL, we find our method keeps the simplicity of IQL and solves the implicit policy-finding problem. Experimental results on D4RL datasets show that our method achieves competitive or superior results compared with other SOTA offline RL methods. Especially in complex sparse reward tasks like Antmaze and Adroit, our method outperforms IQL and IDQL by a significant margin.

[296] EraseFlow: Learning Concept Erasure Policies via GFlowNet-Driven Alignment

Abhiram Kusumba, Maitreya Patel, Kyle Min, Changhoon Kim, Chitta Baral, Yezhou Yang

Main category: cs.LG

TL;DR: EraseFlow is a novel framework that uses GFlowNets to erase harmful concepts from text-to-image generators by exploring denoising trajectories, achieving better performance than existing methods without needing complex reward models.

Details

Motivation: Current concept erasure techniques for text-to-image generators either degrade image quality, rely on brittle adversarial losses, or require extensive retraining, highlighting the need for a more effective approach.

Method: EraseFlow frames concept unlearning as exploration in denoising path space and optimizes it using GFlowNets with trajectory balance objective, sampling entire trajectories rather than single states.

Result: Extensive experiments show EraseFlow outperforms existing baselines, achieves optimal trade-off between performance and prior preservation, and generalizes effectively to unseen concepts without hackable rewards.

Conclusion: EraseFlow provides a superior approach to concept erasure by leveraging trajectory exploration with GFlowNets, eliminating the need for complex reward models while maintaining image quality and model performance.

Abstract: Erasing harmful or proprietary concepts from powerful text to image generators is an emerging safety requirement, yet current “concept erasure” techniques either collapse image quality, rely on brittle adversarial losses, or demand prohibitive retraining cycles. We trace these limitations to a myopic view of the denoising trajectories that govern diffusion based generation. We introduce EraseFlow, the first framework that casts concept unlearning as exploration in the space of denoising paths and optimizes it with GFlowNets equipped with the trajectory balance objective. By sampling entire trajectories rather than single end states, EraseFlow learns a stochastic policy that steers generation away from target concepts while preserving the model’s prior. EraseFlow eliminates the need for carefully crafted reward models and by doing this, it generalizes effectively to unseen concepts and avoids hackable rewards while improving the performance. Extensive empirical results demonstrate that EraseFlow outperforms existing baselines and achieves an optimal trade off between performance and prior preservation.

[297] Data Quality Monitoring for the Hadron Calorimeters Using Transfer Learning for Anomaly Detection

Mulugeta Weldezgina Asres, Christian Walter Omlin, Long Wang, Pavel Parygin, David Yu, Jay Dittmann, The CMS-HCAL Collaboration

Main category: cs.LG

TL;DR: This paper explores transfer learning for spatio-temporal anomaly detection in high-dimensional sensor data, using a hybrid autoencoder architecture to improve model accuracy and reduce training requirements.

Details

Motivation: The proliferation of sensors generates massive spatio-temporal data, but data curation is time-consuming and expensive. Transfer learning can mitigate data sparsity and model complexity, especially for systems with thousands of sensors and limited training data.

Method: Uses a hybrid autoencoder architecture combining convolutional, graph, and recurrent neural networks. Investigates transferability of models trained on different sections of the Hadron Calorimeter at CERN’s CMS experiment.

Result: Reveals insights into model initialization and training configurations that enhance performance while substantially reducing trainable parameters and mitigating data contamination effects.

Conclusion: Transfer learning shows potential for spatio-temporal anomaly detection in high-dimensional sensor systems, offering improved accuracy and robustness with reduced training requirements.

Abstract: The proliferation of sensors brings an immense volume of spatio-temporal (ST) data in many domains, including monitoring, diagnostics, and prognostics applications. Data curation is a time-consuming process for a large volume of data, making it challenging and expensive to deploy data analytics platforms in new environments. Transfer learning (TL) mechanisms promise to mitigate data sparsity and model complexity by utilizing pre-trained models for a new task. Despite the triumph of TL in fields like computer vision and natural language processing, efforts on complex ST models for anomaly detection (AD) applications are limited. In this study, we present the potential of TL within the context of high-dimensional ST AD with a hybrid autoencoder architecture, incorporating convolutional, graph, and recurrent neural networks. Motivated by the need for improved model accuracy and robustness, particularly in scenarios with limited training data on systems with thousands of sensors, this research investigates the transferability of models trained on different sections of the Hadron Calorimeter of the Compact Muon Solenoid experiment at CERN. The key contributions of the study include exploring TL’s potential and limitations within the context of encoder and decoder networks, revealing insights into model initialization and training configurations that enhance performance while substantially reducing trainable parameters and mitigating data contamination effects. Code: https://github.com/muleina/CMS_HCAL_ML_OnlineDQM .

[298] Dynamical loss functions shape landscape topography and improve learning in artificial neural networks

Eduardo Lavin Pallero, Miguel Ruiz-Garcia

Main category: cs.LG

TL;DR: Dynamical loss functions with periodic oscillations improve validation accuracy by altering the loss landscape without changing global minima, particularly benefiting networks of varying sizes.

Details

Motivation: To enhance neural network training by modifying standard loss functions to periodically oscillate class contributions, which globally alters the loss landscape while preserving global minima.

Method: Transform cross-entropy and mean squared error into dynamical loss functions with periodic oscillations, analyze impact of network size and learning rate on minima exploration, and propose several versions tested on simple classification problems.

Result: Dynamical loss functions significantly improve validation accuracy for networks of varying sizes, with the landscape evolution revealing instabilities potentially linked to edge-of-instability minimization.

Conclusion: Dynamical loss functions provide an effective approach to improve neural network training by modifying the loss landscape through periodic oscillations, leading to better validation performance across different network architectures.

Abstract: Dynamical loss functions are derived from standard loss functions used in supervised classification tasks, but are modified so that the contribution from each class periodically increases and decreases. These oscillations globally alter the loss landscape without affecting the global minima. In this paper, we demonstrate how to transform cross-entropy and mean squared error into dynamical loss functions. We begin by discussing the impact of increasing the size of the neural network or the learning rate on the depth and sharpness of the minima that the system explores. Building on this intuition, we propose several versions of dynamical loss functions and use a simple classification problem where we can show how they significantly improve validation accuracy for networks of varying sizes. Finally, we explore how the landscape of these dynamical loss functions evolves during training, highlighting the emergence of instabilities that may be linked to edge-of-instability minimization.

[299] This Time is Different: An Observability Perspective on Time Series Foundation Models

Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, Zongzhe Xu, Viktoriya Zhukova, David Asker, Ameet Talwalkar, Othmane Abou-Amal

Main category: cs.LG

TL;DR: Toto is a 151M parameter time series forecasting foundation model with architectural innovations for multivariate observability data, achieving SOTA performance on both the new BOOM benchmark and established benchmarks.

Details

Motivation: To address specific challenges in multivariate observability time series data and create a foundation model that outperforms existing time series forecasting models.

Method: Uses decoder-only architecture with innovations for observability data, pre-trained on a large corpus (4-10x larger than competitors) mixing observability data, open datasets, and synthetic data from Datadog’s telemetry.

Result: Achieves state-of-the-art performance on both the new BOOM benchmark (350M observations across 2,807 real-world time series) and established general purpose time series forecasting benchmarks.

Conclusion: Toto demonstrates superior forecasting capabilities for observability time series and is released as open source with model weights, code, and benchmarks available.

Abstract: We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto’s pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10$\times$ larger than those of leading time series foundation models. Additionally, we introduce BOOM, a large-scale benchmark consisting of 350 million observations across 2,807 real-world time series. For both Toto and BOOM, we source observability data exclusively from Datadog’s own telemetry and internal observability metrics. Extensive evaluations demonstrate that Toto achieves state-of-the-art performance on both BOOM and on established general purpose time series forecasting benchmarks. Toto’s model weights, inference code, and evaluation scripts, as well as BOOM’s data and evaluation code, are all available as open source under the Apache 2.0 License available at https://huggingface.co/Datadog/Toto-Open-Base-1.0 and https://github.com/DataDog/toto.

[300] Learning Expressive Random Feature Models via Parametrized Activations

Zailin Ma, Jiansheng Yang, Yaodong Yang

Main category: cs.LG

TL;DR: RFLAF introduces learnable activation functions in random feature models using weighted sums of basis functions, expanding function space representation and improving performance over fixed activation RF models.

Details

Motivation: Traditional random feature methods use fixed activation functions, limiting adaptability across diverse tasks. RFLAF aims to overcome this limitation by making activation functions learnable.

Method: Parameterize activation functions as weighted sums of basis functions (RBFs, splines, polynomials) within random feature framework. Provide theoretical analysis with RBFs and extend to multiple basis functions.

Result: RFLAF with RBFs and splines consistently outperforms other RF models. RBFs show 3x faster computational efficiency than splines. When first-layer parameters are unfrozen, learnable activation components demonstrate expressivity advantage in two-layer neural networks.

Conclusion: Learnable activation functions significantly expand the represented function space in neural networks, providing deeper understanding of this architectural component’s role in modern neural network architectures.

Abstract: Random feature (RF) method is a powerful kernel approximation technique, but is typically equipped with fixed activation functions, limiting its adaptability across diverse tasks. To overcome this limitation, we introduce the Random Feature Model with Learnable Activation Functions (RFLAF), a novel statistical model that parameterizes activation functions as weighted sums of basis functions within the random feature framework. Examples of basis functions include radial basis functions, spline functions, polynomials, and so forth. For theoretical results, we consider RBFs as representative basis functions. We start with a single RBF as the activation, and then extend the results to multiple RBFs, demonstrating that RF models with learnable activation component largely expand the represented function space. We provide estimates on the required number of samples and random features to achieve low excess risks. For experiments, we test RFLAF with three types of bases: radial basis functions, spline functions and polynomials. Experimental results show that RFLAFs with RBFs and splines consistently outperform other RF models, where RBFs show 3 times faster computational efficiency than splines. We then unfreeze the first-layer parameters and retrain the models, validating the expressivity advantage of learnable activation components on regular two-layer neural networks. Our work provides a deeper understanding of the component of learnable activation functions within modern neural network architectures.

[301] HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs

Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L. Castro, Torsten Hoefler, Dan Alistarh

Main category: cs.LG

TL;DR: HALO enables accurate 8-bit quantized fine-tuning of LLMs using Hadamard rotations, high-performance kernels, and FSDP integration, achieving near-full-precision accuracy with up to 1.41x speedup.

Details

Motivation: Quantized training of LLMs is challenging due to accuracy loss from weight and activation outliers, especially during fine-tuning where maintaining precision is difficult.

Method: Combines strategic Hadamard rotations in forward/backward passes to mitigate outliers, high-performance kernel support, and FSDP integration for low-precision communication.

Result: Achieves near-full-precision-equivalent results on LLAMA-family models during fine-tuning, with up to 1.41x end-to-end speedup on RTX 4090 GPUs, supporting both standard and PEFT fine-tuning.

Conclusion: HALO presents the first practical approach for fully quantized LLM fine-tuning that maintains accuracy in 8-bit precision while delivering performance benefits.

Abstract: Quantized training of Large Language Models (LLMs) remains an open challenge, as maintaining accuracy while performing all matrix multiplications in low precision has proven difficult. This is particularly the case when fine-tuning pre-trained models, which can have large weight and activation outlier values that make lower-precision optimization difficult. To address this, we present HALO, a novel quantization-aware training approach for Transformers that enables accurate and efficient low-precision training by combining 1) strategic placement of Hadamard rotations in both forward and backward passes, which mitigate outliers, 2) high-performance kernel support, and 3) FSDP integration for low-precision communication. Our approach ensures that all large matrix multiplications during the forward and backward passes are executed in lower precision. Applied to LLAMA-family models, HALO achieves near-full-precision-equivalent results during fine-tuning on various tasks, while delivering up to 1.41x end-to-end speedup for full fine-tuning on RTX 4090 GPUs. HALO efficiently supports both standard and parameterefficient fine-tuning (PEFT). Our results demonstrate the first practical approach to fully quantized LLM fine-tuning that maintains accuracy in 8-bit precision, while delivering performance benefits. Code is available at https://github.com/IST-DASLab/HALO.

[302] Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training

Aadim Nepal, Safal Shrestha, Anubhav Shrestha, Minwu Kim, Jalal Naghiyev, Ravid Shwartz-Ziv, Keith Ross

Main category: cs.LG

TL;DR: Math reasoning in LLMs depends on a few critical layers that remain important across all post-training methods, with specialized layers forming during pre-training and remaining stable.

Details

Motivation: To understand whether improvements in math reasoning from instruction tuning, reinforcement learning, or knowledge distillation come from major transformer changes or smaller adjustments that preserve original structure.

Method: Used layer-wise ablation on base and trained model variants, analyzed critical layers using Normalized Mutual Information (NMI) to measure token representation drift.

Result: Removing critical layers reduces math accuracy by up to 80%, while factual recall shows smaller drops. Near critical layers, tokens drift from syntactic clusters toward representations aligned with tokens useful for downstream tasks.

Conclusion: Specialized layers for mathematical tasks form during pre-training and remain stable across different post-training methods, with critical layers enabling math reasoning through representation alignment rather than syntactic processing.

Abstract: Large language models improve at math after instruction tuning, reinforcement learning, or knowledge distillation. We ask whether these gains come from major changes in the transformer layers or from smaller adjustments that keep the original structure. Using layer-wise ablation on base and trained variants, we find that math reasoning depends on a few critical layers, which stay important across all post-training methods. Removing these layers reduces math accuracy by as much as 80%, whereas factual recall tasks only show relatively smaller drops. This suggests that specialized layers for mathematical tasks form during pre-training and remain stable afterward. As measured by Normalized Mutual Information (NMI), we find that near these critical layers, tokens drift from their original syntactic clusters toward representations aligned with tokens less syntactically related but potentially more useful for downstream task.

[303] REINFORCE-ING Chemical Language Models for Drug Discovery

Morgan Thomas, Albert Bou, Jose Carlos Gómez-Tamayo, Gary Tresadern, Mazen Ahmad, Gianni De Fabritiis

Main category: cs.LG

TL;DR: Analysis of RL algorithms for chemical language models in drug discovery, proposing improved regularization and demonstrating enhanced efficiency with Boltz2 reward models.

Details

Motivation: To clarify best practices for RL algorithms in chemical language models for drug discovery, as current performance and optimal approaches are unclear.

Method: Investigated RL components including experience replay, hill-climbing, variance reduction baselines, and reward shaping; proposed new regularization method aligned with REINFORCE; fine-tuned RL hyperparameters; applied learnings using Boltz2 as reward model.

Result: Developed improved regularization method and demonstrated enhanced learning efficiency on binding affinity models using Boltz2 reward models.

Conclusion: Shared RL models in ACEGEN repository; findings provide guidance for researchers applying RL to chemical language models for drug discovery.

Abstract: Chemical language models, combined with reinforcement learning (RL), have shown significant promise to efficiently traverse large chemical spaces for drug discovery. However, the performance of various RL algorithms and their best practices for practical drug discovery are still unclear. Here, starting from the principles of the REINFORCE algorithm, we investigate the effect of different components from RL theory including experience replay, hill-climbing, baselines to reduce variance, and alternative reward shaping. We propose a new regularization method more aligned to REINFORCE than current standard practices, and demonstrate how RL hyperparameters can be fine-tuned for effectiveness and efficiency. Lastly, we apply our learnings to practical drug discovery by demonstrating enhanced learning efficiency on frontier binding affinity models by using Boltz2 as a reward model. We share our RL models used in the ACEGEN repository, and hope the experiments here act as a guide to researchers applying RL to chemical language models for drug discovery.

[304] FedRef: Communication-Efficient Bayesian Fine-Tuning using a Reference Model

Taehwan Yoon, Bongjun Choi, Wesley De Neve

Main category: cs.LG

TL;DR: Proposes a reference model-based fine-tuning method for federated learning that prevents catastrophic forgetting using Bayesian parameter-efficient transfer learning with a proximal term.

Details

Motivation: Federated learning suffers from catastrophic forgetting, which increases communication/computational costs and energy consumption, despite existing optimization methods.

Method: Reference model-based fine-tuning derived from Bayesian parameter-efficient transfer learning with proximal term, incorporating previous model parameters and global features.

Result: Achieves higher model performance and lower communication/computational costs for clients compared to existing methods.

Conclusion: The proposed method effectively mitigates catastrophic forgetting in federated learning while improving efficiency.

Abstract: Federated learning (FL) collaboratively trains artificial intelligence (AI) models to ensure user data privacy. Sharing only model updates generated from local training on client data with the server enhances user data privacy. However, model performance may suffer due to data and system heterogeneity among clients in FL scenarios. Previous studies have proposed model optimization, fine-tuning, and personalization to achieve improved model performance. Despite these efforts, models resulting from FL scenarios often exhibit catastrophic forgetting, which increases the communication and computational costs of clients for model optimization and raises energy consumption. To address these challenges, we propose a reference model-based fine-tuning method for federated learning that overcomes catastrophic forgetting in each round. Our method is derived from Bayesian parameter-efficient transfer learning and includes an proximal term. It employs a reference model that incorporates previous model parameters and reviews previous global features in the model optimization step to mitigate catastrophic forgetting. As a result, our method achieves higher model performance and lower communication and computational costs for clients than existing methods.

[305] Sundial: A Family of Highly Capable Time Series Foundation Models

Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, Mingsheng Long

Main category: cs.LG

TL;DR: Sundial is a family of time series foundation models that uses TimeFlow Loss based on flow-matching for native pre-training on continuous-valued time series without tokenization, achieving state-of-the-art forecasting performance with fast inference.

Details

Motivation: To develop flexible and scalable time series foundation models that can handle continuous-valued data without discrete tokenization, enabling more reliable real-world decision-making.

Method: Proposes TimeFlow Loss based on flow-matching for native pre-training of Transformers on continuous time series, uses minimal Transformer adaptations, and trains on TimeBench dataset with one trillion time points comprising real-world and synthetic data.

Result: Achieves state-of-the-art results on both point and probabilistic forecasting benchmarks with just-in-time inference speed (few milliseconds for zero-shot predictions), demonstrating unprecedented model capacity and generalization performance.

Conclusion: Sundial’s pioneering generative forecasting capability improves model reliability in real-world decision-making and represents a significant advancement in time series foundation models.

Abstract: We introduce Sundial, a family of native, flexible, and scalable time series foundation models. To predict the next-patch’s distribution, we propose a TimeFlow Loss based on flow-matching, which facilitates native pre-training of Transformers on continuous-valued time series without discrete tokenization. Conditioned on arbitrary-length time series, our models are pre-trained without specifying any prior distribution and can generate multiple probable predictions, achieving more flexibility in representation learning than using parametric densities. Towards time series foundation models, we leverage minimal but crucial adaptations of Transformers and curate TimeBench with one trillion time points, comprising mostly real-world datasets and synthetic data. By mitigating mode collapse via TimeFlow Loss, we pre-train a family of Sundial models on TimeBench, which achieve unprecedented model capacity and generalization performance. In addition to excellent scalability, Sundial achieves state-of-the-art results on both point and probabilistic forecasting benchmarks with a just-in-time inference speed, i.e., making zero-shot predictions within a few milliseconds. We believe that Sundial’s pioneering generative forecasting capability can improve model reliability in real-world decision-making. Code is available at: https://github.com/thuml/Sundial.

[306] DiffSpectra: Molecular Structure Elucidation from Spectra using Diffusion Models

Liang Wang, Yu Rong, Tingyang Xu, Zhenyi Zhong, Zhiyuan Liu, Pengju Wang, Deli Zhao, Qiang Liu, Shu Wu, Liang Wang, Yang Zhang

Main category: cs.LG

TL;DR: DiffSpectra is a generative framework that uses diffusion models to directly infer 2D and 3D molecular structures from multi-modal spectra, achieving high accuracy in molecular structure elucidation.

Details

Motivation: Conventional approaches for molecular structure elucidation from spectra rely heavily on expert interpretation and lack scalability, while existing machine learning approaches are constrained by limited reference libraries and struggle with 3D geometry and multi-modal spectral integration.

Method: Uses diffusion models with an SE(3)-equivariant architecture (Diffusion Molecule Transformer) for geometric modeling, conditioned by a Transformer-based spectral encoder (SpecFormer) that captures multi-modal spectral dependencies.

Result: Achieves 40.76% top-1 and 99.49% top-10 accuracy in molecular structure elucidation, with performance benefiting from 3D geometric modeling, SpecFormer pre-training, and multi-modal conditioning.

Conclusion: DiffSpectra is the first framework that unifies multi-modal spectral reasoning and joint 2D/3D generative modeling for de novo molecular structure elucidation.

Abstract: Molecular structure elucidation from spectra is a fundamental challenge in molecular science. Conventional approaches rely heavily on expert interpretation and lack scalability, while retrieval-based machine learning approaches remain constrained by limited reference libraries. Generative models offer a promising alternative, yet most adopt autoregressive architectures that overlook 3D geometry and struggle to integrate diverse spectral modalities. In this work, we present DiffSpectra, a generative framework that formulates molecular structure elucidation as a conditional generation process, directly inferring 2D and 3D molecular structures from multi-modal spectra using diffusion models. Its denoising network is parameterized by the Diffusion Molecule Transformer, an SE(3)-equivariant architecture for geometric modeling, conditioned by SpecFormer, a Transformer-based spectral encoder capturing multi-modal spectral dependencies. Extensive experiments demonstrate that DiffSpectra accurately elucidates molecular structures, achieving 40.76% top-1 and 99.49% top-10 accuracy. Its performance benefits substantially from 3D geometric modeling, SpecFormer pre-training, and multi-modal conditioning. To our knowledge, DiffSpectra is the first framework that unifies multi-modal spectral reasoning and joint 2D/3D generative modeling for de novo molecular structure elucidation.

[307] Stable Port-Hamiltonian Neural Networks

Fabian J. Roth, Dominik K. Klein, Maximilian Kannapinn, Jan Peters, Oliver Weeger

Main category: cs.LG

TL;DR: Stable port-Hamiltonian neural networks incorporate physical biases of energy conservation and dissipation to ensure global Lyapunov stability in learned dynamics, outperforming purely data-driven approaches in accuracy and physically meaningful generalization.

Details

Motivation: Purely data-driven neural networks for nonlinear dynamic system identification often struggle with extrapolation, yield physically implausible forecasts, and exhibit instabilities, making them difficult to apply safely and robustly.

Method: Introduces stable port-Hamiltonian neural networks that incorporate physical biases of energy conservation and dissipation to ensure global Lyapunov stability of the learned dynamics.

Result: The architecture facilitates robust learning of stable dynamics from sparse data, avoids instability, and surpasses purely data-driven approaches in accuracy and physically meaningful generalization. Demonstrated effectiveness on illustrative and real-world examples, including multi-physics simulation data.

Conclusion: Stable port-Hamiltonian neural networks provide a robust machine learning architecture for nonlinear dynamic system identification that ensures physical plausibility and stability while improving performance over purely data-driven methods.

Abstract: In recent years, nonlinear dynamic system identification using artificial neural networks has garnered attention due to its broad potential applications across science and engineering. However, purely data-driven approaches often struggle with extrapolation and may yield physically implausible forecasts. Furthermore, the learned dynamics can exhibit instabilities, making it difficult to apply such models safely and robustly. This article introduces stable port-Hamiltonian neural networks, a machine learning architecture that incorporates physical biases of energy conservation and dissipation while ensuring global Lyapunov stability of the learned dynamics. Through illustrative and real-world examples, we demonstrate that these strong inductive biases facilitate robust learning of stable dynamics from sparse data, while avoiding instability and surpassing purely data-driven approaches in accuracy and physically meaningful generalization. Furthermore, the model’s applicability and potential for data-driven surrogate modeling are showcased on multi-physics simulation data.

[308] Decision-aware training of spatiotemporal forecasting models to select a top K subset of sites for intervention

Kyle Heuton, F. Samuel Muench, Shikhar Shrestha, Thomas J. Stopka, Michael C. Hughes

Main category: cs.LG

TL;DR: This paper addresses two key problems in using spatiotemporal prediction models for optimal resource allocation: improving site ranking for the BPR metric and enabling gradient-based training despite discrete selections.

Details

Motivation: Decision makers need data-driven methods to optimally allocate scarce resources across locations, but existing approaches using probabilistic models have limitations in ranking sites effectively for the BPR metric and training models to maximize this metric.

Method: The authors propose a decision-theoretic approach for better site ranking and use perturbed optimizers to enable gradient-based training despite the discrete nature of top-K selections, combining likelihood with BPR constraints.

Result: The approach delivers high-quality top-K rankings while maintaining good forecasts across all sites, demonstrated on opioid overdose mitigation and endangered wildlife monitoring applications.

Conclusion: The proposed methods successfully overcome barriers in using probabilistic models for optimal resource allocation, providing practical solutions for real-world intervention planning.

Abstract: Optimal allocation of scarce resources is a common problem for decision makers faced with choosing a limited number of locations for intervention. Spatiotemporal prediction models could make such decisions data-driven. A recent performance metric called fraction of best possible reach (BPR) measures the impact of using a model’s recommended size K subset of sites compared to the best possible top-K in hindsight. We tackle two open problems related to BPR. First, we explore how to rank all sites numerically given a probabilistic model that predicts event counts jointly across sites. Ranking via the per-site mean is suboptimal for BPR. Instead, we offer a better ranking for BPR backed by decision theory. Second, we explore how to train a probabilistic model’s parameters to maximize BPR. Discrete selection of K sites implies all-zero parameter gradients which prevent standard gradient training. We overcome this barrier via advances in perturbed optimizers. We further suggest a training objective that combines likelihood with a decision-aware BPR constraint to deliver high-quality top-K rankings as well as good forecasts for all sites. We demonstrate our approach on two where-to-intervene applications: mitigating opioid-related fatal overdoses for public health and monitoring endangered wildlife.

[309] UniFault: A Fault Diagnosis Foundation Model from Bearing Data

Emadeldeen Eldele, Mohamed Ragab, Xu Qing, Edward, Zhenghua Chen, Min Wu, Xiaoli Li, Jay Lee

Main category: cs.LG

TL;DR: UniFault is a foundation model for machine fault diagnosis that addresses dataset heterogeneity through data harmonization and cross-domain temporal fusion, achieving state-of-the-art few-shot performance.

Details

Motivation: Existing fault diagnosis models lack generalization across diverse datasets due to heterogeneity in sampling frequencies and channels. Foundation models show promise but face challenges adapting to FD's smaller, more varied datasets.

Method: UniFault uses a comprehensive data harmonization pipeline with: 1) unification scheme converting multivariate inputs to standardized univariate sequences, and 2) cross-domain temporal fusion to mitigate distribution shifts and enrich sample diversity.

Result: Pretrained on 6.9+ million samples, UniFault achieves state-of-the-art performance in extensive experiments on real-world FD datasets, demonstrating superior few-shot learning capabilities.

Conclusion: UniFault sets a new benchmark for fault diagnosis models, enabling more scalable and robust predictive maintenance solutions through effective cross-domain generalization.

Abstract: Machine fault diagnosis (FD) is a critical task for predictive maintenance, enabling early fault detection and preventing unexpected failures. Despite its importance, existing FD models are operation-specific with limited generalization across diverse datasets. Foundation models (FM) have demonstrated remarkable potential in both visual and language domains, achieving impressive generalization capabilities even with minimal data through few-shot or zero-shot learning. However, translating these advances to FD presents unique hurdles. Unlike the large-scale, cohesive datasets available for images and text, FD datasets are typically smaller and more heterogeneous, with significant variations in sampling frequencies and the number of channels across different systems and applications. This heterogeneity complicates the design of a universal architecture capable of effectively processing such diverse data while maintaining robust feature extraction and learning capabilities. In this paper, we introduce UniFault, a foundation model for fault diagnosis that systematically addresses these issues. Specifically, the model incorporates a comprehensive data harmonization pipeline featuring two key innovations. First, a unification scheme transforms multivariate inputs into standardized univariate sequences. Second, a novel cross-domain temporal fusion strategy mitigates distribution shifts and enriches sample diversity and count, improving the model generalization across varying conditions. UniFault is pretrained on over 6.9 million samples spanning diverse FD datasets, enabling superior few-shot performance. Extensive experiments on real-world FD datasets demonstrate that UniFault achieves state-of-the-art performance, setting a new benchmark for fault diagnosis models and paving the way for more scalable and robust predictive maintenance solutions.

[310] Reliable and efficient inverse analysis using physics-informed neural networks with normalized distance functions and adaptive weight tuning

Shota Deguchi, Mitsuteru Asai

Main category: cs.LG

TL;DR: PINNs struggle with boundary condition accuracy. This paper uses R-functions to enforce exact boundary conditions and combines them with adaptive weight tuning for better inverse problem solving.

Details

Motivation: Conventional PINNs use penalty methods for boundary conditions, which are inaccurate and sensitive to penalty parameters. This limits solution accuracy, especially for inverse problems.

Method: Proposes using R-functions (distance functions) to enforce boundary conditions exactly, combined with bias-corrected adaptive weight tuning for improved accuracy and efficiency.

Result: Numerical results show the method provides more accurate and efficient solutions to inverse problems than penalty-based approaches, even with non-convex geometries and complex boundary conditions.

Conclusion: The integrated framework offers a reliable and efficient approach for inverse analysis using PINNs, with broad engineering applications.

Abstract: Physics-informed neural networks have attracted significant attention in scientific machine learning for their capability to solve forward and inverse problems governed by partial differential equations. However, the accuracy of PINN solutions is often limited by the treatment of boundary conditions. Conventional penalty-based methods, which incorporate boundary conditions as penalty terms in the loss function, cannot guarantee exact satisfaction of the given boundary conditions and are highly sensitive to the choice of penalty parameters. This paper demonstrates that distance functions, specifically R-functions, can be leveraged to enforce boundary conditions, overcoming these limitations. R-functions provide normalized distance fields, enabling flexible representation of boundary geometries, including non-convex domains, and facilitating various types of boundary conditions. Nevertheless, distance functions alone are insufficient for accurate inverse analysis in PINNs. To address this, we propose an integrated framework that combines the normalized distance field with bias-corrected adaptive weight tuning to improve both accuracy and efficiency. Numerical results show that the proposed method provides more accurate and efficient solutions to various inverse problems than penalty-based approaches, even in the presence of non-convex geometries with complex boundary conditions. This approach offers a reliable and efficient framework for inverse analysis using PINNs, with potential applications across a wide range of engineering problems.

[311] CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction

Jueon Park, Yein Park, Minju Song, Soyon Park, Donghyeon Lee, Seungheun Baek, Jaewoo Kang

Main category: cs.LG

TL;DR: CoTox is a novel framework that integrates LLMs with chain-of-thought reasoning for multi-toxicity prediction, combining chemical structures, biological pathways, and gene ontology terms to generate interpretable toxicity predictions.

Details

Motivation: Drug toxicity remains a major challenge in pharmaceutical development. Current machine learning models lack interpretability and biological context, limiting their ability to capture organ-specific toxicities driven by complex biological mechanisms.

Method: CoTox integrates LLMs with chain-of-thought reasoning, combining chemical structure data (using IUPAC names instead of SMILES), biological pathways, and gene ontology terms. It simulates drug treatment on relevant cell types and incorporates the resulting biological context.

Result: CoTox outperforms both traditional machine learning and deep learning models. Using IUPAC names instead of SMILES enhances reasoning ability and improves predictive performance. The framework generates toxicity predictions aligned with physiological responses.

Conclusion: LLM-based frameworks like CoTox have significant potential to improve interpretability and support early-stage drug safety assessment by providing transparent, biologically-informed toxicity predictions.

Abstract: Drug toxicity remains a major challenge in pharmaceutical development. Recent machine learning models have improved in silico toxicity prediction, but their reliance on annotated data and lack of interpretability limit their applicability. This limits their ability to capture organ-specific toxicities driven by complex biological mechanisms. Large language models (LLMs) offer a promising alternative through step-by-step reasoning and integration of textual data, yet prior approaches lack biological context and transparent rationale. To address this issue, we propose CoTox, a novel framework that integrates LLM with chain-of-thought (CoT) reasoning for multi-toxicity prediction. CoTox combines chemical structure data, biological pathways, and gene ontology (GO) terms to generate interpretable toxicity predictions through step-by-step reasoning. Using GPT-4o, we show that CoTox outperforms both traditional machine learning and deep learning model. We further examine its performance across various LLMs to identify where CoTox is most effective. Additionally, we find that representing chemical structures with IUPAC names, which are easier for LLMs to understand than SMILES, enhances the model’s reasoning ability and improves predictive performance. To demonstrate its practical utility in drug development, we simulate the treatment of relevant cell types with drug and incorporated the resulting biological context into the CoTox framework. This approach allow CoTox to generate toxicity predictions aligned with physiological responses, as shown in case study. This result highlights the potential of LLM-based frameworks to improve interpretability and support early-stage drug safety assessment. The code and prompt used in this work are available at https://github.com/dmis-lab/CoTox.

[312] NeuralSurv: Deep Survival Analysis with Bayesian Uncertainty Quantification

Mélodie Monod, Alessandro Micheli, Samir Bhatt

Main category: cs.LG

TL;DR: NeuralSurv is the first Bayesian deep survival model that provides uncertainty quantification through a novel two-stage data augmentation scheme and efficient variational inference with closed-form updates.

Details

Motivation: Current deep survival models lack proper uncertainty quantification, which is crucial for reliable predictions, especially in data-scarce scenarios where model calibration matters most.

Method: Uses a non-parametric, architecture-agnostic framework with two-stage data augmentation for time-varying covariate-risk relationships. Implements mean-field variational inference with coordinate-ascent updates that scale linearly and achieve full conjugacy through local linearization of Bayesian neural networks.

Result: NeuralSurv achieves superior calibration compared to state-of-the-art deep survival models while matching or exceeding their discriminative performance on both synthetic and real-world datasets.

Conclusion: Bayesian principles significantly enhance model calibration in survival analysis, providing robust and well-calibrated uncertainty estimates for survival functions, particularly valuable in data-scarce regimes.

Abstract: We introduce NeuralSurv, the first deep survival model to incorporate Bayesian uncertainty quantification. Our non-parametric, architecture-agnostic framework captures time-varying covariate-risk relationships in continuous time via a novel two-stage data-augmentation scheme, for which we establish theoretical guarantees. For efficient posterior inference, we introduce a mean-field variational algorithm with coordinate-ascent updates that scale linearly in model size. By locally linearizing the Bayesian neural network, we obtain full conjugacy and derive all coordinate updates in closed form. In experiments, NeuralSurv delivers superior calibration compared to state-of-the-art deep survival models, while matching or exceeding their discriminative performance across both synthetic benchmarks and real-world datasets. Our results demonstrate the value of Bayesian principles in data-scarce regimes by enhancing model calibration and providing robust, well-calibrated uncertainty estimates for the survival function.

[313] Fast weight programming and linear transformers: from machine learning to neurobiology

Kazuki Irie, Samuel J. Gershman

Main category: cs.LG

TL;DR: Fast Weight Programmers (FWPs) are recurrent neural networks with 2D matrix-form hidden states that dynamically modify synaptic weights as short-term memory, controlled by a programmer network trained via gradient descent.

Details

Motivation: To explore neural network architectures with dynamic synaptic weights that serve as short-term memory storage, bridging connections to transformers, state space models, and biological synaptic plasticity.

Method: Uses 2D matrix-form hidden states where fast weights change over time based on input observations, programmed by another network whose parameters are trained through gradient descent.

Result: Established technical foundations of FWPs, their computational characteristics, and demonstrated connections to transformers, state space models, and biological synaptic plasticity models.

Conclusion: FWPs represent a convergence of natural and artificial intelligence, offering insights into dynamic memory mechanisms and connecting machine learning architectures with biological neural processes.

Abstract: Recent advances in artificial neural networks for machine learning, and language modeling in particular, have established a family of recurrent neural network (RNN) architectures that, unlike conventional RNNs with vector-form hidden states, use two-dimensional (2D) matrix-form hidden states. Such 2D-state RNNs, known as Fast Weight Programmers (FWPs), can be interpreted as a neural network whose synaptic weights (called fast weights) dynamically change over time as a function of input observations, and serve as short-term memory storage; corresponding synaptic weight modifications are controlled or programmed by another network (the programmer) whose parameters are trained (e.g., by gradient descent). In this Primer, we review the technical foundations of FWPs, their computational characteristics, and their connections to transformers and state space models. We also discuss connections between FWPs and models of synaptic plasticity in the brain, suggesting a convergence of natural and artificial intelligence.

[314] Recurrent Self-Attention Dynamics: An Energy-Agnostic Perspective from Jacobians

Akiyoshi Tomihari, Ryo Karakida

Main category: cs.LG

TL;DR: This paper provides an energy-agnostic analysis of self-attention dynamics using dynamical systems theory, relaxing traditional energy constraints and revealing the critical role of normalization layers in controlling Lipschitzness and achieving high performance.

Details

Motivation: To broaden understanding of self-attention beyond idealized energy-based formulations by relaxing symmetry and single-head constraints, enabling analysis of more general SA architectures without requiring energy functions.

Method: Dynamical systems analysis using Jacobian matrices of state, investigation of normalization layers’ role in suppressing Lipschitzness and complex eigenvalues, computation of Lyapunov exponents to assess criticality.

Result: Reveals that normalization layers are essential for controlling SA dynamics, normalized dynamics operate near critical states, and this criticality strongly correlates with high inference performance.

Conclusion: The Jacobian-based approach enables comprehensive analysis of general SA architectures, provides insights into inference dynamics, and facilitates development of regularization methods and monitoring tools for training.

Abstract: The theoretical understanding of self-attention (SA) has been steadily progressing. A prominent line of work studies a class of SA layers that admit an energy function decreased by state updates. While it provides valuable insights into inherent biases in signal propagation, it often relies on idealized assumptions or additional constraints not necessarily present in standard SA. Thus, to broaden our understanding, this work aims to relax these energy constraints and provide an energy-agnostic characterization of inference dynamics by dynamical systems analysis. In more detail, we first consider relaxing the symmetry and single-head constraints traditionally required in energy-based formulations. Next, we show that analyzing the Jacobian matrix of the state is highly valuable when investigating more general SA architectures without necessarily admitting an energy function. It reveals that the normalization layer plays an essential role in suppressing the Lipschitzness of SA and the Jacobian’s complex eigenvalues, which correspond to the oscillatory components of the dynamics. In addition, the Lyapunov exponents computed from the Jacobians demonstrate that the normalized dynamics lie close to a critical state, and this criticality serves as a strong indicator of high inference performance. Furthermore, the Jacobian perspective also enables us to develop regularization methods for training and a pseudo-energy for monitoring inference dynamics.

[315] On scalable and efficient training of diffusion samplers

Minkyu Kim, Kiyoung Seong, Dongyeop Woo, Sungsoo Ahn, Minsu Kim

Main category: cs.LG

TL;DR: A scalable framework combining MCMC samplers with diffusion models to efficiently sample from unnormalized energy distributions, addressing mode collapse and improving sample efficiency in high-dimensional problems.

Details

Motivation: Existing diffusion samplers struggle with scalability in high-dimensional spaces where energy evaluations are expensive, and they suffer from mode collapse during training.

Method: Proposes a hybrid approach using MCMC samplers with novelty-based auxiliary energy as ‘Searchers’ to collect off-policy samples, combined with on-policy data to train diffusion samplers. Introduces periodic re-initialization to address primacy bias and mode collapse.

Result: Significantly improves sample efficiency on standard benchmarks, excels at higher-dimensional problems, and performs well on real-world molecular conformer generation tasks.

Conclusion: The proposed framework successfully harmonizes classical sampling methods with diffusion samplers, providing a scalable and sample-efficient solution for sampling from unnormalized energy distributions in challenging scenarios.

Abstract: We address the challenge of training diffusion models to sample from unnormalized energy distributions in the absence of data, the so-called diffusion samplers. Although these approaches have shown promise, they struggle to scale in more demanding scenarios where energy evaluations are expensive and the sampling space is high-dimensional. To address this limitation, we propose a scalable and sample-efficient framework that properly harmonizes the powerful classical sampling method and the diffusion sampler. Specifically, we utilize Monte Carlo Markov chain (MCMC) samplers with a novelty-based auxiliary energy as a Searcher to collect off-policy samples, using an auxiliary energy function to compensate for exploring modes the diffusion sampler rarely visits. These off-policy samples are then combined with on-policy data to train the diffusion sampler, thereby expanding its coverage of the energy landscape. Furthermore, we identify primacy bias, i.e., the preference of samplers for early experience during training, as the main cause of mode collapse during training, and introduce a periodic re-initialization trick to resolve this issue. Our method significantly improves sample efficiency on standard benchmarks for diffusion samplers and also excels at higher-dimensional problems and real-world molecular conformer generation.

[316] A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation

Etienne Boursier, Scott Pesme, Radu-Alexandru Dragomir

Main category: cs.LG

TL;DR: Gradient flow with small weight decay exhibits two-phase behavior: initial fast convergence to critical points followed by slow drift phase that reduces parameter norm, explaining the grokking phenomenon in deep learning.

Details

Motivation: To explain the grokking effect in deep learning where test loss suddenly improves after training loss plateaus, by analyzing the dynamics of gradient flow with weight decay.

Method: Analyze gradient flow with small weight decay under mild regularity assumptions, showing it follows unregularized gradient flow initially then transitions to Riemannian gradient flow minimizing parameter norm.

Result: The trajectory exhibits two-phase behavior: fast convergence to critical points followed by slow drift phase at time ~1/λ that reduces ℓ₂-norm, explaining grokking as norm reduction induced by weight decay.

Conclusion: Weight decay induces a slow norm reduction mechanism that explains the grokking phenomenon, validated empirically on synthetic regression tasks.

Abstract: We study the dynamics of gradient flow with small weight decay on general training losses $F: \mathbb{R}^d \to \mathbb{R}$. Under mild regularity assumptions and assuming convergence of the unregularised gradient flow, we show that the trajectory with weight decay $\lambda$ exhibits a two-phase behaviour as $\lambda \to 0$. During the initial fast phase, the trajectory follows the unregularised gradient flow and converges to a manifold of critical points of $F$. Then, at time of order $1/\lambda$, the trajectory enters a slow drift phase and follows a Riemannian gradient flow minimising the $\ell_2$-norm of the parameters. This purely optimisation-based phenomenon offers a natural explanation for the \textit{grokking} effect observed in deep learning, where the training loss rapidly reaches zero while the test loss plateaus for an extended period before suddenly improving. We argue that this generalisation jump can be attributed to the slow norm reduction induced by weight decay, as explained by our analysis. We validate this mechanism empirically on several synthetic regression tasks.

[317] Robust and Computation-Aware Gaussian Processes

Marshal Arijona Sinaga, Julien Martinelli, Samuel Kaski

Main category: cs.LG

TL;DR: RCaGP is a novel Gaussian Process model that jointly addresses robustness to outliers and computational tractability through principled approximation-aware uncertainty estimation and robust Bayesian updating.

Details

Motivation: Standard GPs struggle with large datasets containing outliers - they face computational intractability and lack robustness, while existing approaches address either robustness or approximation quality separately but not both.

Method: Combines approximation-induced uncertainty treatment with robust generalized Bayesian updating, using low-rank matrix approximations and tailored model selection for robust mean functions.

Result: Provides more conservative and reliable uncertainty estimates, demonstrates robustness properties, and shows superior performance on both clean and outlier-contaminated datasets for regression and Bayesian optimization.

Conclusion: Robustness and approximation-awareness are intertwined challenges that must be addressed jointly, and RCaGP provides a principled, scalable framework that effectively manages both outliers and computational uncertainties.

Abstract: Gaussian processes (GPs) are widely used for regression and optimization tasks such as Bayesian optimization (BO) due to their expressiveness and principled uncertainty estimates. However, in settings with large datasets corrupted by outliers, standard GPs and their sparse approximations struggle with computational tractability and robustness. We introduce Robust Computation-aware Gaussian Process (RCaGP), a novel GP model that jointly addresses these challenges by combining a principled treatment of approximation-induced uncertainty with robust generalized Bayesian updating. The key insight is that robustness and approximation-awareness are not orthogonal but intertwined: approximations can exacerbate the impact of outliers, and mitigating one without the other is insufficient. Unlike previous work that focuses narrowly on either robustness or approximation quality, RCaGP combines both in a principled and scalable framework, thus effectively managing both outliers and computational uncertainties introduced by approximations such as low-rank matrix multiplications. Our model ensures more conservative and reliable uncertainty estimates, a property we rigorously demonstrate. Additionally, we establish a robustness property and show that the mean function is key to preserving it, motivating a tailored model selection scheme for robust mean functions. Empirical results confirm that solving these challenges jointly leads to superior performance across both clean and outlier-contaminated settings, both on regression and high-throughput Bayesian optimization benchmarks.

[318] SME-TEAM: Leveraging Trust and Ethics for Secure and Responsible Use of AI and LLMs in SMEs

Iqbal H. Sarker, Helge Janicke, Ahmad Mohsin, Leandros Maglaras

Main category: cs.LG

TL;DR: SME-TEAM framework provides a structured approach for SMEs to adopt AI and LLMs responsibly, addressing trust, ethical, and technical challenges through four key pillars.

Details

Motivation: AI and LLMs are transforming business but pose serious trust, ethical, and technical challenges for SMEs, requiring a structured framework for responsible adoption.

Method: Developed SME-TEAM framework with four key pillars: Data, Algorithms, Human Oversight, and Model Architecture to bridge ethical principles with operational practice.

Result: Framework enhances AI capabilities across SME applications and provides a roadmap for secure and responsible technology adoption.

Conclusion: SME-TEAM positions trust and ethics as drivers for resilience, competitiveness, and sustainable innovation in business analytics for SMEs.

Abstract: Artificial Intelligence (AI) and Large Language Models (LLMs) are revolutionizing today’s business practices; however, their adoption within small and medium-sized enterprises (SMEs) raises serious trust, ethical, and technical issues. In this perspective paper, we introduce a structured, multi-phased framework, “SME-TEAM” for the secure and responsible use of these technologies in SMEs. Based on a conceptual structure of four key pillars, i.e., Data, Algorithms, Human Oversight, and Model Architecture, SME-TEAM bridges theoretical ethical principles with operational practice, enhancing AI capabilities across a wide range of applications in SMEs. Ultimately, this paper provides a structured roadmap for the adoption of these emerging technologies, positioning trust and ethics as a driving force for resilience, competitiveness, and sustainable innovation within the area of business analytics and SMEs.

[319] Why Machine Learning Models Fail to Fully Capture Epistemic Uncertainty

Sebastián Jiménez, Mira Jürgens, Willem Waegeman

Main category: cs.LG

TL;DR: Current second-order uncertainty methods fail to capture full epistemic uncertainty due to neglected model bias, leading to misleadingly low epistemic estimates and blurring bias-induced errors into aleatoric uncertainty.

Details

Motivation: To demonstrate that existing supervised learning methods for disentangling aleatoric and epistemic uncertainty based on second-order distributions inadequately capture epistemic uncertainty, particularly due to model bias.

Method: Using a fine-grained taxonomy of epistemic uncertainty sources, analyzing bias-variance decomposition, and employing simulation-based evaluation with procedural- and data-driven uncertainty components.

Result: Current methods rarely capture full epistemic uncertainty spectrum; high model bias leads to misleadingly low epistemic uncertainty estimates, and second-order methods systematically blur bias-induced errors into aleatoric estimates.

Conclusion: Meaningful aleatoric uncertainty estimates require proper representation of all epistemic uncertainty sources, as current methods systematically underrepresent epistemic uncertainty due to model bias.

Abstract: In recent years various supervised learning methods that disentangle aleatoric and epistemic uncertainty based on second-order distributions have been proposed. We argue that these methods fail to capture critical components of epistemic uncertainty, particularly due to the often-neglected component of model bias. To show this, we make use of a more fine-grained taxonomy of epistemic uncertainty sources in machine learning models, and analyse how the classical bias-variance decomposition of the expected prediction error can be decomposed into different parts reflecting these uncertainties. By using a simulation-based evaluation protocol which encompasses epistemic uncertainty due to both procedural- and data-driven uncertainty components, we illustrate that current methods rarely capture the full spectrum of epistemic uncertainty. Through theoretical insights and synthetic experiments, we show that high model bias can lead to misleadingly low estimates of epistemic uncertainty, and common second-order uncertainty quantification methods systematically blur bias-induced errors into aleatoric estimates, thereby underrepresenting epistemic uncertainty. Our findings underscore that meaningful aleatoric estimates are feasible only if all relevant sources of epistemic uncertainty are properly represented.

[320] DiCoFlex: Model-agnostic diverse counterfactuals with flexible control

Oleksii Furman, Ulvi Movsum-zada, Patryk Marszalek, Maciej Zięba, Marek Śmieja

Main category: cs.LG

TL;DR: DiCoFlex is a model-agnostic conditional generative framework that produces diverse counterfactual explanations in a single forward pass using normalizing flows, enabling real-time customization of constraints.

Details

Motivation: Existing counterfactual generation methods require constant model access, involve intensive optimization per instance, and lack flexibility for new user constraints without retraining.

Method: Leverages conditional normalizing flows trained on labeled data to generate multiple diverse counterfactuals simultaneously, allowing real-time customization of constraints like sparsity and actionability.

Result: Outperforms existing methods on benchmark datasets in validity, diversity, proximity, and constraint adherence.

Conclusion: DiCoFlex provides a practical and scalable solution for counterfactual generation in sensitive decision-making domains.

Abstract: Counterfactual explanations play a pivotal role in explainable artificial intelligence (XAI) by offering intuitive, human-understandable alternatives that elucidate machine learning model decisions. Despite their significance, existing methods for generating counterfactuals often require constant access to the predictive model, involve computationally intensive optimization for each instance and lack the flexibility to adapt to new user-defined constraints without retraining. In this paper, we propose DiCoFlex, a novel model-agnostic, conditional generative framework that produces multiple diverse counterfactuals in a single forward pass. Leveraging conditional normalizing flows trained solely on labeled data, DiCoFlex addresses key limitations by enabling real-time user-driven customization of constraints such as sparsity and actionability at inference time. Extensive experiments on standard benchmark datasets show that DiCoFlex outperforms existing methods in terms of validity, diversity, proximity, and constraint adherence, making it a practical and scalable solution for counterfactual generation in sensitive decision-making domains.

[321] Automatic Discovery of One-Parameter Subgroups of Lie Groups: Compact and Non-Compact Cases of $\mathbf{SO(n)}$ and $\mathbf{SL(n)}$

Pavan Karjol, Vivek V Kashyap, Rohan Kashyap, Prathosh A P

Main category: cs.LG

TL;DR: A framework for automatic discovery of one-parameter subgroups of SO(n) using Jordan forms of skew-symmetric matrices, with applications in robotics, quantum mechanics, and molecular analysis.

Details

Motivation: One-parameter subgroups of SO(n) are crucial in various applications but are difficult to discover automatically. Existing methods lack systematic approaches for identifying these symmetry structures.

Method: Uses standard Jordan form of skew-symmetric matrices (defining SO(n) Lie algebra) to establish canonical forms for orbits and derive standardized representations for invariant functions, then learns parameters to uncover subgroups.

Result: Successfully recovers meaningful subgroup structure in applications including double pendulum modeling, moment of inertia prediction, top quark tagging, and invariant polynomial regression.

Conclusion: The framework effectively discovers one-parameter subgroups of SO(n) and produces interpretable, symmetry-aware representations across multiple domains.

Abstract: We introduce a novel framework for the automatic discovery of one-parameter subgroups ($H_{\gamma}$) of $SO(3)$ and, more generally, $SO(n)$. One-parameter subgroups of $SO(n)$ are crucial in a wide range of applications, including robotics, quantum mechanics, and molecular structure analysis. Our method utilizes the standard Jordan form of skew-symmetric matrices, which define the Lie algebra of $SO(n)$, to establish a canonical form for orbits under the action of $H_{\gamma}$. This canonical form is then employed to derive a standardized representation for $H_{\gamma}$-invariant functions. By learning the appropriate parameters, the framework uncovers the underlying one-parameter subgroup $H_{\gamma}$. The effectiveness of the proposed approach is demonstrated through tasks such as double pendulum modeling, moment of inertia prediction, top quark tagging and invariant polynomial regression, where it successfully recovers meaningful subgroup structure and produces interpretable, symmetry-aware representations.

[322] Model-Informed Flows for Bayesian Inference

Joohwan Ko, Justin Domke

Main category: cs.LG

TL;DR: The paper introduces Model-Informed Flow (MIF), a novel variational inference architecture that combines VIP with Gaussian flows and incorporates prior information to better handle complex hierarchical Bayesian models.

Details

Motivation: Variational inference struggles with posterior geometry in complex hierarchical Bayesian models, and while recent advances like flow-based families and VIP address aspects of this challenge, their relationship remains unexplored.

Method: Proved that VIP combined with full-rank Gaussian can be represented as forward autoregressive flow with translation term and prior input. Introduced MIF architecture that adds translation mechanism, prior information, and hierarchical ordering.

Result: Empirically, MIF delivers tighter posterior approximations and matches or exceeds state-of-the-art performance across hierarchical and non-hierarchical benchmarks.

Conclusion: The theoretical connection between VIP and flow-based methods enables the development of MIF, which effectively handles complex posterior geometries in hierarchical Bayesian models.

Abstract: Variational inference often struggles with the posterior geometry exhibited by complex hierarchical Bayesian models. Recent advances in flow-based variational families and Variationally Inferred Parameters (VIP) each address aspects of this challenge, but their formal relationship is unexplored. Here, we prove that the combination of VIP and a full-rank Gaussian can be represented exactly as a forward autoregressive flow augmented with a translation term and input from the model’s prior. Guided by this theoretical insight, we introduce the Model-Informed Flow (MIF) architecture, which adds the necessary translation mechanism, prior information, and hierarchical ordering. Empirically, MIF delivers tighter posterior approximations and matches or exceeds state-of-the-art performance across a suite of hierarchical and non-hierarchical benchmarks.

[323] CardioForest: An Explainable Ensemble Learning Model for Automatic Wide QRS Complex Tachycardia Diagnosis from ECG

Vaskar Chakma, Ju Xiaolin, Heling Cao, Xue Feng, Ji Xiaodong, Pan Haiyan, Gao Zhan

Main category: cs.LG

TL;DR: Developed an ensemble ML framework (CardioForest) for automatic WCT detection from ECG signals, achieving high accuracy (95.19%) with explainable AI using SHAP for clinical interpretability.

Details

Motivation: To create an accurate and interpretable automated system for detecting Wide QRS Complex Tachycardia from ECG signals, addressing the need for reliable diagnostic tools in clinical practice, especially in emergency scenarios.

Method: Used ensemble learning techniques including optimized Random Forest (CardioForest), XGBoost, and LightGBM trained on MIMIC-IV ECG dataset. Employed SHAP for explainable AI to ensure clinical relevance and interpretability.

Result: CardioForest achieved best performance: 95.19% accuracy, 88.76% balanced accuracy, 95.26% precision, 78.42% recall, and 0.8886 ROC-AUC. SHAP analysis confirmed clinically relevant feature importance ranking.

Conclusion: CardioForest is a highly dependable and interpretable WCT detection model that provides accurate predictions with transparency, making it valuable for cardiologists in clinical decision-making, particularly in emergency situations.

Abstract: This study aims to develop and evaluate an ensemble machine learning-based framework for the automatic detection of Wide QRS Complex Tachycardia (WCT) from ECG signals, emphasizing diagnostic accuracy and interpretability using Explainable AI. The proposed system integrates ensemble learning techniques, i.e., an optimized Random Forest known as CardioForest, and models like XGBoost and LightGBM. The models were trained and tested on ECG data from the publicly available MIMIC-IV dataset. The testing was carried out with the assistance of accuracy, balanced accuracy, precision, recall, F1 score, ROC-AUC, and error rate (RMSE, MAE) measures. In addition, SHAP (SHapley Additive exPlanations) was used to ascertain model explainability and clinical relevance. The CardioForest model performed best on all metrics, achieving a test accuracy of 95.19%, a balanced accuracy of 88.76%, a precision of 95.26%, a recall of 78.42%, and an ROC-AUC of 0.8886. SHAP analysis confirmed the model’s ability to rank the most relevant ECG features, such as QRS duration, in accordance with clinical intuitions, thereby fostering trust and usability in clinical practice. The findings recognize CardioForest as an extremely dependable and interpretable WCT detection model. Being able to offer accurate predictions and transparency through explainability makes it a valuable tool to help cardiologists make timely and well-informed diagnoses, especially for high-stakes and emergency scenarios.

[324] Inference-Time Reward Hacking in Large Language Models

Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio du Pin Calmon

Main category: cs.LG

TL;DR: The paper addresses reward hacking in LLM alignment, introduces Best-of-Poisson (BoP) as an efficient approximation of optimal policy, and proposes HedgeTune to mitigate reward hacking through optimal inference-time parameter selection.

Details

Motivation: Reward models are imperfect proxies for complex desiderata like correctness and safety, and overoptimizing for misspecified rewards can subvert alignment goals through reward hacking.

Method: Introduces Best-of-Poisson (BoP) as an efficient approximation of optimal reward-KL divergence policy, and HedgeTune algorithm to find optimal inference-time parameters for mitigating reward hacking.

Result: Shows that reward hacking pattern (true reward increasing then declining) is inevitable in inference-time mechanisms, and demonstrates that hedging achieves superior reward-distortion tradeoffs on math, reasoning, and human-preference tasks.

Conclusion: Hedging on proxy rewards effectively mitigates reward hacking and improves alignment performance across various domains by finding optimal inference-time parameters.

Abstract: A common paradigm to improve the performance of large language models is optimizing for a reward model. Reward models assign a numerical score to an LLM’s output that indicates, for example, how likely it is to align with user preferences or safety goals. However, reward models are never perfect. They inevitably function as proxies for complex desiderata such as correctness, helpfulness, and safety. By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance, a phenomenon commonly referred to as reward hacking. In this work, we characterize reward hacking in inference-time alignment and demonstrate when and how we can mitigate it by hedging on the proxy reward. We study this phenomenon under Best-of-$n$ (BoN) and Soft Best-of-$n$ (SBoN), and we introduce Best-of-Poisson (BoP) that provides an efficient, near-exact approximation of the optimal reward-KL divergence policy at inference time. We show that the characteristic pattern of hacking as observed in practice (where the true reward first increases before declining) is an inevitable property of a broad class of inference-time mechanisms, including BoN and BoP. To counter this effect, we introduce HedgeTune, an efficient algorithm to find the optimal inference-time parameter. We demonstrate that hedging mitigates reward hacking and achieves superior reward-distortion tradeoffs on math, reasoning, and human-preference setups.

[325] Efficient Latent Variable Causal Discovery: Combining Score Search and Targeted Testing

Joseph Ramsey, Bryan Andrews, Peter Spirtes

Main category: cs.LG

TL;DR: The paper introduces score-guided mixed-strategy causal search algorithms that improve upon FCI for latent variable settings, including BOSS-FCI, GRaSP-FCI, FCIT, and LV-Dumb, achieving better precision and efficiency.

Details

Motivation: FCI algorithm struggles with spurious independences and unreliable orientations when latent variables or selection bias are present, requiring improvements in efficiency and correctness.

Method: Developed BOSS-FCI and GRaSP-FCI variants using score-based search, FCIT with targeted testing guided by BOSS, and LV-Dumb heuristic that returns PAG of BOSS DAG.

Result: BOSS-FCI and GRaSP-FCI provide robust baselines, FCIT achieves best precision-reliability balance, and LV-Dumb offers fast near-equivalent performance to FCIT.

Conclusion: Targeted and score-guided strategies significantly improve efficiency and correctness of latent-variable causal discovery compared to traditional FCI approaches.

Abstract: Learning causal structure from observational data is especially challenging when latent variables or selection bias are present. The Fast Causal Inference (FCI) algorithm addresses this setting but performs exhaustive conditional independence tests across many subsets, often leading to spurious independences, missing or extra edges, and unreliable orientations. We present a family of score-guided mixed-strategy causal search algorithms that extend this framework. First, we introduce BOSS-FCI and GRaSP-FCI, variants of GFCI (Greedy Fast Causal Inference) that substitute BOSS (Best Order Score Search) or GRaSP (Greedy Relaxations of Sparsest Permutation) for FGES (Fast Greedy Equivalence Search), preserving correctness while trading off scalability and conservativeness. Second, we develop FCI Targeted-Testing (FCIT), a novel hybrid method that replaces exhaustive testing with targeted, score-informed tests guided by BOSS. FCIT guarantees well-formed PAGs and achieves higher precision and efficiency across sample sizes. Finally, we propose a lightweight heuristic, LV-Dumb (Latent Variable “Dumb”), which returns the PAG of the BOSS DAG (Directed Acyclic Graph). Though not strictly sound for latent confounding, LV-Dumb often matches FCIT’s accuracy while running substantially faster. Simulations and real-data analyses show that BOSS-FCI and GRaSP-FCI provide robust baselines, FCIT yields the best balance of precision and reliability, and LV-Dumb offers a fast, near-equivalent alternative. Together, these methods demonstrate that targeted and score-guided strategies can dramatically improve the efficiency and correctness of latent-variable causal discovery.

[326] Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations

Yuchao Lin, Cong Fu, Zachary Krueger, Haiyang Yu, Maho Nakata, Jianwen Xie, Emine Kucukbenli, Xiaofeng Qian, Shuiwang Ji

Main category: cs.LG

TL;DR: TDNs replace computationally expensive Clebsch-Gordan tensor products in SO(3)-equivariant networks with low-rank tensor decompositions, achieving dramatic speedup while maintaining competitive performance on molecular datasets.

Details

Motivation: Clebsch-Gordan tensor products in SO(3)-equivariant networks are computationally expensive, limiting their efficiency for machine learning interatomic potentials.

Method: Develop tensor decomposition networks (TDNs) using low-rank tensor decompositions like CP decomposition, with path-weight sharing to reduce parameters and computational complexity from O(L^6) to O(L^4).

Result: TDNs achieve competitive performance with dramatic speedup on PubChemQCR, OC20, and OC22 datasets, handling 105 million DFT-calculated snapshots efficiently.

Conclusion: TDNs provide an effective plug-and-play replacement for tensor products in existing networks, offering significant computational acceleration while maintaining equivariance and performance.

Abstract: $\rm{SO}(3)$-equivariant networks are the dominant models for machine learning interatomic potentials (MLIPs). The key operation of such networks is the Clebsch-Gordan (CG) tensor product, which is computationally expensive. To accelerate the computation, we develop tensor decomposition networks (TDNs) as a class of approximately equivariant networks in which CG tensor products are replaced by low-rank tensor decompositions, such as the CANDECOMP/PARAFAC (CP) decomposition. With the CP decomposition, we prove (i) a uniform bound on the induced error of $\rm{SO}(3)$-equivariance, and (ii) the universality of approximating any equivariant bilinear map. To further reduce the number of parameters, we propose path-weight sharing that ties all multiplicity-space weights across the $\mathcal{O}(L^3)$ CG paths into a single path without compromising equivariance, where $L$ is the maximum angular degree. The resulting layer acts as a plug-and-play replacement for tensor products in existing networks, and the computational complexity of tensor products is reduced from $\mathcal{O}(L^6)$ to $\mathcal{O}(L^4)$. We evaluate TDNs on PubChemQCR, a newly curated molecular relaxation dataset containing 105 million DFT-calculated snapshots. We also use existing datasets, including OC20, and OC22. Results show that TDNs achieve competitive performance with dramatic speedup in computations. Our code is publicly available as part of the AIRS library (\href{https://github.com/divelab/AIRS/tree/main/OpenMol/TDN}{https://github.com/divelab/AIRS/}).

[327] DE3S: Dual-Enhanced Soft-Sparse-Shape Learning for Medical Early Time-Series Classification

Tao Xie, Zexi Tan, Haoyi Xiao, Binbin Sun, Yiqun Zhang

Main category: cs.LG

TL;DR: DE3S is a dual-enhanced soft-sparse sequence learning framework for early time series classification that addresses the accuracy-earliness trade-off in medical applications like sepsis detection.

Details

Motivation: Early Time Series Classification faces inherent trade-offs between accuracy and earliness, with existing methods struggling to model weak early signals while capturing both local subject-specific variations and global temporal patterns.

Method: Proposes a dual enhancement mechanism for weak early signals, attention-based patch module for noise reduction, dual-path fusion with sparse mixture of experts for local variations, and multi-scale inception module for global dependencies.

Result: Experiments on six real-world medical datasets show competitive performance, especially in early prediction windows, with ablation studies confirming each component’s effectiveness.

Conclusion: DE3S systematically addresses the core challenges in ETSC by enhancing early signal modeling while preserving discriminative information and capturing both local and global temporal patterns.

Abstract: Early Time Series Classification (ETSC) is critical in time-sensitive medical applications such as sepsis, yet it presents an inherent trade-off between accuracy and earliness. This trade-off arises from two core challenges: 1) models should effectively model inherently weak and noisy early-stage snippets, and 2) they should resolve the complex, dual requirement of simultaneously capturing local, subject-specific variations and overarching global temporal patterns. Existing methods struggle to overcome these underlying challenges, often forcing a severe compromise: sacrificing accuracy to achieve earliness, or vice-versa. We propose \textbf{DE3S}, a \textbf{D}ual-\textbf{E}nhanced \textbf{S}oft-\textbf{S}parse \textbf{S}equence Learning framework, which systematically solves these challenges. A dual enhancement mechanism is proposed to enhance the modeling of weak, early signals. Then, an attention-based patch module is introduced to preserve discriminative information while reducing noise and complexity. A dual-path fusion architecture is designed, using a sparse mixture of experts to model local, subject-specific variations. A multi-scale inception module is also employed to capture global dependencies. Experiments on six real-world medical datasets show the competitive performance of DE3S, particularly in early prediction windows. Ablation studies confirm the effectiveness of each component in addressing its targeted challenge. The source code is available \href{https://github.com/kuxit/DE3S}{\textbf{here}}.

[328] Compliance Minimization via Physics-Informed Gaussian Processes

Xiangyu Sun, Amin Yousefpour, Shirin Hosseinmardi, Ramin Bostanabad

Main category: cs.LG

TL;DR: Proposes a mesh-free physics-informed Gaussian process framework for compliance minimization that uses neural networks as mean functions to control design complexity and achieve high-quality topologies.

Details

Motivation: Existing ML methods for compliance minimization provide poor feature boundaries, are expensive, and lack systematic control over design complexity.

Method: Parameterizes design and state variables with GP priors sharing a multi-output neural network mean function based on PGCANs, with simultaneous minimization of compliance, potential energy, and volume fraction constraint.

Result: Achieves super-resolution topologies with fast convergence, comparable compliance to traditional methods with less gray area, control over fine-scale features, and outperforms competing ML methods.

Conclusion: The proposed physics-informed GP framework effectively addresses limitations of existing ML approaches for compliance minimization while providing interpretable complexity control.

Abstract: Machine learning (ML) techniques have recently gained significant attention for solving compliance minimization (CM) problems. However, these methods typically provide poor feature boundaries, are very expensive, and lack a systematic mechanism to control the design complexity. Herein, we address these limitations by proposing a mesh-free and simultaneous framework based on physics-informed Gaussian processes (GPs). In our approach, we parameterize the design and state variables with GP priors which have independent kernels but share a multi-output neural network (NN) as their mean function. The architecture of this NN is based on Parametric Grid Convolutional Attention Networks (PGCANs) which not only mitigate spectral bias issues, but also provide an interpretable mechanism to control design complexity. We estimate all the parameters of our GP-based representations by simultaneously minimizing the compliance, total potential energy, and residual of volume fraction constraint. Importantly, our loss function exclude all data-based residuals as GPs automatically satisfy them. We also develop computational schemes based on curriculum training and numerical integration to increase the efficiency and robustness of our approach which is shown to (1) produce super-resolution topologies with fast convergence, (2) achieve comparable compliance and less gray area fraction compared to traditional numerical methods, (3) provide control over fine-scale features, and (4) outperform competing ML-based methods.

[329] ADPO: Anchored Direct Preference Optimization

Wang Zixian

Main category: cs.LG

TL;DR: ADPO extends DPO to handle soft, listwise supervision using reference anchoring, unifying multiple learning paradigms and providing theoretical guarantees on stability and implicit trust regions.

Details

Motivation: DPO's reliance on hard pairwise preferences makes it brittle to annotator noise and distribution shift, limiting its robustness in real-world applications.

Method: Proposes Anchored Direct Preference Optimization (ADPO) using reference anchoring to extend preference learning to soft, listwise supervision, with theoretical analysis of anchor dynamics and implicit trust regions.

Result: ADPO unifies major learning paradigms (SFT, knowledge distillation, MaxEnt RL, DPO) and achieves 170-5000x reduction in teacher-student KL divergence, with dynamic anchors for online exploration and fixed anchors for offline distillation.

Conclusion: ADPO provides a theoretically grounded framework that addresses DPO’s brittleness while unifying multiple learning approaches, with empirical results showing significant improvements in knowledge distillation efficiency.

Abstract: Direct Preference Optimization (DPO) has become a standard for aligning models with human feedback, yet its reliance on hard, pairwise preferences makes it brittle to annotator noise and distribution shift. We propose Anchored Direct Preference Optimization (ADPO), a theoretically grounded framework that extends preference learning to soft, listwise supervision through reference anchoring. Our key theoretical contributions are threefold: (1) we establish that ADPO unifies major learning paradigms, including supervised fine-tuning, knowledge distillation, maximum-entropy reinforcement learning, and DPO, as special cases through different choices of target distribution, anchor policy, and temperature; (2) we prove that anchoring induces an implicit trust region governed by the softmax Fisher metric; and (3) we formalize the stability of dynamic anchor updates. Empirically, we discover a task-dependent tradeoff: dynamic anchors suit online exploration, while fixed anchors excel at offline distillation, reducing teacher-student KL divergence by two to three orders of magnitude (170 to 5000 times).

[330] Composing Linear Layers from Irreducibles

Travis Pence, Daisuke Yamada, Vikas Singh

Main category: cs.LG

TL;DR: The paper shows that linear layers in neural networks can be decomposed into compositions of geometric primitives called bivectors and rotors using Clifford algebra, achieving parameter efficiency comparable to strong baselines.

Details

Motivation: To understand the fundamental building blocks of large models and identify how low-level geometric primitives compose into richer functionality, particularly in linear layers.

Method: Using Clifford algebra to express linear layers as compositions of bivectors, and introducing a differentiable algorithm that decomposes them into products of rotors with O(log²d) parameters instead of O(d²).

Result: Rotor-based layers applied to key, query, and value projections in LLM attention layers match the performance of strong baselines like block-Hadamard and low-rank approximations.

Conclusion: The work provides an algebraic perspective on how geometric primitives can compose into higher-level functions within deep models, offering parameter-efficient alternatives to dense matrices.

Abstract: Contemporary large models often exhibit behaviors suggesting the presence of low-level primitives that compose into modules with richer functionality, but these fundamental building blocks remain poorly understood. We investigate this compositional structure in linear layers by asking: can we identify/synthesize linear transformations from a minimal set of geometric primitives? Using Clifford algebra, we show that linear layers can be expressed as compositions of bivectors – geometric objects encoding oriented planes – and introduce a differentiable algorithm that decomposes them into products of rotors. This construction uses only O(log^2 d) parameters, versus O(d^2) required by dense matrices. Applied to the key, query, and value projections in LLM attention layers, our rotor-based layers match the performance of strong baselines such as block-Hadamard and low-rank approximations. Our findings provide an algebraic perspective on how these geometric primitives can compose into higher-level functions within deep models.

[331] OrdShap: Feature Position Importance for Sequential Black-Box Models

Davin Hill, Brian L. Hill, Aria Masoomi, Vijay S. Nori, Robert E. Tillman, Jennifer Dy

Main category: cs.LG

TL;DR: OrdShap is a novel attribution method that disentangles feature value effects from feature position effects in sequential deep learning models by quantifying how predictions change when permuting feature positions.

Details

Motivation: Existing feature attribution methods for sequential models conflate feature values with their positions in input sequences, limiting understanding of model behavior.

Method: OrdShap uses permutation-based attribution to quantify how model predictions change when feature positions are altered, with game-theoretic connections to Sanchez-Bergantiños values.

Result: Empirical evaluation on health, natural language, and synthetic datasets shows OrdShap effectively captures both feature value and position attributions.

Conclusion: OrdShap provides deeper insight into sequential model behavior by disentangling feature value and position effects through theoretically grounded position-sensitive attribution.

Abstract: Sequential deep learning models excel in domains with temporal or sequential dependencies, but their complexity necessitates post-hoc feature attribution methods for understanding their predictions. While existing techniques quantify feature importance, they inherently assume fixed feature ordering - conflating the effects of (1) feature values and (2) their positions within input sequences. To address this gap, we introduce OrdShap, a novel attribution method that disentangles these effects by quantifying how a model’s predictions change in response to permuting feature position. We establish a game-theoretic connection between OrdShap and Sanchez-Berganti~nos values, providing a theoretically grounded approach to position-sensitive attribution. Empirical results from health, natural language, and synthetic datasets highlight OrdShap’s effectiveness in capturing feature value and feature position attributions, and provide deeper insight into model behavior.

[332] Navigating High Dimensional Concept Space with Metalearning

Max Gupta

Main category: cs.LG

TL;DR: Meta-learning improves few-shot concept learning for compositional complexity but struggles with featural complexity, with extended gradient adaptation helping exploration in rough loss landscapes.

Details

Motivation: To investigate whether gradient-based meta-learning can provide neural networks with inductive biases for efficient few-shot acquisition of discrete concepts, comparing meta-learning against supervised learning on Boolean concepts.

Method: Used Boolean concepts generated by probabilistic context-free grammar (PCFG), systematically varying concept dimensionality and recursive compositionality. Compared meta-learning methods with supervised baseline, analyzed representations and loss landscapes, and tested extended adaptation steps in meta-SGD.

Result: Meta-learners handle compositional complexity much better than featural complexity. Featural complexity increases loss landscape roughness, making curvature-aware optimization more effective. Increasing adaptation steps in meta-SGD improves out-of-distribution generalization on complex concepts.

Conclusion: Meta-learning is effective for compositional complexity but limited for featural complexity in high-dimensional concept spaces, with second-order methods and extended gradient adaptation playing important roles in few-shot concept learning.

Abstract: Rapidly learning abstract concepts from limited examples is a hallmark of human intelligence. This work investigates whether gradient-based meta-learning can equip neural networks with inductive biases for efficient few-shot acquisition of discrete concepts. I compare meta-learning methods against a supervised learning baseline on Boolean concepts (logical statements) generated by a probabilistic context-free grammar (PCFG). By systematically varying concept dimensionality (number of features) and recursive compositionality (depth of grammar recursion), I delineate between complexity regimes in which meta-learning robustly improves few-shot concept learning and regimes in which it does not. Meta-learners are much better able to handle compositional complexity than featural complexity. I highlight some reasons for this with a representational analysis of the weights of meta-learners and a loss landscape analysis demonstrating how featural complexity increases the roughness of loss trajectories, allowing curvature-aware optimization to be more effective than first-order methods. I find improvements in out-of-distribution generalization on complex concepts by increasing the number of adaptation steps in meta-SGD, where adaptation acts as a way of encouraging exploration of rougher loss basins. Overall, this work highlights the intricacies of learning compositional versus featural complexity in high dimensional concept spaces and provides a road to understanding the role of 2nd order methods and extended gradient adaptation in few-shot concept learning.

[333] Diagrams-to-Dynamics (D2D): Exploring Causal Loop Diagram Leverage Points under Uncertainty

Jeroen F. Uleman, Loes Crielaard, Leonie K. Elsenburg, Guido A. Veldhuis, Naja Hulvej Rod, Rick Quax, Vítor V. Vasconcelos

Main category: cs.LG

TL;DR: D2D converts qualitative causal loop diagrams into exploratory system dynamics models without empirical data, enabling dynamic analysis and intervention testing.

Details

Motivation: Causal loop diagrams are limited as static representations that cannot support dynamic analysis or inform intervention strategies effectively.

Method: Diagrams-to-Dynamics (D2D) method converts CLDs into system dynamics models using structural information (link existence and polarity) with minimal user input for variable labeling.

Result: D2D helps distinguish between high- and low-ranked leverage points, shows greater consistency with data-driven models than static network analysis, and provides uncertainty estimates.

Conclusion: D2D lowers the barrier to dynamic modeling for CLD researchers and provides guidance for future data collection, with open-source implementation available for broader testing.

Abstract: Causal loop diagrams (CLDs) are widely used in health and environmental research to represent hypothesized causal structures underlying complex problems. However, as qualitative and static representations, CLDs are limited in their ability to support dynamic analysis and inform intervention strategies. We propose Diagrams-to-Dynamics (D2D), a method for converting CLDs into exploratory system dynamics models (SDMs) in the absence of empirical data. With minimal user input - following a protocol to label variables as stocks, flows or auxiliaries, and constants - D2D leverages the structural information already encoded in CLDs, namely, link existence and polarity, to simulate hypothetical interventions and explore potential leverage points under uncertainty. Results suggest that D2D helps distinguish between high- and low-ranked leverage points. We compare D2D to a data-driven SDM constructed from the same CLD and variable labels. D2D showed greater consistency with the data-driven model compared to static network centrality analysis, while providing uncertainty estimates and guidance for future data collection. The D2D method is implemented in an open-source Python package and a web-based application to support further testing and lower the barrier to dynamic modeling for researchers working with CLDs. We expect that additional validation studies will further establish the approach’s utility across a broad range of cases and domains.

[334] LLMComp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression (Technical Report)

Guozhong Li, Muhannad Alhumaidi, Spiros Skiadopoulos, Panos Kalnis

Main category: cs.LG

TL;DR: LLMCOMP uses decoder-only LLMs for lossy compression of scientific data, achieving up to 30% higher compression ratios than state-of-the-art methods while maintaining strict error bounds.

Details

Motivation: The rapid growth of massive spatiotemporal datasets from scientific simulations and observations requires efficient, error-bounded compression methods, while decoder-only LLMs have shown strong capabilities in modeling complex sequential data.

Method: Quantizes 3D fields into discrete tokens, arranges them via Z-order curves to preserve locality, applies coverage-guided sampling for training efficiency, trains autoregressive transformer with spatial-temporal embeddings, and uses top-k prediction with rank indices and fallback corrections during compression.

Result: Experiments on multiple reanalysis datasets show LLMCOMP consistently outperforms state-of-the-art compressors, achieving up to 30% higher compression ratios under strict error bounds.

Conclusion: LLMs have strong potential as general-purpose compressors for high-fidelity scientific data, demonstrating superior performance compared to existing compression methods.

Abstract: The rapid growth of high-resolution scientific simulations and observation systems is generating massive spatiotemporal datasets, making efficient, error-bounded compression increasingly important. Meanwhile, decoder-only large language models (LLMs) have demonstrated remarkable capabilities in modeling complex sequential data. In this paper, we propose LLMCOMP, a novel lossy compression paradigm that leverages decoder-only large LLMs to model scientific data. LLMCOMP first quantizes 3D fields into discrete tokens, arranges them via Z-order curves to preserve locality, and applies coverage-guided sampling to enhance training efficiency. An autoregressive transformer is then trained with spatial-temporal embeddings to model token transitions. During compression, the model performs top-k prediction, storing only rank indices and fallback corrections to ensure strict error bounds. Experiments on multiple reanalysis datasets show that LLMCOMP consistently outperforms state-of-the-art compressors, achieving up to 30% higher compression ratios under strict error bounds. These results highlight the potential of LLMs as general-purpose compressors for high-fidelity scientific data.

[335] MetaFed: Advancing Privacy, Performance, and Sustainability in Federated Metaverse Systems

Muhammet Anil Yagiz, Zeynep Sude Cengiz, Polat Goktas

Main category: cs.LG

TL;DR: MetaFed is a decentralized federated learning framework that addresses performance, privacy, and sustainability challenges in Metaverse applications through intelligent resource orchestration.

Details

Motivation: Centralized architectures for Metaverse applications lead to high energy consumption, latency, and privacy concerns, requiring a more sustainable and privacy-preserving approach.

Method: Integrates multi-agent reinforcement learning for dynamic client selection, privacy-preserving FL with homomorphic encryption, and carbon-aware scheduling aligned with renewable energy availability.

Result: Achieves up to 25% reduction in carbon emissions compared to conventional approaches while maintaining high accuracy and minimal communication overhead on MNIST and CIFAR-10 datasets.

Conclusion: MetaFed provides a scalable solution for building environmentally responsible and privacy-compliant Metaverse infrastructures.

Abstract: The rapid expansion of immersive Metaverse applications introduces complex challenges at the intersection of performance, privacy, and environmental sustainability. Centralized architectures fall short in addressing these demands, often resulting in elevated energy consumption, latency, and privacy concerns. This paper proposes MetaFed, a decentralized federated learning (FL) framework that enables sustainable and intelligent resource orchestration for Metaverse environments. MetaFed integrates (i) multi-agent reinforcement learning for dynamic client selection, (ii) privacy-preserving FL using homomorphic encryption, and (iii) carbon-aware scheduling aligned with renewable energy availability. Evaluations on MNIST and CIFAR-10 using lightweight ResNet architectures demonstrate that MetaFed achieves up to 25% reduction in carbon emissions compared to conventional approaches, while maintaining high accuracy and minimal communication overhead. These results highlight MetaFed as a scalable solution for building environmentally responsible and privacy-compliant Metaverse infrastructures.

[336] Breaking the Black Box: Inherently Interpretable Physics-Constrained Machine Learning With Weighted Mixed-Effects for Imbalanced Seismic Data

Vemula Sreenath, Filippo Gatti, Pierre Jehel

Main category: cs.LG

TL;DR: Developed an interpretable neural network for ground motion models that addresses data imbalance and black-box limitations using HazBinLoss and concurvity regularization, achieving robust performance with physics-consistent variance partitioning.

Details

Motivation: Existing ML-based ground motion models operate as 'black boxes' with limited interpretability, and seismic datasets suffer from severe imbalance causing systematic underprediction of critical high-hazard ground motions.

Method: Used an inherently interpretable neural network with independent additive pathways, HazBinLoss (physics-constrained weighting with inverse bin count scaling), and concurvity regularization to enforce pathway orthogonality.

Result: Achieved robust performance: MSE=0.6235, MAE=0.6230, R²=88.48%. Pathway scaling confirmed seismological behaviors. Residual analysis showed unbiased predictions with physically consistent variance partitioning.

Conclusion: The interpretable framework advances ground motion models by establishing a transparent, physics-consistent foundation for seismic hazard and risk assessment, with implications for non-ergodic hazard analysis.

Abstract: Ground motion models (GMMs) are critical for seismic risk mitigation and infrastructure design. Machine learning (ML) is increasingly applied to GMM development due to expanding strong motion databases. However, existing ML-based GMMs operate as ‘black boxes,’ creating opacity that undermines confidence in engineering decisions. Moreover, seismic datasets exhibit severe imbalance, with scarce large-magnitude near-field records causing systematic underprediction of critical high-hazard ground motions. Despite these limitations, research addressing both interpretability and data imbalance remains limited. This study develops an inherently interpretable neural network employing independent additive pathways with novel HazBinLoss and concurvity regularization. HazBinLoss integrates physics-constrained weighting with inverse bin count scaling to address underfitting in sparse, high-hazard regions. Concurvity regularization enforces pathway orthogonality, reducing inter-pathway correlation. The model achieves robust performance: mean squared error = 0.6235, mean absolute error = 0.6230, and coefficient of determination = 88.48%. Pathway scaling corroborates established seismological behaviors. Weighted hierarchical Student-t mixed-effects analysis demonstrates unbiased residuals with physically consistent variance partitioning: sigma components range from 0.26-0.38 (inter-event), 0.12-0.41 (inter-region), 0.58-0.71 (intra-event), and 0.68-0.89 (total). The lower inter-event and higher intra-event components have implications for non-ergodic hazard analysis. Predictions exhibit strong agreement with NGA-West2 GMMs across diverse conditions. This interpretable framework advances GMMs, establishing a transparent, physics-consistent foundation for seismic hazard and risk assessment.

[337] Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning

Bastien Dubail, Stefan Stojanovic, Alexandre Proutière

Main category: cs.LG

TL;DR: The paper challenges the common low-rank assumption for successor measures in RL, showing that the shifted successor measure (after bypassing initial transitions) naturally exhibits low-rank structure, enabling efficient estimation and improved performance.

Details

Motivation: To address the misconception that successor measures are approximately low-rank in RL, and to demonstrate that low-rank structure actually emerges in shifted successor measures after initial transitions.

Method: Proposed using shifted successor measures with finite-sample guarantees for low-rank approximation, introduced spectral recoverability analysis and Type II Poincaré inequalities to quantify required shift, and connected shift selection to local mixing properties.

Result: Showed that shifted successor measures enable effective low-rank approximation with small shifts in practice, and demonstrated improved performance in goal-conditioned RL experiments.

Conclusion: Shifting successor measures reveals natural low-rank structure that was previously overlooked, providing a principled approach for efficient RL algorithm design with theoretical guarantees and practical benefits.

Abstract: Low-rank structure is a common implicit assumption in many modern reinforcement learning (RL) algorithms. For instance, reward-free and goal-conditioned RL methods often presume that the successor measure admits a low-rank representation. In this work, we challenge this assumption by first remarking that the successor measure itself is not approximately low-rank. Instead, we demonstrate that a low-rank structure naturally emerges in the shifted successor measure, which captures the system dynamics after bypassing a few initial transitions. We provide finite-sample performance guarantees for the entry-wise estimation of a low-rank approximation of the shifted successor measure from sampled entries. Our analysis reveals that both the approximation and estimation errors are primarily governed by a newly introduced quantitity: the spectral recoverability of the corresponding matrix. To bound this parameter, we derive a new class of functional inequalities for Markov chains that we call Type II Poincar'e inequalities and from which we can quantify the amount of shift needed for effective low-rank approximation and estimation. This analysis shows in particular that the required shift depends on decay of the high-order singular values of the shifted successor measure and is hence typically small in practice. Additionally, we establish a connection between the necessary shift and the local mixing properties of the underlying dynamical system, which provides a natural way of selecting the shift. Finally, we validate our theoretical findings with experiments, and demonstrate that shifting the successor measure indeed leads to improved performance in goal-conditioned RL.

[338] Proposing a Framework for Machine Learning Adoption on Legacy Systems

Ashiqur Rahman, Hamed Alhoori

Main category: cs.LG

TL;DR: API-based framework that decouples ML lifecycle from production to enable ML adoption without costly system upgrades or downtime

Details

Motivation: Overcome prohibitive costs and operational disruptions that prevent ML adoption in industrial settings, especially for SMEs

Method: Lightweight browser-based interface with human-in-the-loop approach, allowing domain experts interactive control over model parameters

Result: Enables ML implementation without local hardware upgrades and ensures zero production downtime during model maintenance

Conclusion: Provides scalable, accessible pathway to enhance production quality and safety while strengthening manufacturing competitiveness

Abstract: The integration of machine learning (ML) is critical for industrial competitiveness, yet its adoption is frequently stalled by the prohibitive costs and operational disruptions of upgrading legacy systems. The financial and logistical overhead required to support the full ML lifecycle presents a formidable barrier to widespread implementation, particularly for small and medium-sized enterprises. This paper introduces a pragmatic, API-based framework designed to overcome these challenges by strategically decoupling the ML model lifecycle from the production environment. Our solution delivers the analytical power of ML to domain experts through a lightweight, browser-based interface, eliminating the need for local hardware upgrades and ensuring model maintenance can occur with zero production downtime. This human-in-the-loop approach empowers experts with interactive control over model parameters, fostering trust and facilitating seamless integration into existing workflows. By mitigating the primary financial and operational risks, this framework offers a scalable and accessible pathway to enhance production quality and safety, thereby strengthening the competitive advantage of the manufacturing sector.

[339] ReNF: Rethinking the Design Space of Neural Long-Term Time Series Forecasters

Yihang Lu, Xianwei Meng, Enhong Chen

Main category: cs.LG

TL;DR: A principled approach to Long-term Time Series Forecasting that combines Auto-Regressive and Direct Output methods with parameter stabilization, enabling simple MLPs to outperform complex models.

Details

Motivation: Current Neural Forecasters overemphasize architectural complexity while neglecting fundamental forecasting principles, hindering progress in Long-term Time Series Forecasting.

Method: Proposes Boosted Direct Output (BDO) strategy that synergistically combines AR and DO approaches, with smooth parameter tracking to stabilize learning. Based on a Multiple Neural Forecasting Theorem.

Result: A simple MLP with these principled improvements achieves state-of-the-art performance, outperforming recent complex models in nearly all cases without domain-specific considerations.

Conclusion: The work establishes a dynamic performance bound and identifies promising research directions, demonstrating that principled improvements can enable simple models to surpass complex architectures.

Abstract: Neural Forecasters (NFs) are a cornerstone of Long-term Time Series Forecasting (LTSF). However, progress has been hampered by an overemphasis on architectural complexity at the expense of fundamental forecasting principles. In this work, we return to first principles to redesign the LTSF paradigm. We begin by introducing a Multiple Neural Forecasting Theorem that provides a theoretical basis for our approach. We propose Boosted Direct Output (BDO), a novel forecasting strategy that synergistically combines the advantages of both Auto-Regressive (AR) and Direct Output (DO). In addition, we stabilize the learning process by smoothly tracking the model’s parameters. Extensive experiments show that these principled improvements enable a simple MLP to achieve state-of-the-art performance, outperforming recent, complex models in nearly all cases, without any specific considerations in the area. Finally, we empirically verify our theorem, establishing a dynamic performance bound and identifying promising directions for future research. The code for review is available at: .

Omar Islam Laskar, Fatemeh Ramezani Khozestani, Ishika Nankani, Sohrab Namazi Nia, Senjuti Basu Roy, Kaustubh Beedkar

Main category: cs.LG

TL;DR: Aegis is a middleware framework that efficiently selects optimal data masking configurations for ML datasets while preserving privacy, achieving 10x faster performance with comparable predictive utility.

Details

Motivation: In sensitive domains like healthcare, data sharing requires privacy protection, but different masking configurations that meet privacy thresholds can vary significantly in utility. The challenge is efficiently selecting the configuration that preserves maximum utility for downstream ML tasks.

Method: Aegis uses a utility optimizer that minimizes predictive utility deviation by quantifying shifts in feature-label correlations due to masking. It leverages limited data summaries (1D histograms) or none to estimate feature-label joint distribution via iterative proportional fitting, supporting various correlation quantification methods like mutual information, chi-square, or g3.

Result: Experimental evaluation on real-world datasets shows Aegis identifies optimal masking configurations over an order of magnitude faster, while the resulting masked datasets achieve predictive performance on downstream ML tasks on par with baseline approaches.

Conclusion: Aegis effectively complements privacy anonymization techniques by efficiently selecting optimal masking configurations that balance privacy requirements with utility preservation for ML applications.

Abstract: Data sharing ecosystems connect providers, consumers, and intermediaries to facilitate the exchange and use of data for a wide range of downstream tasks. In sensitive domains such as healthcare, privacy is enforced as a hard constraint, any shared data must satisfy a minimum privacy threshold. However, among all masking configurations that meet this requirement, the utility of the masked data can vary significantly, posing a key challenge: how to efficiently select the optimal configuration that preserves maximum utility. This paper presents Aegis, a middleware framework that selects optimal masking configurations for machine learning datasets with features and class labels. Aegis incorporates a utility optimizer that minimizes predictive utility deviation, quantifying shifts in feature label correlations due to masking. Our framework leverages limited data summaries (such as 1D histograms) or none to estimate the feature label joint distribution, making it suitable for scenarios where raw data is inaccessible due to privacy restrictions. To achieve this, we propose a joint distribution estimator based on iterative proportional fitting, which allows supporting various feature label correlation quantification methods such as mutual information, chi square, or g3. Our experimental evaluation of real world datasets shows that Aegis identifies optimal masking configurations over an order of magnitude faster, while the resulting masked datasets achieve predictive performance on downstream ML tasks on par with baseline approaches and complements privacy anonymization data masking techniques.

[341] Finding geodesics with the Deep Ritz method

Conor Rowan

Main category: cs.LG

TL;DR: The paper proposes using the Deep Ritz method for solving geodesic problems, demonstrating its effectiveness across four application domains: path planning, optics, solid mechanics, and generative modeling.

Details

Motivation: Geodesic problems are fundamental in physics and engineering but have received little attention from the scientific machine learning community. The authors argue these problems are ideal for the Deep Ritz method due to their simple geometry, variational structure, and natural nonlinearity.

Method: The authors apply the Deep Ritz method to solve geodesic problems, presenting numerical examples across four different application domains to demonstrate the method’s versatility and effectiveness.

Result: The Deep Ritz method successfully solves geodesic problems in path planning, optics, solid mechanics, and generative modeling, showing promising performance across these diverse applications.

Conclusion: Geodesic problems represent a promising application area for the Deep Ritz method and a fruitful direction for future scientific machine learning research, though this work serves as an initial exploration rather than an exhaustive study.

Abstract: Geodesic problems involve computing trajectories between prescribed initial and final states to minimize a user-defined measure of distance, cost, or energy. They arise throughout physics and engineering – for instance, in determining optimal paths through complex environments, modeling light propagation in refractive media, and the study of spacetime trajectories in control theory and general relativity. Despite their ubiquity, the scientific machine learning (SciML) community has given relatively little attention to investigating its methods in the context of these problems. In this work, we argue that given their simple geometry, variational structure, and natural nonlinearity, geodesic problems are particularly well-suited for the Deep Ritz method. We substantiate this claim with four numerical examples drawn from path planning, optics, solid mechanics, and generative modeling. Our goal is not to provide an exhaustive study of geodesic problems, but rather to identify a promising application of the Deep Ritz method and a fruitful direction for future SciML research.

[342] On Measuring Localization of Shortcuts in Deep Networks

Nikita Tsoy, Nikola Konstantinov

Main category: cs.LG

TL;DR: Shortcuts (spurious rules) in deep networks are distributed throughout layers rather than localized, with shallow layers encoding spurious features and deeper layers forgetting core features, making general shortcut-mitigation methods difficult to design.

Details

Motivation: The impact of shortcuts on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods, so the authors investigate layer-wise localization of shortcuts in deep models.

Method: A novel experiment design quantifying layer-wise contribution to accuracy degradation through counterfactual training on clean and skewed datasets, tested on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures.

Result: Shortcut learning is distributed throughout the network - shallow layers predominantly encode spurious features while deeper layers predominantly forget core features that are predictive on clean data.

Conclusion: Layer-wise shortcut-mitigation strategies suggest the hardness of designing general methods, supporting dataset- and architecture-specific approaches instead.

Abstract: Shortcuts, spurious rules that perform well during training but fail to generalize, present a major challenge to the reliability of deep networks (Geirhos et al., 2020). However, the impact of shortcuts on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods. To overcome this limitation, we investigate the layer-wise localization of shortcuts in deep models. Our novel experiment design quantifies the layer-wise contribution to accuracy degradation caused by a shortcut-inducing skew by counterfactual training on clean and skewed datasets. We employ our design to study shortcuts on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures. We find that shortcut learning is not localized in specific layers but distributed throughout the network. Different network parts play different roles in this process: shallow layers predominantly encode spurious features, while deeper layers predominantly forget core features that are predictive on clean data. We also analyze the differences in localization and describe its principal axes of variation. Finally, our analysis of layer-wise shortcut-mitigation strategies suggests the hardness of designing general methods, supporting dataset- and architecture-specific approaches instead.

[343] In Situ Training of Implicit Neural Compressors for Scientific Simulations via Sketch-Based Regularization

Cooper Simpson, Stephen Becker, Alireza Doostan

Main category: cs.LG

TL;DR: Novel in situ training protocol for implicit neural representations using memory buffers with sketched data to prevent catastrophic forgetting, achieving strong compression performance comparable to offline methods.

Details

Motivation: To address catastrophic forgetting in continual learning scenarios, particularly for neural compression using implicit neural representation-based hypernetworks, by leveraging sketching as a regularizer.

Method: In situ training with limited memory buffers containing both full and sketched data samples, where sketching serves as a regularizer based on Johnson-Lindenstrauss principles, applied to implicit neural representations.

Result: Demonstrated strong reconstruction performance at high compression rates on complex 2D/3D simulation data over long time horizons, across unstructured grids and non-Cartesian geometries.

Conclusion: Sketching enables in situ training to approximately match offline method performance, making it a promising approach for continual learning in neural compression applications.

Abstract: Focusing on implicit neural representations, we present a novel in situ training protocol that employs limited memory buffers of full and sketched data samples, where the sketched data are leveraged to prevent catastrophic forgetting. The theoretical motivation for our use of sketching as a regularizer is presented via a simple Johnson-Lindenstrauss-informed result. While our methods may be of wider interest in the field of continual learning, we specifically target in situ neural compression using implicit neural representation-based hypernetworks. We evaluate our method on a variety of complex simulation data in two and three dimensions, over long time horizons, and across unstructured grids and non-Cartesian geometries. On these tasks, we show strong reconstruction performance at high compression rates. Most importantly, we demonstrate that sketching enables the presented in situ scheme to approximately match the performance of the equivalent offline method.

[344] EVINGCA: Adaptive Graph Clustering with Evolving Neighborhood Statistics

Randolph Wiredu-Aidoo

Main category: cs.LG

TL;DR: EVINGCA is a density-variance based clustering algorithm that treats cluster formation as an adaptive, evolving process on nearest-neighbor graphs, replacing fixed density thresholds with local statistical feedback.

Details

Motivation: Existing clustering algorithms have restrictive assumptions - K-Means/Gaussian Mixtures assume convex Gaussian clusters, while DBSCAN/HDBSCAN capture non-convexity but are highly sensitive.

Method: EVINGCA expands rooted graphs via breadth-first search guided by continuously updated local distance and shape statistics, using spatial indexing for efficiency.

Result: EVINGCA achieves log-linear complexity in average case and exhibits competitive performance against baselines across synthetic, real-world, low-dimensional and high-dimensional datasets.

Conclusion: EVINGCA provides an effective alternative to traditional clustering methods by treating cluster formation as an adaptive, evolving process with local statistical guidance.

Abstract: Clustering algorithms often rely on restrictive assumptions: K-Means and Gaussian Mixtures presuppose convex, Gaussian-like clusters, while DBSCAN and HDBSCAN capture non-convexity but can be highly sensitive. I introduce EVINGCA (Evolving Variance-Informed Nonparametric Graph Construction Algorithm), a density-variance based clustering algorithm that treats cluster formation as an adaptive, evolving process on a nearest-neighbor graph. EVINGCA expands rooted graphs via breadth-first search, guided by continuously updated local distance and shape statistics, replacing fixed density thresholds with local statistical feedback. With spatial indexing, EVINGCA features log-linear complexity in the average case and exhibits competitive performance against baselines across a variety of synthetic, real-world, low-d, and high-d datasets.

[345] Scalable Evaluation and Neural Models for Compositional Generalization

Giacomo Camposampiero, Pietro Barbiero, Michael Hersche, Roger Wattenhofer, Abbas Rahimi

Main category: cs.LG

TL;DR: This paper addresses compositional generalization in machine learning by introducing a rigorous evaluation framework, conducting extensive experiments on vision backbones, and proposing Attribute Invariant Networks that significantly improve accuracy while reducing parameter overhead.

Details

Motivation: Compositional generalization remains a key challenge in ML, with current evaluation protocols lacking standardization and benchmarks favoring efficiency over rigor. General-purpose vision architectures lack necessary inductive biases, and existing approaches compromise scalability.

Method: 1) Developed a rigorous evaluation framework that unifies previous approaches while reducing computational requirements from combinatorial to constant; 2) Conducted extensive evaluation training over 5000 models; 3) Proposed Attribute Invariant Networks as a new class of models.

Result: Attribute Invariant Networks establish a new Pareto frontier in compositional generalization, achieving 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts.

Conclusion: The paper provides a comprehensive solution to compositional generalization challenges through rigorous evaluation methodology and efficient model design that balances performance and scalability.

Abstract: Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts. Our code is available at https://github.com/IBM/scalable-compositional-generalization.

Shaghayegh Fazliani, Madeleine Udell

Main category: cs.LG

TL;DR: PDE-SHARP reduces computational costs for PDE solver generation by replacing expensive scientific computations with cheaper LLM inference, achieving superior accuracy with 60-75% fewer evaluations.

Details

Motivation: Current LLM-driven approaches for generating PDE solvers require executing many solver samples, which is computationally expensive especially for complex PDEs requiring substantial resources for numerical evaluation.

Method: Three-stage framework: (1) Analysis: mathematical chain-of-thought including PDE classification, solution type detection, and stability analysis; (2) Genesis: solver generation based on mathematical insights; (3) Synthesis: collaborative selection-hybridization tournaments with LLM judges iteratively refining implementations.

Result: Requires fewer than 13 solver evaluations on average vs 30+ for baselines, improves accuracy by 4× on average across tested PDEs, and shows robust performance across different LLM architectures.

Conclusion: PDE-SHARP significantly reduces computational costs while improving solver accuracy, demonstrating the effectiveness of replacing expensive scientific computation with LLM inference for PDE solver generation.

Abstract: Current LLM-driven approaches using test-time computing to generate PDE solvers execute a large number of solver samples to identify high-accuracy solvers. These paradigms are especially costly for complex PDEs requiring substantial computational resources for numerical evaluation. We introduce PDE-SHARP, a framework to reduce computational costs by replacing expensive scientific computation by cheaper LLM inference that achieves superior solver accuracy with 60-75% fewer computational evaluations. PDE-SHARP employs three stages: (1) Analysis: mathematical chain-of-thought analysis including PDE classification, solution type detection, and stability analysis; (2) Genesis: solver generation based on mathematical insights from the previous stage; and (3) Synthesis: collaborative selection-hybridization tournaments in which LLM judges iteratively refine implementations through flexible performance feedback. To generate high-quality solvers, PDE-SHARP requires fewer than 13 solver evaluations on average compared to 30+ for baseline methods, improving accuracy uniformly across tested PDEs by $4\times$ on average, and demonstrates robust performance across LLM architectures, from general-purpose to specialized reasoning models.

[347] TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, Vinay Kumar Sankarapu

Main category: cs.LG

TL;DR: TabTune is a unified library that standardizes workflows for tabular foundation models, addressing adoption barriers like heterogeneous preprocessing and fragmented APIs through consistent interfaces and automated pipelines.

Details

Motivation: Tabular foundation models face limited adoption due to inconsistent preprocessing, fragmented APIs, varied fine-tuning procedures, and lack of standardized evaluation for deployment metrics like calibration and fairness.

Method: TabTune provides a unified interface with consistent access to 7 state-of-the-art models, supports multiple adaptation strategies (zero-shot, meta-learning, SFT, PEFT), automates model-aware preprocessing, and manages architectural heterogeneity internally.

Result: The framework enables standardized benchmarking of adaptation strategies and integrates evaluation modules for performance, calibration, and fairness metrics.

Conclusion: TabTune addresses key adoption barriers for tabular foundation models by providing extensible, reproducible workflows that standardize the complete modeling pipeline from preprocessing to evaluation.

Abstract: Tabular foundation models represent a growing paradigm in structured data learning, extending the benefits of large-scale pretraining to tabular domains. However, their adoption remains limited due to heterogeneous preprocessing pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the absence of standardized evaluation for deployment-oriented metrics such as calibration and fairness. We present TabTune, a unified library that standardizes the complete workflow for tabular foundation models through a single interface. TabTune provides consistent access to seven state-of-the-art models supporting multiple adaptation strategies, including zero-shot inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT). The framework automates model-aware preprocessing, manages architectural heterogeneity internally, and integrates evaluation modules for performance, calibration, and fairness. Designed for extensibility and reproducibility, TabTune enables consistent benchmarking of adaptation strategies of tabular foundation models.

[348] Measuring the Intrinsic Dimension of Earth Representations

Arjun Rao, Marc Rußwurm, Konstantin Klemmer, Esther Rolf

Main category: cs.LG

TL;DR: This paper studies the intrinsic dimensionality of geographic Implicit Neural Representations (INRs) for Earth observation, finding they typically have 2-10 dimensions and that this metric correlates with downstream performance and can detect spatial artifacts.

Details

Motivation: Geographic INRs aim to create compact representations of Earth's data, but there's limited understanding of how much information they actually contain and where it's concentrated. The intrinsic dimension provides a way to measure this information content.

Method: Analyzed the intrinsic dimensionality of geographic INRs with ambient dimensions between 256-512 by examining how spatial resolution and input modalities during pre-training affect the intrinsic dimension.

Result: Found that geographic INRs have intrinsic dimensions between 2-10, which are sensitive to spatial resolution and input modalities. The intrinsic dimension correlates with downstream task performance and can capture spatial artifacts.

Conclusion: Intrinsic dimension serves as an architecture-agnostic, label-free metric for evaluating information content in geographic INRs, enabling unsupervised evaluation, model selection, and pre-training design.

Abstract: Within the context of representation learning for Earth observation, geographic Implicit Neural Representations (INRs) embed low-dimensional location inputs (longitude, latitude) into high-dimensional embeddings, through models trained on geo-referenced satellite, image or text data. Despite the common aim of geographic INRs to distill Earth’s data into compact, learning-friendly representations, we lack an understanding of how much information is contained in these Earth representations, and where that information is concentrated. The intrinsic dimension of a dataset measures the number of degrees of freedom required to capture its local variability, regardless of the ambient high-dimensional space in which it is embedded. This work provides the first study of the intrinsic dimensionality of geographic INRs. Analyzing INRs with ambient dimension between 256 and 512, we find that their intrinsic dimensions fall roughly between 2 and 10 and are sensitive to changing spatial resolution and input modalities during INR pre-training. Furthermore, we show that the intrinsic dimension of a geographic INR correlates with downstream task performance and can capture spatial artifacts, facilitating model evaluation and diagnostics. More broadly, our work offers an architecture-agnostic, label-free metric of information content that can enable unsupervised evaluation, model selection, and pre-training design across INRs.

[349] Probabilistic Graph Cuts

Ayoub Ghriss

Main category: cs.LG

TL;DR: A unified probabilistic framework for differentiable graph partitioning that provides tight analytic bounds on expected cuts using integral representations and hypergeometric functions, with closed-form forward/backward passes.

Details

Motivation: Prior probabilistic relaxations of graph cuts focused only on RatioCut and lacked general guarantees and principled gradients, limiting their applicability to various clustering objectives.

Method: Developed a unified probabilistic framework using integral representations and Gauss hypergeometric functions to derive tight analytic upper bounds on expected discrete cuts, with closed-form forward and backward computations.

Result: The framework covers a wide class of cuts including Normalized Cut, provides rigorous guarantees, and enables numerically stable, scalable differentiable graph partitioning.

Conclusion: This work establishes a rigorous foundation for scalable differentiable graph partitioning that supports various clustering and contrastive learning objectives through stable closed-form computations.

Abstract: Probabilistic relaxations of graph cuts offer a differentiable alternative to spectral clustering, enabling end-to-end and online learning without eigendecompositions, yet prior work centered on RatioCut and lacked general guarantees and principled gradients. We present a unified probabilistic framework that covers a wide class of cuts, including Normalized Cut. Our framework provides tight analytic upper bounds on expected discrete cuts via integral representations and Gauss hypergeometric functions with closed-form forward and backward. Together, these results deliver a rigorous, numerically stable foundation for scalable, differentiable graph partitioning covering a wide range of clustering and contrastive learning objectives.

[350] NOWS: Neural Operator Warm Starts for Accelerating Iterative Solvers

Mohammad Sadegh Eshaghi, Cosmin Anitescu, Navid Valizadeh, Yizheng Wang, Xiaoying Zhuang, Timon Rabczuk

Main category: cs.LG

TL;DR: Neural Operator Warm Starts (NOWS) combines neural operators with classical iterative solvers to accelerate PDE simulations by providing high-quality initial guesses, reducing computational time by up to 90% while maintaining solver guarantees.

Details

Motivation: High-fidelity PDE simulations are computationally expensive for many-query, real-time, and design tasks, while purely data-driven surrogates can be unreliable outside their training distribution.

Method: NOWS uses learned solution operators to generate initial guesses for Krylov methods (conjugate gradient, GMRES), integrating with existing discretizations and solver infrastructures without modification.

Result: Consistently reduces iteration counts and end-to-end runtime, achieving up to 90% computational time reduction while preserving stability and convergence guarantees.

Conclusion: NOWS provides a practical and trustworthy approach to accelerate high-fidelity PDE simulations by combining neural operator speed with traditional solver rigor.

Abstract: Partial differential equations (PDEs) underpin quantitative descriptions across the physical sciences and engineering, yet high-fidelity simulation remains a major computational bottleneck for many-query, real-time, and design tasks. Data-driven surrogates can be strikingly fast but are often unreliable when applied outside their training distribution. Here we introduce Neural Operator Warm Starts (NOWS), a hybrid strategy that harnesses learned solution operators to accelerate classical iterative solvers by producing high-quality initial guesses for Krylov methods such as conjugate gradient and GMRES. NOWS leaves existing discretizations and solver infrastructures intact, integrating seamlessly with finite-difference, finite-element, isogeometric analysis, finite volume method, etc. Across our benchmarks, the learned initialization consistently reduces iteration counts and end-to-end runtime, resulting in a reduction of the computational time of up to 90 %, while preserving the stability and convergence guarantees of the underlying numerical algorithms. By combining the rapid inference of neural operators with the rigor of traditional solvers, NOWS provides a practical and trustworthy approach to accelerate high-fidelity PDE simulations.

cs.MA

[351] ALAS: Transactional and Dynamic Multi-Agent LLM Planning

Longling Geng, Edward Y. Chang

Main category: cs.MA

TL;DR: ALAS is a stateful framework that improves multi-agent LLM planning by separating planning from validation, maintaining versioned execution logs, and enabling localized repair to handle disruptions efficiently.

Details

Motivation: Current LLM-based multi-agent planning systems are fragile with circular verification, untracked state changes, and costly global recomputation for small faults.

Method: ALAS separates planning from non-circular validation, uses versioned execution logs for grounded checks and restore points, and performs localized repair with explicit policies (retry, catch, timeout, etc.) defined in a workflow IR.

Result: On job-shop scheduling benchmarks, ALAS achieves 83.7% success rate, reduces token usage by 60%, runs 1.82x faster, and effectively contains runtime perturbations with bounded edit radius.

Conclusion: The combination of validator isolation, versioned execution logs, and localized repair provides measurable efficiency, feasibility, and scalability for multi-agent LLM planning.

Abstract: Large language models enable flexible multi-agent planning but remain fragile in practice: verification is often circular, state changes are not tracked for repair, and small faults trigger costly global recomputation. We present ALAS, a stateful, disruption-aware framework that separates planning from non-circular validation, records a versioned execution log for grounded checks and restore points, and performs localized repair that preserves work in progress. The validator operates independently of the planning LLM with fresh, bounded context, avoiding self-check loops and mid-context attrition. The repair protocol edits only the minimal affected region under explicit policies (retry, catch, timeout, backoff, idempotency keys, compensation, loop guards) defined in a canonical workflow IR that maps to Amazon States Language and Argo Workflows. On job-shop scheduling suites (DMU, TA) across five classical benchmarks, ALAS matches or exceeds strong single-LLM and multi-agent baselines, achieving 83.7% success, reducing token usage by 60%, and running 1.82times faster under comparable settings. A minimal reliability study shows that the validator detects injected structural faults with low overhead, and that localized repair contains runtime perturbations with a bounded edit radius and less makespan degradation than global recompute. Results indicate that the combination of validator isolation, versioned execution logs, and localized repair provides measurable efficiency, feasibility, and scalability for multi-agent LLM planning. Code and seeds will be released.

[352] Learning Communication Skills in Multi-task Multi-agent Deep Reinforcement Learning

Changxi Zhu, Mehdi Dastani, Shihan Wang

Main category: cs.MA

TL;DR: MCS is a multi-agent deep reinforcement learning method that enables agents to learn and perform multiple tasks simultaneously using learnable communication protocols and Transformer encoders.

Details

Motivation: To improve multi-agent coordination across multiple tasks by leveraging shared communication skills and knowledge transfer between tasks.

Method: Uses Transformer encoder to encode task-specific observations into shared message space, with a prediction network that correlates messages with sender agents’ actions to enhance coordination.

Result: MCS outperforms multi-task MADRL baselines without communication and single-task MADRL baselines with and without communication on adapted multi-agent benchmark environments.

Conclusion: The proposed MCS method effectively enables multi-task learning with communication in multi-agent systems, demonstrating superior performance through shared communication skills and improved coordination.

Abstract: In multi-agent deep reinforcement learning (MADRL), agents can communicate with one another to perform a task in a coordinated manner. When multiple tasks are involved, agents can also leverage knowledge from one task to improve learning in other tasks. In this paper, we propose Multi-task Communication Skills (MCS), a MADRL with communication method that learns and performs multiple tasks simultaneously, with agents interacting through learnable communication protocols. MCS employs a Transformer encoder to encode task-specific observations into a shared message space, capturing shared communication skills among agents. To enhance coordination among agents, we introduce a prediction network that correlates messages with the actions of sender agents in each task. We adapt three multi-agent benchmark environments to multi-task settings, where the number of agents as well as the observation and action spaces vary across tasks. Experimental results demonstrate that MCS achieves better performance than multi-task MADRL baselines without communication, as well as single-task MADRL baselines with and without communication.

[353] Large Language Models Miss the Multi-Agent Mark

Emanuele La Malfa, Gabriele La Malfa, Samuele Marro, Jie M. Zhang, Elizabeth Black, Michael Luck, Philip Torr, Michael Wooldridge

Main category: cs.MA

TL;DR: Current MAS LLM frameworks misuse MAS terminology without implementing core multi-agent principles like autonomy, social interaction, and structured environments, risking inefficiency by ignoring established MAS research.

Details

Motivation: To highlight the gap between MAS theory and current MAS LLM implementations, and advocate for proper integration of foundational MAS concepts to avoid mischaracterization and missed opportunities.

Method: Systematic analysis of discrepancies between MAS theory and MAS LLM implementations across four key areas: social agency, environment design, coordination protocols, and emergent behavior measurement.

Result: Identified that many MAS LLMs lack true multi-agent characteristics, rely on oversimplified LLM-centric architectures, and risk revisiting problems already solved in MAS literature.

Conclusion: Advocates for better integration of established MAS concepts and more precise terminology to ensure meaningful progress in MAS LLM research and avoid reinventing solutions.

Abstract: Recent interest in Multi-Agent Systems of Large Language Models (MAS LLMs) has led to an increase in frameworks leveraging multiple LLMs to tackle complex tasks. However, much of this literature appropriates the terminology of MAS without engaging with its foundational principles. In this position paper, we highlight critical discrepancies between MAS theory and current MAS LLMs implementations, focusing on four key areas: the social aspect of agency, environment design, coordination and communication protocols, and measuring emergent behaviours. Our position is that many MAS LLMs lack multi-agent characteristics such as autonomy, social interaction, and structured environments, and often rely on oversimplified, LLM-centric architectures. The field may slow down and lose traction by revisiting problems the MAS literature has already addressed. Therefore, we systematically analyse this issue and outline associated research opportunities; we advocate for better integrating established MAS concepts and more precise terminology to avoid mischaracterisation and missed opportunities.

cs.MM

[354] A Versatile Depth Video Encoding Scheme Based on Low-rank Tensor Modeling for Free Viewpoint Video

Mansi Sharma, Jyotsana Grover

Main category: cs.MM

TL;DR: A low-complexity depth video compression scheme using tensor decomposition and HEVC intra coding that efficiently compresses depth sequences while maintaining view synthesis quality.

Details

Motivation: HEVC's depth map intra prediction with DMMs provides high compression efficiency but has very high encoding complexity, limiting practical use for 3D display applications.

Method: Proposed scheme uses low-rank tensor decomposition via CP decomposition with alternating least squares to represent depth sequences compactly, then compresses factor matrices with HEVC intra prediction.

Result: Achieves significant rate gains by efficiently compressing depth planes in low-rank representation, maintaining appropriate rendering quality for view synthesis in multi-view video systems.

Conclusion: The proposed approach enables flexible bitrate adjustment through tensor decomposition ranks and quantization parameters while reducing encoding complexity compared to conventional HEVC depth coding.

Abstract: The compression quality losses of depth sequences determine quality of view synthesis in free-viewpoint video. The depth map intra prediction in 3D extensions of the HEVC applies intra modes with auxiliary depth modeling modes (DMMs) to better preserve depth edges and handle motion discontinuities. Although such modes enable high efficiency compression, but at the cost of very high encoding complexity. Skipping conventional intra coding modes and DMMs in depth coding limits practical applicability of the HEVC for 3D display applications. In this paper, we introduce a novel low-complexity scheme for depth video compression based on low-rank tensor decomposition and HEVC intra coding. The proposed scheme leverages spatial and temporal redundancy by compactly representing the depth sequence as a high-order tensor. Tensor factorization into a set of factor matrices following CANDECOMP PARAFAC (CP) decomposition via alternating least squares give a low-rank approximation of the scene geometry. Further, compression of factor matrices with HEVC intra prediction support arbitrary target accuracy by flexible adjustment of bitrate, varying tensor decomposition ranks and quantization parameters. The results demonstrate proposed approach achieves significant rate gains by efficiently compressing depth planes in low-rank approximated representation. The proposed algorithm is applied to encode depth maps of benchmark Ballet and Breakdancing sequences. The decoded depth sequences are used for view synthesis in a multi-view video system, maintaining appropriate rendering quality.

eess.AS

[355] Seeing What You Say: Expressive Image Generation from Speech

Jiyoung Lee, Song Park, Sanghyuk Chun, Soo-Whan Chung

Main category: eess.AS

TL;DR: VoxStudio is the first end-to-end speech-to-image model that generates expressive images directly from spoken descriptions, using a speech information bottleneck to preserve prosody and emotional nuance without needing speech-to-text conversion.

Details

Motivation: Existing speech-to-image systems rely on speech-to-text conversion which loses paralinguistic information like tone and emotion. There's a need for unified models that can directly process speech while preserving these expressive details.

Method: Uses a speech information bottleneck (SIB) module to compress raw speech into compact semantic tokens that preserve prosody and emotional nuance. Also introduces VoxEmoset, a large-scale emotional speech-image dataset generated via TTS engine.

Result: Comprehensive experiments on SpokenCOCO, Flickr8kAudio, and VoxEmoset benchmarks demonstrate the method’s feasibility. The model successfully generates expressive images directly from speech while highlighting challenges like emotional consistency and linguistic ambiguity.

Conclusion: VoxStudio represents a significant advancement in speech-to-image generation by eliminating the need for intermediate text conversion and preserving emotional nuance, paving the way for future research in this domain.

Abstract: This paper proposes VoxStudio, the first unified and end-to-end speech-to-image model that generates expressive images directly from spoken descriptions by jointly aligning linguistic and paralinguistic information. At its core is a speech information bottleneck (SIB) module, which compresses raw speech into compact semantic tokens, preserving prosody and emotional nuance. By operating directly on these tokens, VoxStudio eliminates the need for an additional speech-to-text system, which often ignores the hidden details beyond text, e.g., tone or emotion. We also release VoxEmoset, a large-scale paired emotional speech-image dataset built via an advanced TTS engine to affordably generate richly expressive utterances. Comprehensive experiments on the SpokenCOCO, Flickr8kAudio, and VoxEmoset benchmarks demonstrate the feasibility of our method and highlight key challenges, including emotional consistency and linguistic ambiguity, paving the way for future research.

[356] Quantifying Articulatory Coordination as a Biomarker for Schizophrenia

Gowtham Premananth, Carol Espy-Wilson

Main category: eess.AS

TL;DR: An interpretable framework using articulatory speech features and WSED scoring to quantify vocal tract coordination in schizophrenia, correlating with symptom severity and balance.

Details

Motivation: Limited interpretability of AI in healthcare hinders clinical adoption, especially for complex disorders like schizophrenia that need tools capturing symptom severity beyond binary diagnosis.

Method: Leverages articulatory speech features through eigenspectra difference plots and weighted sum with exponential decay (WSED) to quantify vocal tract coordination.

Result: Eigenspectra plots distinguished complex vs simple coordination patterns; WSED scores reliably separated groups and correlated with BPRS severity and positive/negative symptom balance.

Conclusion: Provides a transparent, severity-sensitive biomarker for schizophrenia, advancing clinically interpretable speech-based assessment tools.

Abstract: Advances in artificial intelligence (AI) and deep learning have improved diagnostic capabilities in healthcare, yet limited interpretability continues to hinder clinical adoption. Schizophrenia, a complex disorder with diverse symptoms including disorganized speech and social withdrawal, demands tools that capture symptom severity and provide clinically meaningful insights beyond binary diagnosis. Here, we present an interpretable framework that leverages articulatory speech features through eigenspectra difference plots and a weighted sum with exponential decay (WSED) to quantify vocal tract coordination. Eigenspectra plots effectively distinguished complex from simpler coordination patterns, and WSED scores reliably separated these groups, with ambiguity confined to a narrow range near zero. Importantly, WSED scores correlated not only with overall BPRS severity but also with the balance between positive and negative symptoms, reflecting more complex coordination in subjects with pronounced positive symptoms and the opposite trend for stronger negative symptoms. This approach offers a transparent, severity-sensitive biomarker for schizophrenia, advancing the potential for clinically interpretable speech-based assessment tools.

[357] audio2chart: End to End Audio Transcription into playable Guitar Hero charts

Riccardo Tripodi

Main category: eess.AS

TL;DR: Audio2Chart framework automatically generates Guitar Hero style charts from raw audio using sequence prediction with audio conditioning.

Details

Motivation: To create an automated system for generating rhythm game charts directly from audio, eliminating manual chart creation.

Method: Formalized as sequence prediction problem, training models to generate discrete chart tokens aligned with audio on discrete time steps, with both unconditional baseline and audio-conditioned approaches.

Result: Unconditional baseline showed strong performance, while audio conditioning consistently improved accuracy across metrics, demonstrating feasibility and effectiveness of audio conditioning for note prediction.

Conclusion: Audio conditioning is both feasible and effective for improving automatic chart generation, with codebase and pretrained models publicly available for reproducible research.

Abstract: This work introduces audio2chart, a framework for the automatic generation of Guitar Hero style charts directly from raw audio. The task is formalized as a sequence prediction problem, where models are trained to generate discrete chart tokens aligned with the audio on discrete time steps. An unconditional baseline demonstrates strong predictive performance, while the addition of audio conditioning yields consistent improvements across accuracy based metrics. This work demonstrates that incorporating audio conditioning is both feasible and effective for improving note prediction in automatic chart generation. The complete codebase for training and inference is publicly available on GitHub supporting reproducible research on neural chart generation. A family of pretrained models is released on Hugging Face.

[358] Speech-Based Prioritization for Schizophrenia Intervention

Gowtham Premananth, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Espy-Wilson

Main category: eess.AS

TL;DR: Speech-based AI model for pairwise comparison of schizophrenia symptom severity using articulatory and acoustic features, outperforming regression-based approaches for clinical triage.

Details

Motivation: Address limited clinical resources and labor-intensive mental health assessments by providing scalable, automated monitoring to prioritize care in resource-constrained settings.

Method: Uses speech-based model with articulatory and acoustic features for pairwise symptom severity comparisons, then applies Bradley-Terry model to generate severity rankings.

Result: Outperforms previous regression-based models on ranking-based metrics, providing more effective clinical triage and prioritization.

Conclusion: Speech-based pairwise comparison approach offers superior performance for schizophrenia symptom severity assessment compared to traditional regression methods.

Abstract: Millions of people suffer from mental health conditions, yet many remain undiagnosed or receive delayed care due to limited clinical resources and labor-intensive assessment methods. While most machine-assisted approaches focus on diagnostic classification, estimating symptom severity is essential for prioritizing care, particularly in resource-constrained settings. Speech-based AI provides a scalable alternative by enabling automated, continuous, and remote monitoring, reducing reliance on subjective self-reports and time-consuming evaluations. In this paper, we introduce a speech-based model for pairwise comparison of schizophrenia symptom severity, leveraging articulatory and acoustic features. These comparisons are used to generate severity rankings via the Bradley-Terry model. Our approach outperforms previous regression-based models on ranking-based metrics, offering a more effective solution for clinical triage and prioritization.

[359] TASU: Text-Only Alignment for Speech Understanding

Jing Peng, Yi Yang, Xu Li, Yu Xi, Quanwei Tang, Yangui Fang, Junjie Li, Kai Yu

Main category: eess.AS

TL;DR: TASU is a novel text-only alignment paradigm for Speech LLMs that uses unpaired text data for cross-modal alignment, achieving competitive zero-shot speech recognition and superior performance on speech understanding tasks.

Details

Motivation: Current Speech LLM alignment methods require large-scale audio-text paired data and intensive training, yet have limited generalization to unseen domains/tasks.

Method: Proposed TASU (Text-only Alignment for Speech Understanding) that leverages only unpaired text data to guide cross-modal alignment, enabling zero-shot capabilities.

Result: Achieves competitive zero-shot speech recognition, enhances domain generalization in curriculum learning, and outperforms prominent Speech LLMs on MMSU benchmark.

Conclusion: TASU establishes an efficient and scalable alignment paradigm for Speech LLMs that reduces data dependency while improving generalization.

Abstract: Recent advances in Speech Large Language Models (Speech LLMs) have paved the way for unified architectures across diverse speech understanding tasks. However, prevailing alignment paradigms rely heavily on large-scale audio-text paired data and computationally intensive training, yet often exhibit limited generalization to unseen domains or tasks. To address these limitations, we propose TASU (Text-only Alignment for Speech Understanding), a novel alignment paradigm that can leverage only unpaired text data to guide cross-modal alignment. Experiments show that TASU achieves competitive zero-shot speech recognition. Leveraging this property, it can further function as a pre-training stage in curriculum learning, enhancing domain generalization in speech recognition. Ultimately, TASU can extend its zero-shot generalization to a wide range of speech understanding tasks and notably outperforms prominent Speech LLMs including GLM-4-Voice and Step-Audio on the MMSU benchmark, establishing TASU as an efficient and scalable alignment paradigm for Speech LLMs.

[360] Open Source State-Of-the-Art Solution for Romanian Speech Recognition

Gabriel Pirlogeanu, Alexandru-Lucian Georgescu, Horia Cucu

Main category: eess.AS

TL;DR: New state-of-the-art Romanian ASR system using FastConformer architecture achieves 27% WER reduction across all benchmarks with efficient decoding suitable for low-latency applications.

Details

Motivation: To develop a high-performance Romanian Automatic Speech Recognition system using modern neural architectures not previously explored for Romanian language.

Method: FastConformer architecture trained on 2,600+ hours of speech with hybrid CTC-TDT decoder, evaluated using greedy, ALSD, and CTC beam search with 6-gram token-level language model.

Result: Achieved state-of-the-art performance across all Romanian benchmarks (read, spontaneous, domain-specific speech) with up to 27% relative WER reduction compared to previous best systems.

Conclusion: The proposed Romanian ASR system demonstrates both superior transcription accuracy and practical decoding efficiency, making it suitable for research and low-latency deployment.

Abstract: In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA’s FastConformer architecture–explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600 hours of speech. Leveraging a hybrid decoder with both Connectionist Temporal Classification (CTC) and Token-Duration Transducer (TDT) branches, we evaluate a range of decoding strategies including greedy, ALSD, and CTC beam search with a 6-gram token-level language model. Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks, including read, spontaneous, and domain-specific speech, with up to 27% relative WER reduction compared to previous best-performing systems. In addition to improved transcription accuracy, our approach demonstrates practical decoding efficiency, making it suitable for both research and deployment in low-latency ASR applications.

[361] MEDIC: Zero-shot Music Editing with Disentangled Inversion Control

Huadai Liu, Jialei Wang, Xiangtai Li, Wen Wang, Qian Chen, Rongjie Huang, Yang Liu, Jiayang Xu, Zhou Zhao

Main category: eess.AS

TL;DR: MEDIC is a zero-shot music editing system that uses Disentangled Inversion Control to fix DDIM inversion errors and enable complex non-rigid music edits while maintaining content integrity and fidelity.

Details

Motivation: Existing zero-shot audio editing methods accumulate errors across diffusion steps and struggle with complex non-rigid music edits while preserving content integrity and high fidelity.

Method: Proposes Disentangled Inversion Control (DIC) with Harmonized Attention Control and Disentangled Inversion. Disentangled Inversion uses triple branches to rectify DDIM inversion errors, while Harmonized Attention Control unifies self-attention and cross-attention control with a Harmonic Branch for progressive harmonic/melodic generation.

Result: Outperforms state-of-the-art inversion techniques in editing fidelity and content preservation. Introduces ZoME-Bench benchmark with 1,100 samples across 10 editing categories for both zero-shot and instruction-based music editing.

Conclusion: MEDIC successfully addresses limitations of existing methods by providing a novel zero-shot music editing system that maintains high fidelity and content integrity through innovative disentanglement and attention control techniques.

Abstract: Text-guided diffusion models revolutionize audio generation by adapting source audio to specific text prompts. However, existing zero-shot audio editing methods such as DDIM inversion accumulate errors across diffusion steps, reducing the effectiveness. Moreover, existing editing methods struggle with conducting complex non-rigid music edits while maintaining content integrity and high fidelity. To address these challenges, we propose MEDIC, a novel zero-shot music editing system based on innovative Disentangled Inversion Control (DIC) technique, which comprises Harmonized Attention Control and Disentangled Inversion. Disentangled Inversion disentangles the diffusion process into triple branches to rectify the deviated path of the source branch caused by DDIM inversion. Harmonized Attention Control unifies the mutual self-attention control and the cross-attention control with an intermediate Harmonic Branch to progressively generate the desired harmonic and melodic information in the target music. We also introduce ZoME-Bench, a comprehensive music editing benchmark with 1,100 samples covering ten distinct editing categories. ZoME-Bench facilitates both zero-shot and instruction-based music editing tasks. Our method outperforms state-of-the-art inversion techniques in editing fidelity and content preservation. The code and benchmark will be released. Audio samples are available at https://medic-edit.github.io/.

[362] Listen to Extract: Onset-Prompted Target Speaker Extraction

Pengjie Shen, Kangrui Chen, Shulin He, Pengru Chen, Shuqi Yuan, He Kong, Xueliang Zhang, Zhong-Qiu Wang

Main category: eess.AS

TL;DR: LExt is a simple yet effective monaural target speaker extraction method that concatenates target speaker enrollment audio with mixed speech at the waveform level, using deep neural networks to extract the target speech by leveraging artificial speech onsets.

Details

Motivation: To develop a highly effective but extremely simple algorithm for monaural target speaker extraction that can identify and extract target speakers from mixed speech using minimal processing.

Method: Concatenate enrollment utterance of target speaker with mixture signal at waveform level, then train deep neural networks to extract target speech from the concatenated signal, leveraging artificial speech onsets to guide extraction.

Result: Achieves strong target speaker extraction performance on multiple public datasets including WSJ0-2mix, WHAM! and WHAMR!.

Conclusion: The simple approach of creating artificial speech onsets by concatenating enrollment utterances with mixtures is highly effective for target speaker extraction, providing both speaker identification and spectral-temporal pattern guidance to neural networks.

Abstract: We propose listen to extract (LExt), a highly-effective while extremely-simple algorithm for monaural target speaker extraction (TSE). Given an enrollment utterance of a target speaker, LExt aims at extracting the target speaker from the speaker’s mixed speech with other speakers. For each mixture, LExt concatenates an enrollment utterance of the target speaker to the mixture signal at the waveform level, and trains deep neural networks (DNN) to extract the target speech based on the concatenated mixture signal. The rationale is that, this way, an artificial speech onset is created for the target speaker and it could prompt the DNN (a) which speaker is the target to extract; and (b) spectral-temporal patterns of the target speaker that could help extraction. This simple approach produces strong TSE performance on multiple public TSE datasets including WSJ0-2mix, WHAM! and WHAMR!.

[363] ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Huadai Liu, Kaicheng Luo, Jialei Wang, Wen Wang, Qian Chen, Zhou Zhao, Wei Xue

Main category: eess.AS

TL;DR: ThinkSound is a novel framework that uses Chain-of-Thought reasoning for stepwise video-to-audio generation and editing, achieving state-of-the-art performance through three complementary stages: foley generation, object-centric refinement, and targeted editing.

Details

Motivation: Current end-to-end video-to-audio generation struggles to produce high-fidelity audio that authentically captures visual nuances, requiring sophisticated reasoning about visual dynamics, acoustic environments, and temporal relationships.

Method: Decomposes audio generation into three stages: foundational foley generation, interactive object-centric refinement via user interactions, and targeted editing guided by natural language instructions. Uses multimodal LLM to generate CoT reasoning that guides a unified audio foundation model.

Result: Achieves state-of-the-art performance in video-to-audio generation across audio metrics and CoT metrics, and excels in the out-of-distribution Movie Gen Audio benchmark.

Conclusion: ThinkSound demonstrates the effectiveness of Chain-of-Thought reasoning for high-fidelity video-to-audio generation, enabling stepwise, interactive audio creation and editing with superior performance.

Abstract: While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, this generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics, and excels in the out-of-distribution Movie Gen Audio benchmark. The project page is available at https://ThinkSound-Project.github.io.

[364] StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction

Qianheng Xu

Main category: eess.AS

TL;DR: StutterZero and StutterFormer are the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting transcriptions, achieving significant improvements over existing methods.

Details

Motivation: Over 70 million people worldwide experience stuttering, but current speech systems misinterpret disfluent utterances. Existing methods use multi-stage pipelines that separate transcription from audio reconstruction and amplify distortions.

Method: StutterZero uses convolutional-bidirectional LSTM encoder-decoder with attention, while StutterFormer integrates dual-stream Transformer with shared acoustic-linguistic representations. Both trained on paired stuttered-fluent data from SEP-28K and LibriStutter corpora.

Result: StutterZero achieved 24% decrease in Word Error Rate and 31% improvement in BERTScore vs Whisper-Medium. StutterFormer performed better with 28% decrease in WER and 34% improvement in BERTScore on FluencyBank dataset.

Conclusion: The results validate direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.

Abstract: Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, whereas StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both architectures are trained on paired stuttered-fluent data synthesized from the SEP-28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a 31% improvement in semantic similarity (BERTScore) compared to the leading Whisper-Medium model. StutterFormer achieved better results, with a 28% decrease in WER and a 34% improvement in BERTScore. The results validate the feasibility of direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.

eess.IV

[365] Optimizing the nnU-Net model for brain tumor (Glioma) segmentation Using a BraTS Sub-Saharan Africa (SSA) dataset

Chukwuemeka Arua Kalu, Adaobi Chiazor Emegoakor, Fortune Okafor, Augustine Okoh Uchenna, Chijioke Kelvin Ukpai, Godsent Erere Onyeugbo

Main category: eess.IV

TL;DR: Medical image segmentation study using BraTS Sub-Saharan Africa dataset shows that nnU-Net trained on original 60 MRI cases outperformed models trained on offline-augmented data, achieving 0.84 Dice score for whole tumor segmentation.

Details

Motivation: To develop accurate medical image segmentation models for glioma detection in under-represented regions like Sub-Saharan Africa, and investigate the impact of data quality and augmentation strategies on model performance.

Method: Used BraTS Sub-Saharan Africa dataset with 60 multimodal MRI cases, compared nnU-Net performance between original dataset and offline-augmented dataset (360 cases), leveraging nnU-Net’s robust online augmentation procedures.

Result: nnU-Net trained on original 60 cases performed better than on offline-augmented 360 cases, achieving Dice score of 0.84 for whole tumor segmentation. Offline augmentations introduced artificial variances that reduced generalization.

Conclusion: Data quality and proper augmentation approaches are crucial for building accurate, generalizable medical image segmentation models, especially for under-represented regions. Original data with robust online augmentation outperforms artificial offline augmentation.

Abstract: Medical image segmentation is a critical achievement in modern medical science, developed over decades of research. It allows for the exact delineation of anatomical and pathological features in two- or three-dimensional pictures by utilizing notions like pixel intensity, texture, and anatomical context. With the advent of automated segmentation, physicians and radiologists may now concentrate on diagnosis and treatment planning while intelligent computers perform routine image processing tasks. This study used the BraTS Sub-Saharan Africa dataset, a selected subset of the BraTS dataset that included 60 multimodal MRI cases from patients with glioma. Surprisingly, the nnU Net model trained on the initial 60 instances performed better than the network trained on an offline-augmented dataset of 360 cases. Hypothetically, the offline augmentations introduced artificial anatomical variances or intensity distributions, reducing generalization. In contrast, the original dataset, when paired with nnU Net’s robust online augmentation procedures, maintained realistic variability and produced better results. The study achieved a Dice score of 0.84 for whole tumor segmentation. These findings highlight the significance of data quality and proper augmentation approaches in constructing accurate, generalizable medical picture segmentation models, particularly for under-represented locations.

[366] Domain-Adaptive Transformer for Data-Efficient Glioma Segmentation in Sub-Saharan MRI

Ilerioluwakiiye Abolade, Aniekan Udo, Augustine Ojo, Abdulbasit Oyetunji, Hammed Ajigbotosho, Aondana Iorumbur, Confidence Raymond, Maruf Adewole

Main category: eess.IV

TL;DR: SegFormer3D-plus: A radiomics-guided transformer for robust glioma segmentation in Sub-Saharan Africa, addressing domain shift from heterogeneous MRI protocols through intensity harmonization, domain-aware sampling, and dual-pathway encoding.

Details

Motivation: Glioma segmentation is critical for diagnosis and treatment but challenging in Sub-Saharan Africa due to limited MRI infrastructure and heterogeneous acquisition protocols causing severe domain shift.

Method: Combines histogram matching for intensity harmonization, radiomic feature extraction with PCA-reduced k-means for domain-aware stratified sampling, dual-pathway encoder with frequency-aware features and spatial-channel attention, and composite Dice-Cross-Entropy loss.

Result: Demonstrates improved tumor subregion delineation and boundary localization across heterogeneous African clinical scans when pretrained on BraTS 2023 and fine-tuned on BraTS-Africa data.

Conclusion: Highlights the value of radiomics-guided domain adaptation for robust glioma segmentation in resource-limited settings with domain variability.

Abstract: Glioma segmentation is critical for diagnosis and treatment planning, yet remains challenging in Sub-Saharan Africa due to limited MRI infrastructure and heterogeneous acquisition protocols that induce severe domain shift. We propose SegFormer3D-plus, a radiomics-guided transformer architecture designed for robust segmentation under domain variability. Our method combines: (1) histogram matching for intensity harmonization across scanners, (2) radiomic feature extraction with PCA-reduced k-means for domain-aware stratified sampling, (3) a dual-pathway encoder with frequency-aware feature extraction and spatial-channel attention, and (4) composite Dice-Cross-Entropy loss for boundary refinement. Pretrained on BraTS 2023 and fine-tuned on BraTS-Africa data, SegFormer3D-plus demonstrates improved tumor subregion delineation and boundary localization across heterogeneous African clinical scans, highlighting the value of radiomics-guided domain adaptation for resource-limited settings.

[367] SAAIPAA: Optimizing aspect-angles-invariant physical adversarial attacks on SAR target recognition models

Isar Lemeire, Yee Wei Law, Sang-Heon Lee, Will Meakin, Tat-Jun Chin

Main category: eess.IV

TL;DR: SAAIPAA is a novel framework for physical adversarial attacks on SAR automatic target recognition systems that remains effective even without knowledge of the SAR platform’s aspect angles, achieving over 80% fooling rates.

Details

Motivation: The surge of adversarial attacks against ML-based SAR ATR systems requires systematic research into adversarial perturbation mechanisms, particularly in the physical domain.

Method: Deploys at least one reflector in each azimuthal quadrant and optimizes reflector orientations to create aspect-angles-invariant physical adversarial attacks that evade ML-based ATR.

Result: Achieves state-of-the-art fooling rates (over 80% for DenseNet-121 and ResNet50) in white-box settings, and 99.2% when aspect angles are known. Shows good transferability between some models but limited to others.

Conclusion: SAAIPAA provides an efficient and optimal framework for physical evasion attacks on SAR ATR systems, with the additional contribution of a method for generating bounding boxes for densely sampled azimuthal SAR datasets.

Abstract: Synthetic aperture radar (SAR) enables versatile, all-time, all-weather remote sensing. Coupled with automatic target recognition (ATR) leveraging machine learning (ML), SAR is empowering a wide range of Earth observation and surveillance applications. However, the surge of attacks based on adversarial perturbations against the ML algorithms underpinning SAR ATR is prompting the need for systematic research into adversarial perturbation mechanisms. Research in this area began in the digital (image) domain and evolved into the physical (signal) domain, resulting in physical adversarial attacks (PAAs) that strategically exploit corner reflectors as attack vectors to evade ML-based ATR. This paper proposes a novel framework called SAR Aspect-Angles-Invariant Physical Adversarial Attack (SAAIPAA) for physics-based modelling of reflector-actuated adversarial perturbations, which improves on the rigor of prior work. A unique feature of SAAIPAA is its ability to remain effective even when the attacker lacks knowledge of the SAR platform’s aspect angles, by deploying at least one reflector in each azimuthal quadrant and optimizing reflector orientations. The resultant physical evasion attacks are efficiently realizable and optimal over the considered range of aspect angles between a SAR platform and a target, achieving state-of-the-art fooling rates (over 80% for DenseNet-121 and ResNet50) in the white-box setting. When aspect angles are known to the attacker, an average fooling rate of 99.2% is attainable. In black-box settings, although the attack efficacy of SAAIPAA transfers well between some models (e.g., from ResNet50 to DenseNet121), the transferability to some models (e.g., MobileNetV2) can be improved. A useful outcome of using the MSTAR dataset for the experiments in this article, a method for generating bounding boxes for densely sampled azimuthal SAR datasets is introduced.

[368] Morpho-Genomic Deep Learning for Ovarian Cancer Subtype and Gene Mutation Prediction from Histopathology

Gabriela Fernandes

Main category: eess.IV

TL;DR: A hybrid deep learning pipeline combining ResNet-50 CNN and Vision Transformer achieves 84.2% accuracy in ovarian cancer subtype classification and can infer gene mutations (TP53, BRCA1, ARID1A) directly from H&E histopathology images.

Details

Motivation: Ovarian cancer has high mortality due to late diagnosis and heterogeneity. Current diagnostic methods lack the ability to reveal genomic variations needed for precision oncology.

Method: Developed a fusion model combining ResNet-50 CNN encoder and Vision Transformer using ~45,000 H&E image patches from TCGA and public datasets to capture both local morphological texture and global tissue context.

Result: Achieved 84.2% subtype classification accuracy (Macro AUC 0.87±0.03) and gene mutation inference with AUCs: TP53=0.82±0.02, BRCA1=0.76±0.04, ARID1A=0.73±0.05. Nuclear solidity and eccentricity were key predictors for TP53 mutation.

Conclusion: Quantifiable histological phenotypes encode measurable genomic signals, enabling cost-effective precision histopathology for ovarian cancer diagnosis and triage.

Abstract: Ovarian cancer remains one of the most lethal gynecological malignancies, largely due to late diagnosis and extensive heterogeneity across subtypes. Current diagnostic methods are limited in their ability to reveal underlying genomic variations essential for precision oncology. This study introduces a novel hybrid deep learning pipeline that integrates quantitative nuclear morphometry with deep convolutional image features to perform ovarian cancer subtype classification and gene mutation inference directly from Hematoxylin and Eosin (H&E) histopathological images. Using $\sim45,000$ image patches sourced from The Cancer Genome Atlas (TCGA) and public datasets, a fusion model combining a ResNet-50 Convolutional Neural Network (CNN) encoder and a Vision Transformer (ViT) was developed. This model successfully captured both local morphological texture and global tissue context. The pipeline achieved a robust overall subtype classification accuracy of $84.2%$ (Macro AUC of $0.87 \pm 0.03$). Crucially, the model demonstrated the capacity for gene mutation inference with moderate-to-high accuracy: $AUC_{TP53} = 0.82 \pm 0.02$, $AUC_{BRCA1} = 0.76 \pm 0.04$, and $AUC_{ARID1A} = 0.73 \pm 0.05$. Feature importance analysis established direct quantitative links, revealing that nuclear solidity and eccentricity were the dominant predictors for TP53 mutation. These findings validate that quantifiable histological phenotypes encode measurable genomic signals, paving the way for cost-effective, precision histopathology in ovarian cancer triage and diagnosis.

[369] Computational Imaging Meets LLMs: Zero-Shot IDH Mutation Prediction in Brain Gliomas

Syed Muqeem Mahmood, Hassan Mohy-ud-Din

Main category: eess.IV

TL;DR: Framework combining LLMs with computational image analytics for zero-shot prediction of IDH mutation status in brain gliomas using multi-parametric MRI without fine-tuning.

Details

Motivation: To enable non-invasive, precise tumor genotyping in neuro-oncology by integrating LLM-based reasoning with computational image analytics, advancing diagnostic strategies.

Method: Processed coregistered multi-parametric MRI scans and tumor segmentation maps to extract semantic attributes and quantitative features, serialized in JSON format, then queried GPT 4o and GPT 5 without fine-tuning.

Result: High accuracy and balanced classification performance across six datasets (N=1427), with GPT 5 outperforming GPT 4o. Volumetric features were most important predictors, supplemented by imaging markers and clinical data.

Conclusion: Integration of LLM-based reasoning with computational image analytics shows strong potential for precise, non-invasive tumor genotyping in neuro-oncology.

Abstract: We present a framework that combines Large Language Models with computational image analytics for non-invasive, zero-shot prediction of IDH mutation status in brain gliomas. For each subject, coregistered multi-parametric MRI scans and multi-class tumor segmentation maps were processed to extract interpretable semantic (visual) attributes and quantitative features, serialized in a standardized JSON file, and used to query GPT 4o and GPT 5 without fine-tuning. We evaluated this framework on six publicly available datasets (N = 1427) and results showcased high accuracy and balanced classification performance across heterogeneous cohorts, even in the absence of manual annotations. GPT 5 outperformed GPT 4o in context-driven phenotype interpretation. Volumetric features emerged as the most important predictors, supplemented by subtype-specific imaging markers and clinical information. Our results demonstrate the potential of integrating LLM-based reasoning with computational image analytics for precise, non-invasive tumor genotyping, advancing diagnostic strategies in neuro-oncology. The code is available at https://github.com/ATPLab-LUMS/CIM-LLM.

[370] MAROON: A Framework for the Joint Characterization of Near-Field High-Resolution Radar and Optical Depth Imaging Techniques

Vanessa Wirth, Johanna Bräunig, Nikolai Hofmann, Martin Vossiek, Tim Weyrich, Marc Stamminger

Main category: eess.IV

TL;DR: This paper presents a multimodal characterization of optical and radar depth sensors for close-range applications, comparing their performance across different materials, geometries, and distances through a comprehensive dataset called MAROON.

Details

Motivation: There is limited research on the intersection of optical depth sensors and close-range radars, especially with growing interest in high-resolution imaging radars operating in near-field conditions. Understanding how these sensors compare is crucial for robust computer-assisted tasks like autonomous driving.

Method: The authors use multimodal spatial calibration to jointly characterize four depth imagers (three optical sensors with varying operation principles and one imaging radar). They evaluate depth measurements across different object materials, geometries, and object-to-sensor distances.

Result: The study reveals scattering effects of partially transmissive materials and investigates radio-frequency signal responses. The comprehensive evaluation provides insights into sensor behavior in close-range scenarios.

Conclusion: All object measurements are made publicly available as the MAROON multimodal dataset, enabling further research in multimodal sensor fusion for close-range applications.

Abstract: Utilizing the complementary strengths of wavelength-specific range or depth sensors is crucial for robust computer-assisted tasks such as autonomous driving. Despite this, there is still little research done at the intersection of optical depth sensors and radars operating close range, where the target is decimeters away from the sensors. Together with a growing interest in high-resolution imaging radars operating in the near field, the question arises how these sensors behave in comparison to their traditional optical counterparts. In this work, we take on the unique challenge of jointly characterizing depth imagers from both, the optical and radio-frequency domain using a multimodal spatial calibration. We collect data from four depth imagers, with three optical sensors of varying operation principle and an imaging radar. We provide a comprehensive evaluation of their depth measurements with respect to distinct object materials, geometries, and object-to-sensor distances. Specifically, we reveal scattering effects of partially transmissive materials and investigate the response of radio-frequency signals. All object measurements will be made public in form of a multimodal dataset, called MAROON.

[371] Alleviating Hyperparameter-Tuning Burden in SVM Classifiers for Pulmonary Nodules Diagnosis with Multi-Task Bayesian Optimization

Wenhao Chi, Haiping Liu, Hongqiao Dong, Wenhua Liang, Bo Liu

Main category: eess.IV

TL;DR: This paper investigates using multi-task Bayesian optimization to accelerate hyperparameter search for classifying benign vs malignant pulmonary nodules using RBF SVM, reducing redundant training across multiple image discretization methods.

Details

Motivation: Radiomic features for tumor diagnosis are affected by image discretization methods, requiring evaluation of multiple strategies individually which leads to redundant and time-consuming model training and hyperparameter tuning.

Method: Employed multi-task Bayesian optimization to accelerate hyperparameter search for RBF SVM classification of pulmonary nodules, comparing against single-task approach.

Result: Multi-task Bayesian optimization significantly accelerates hyperparameter search compared to single-task approach.

Conclusion: This is the first investigation to utilize multi-task Bayesian optimization in a critical medical context, demonstrating its feasibility for accelerating diagnosis-related hyperparameter optimization.

Abstract: In the field of non-invasive medical imaging, radiomic features are utilized to measure tumor characteristics. However, these features can be affected by the techniques used to discretize the images, ultimately impacting the accuracy of diagnosis. To investigate the influence of various image discretization methods on diagnosis, it is common practice to evaluate multiple discretization strategies individually. This approach often leads to redundant and time-consuming tasks such as training predictive models and fine-tuning hyperparameters separately. This study examines the feasibility of employing multi-task Bayesian optimization to accelerate the hyperparameters search for classifying benign and malignant pulmonary nodules using RBF SVM. Our findings suggest that multi-task Bayesian optimization significantly accelerates the search for hyperparameters in comparison to a single-task approach. To the best of our knowledge, this is the first investigation to utilize multi-task Bayesian optimization in a critical medical context.

[372] NeurOp-Diff:Continuous Remote Sensing Image Super-Resolution via Neural Operator Diffusion

Zihao Xu, Yuzhi Tang, Bowen Xu, Qingquan Li

Main category: eess.IV

TL;DR: Proposes NeurOp-Diff, a diffusion model guided by neural operators for continuous remote sensing image super-resolution that addresses artifacts and smoothing issues in existing methods.

Details

Motivation: Most publicly accessible remote sensing data suffer from low resolution, limiting their practical applications.

Method: Uses neural operators to learn resolution representations at arbitrary scales, encoding LR images into high-dimensional features as prior conditions to guide diffusion model denoising. Adjusts super-resolution scale by scaling factor s for different magnifications.

Result: Effectively addresses artifacts and excessive smoothing issues in existing SR methods, enabling generation of high-quality, continuous super-resolution images.

Conclusion: Experiments on multiple datasets demonstrate the effectiveness of NeurOp-Diff for remote sensing image super-resolution.

Abstract: Most publicly accessible remote sensing data suffer from low resolution, limiting their practical applications. To address this, we propose a diffusion model guided by neural operators for continuous remote sensing image super-resolution (NeurOp-Diff). Neural operators are used to learn resolution representations at arbitrary scales, encoding low-resolution (LR) images into high-dimensional features, which are then used as prior conditions to guide the diffusion model for denoising. This effectively addresses the artifacts and excessive smoothing issues present in existing super-resolution (SR) methods, enabling the generation of high-quality, continuous super-resolution images. Specifically, we adjust the super-resolution scale by a scaling factor s, allowing the model to adapt to different super-resolution magnifications. Furthermore, experiments on multiple datasets demonstrate the effectiveness of NeurOp-Diff. Our code is available at https://github.com/zerono000/NeurOp-Diff.

[373] BRISC: Annotated Dataset for Brain Tumor Segmentation and Classification

Amirreza Fateh, Yasin Rezvani, Sara Moayedi, Sadjad Rezvani, Fatemeh Fateh, Mansoor Fateh, Vahid Abolghasemi

Main category: eess.IV

TL;DR: BRISC is a new brain tumor dataset with 6,000 expert-annotated MRI scans for segmentation and classification tasks, addressing the lack of high-quality medical imaging datasets.

Details

Motivation: Address the gap in high-quality, balanced, and diverse brain tumor datasets with expert annotations for medical image analysis.

Method: Created BRISC dataset by collating 6,000 contrast-enhanced T1-weighted MRI scans from multiple public datasets and performing expert annotation by certified radiologists and physicians.

Result: Developed a comprehensive dataset with high-resolution segmentation masks covering three major tumor types (glioma, meningioma, pituitary) and non-tumorous cases across multiple imaging planes.

Conclusion: BRISC dataset enables robust model development for brain tumor segmentation and classification, with benchmark results provided and the dataset made publicly available.

Abstract: Accurate segmentation and classification of brain tumors from Magnetic Resonance Imaging (MRI) remain key challenges in medical image analysis, primarily due to the lack of high-quality, balanced, and diverse datasets with expert annotations. In this work, we address this gap by introducing BRISC, a dataset designed for brain tumor segmentation and classification tasks, featuring high-resolution segmentation masks. The dataset comprises 6,000 contrast-enhanced T1-weighted MRI scans, which were collated from multiple public datasets that lacked segmentation labels. Our primary contribution is the subsequent expert annotation of these images, performed by certified radiologists and physicians. It includes three major tumor types, namely glioma, meningioma, and pituitary, as well as non-tumorous cases. Each sample includes high-resolution labels and is categorized across axial, sagittal, and coronal imaging planes to facilitate robust model development and cross-view generalization. To demonstrate the utility of the dataset, we provide benchmark results for both tasks using standard deep learning models. The BRISC dataset is made publicly available. datasetlink: Kaggle (https://www.kaggle.com/datasets/briscdataset/brisc2025/), Figshare (https://doi.org/10.6084/m9.figshare.30533120), Zenodo (https://doi.org/10.5281/zenodo.17524350)

Today’s Research Highlights

Table of Contents

cs.CL

[1] Cache Mechanism for Agent RAG Systems

[2] Automatic Machine Translation Detection Using a Surrogate Multilingual Translation Model

[3] LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

[4] Step-Audio-EditX Technical Report

[5] Targeted Error Correction in Knowledge Distillation: Small Language Models Surpass GPT

[6] Data-Efficient Adaptation and a Novel Evaluation Method for Aspect-based Sentiment Analysis

[7] A Computational Approach to Analyzing Disrupted Language in Schizophrenia: Integrating Surprisal and Coherence Measures

[8] ROBoto2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment

[9] Reading Between the Lines: The One-Sided Conversation Problem

[10] PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech

[11] Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

[12] CARMA: Comprehensive Automatically-annotated Reddit Mental Health Dataset for Arabic

[13] Control Barrier Function for Aligning Large Language Models

[14] MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

[15] Who Sees the Risk? Stakeholder Conflicts and Explanatory Policies in LLM-based Risk Assessment

[16] Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks

[17] BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

[18] LGM: Enhancing Large Language Models with Conceptual Meta-Relations and Iterative Retrieval

[19] Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification

[20] Beyond Ranked Lists: The SARAL Framework for Cross-Lingual Document Set Retrieval

[21] IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs

[22] Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature

[23] SCALE: Upscaled Continual Learning of Large Language Models

[24] How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

[25] Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks

[26] Generative Artificial Intelligence in Bioinformatics: A Systematic Review of Models, Applications, and Methodological Advances

[27] Silenced Biases: The Dark Side LLMs Learned to Refuse

[28] EQ-Negotiator: Dynamic Emotional Personas Empower Small Language Models for Edge-Deployable Credit Negotiation

[29] LFC-DA: Logical Formula-Controlled Data Augmentation for Enhanced Logical Reasoning

[30] Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

[31] Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

[32] Efficient Reasoning via Thought-Training and Thought-Free Inference

[33] Knowledge-Augmented Question Error Correction for Chinese Question Answer System with QuestionRAG

[34] CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

[35] Kastor: Fine-tuned Small Language Models for Shape-based Active Relation Extraction

[36] BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation

[37] HaluMem: Evaluating Hallucinations in Memory Systems of Agents

[38] One Battle After Another: Probing LLMs’ Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

[39] SOLVE-Med: Specialized Orchestration for Leading Vertical Experts across Medical Specialties

[40] Bearing Syntactic Fruit with Stack-Augmented Neural Networks

[41] MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

[42] AILA–First Experiments with Localist Language Models

[43] ASVRI-Legal: Fine-Tuning LLMs with Retrieval Augmented Generation for Enhanced Legal Regulation

[44] A systematic review of relation extraction task since the emergence of Transformers

[45] Towards Transparent Stance Detection: A Zero-Shot Approach Using Implicit and Explicit Interpretability

[46] ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation

[47] Do Androids Dream of Unseen Puppeteers? Probing for a Conspiracy Mindset in Large Language Models

[48] Grounded Misunderstandings in Asymmetric Dialogue: A Perspectivist Annotation Scheme for MapTask

[49] Retrieval-Augmented Feature Generation for Domain-Specific Classification

[50] Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation

[51] From Haystack to Needle: Label Space Reduction for Zero-shot Classification

[52] Verdict: A Library for Scaling Judge-Time Compute

[53] Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models

[54] Traversal Verification for Speculative Tree Decoding

[55] Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

[56] The Case for Repeatable, Open, and Expert-Grounded Hallucination Benchmarks in Large Language Models

[57] Distilling LLM Agent into Small Models with Retrieval and Code Tools

[58] R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

[59] Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs

[60] LexTime: A Benchmark for Temporal Ordering of Legal Events

[61] Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models

[62] Scalable Medication Extraction and Discontinuation Identification from Electronic Health Records Using Large Language Models

[63] Post Persona Alignment for Multi-Session Dialogue Generation

[64] AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

[65] MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

[66] PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

[67] Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

[68] Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives

[69] Evaluating Large Language Models for Detecting Antisemitism

[70] FaStfact: Faster, Stronger Long-Form Factuality Evaluations in LLMs

[71] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness

[72] Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers

[73] SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

[74] AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation

[75] HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

cs.CV

[76] Cropland Mapping using Geospatial Embeddings